changes since v1:
- rebase to v5.17
- erofs: In chunk based layout, since the logical file offset has the
same remainder over PAGE_SIZE with the corresponding physical address
inside the data blob file, the file page cache can be directly
transferred to netfs library to contain the data from data blob file.
(patch 15) (Gao Xiang)
- netfs,cachefiles: manage logical/physical offset separately. (patch 2)
(It is used by erofs_begin_cache_operation() in patch 15.)
- cachefiles: introduce a new devnode specificaly for on-demand reading.
(patch 6)
- netfs,fscache,cachefiles: add new CONFIG_* for on-demand reading.
(patch 3/5)
- You could start a quick test by
https://github.com/lostjeffle/demand-read-cachefilesd
- add more background information (mainly introduction to nydus) in the
"Background" part of this cover letter
[Important Issues]
The following issues still need further discussion. Thanks for your time
and patience.
1. I noticed that there's refactoring of netfs library[1], and patch 1
is not needed since [2].
2. The current implementation will severely conflict with the
refactoring of netfs library[1][2]. The assumption of 'struct
netfs_i_context' [2] is that, every file in the upper netfs will
correspond to only one backing file. While in our scenario, one file in
erofs can correspond to multiple backing files. That is, the content of
one file can be divided into multiple chunks, and are distrubuted over
multiple blob files, i.e. multiple backing files. Currently I have no
good idea solving this conflic.
Besides there are still two quetions:
- What's the plan of [1]? When is it planned to be merged?
- It seems that all upper fs using fscache is going to use netfs API,
while the APIs like fscache_read_or_alloc_page() are deprecated. Is
that true?
[1] https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-lib
[2] https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/commit/?h=netfs-lib&id=087d913752522fb9aa6d3effdb9a8c7908c779dd
RFC: https://lore.kernel.org/all/[email protected]/t/
v1: https://lore.kernel.org/lkml/[email protected]/T/
[Background]
============
Nydus is a remote container snapthotter specially optimised for container
images distribution over network. It has recently been accepted as a
sub-project of containerd[1]. Nydus is an excellent container image
acceleration solution, since it only pulls data from remote when it's
really needed, a.k.a. on-demand reading.
erofs (Enhanced Read-Only File System) is a filesystem specially
optimised for read-only scenarios. (Documentation/filesystem/erofs.rst)
Recently we are focusing on erofs in container images distribution
scenario [2], trying to combine it with nydus. In this case, erofs can
be mounted from one bootstrap file (metadata) with (optional) multiple
data blob files (data) stored on another local filesystem. (All these
files are actually image files in erofs disk format.)
To accelerate the container startup (fetching container image from remote
and then start the container), we do hope that the bootstrap blob file
could support demand read. That is, erofs can be mounted and accessed
even when the bootstrap/data blob files have not been fully downloaded.
That means we have to manage the cache state of the bootstrap/data blob
files (if cache hit, read directly from the local cache; if cache miss,
fetch the data somehow). It would be painful and may be dumb for erofs to
implement the cache management itself. Thus we prefer fscache/cachefiles
to do the cache management. Besides, the demand-read feature shall be
general and it can benefit other using scenarios if it can be implemented
in fscache level.
[1] https://d7y.io/en-us/blog/containerd_accepted_nydus-snapshotter.html
[2] https://sched.co/pcdL
[Overall Design]
================
The upper fs uses a backing file on the local fs as the local cache
(exactly the "cachefiles" way), and relies on fscache to detect if data
is ready or not (cache hit/miss). Since currently fscache detects cache
hit/miss by detecting the hole of the backing files, our demand-read
mechanism also relies on the hole detecting.
1. initial phase
On the first beginning, the user daemon will touch the backing files
(bootstrap/data blob files) under corresponding directory (under
<root>/cache/<volume>/<fan>/) in advance. These backing files are
completely sparse files (with zero disk usage). Since these backing
files are all read-only and the file size is known prior mounting, user
daemon will set corresponding file size and thus create all these sparse
backing files in advance.
2. cache miss
When a file range (of bootstrap/data blob file) is accessed for the
first time, a cache miss will be triggered and then .issue_op() will be
called to fetch the data somehow.
In the demand-read case, we relies on a user daemon to fetch the data
from local/remote. In this case, .issue_op() just packages the file
range into a message and informs the user daemon. User daemon needs to
poll and wait on the devnode (/dev/cachefiles_demand). Once awaken, the
user daemon will read the devnode to get the file range information, and
then fetch the data corresponding to the file range somehow, e.g.
download from remote through network. Once data ready, the user daemon
will write the fetched data into the backing file and then inform
cachefiles backend by writing to the devnode. Cachefiles backend getting
blocked on the previous .issue_op() calling will be awaken then. By then
the data has been ready in the backing file, and the netfs API will
re-initiate a read request from the backing file.
3. cache hit
Once data is already ready in the backing file, netfs API will read from
the backing file directly.
[Advantage of fscache-based demand-read]
========================================
1. Asynchronous Prefetch
In current mechanism, fscache is responsible for cache state management,
while the data plane (fetch data from local/remote on cache miss) is
done on the user daemon side.
If data has already been ready in the backing file, netfs API will read
from the backing file directly and won't be trapped to user space anymore.
Thus the user daemon could fetch data (from remote) asynchronously on the
background, and thus accelerate the backing file accessing in some degree.
2. Support massive blob files
Besides this mechanism supports a large amount of backing files, and
thus can benefit the densely employed scenario.
In our using scenario, one container image can correspond to one
bootstrap file (required) and multiple data blob files (optional). For
example, one container image for node.js will corresponds to ~20 files
in total. In densely employed environment, there could be as many as
hundreds of containers and thus thousands of backing files on one
machine.
[Test]
You could start a quick test by
https://github.com/lostjeffle/demand-read-cachefilesd
Jeffle Xu (20):
netfs: make @file optional in netfs_alloc_read_request()
netfs,cachefiles: manage logical/physical offset separately
netfs,fscache: support on-demand reading
cachefiles: extract generic daemon write function
cachefiles: detect backing file size in on-demand read mode
cachefiles: introduce new devnode for on-demand read mode
erofs: use meta buffers for erofs_read_superblock()
erofs: export erofs_map_blocks()
erofs: add mode checking helper
erofs: register global fscache volume
erofs: add cookie context helper functions
erofs: add anonymous inode managing page cache of blob file
erofs: register cookie context for bootstrap blob
erofs: implement fscache-based metadata read
erofs: implement fscache-based data read for non-inline layout
erofs: implement fscache-based data read for inline layout
erofs: register cookie context for data blobs
erofs: implement fscache-based data read for data blobs
erofs: add 'uuid' mount option
erofs: support on-demand reading
fs/cachefiles/Kconfig | 8 +
fs/cachefiles/daemon.c | 147 ++++++++++++++++-
fs/cachefiles/internal.h | 23 +++
fs/cachefiles/io.c | 82 +++++++++-
fs/cachefiles/main.c | 27 ++++
fs/cachefiles/namei.c | 60 ++++++-
fs/erofs/Kconfig | 2 +-
fs/erofs/Makefile | 3 +-
fs/erofs/data.c | 18 ++-
fs/erofs/fscache.c | 339 +++++++++++++++++++++++++++++++++++++++
fs/erofs/inode.c | 6 +-
fs/erofs/internal.h | 30 ++++
fs/erofs/super.c | 101 +++++++++---
fs/fscache/Kconfig | 8 +
fs/netfs/Kconfig | 8 +
fs/netfs/read_helper.c | 65 ++++++--
include/linux/netfs.h | 10 ++
17 files changed, 886 insertions(+), 51 deletions(-)
create mode 100644 fs/erofs/fscache.c
--
2.27.0
Add ondemand_read() callback to netfs_cache_ops to implement on-demand
reading.
The precondition for implementing on-demand reading semantic is that,
all blob files have been placed under corresponding directory with
correct file size (sparse files) on the first beginning. When upper fs
starts to access the blob file, it will "cache miss" (hit the hole) and
then .issue_op() callback will be called to prepare the data.
The following working flow is described as below. The .issue_op()
callback could be implemented by netfs_ondemand_read() helper, which
will in turn call .ondemand_read() callback of corresponding fscache
backend to prepare the data.
The implementation of .ondemand_read() callback can be backend specific.
The following patch will introduce an implementation of .ondemand_read()
callback for cachefiles, which will notify user daemon the requested
file range to read. The .ondemand_read() callback will get blocked until
the user daemon has prepared the corresponding data.
Then once .ondemand_read() callback returns with 0, it is guaranteed
that the requested data has been ready. In this case, transform this IO
request to NETFS_READ_FROM_CACHE state, initiate an incomplete
completion and then retry to read from backing file.
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/fscache/Kconfig | 8 ++++++++
fs/netfs/Kconfig | 8 ++++++++
fs/netfs/read_helper.c | 37 +++++++++++++++++++++++++++++++++++++
include/linux/netfs.h | 8 ++++++++
4 files changed, 61 insertions(+)
diff --git a/fs/fscache/Kconfig b/fs/fscache/Kconfig
index 76316c4a3fb7..f6b5396759ee 100644
--- a/fs/fscache/Kconfig
+++ b/fs/fscache/Kconfig
@@ -41,3 +41,11 @@ config FSCACHE_DEBUG
config FSCACHE_OLD_API
bool
+
+config FSCACHE_ONDEMAND
+ bool "Support for on-demand reading"
+ depends on FSCACHE
+ select NETFS_ONDEMAND
+ help
+ This permits on-demand reading with fscache.
+ If unsure, say N.
diff --git a/fs/netfs/Kconfig b/fs/netfs/Kconfig
index b4db21022cb4..c4bdd0b032dd 100644
--- a/fs/netfs/Kconfig
+++ b/fs/netfs/Kconfig
@@ -21,3 +21,11 @@ config NETFS_STATS
multi-CPU system these may be on cachelines that keep bouncing
between CPUs. On the other hand, the stats are very useful for
debugging purposes. Saying 'Y' here is recommended.
+
+config NETFS_ONDEMAND
+ bool "Support for on-demand reading"
+ depends on NETFS_SUPPORT
+ default n
+ help
+ This enables on-demand reading with netfs API.
+ If unsure, say N.
diff --git a/fs/netfs/read_helper.c b/fs/netfs/read_helper.c
index 077c0ca96612..b84c184c365d 100644
--- a/fs/netfs/read_helper.c
+++ b/fs/netfs/read_helper.c
@@ -1013,6 +1013,43 @@ int netfs_readpage(struct file *file,
}
EXPORT_SYMBOL(netfs_readpage);
+#ifdef CONFIG_NETFS_ONDEMAND
+void netfs_ondemand_read(struct netfs_read_subrequest *subreq)
+{
+ struct netfs_read_request *rreq = subreq->rreq;
+ struct netfs_cache_resources *cres = &rreq->cache_resources;
+ loff_t start_pos;
+ size_t len;
+ int ret = -ENOBUFS;
+
+ /* The cache backend may not be accessible at this moment. */
+ if (!cres->ops)
+ goto out;
+
+ if (!cres->ops->ondemand_read) {
+ ret = -EOPNOTSUPP;
+ goto out;
+ }
+
+ start_pos = subreq->p_start + subreq->transferred;
+ len = subreq->len - subreq->transferred;
+
+ /*
+ * In success case (ret == 0), user daemon has prepared data for
+ * us, thus transform to NETFS_READ_FROM_CACHE state and
+ * advertise that 0 byte readed, so that the request will enter
+ * into INCOMPLETE state and retry to read from backing file.
+ */
+ ret = cres->ops->ondemand_read(cres, start_pos, len);
+ if (!ret) {
+ subreq->source = NETFS_READ_FROM_CACHE;
+ __clear_bit(NETFS_SREQ_WRITE_TO_CACHE, &subreq->flags);
+ }
+out:
+ netfs_subreq_terminated(subreq, ret, false);
+}
+#endif
+
/*
* Prepare a folio for writing without reading first
* @folio: The folio being prepared
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index a17740b3b9d6..d6e041293dcc 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -246,6 +246,11 @@ struct netfs_cache_ops {
int (*prepare_write)(struct netfs_cache_resources *cres,
loff_t *_start, size_t *_len, loff_t i_size,
bool no_space_allocated_yet);
+
+#ifdef CONFIG_NETFS_ONDEMAND
+ int (*ondemand_read)(struct netfs_cache_resources *cres,
+ loff_t start_pos, size_t len);
+#endif
};
struct readahead_control;
@@ -261,6 +266,9 @@ extern int netfs_write_begin(struct file *, struct address_space *,
void **,
const struct netfs_read_request_ops *,
void *);
+#ifdef CONFIG_NETFS_ONDEMAND
+extern void netfs_ondemand_read(struct netfs_read_subrequest *);
+#endif
extern void netfs_subreq_terminated(struct netfs_read_subrequest *, ssize_t, bool);
extern void netfs_stats_show(struct seq_file *);
--
2.27.0
Currently fscache is used in a style that every file in upper fs has a
corresponding backing file in fscache, and the file offset in the upper
file (logical) is always equal to that in the backing file (physical).
While upper fs may implement different backing strategy, the above
assumption can no longer be valid, e.g. multiple upper files can be
packed into one single backing file.
Thus this patch abstracts these two different offsets and manage them
separately, so that upper fs can implement different backing strategy.
For the original users where these two offsets are always equal, no
change is needed. While for the scenario where these two offsets can be
different, upper fs can set a separate logical/physical offset in
ops->begin_cache_operation() if it's needed.
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/cachefiles/io.c | 14 +++++++-------
fs/netfs/read_helper.c | 16 ++++++++++++----
include/linux/netfs.h | 2 ++
3 files changed, 21 insertions(+), 11 deletions(-)
diff --git a/fs/cachefiles/io.c b/fs/cachefiles/io.c
index 60b1eac2ce78..5da0bfd78188 100644
--- a/fs/cachefiles/io.c
+++ b/fs/cachefiles/io.c
@@ -370,7 +370,7 @@ static enum netfs_read_source cachefiles_prepare_read(struct netfs_read_subreque
off = cachefiles_inject_read_error();
if (off == 0)
- off = vfs_llseek(file, subreq->start, SEEK_DATA);
+ off = vfs_llseek(file, subreq->p_start, SEEK_DATA);
if (off < 0 && off >= (loff_t)-MAX_ERRNO) {
if (off == (loff_t)-ENXIO) {
why = cachefiles_trace_read_seek_nxio;
@@ -382,21 +382,21 @@ static enum netfs_read_source cachefiles_prepare_read(struct netfs_read_subreque
goto out;
}
- if (off >= subreq->start + subreq->len) {
+ if (off >= subreq->p_start + subreq->len) {
why = cachefiles_trace_read_found_hole;
goto download_and_store;
}
- if (off > subreq->start) {
+ if (off > subreq->p_start) {
off = round_up(off, cache->bsize);
- subreq->len = off - subreq->start;
+ subreq->len = off - subreq->p_start;
why = cachefiles_trace_read_found_part;
goto download_and_store;
}
to = cachefiles_inject_read_error();
if (to == 0)
- to = vfs_llseek(file, subreq->start, SEEK_HOLE);
+ to = vfs_llseek(file, subreq->p_start, SEEK_HOLE);
if (to < 0 && to >= (loff_t)-MAX_ERRNO) {
trace_cachefiles_io_error(object, file_inode(file), to,
cachefiles_trace_seek_error);
@@ -404,12 +404,12 @@ static enum netfs_read_source cachefiles_prepare_read(struct netfs_read_subreque
goto out;
}
- if (to < subreq->start + subreq->len) {
+ if (to < subreq->p_start + subreq->len) {
if (subreq->start + subreq->len >= i_size)
to = round_up(to, cache->bsize);
else
to = round_down(to, cache->bsize);
- subreq->len = to - subreq->start;
+ subreq->len = to - subreq->p_start;
}
why = cachefiles_trace_read_have_data;
diff --git a/fs/netfs/read_helper.c b/fs/netfs/read_helper.c
index ca84918b6b5d..077c0ca96612 100644
--- a/fs/netfs/read_helper.c
+++ b/fs/netfs/read_helper.c
@@ -181,7 +181,7 @@ static void netfs_read_from_cache(struct netfs_read_request *rreq,
subreq->start + subreq->transferred,
subreq->len - subreq->transferred);
- cres->ops->read(cres, subreq->start, &iter, read_hole,
+ cres->ops->read(cres, subreq->p_start, &iter, read_hole,
netfs_cache_read_terminated, subreq);
}
@@ -323,7 +323,7 @@ static void netfs_rreq_do_write_to_cache(struct netfs_read_request *rreq)
netfs_put_subrequest(next, false);
}
- ret = cres->ops->prepare_write(cres, &subreq->start, &subreq->len,
+ ret = cres->ops->prepare_write(cres, &subreq->p_start, &subreq->len,
rreq->i_size, true);
if (ret < 0) {
trace_netfs_failure(rreq, subreq, ret, netfs_fail_prepare_write);
@@ -338,7 +338,7 @@ static void netfs_rreq_do_write_to_cache(struct netfs_read_request *rreq)
netfs_stat(&netfs_n_rh_write);
netfs_get_read_subrequest(subreq);
trace_netfs_sreq(subreq, netfs_sreq_trace_write);
- cres->ops->write(cres, subreq->start, &iter,
+ cres->ops->write(cres, subreq->p_start, &iter,
netfs_rreq_copy_terminated, subreq);
}
@@ -760,6 +760,7 @@ static bool netfs_rreq_submit_slice(struct netfs_read_request *rreq,
subreq->debug_index = (*_debug_index)++;
subreq->start = rreq->start + rreq->submitted;
+ subreq->p_start = rreq->p_start + rreq->submitted;
subreq->len = rreq->len - rreq->submitted;
_debug("slice %llx,%zx,%zx", subreq->start, subreq->len, rreq->submitted);
@@ -818,8 +819,12 @@ static void netfs_rreq_expand(struct netfs_read_request *rreq,
{
/* Give the cache a chance to change the request parameters. The
* resultant request must contain the original region.
+ * Skip expanding if there may be multi-to-multi mapping between
+ * backing file and backed file.
*/
- netfs_cache_expand_readahead(rreq, &rreq->start, &rreq->len, rreq->i_size);
+ if (rreq->start == rreq->p_start)
+ netfs_cache_expand_readahead(rreq, &rreq->start, &rreq->len,
+ rreq->i_size);
/* Give the netfs a chance to change the request parameters. The
* resultant request must contain the original region.
@@ -884,6 +889,7 @@ void netfs_readahead(struct readahead_control *ractl,
goto cleanup;
rreq->mapping = ractl->mapping;
rreq->start = readahead_pos(ractl);
+ rreq->p_start = rreq->start;
rreq->len = readahead_length(ractl);
if (ops->begin_cache_operation) {
@@ -964,6 +970,7 @@ int netfs_readpage(struct file *file,
}
rreq->mapping = folio_file_mapping(folio);
rreq->start = folio_file_pos(folio);
+ rreq->p_start = rreq->start;
rreq->len = folio_size(folio);
if (ops->begin_cache_operation) {
@@ -1129,6 +1136,7 @@ int netfs_write_begin(struct file *file, struct address_space *mapping,
goto error;
rreq->mapping = folio_file_mapping(folio);
rreq->start = folio_file_pos(folio);
+ rreq->p_start = rreq->start;
rreq->len = folio_size(folio);
rreq->no_unlock_folio = folio_index(folio);
__set_bit(NETFS_RREQ_NO_UNLOCK_FOLIO, &rreq->flags);
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index b46c39d98bbd..a17740b3b9d6 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -134,6 +134,7 @@ struct netfs_read_subrequest {
struct netfs_read_request *rreq; /* Supervising read request */
struct list_head rreq_link; /* Link in rreq->subrequests */
loff_t start; /* Where to start the I/O */
+ loff_t p_start; /* Start position of backing file */
size_t len; /* Size of the I/O */
size_t transferred; /* Amount of data transferred */
refcount_t usage;
@@ -167,6 +168,7 @@ struct netfs_read_request {
short error; /* 0 or error that occurred */
loff_t i_size; /* Size of the file */
loff_t start; /* Start position */
+ loff_t p_start; /* Start position of backing file */
pgoff_t no_unlock_folio; /* Don't unlock this folio after read */
refcount_t usage;
unsigned long flags;
--
2.27.0
Make the @file parameter optional, and derive inode from the @folio
parameter instead in order to support file system internal requests.
@file parameter can't be removed completely, since it also works as
the private data of ops->init_rreq().
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/netfs/read_helper.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/fs/netfs/read_helper.c b/fs/netfs/read_helper.c
index 8c58cff420ba..ca84918b6b5d 100644
--- a/fs/netfs/read_helper.c
+++ b/fs/netfs/read_helper.c
@@ -39,7 +39,7 @@ static void netfs_put_subrequest(struct netfs_read_subrequest *subreq,
static struct netfs_read_request *netfs_alloc_read_request(
const struct netfs_read_request_ops *ops, void *netfs_priv,
- struct file *file)
+ struct inode *inode, struct file *file)
{
static atomic_t debug_ids;
struct netfs_read_request *rreq;
@@ -48,7 +48,7 @@ static struct netfs_read_request *netfs_alloc_read_request(
if (rreq) {
rreq->netfs_ops = ops;
rreq->netfs_priv = netfs_priv;
- rreq->inode = file_inode(file);
+ rreq->inode = inode;
rreq->i_size = i_size_read(rreq->inode);
rreq->debug_id = atomic_inc_return(&debug_ids);
INIT_LIST_HEAD(&rreq->subrequests);
@@ -870,6 +870,7 @@ void netfs_readahead(struct readahead_control *ractl,
void *netfs_priv)
{
struct netfs_read_request *rreq;
+ struct inode *inode = file_inode(ractl->file);
unsigned int debug_index = 0;
int ret;
@@ -878,7 +879,7 @@ void netfs_readahead(struct readahead_control *ractl,
if (readahead_count(ractl) == 0)
goto cleanup;
- rreq = netfs_alloc_read_request(ops, netfs_priv, ractl->file);
+ rreq = netfs_alloc_read_request(ops, netfs_priv, inode, ractl->file);
if (!rreq)
goto cleanup;
rreq->mapping = ractl->mapping;
@@ -948,12 +949,13 @@ int netfs_readpage(struct file *file,
void *netfs_priv)
{
struct netfs_read_request *rreq;
+ struct inode *inode = folio_file_mapping(folio)->host;
unsigned int debug_index = 0;
int ret;
_enter("%lx", folio_index(folio));
- rreq = netfs_alloc_read_request(ops, netfs_priv, file);
+ rreq = netfs_alloc_read_request(ops, netfs_priv, inode, file);
if (!rreq) {
if (netfs_priv)
ops->cleanup(folio_file_mapping(folio), netfs_priv);
@@ -1122,7 +1124,7 @@ int netfs_write_begin(struct file *file, struct address_space *mapping,
}
ret = -ENOMEM;
- rreq = netfs_alloc_read_request(ops, netfs_priv, file);
+ rreq = netfs_alloc_read_request(ops, netfs_priv, inode, file);
if (!rreq)
goto error;
rreq->mapping = folio_file_mapping(folio);
--
2.27.0
... so that the following new devnode can reuse most of the code when
implementing its .write() callback.
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/cachefiles/daemon.c | 20 +++++++++++++++-----
1 file changed, 15 insertions(+), 5 deletions(-)
diff --git a/fs/cachefiles/daemon.c b/fs/cachefiles/daemon.c
index 7ac04ee2c0a0..aa2e5e354afb 100644
--- a/fs/cachefiles/daemon.c
+++ b/fs/cachefiles/daemon.c
@@ -209,10 +209,11 @@ static ssize_t cachefiles_daemon_read(struct file *file, char __user *_buffer,
/*
* Take a command from cachefilesd, parse it and act on it.
*/
-static ssize_t cachefiles_daemon_write(struct file *file,
- const char __user *_data,
- size_t datalen,
- loff_t *pos)
+static ssize_t cachefiles_daemon_do_write(struct file *file,
+ const char __user *_data,
+ size_t datalen,
+ loff_t *pos,
+ const struct cachefiles_daemon_cmd *cmds)
{
const struct cachefiles_daemon_cmd *cmd;
struct cachefiles_cache *cache = file->private_data;
@@ -261,7 +262,7 @@ static ssize_t cachefiles_daemon_write(struct file *file,
}
/* run the appropriate command handler */
- for (cmd = cachefiles_daemon_cmds; cmd->name[0]; cmd++)
+ for (cmd = cmds; cmd->name[0]; cmd++)
if (strcmp(cmd->name, data) == 0)
goto found_command;
@@ -284,6 +285,15 @@ static ssize_t cachefiles_daemon_write(struct file *file,
goto error;
}
+static ssize_t cachefiles_daemon_write(struct file *file,
+ const char __user *_data,
+ size_t datalen,
+ loff_t *pos)
+{
+ return cachefiles_daemon_do_write(file, _data, datalen, pos,
+ cachefiles_daemon_cmds);
+}
+
/*
* Poll for culling state
* - use EPOLLOUT to indicate culling state
--
2.27.0
This patch implements the data plane of reading data from bootstrap blob
file over fscache for inline layout.
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/erofs/fscache.c | 41 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 41 insertions(+)
diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index 588c33ab6a90..8c56bd54b2af 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -100,6 +100,45 @@ static int erofs_fscache_readpage_noinline(struct page *page,
return netfs_readpage(NULL, folio, &erofs_req_ops, &priv);
}
+static int erofs_fscache_readpage_inline(struct page *page,
+ struct erofs_fscache_map *fsmap)
+{
+ struct inode *inode = page->mapping->host;
+ struct super_block *sb = inode->i_sb;
+ struct erofs_buf buf = __EROFS_BUF_INITIALIZER;
+ erofs_blk_t blknr;
+ size_t offset, len;
+ void *src, *dst;
+
+ /*
+ * For inline (tail packing) layout, the offset may be non-zero, while
+ * the offset can be calculated from corresponding physical address
+ * directly.
+ * Currently only flat layout supports inline (FLAT_INLINE), and the
+ * output map.m_pa is exactly the physical address of o_la in this case.
+ */
+ offset = erofs_blkoff(fsmap->m_pa);
+ blknr = erofs_blknr(fsmap->m_pa);
+ len = fsmap->m_llen;
+
+ src = erofs_read_metabuf(&buf, sb, blknr, EROFS_KMAP);
+ if (IS_ERR(src)) {
+ SetPageError(page);
+ unlock_page(page);
+ return PTR_ERR(src);
+ }
+
+ dst = kmap(page);
+ memcpy(dst, src + offset, len);
+ kunmap(page);
+
+ erofs_put_metabuf(&buf);
+
+ SetPageUptodate(page);
+ unlock_page(page);
+ return 0;
+}
+
static int erofs_fscache_readpage(struct file *file, struct page *page)
{
struct inode *inode = page->mapping->host;
@@ -138,6 +177,8 @@ static int erofs_fscache_readpage(struct file *file, struct page *page)
case EROFS_INODE_FLAT_PLAIN:
case EROFS_INODE_CHUNK_BASED:
return erofs_fscache_readpage_noinline(page, &fsmap);
+ case EROFS_INODE_FLAT_INLINE:
+ return erofs_fscache_readpage_inline(page, &fsmap);
default:
DBG_BUGON(1);
ret = -EOPNOTSUPP;
--
2.27.0
Introduce 'uuid' mount option to enable the nodev mode, in which erofs
could be mounted from blob files instead of blkdev. By then users could
specify the path of bootstrap blob file containing the complete erofs
image.
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/erofs/Kconfig | 2 +-
fs/erofs/super.c | 43 ++++++++++++++++++++++++++++++++++++-------
2 files changed, 37 insertions(+), 8 deletions(-)
diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig
index f57255ab88ed..37a2cc82ecc2 100644
--- a/fs/erofs/Kconfig
+++ b/fs/erofs/Kconfig
@@ -2,7 +2,7 @@
config EROFS_FS
tristate "EROFS filesystem support"
- depends on BLOCK
+ depends on BLOCK && FSCACHE_ONDEMAND
select FS_IOMAP
select LIBCRC32C
help
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index f058a04a00c7..3f8557bac786 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -400,6 +400,7 @@ enum {
Opt_dax,
Opt_dax_enum,
Opt_device,
+ Opt_uuid,
Opt_err
};
@@ -424,6 +425,7 @@ static const struct fs_parameter_spec erofs_fs_parameters[] = {
fsparam_flag("dax", Opt_dax),
fsparam_enum("dax", Opt_dax_enum, erofs_dax_param_enums),
fsparam_string("device", Opt_device),
+ fsparam_string("uuid", Opt_uuid),
{}
};
@@ -519,6 +521,12 @@ static int erofs_fc_parse_param(struct fs_context *fc,
}
++ctx->devs->extra_devices;
break;
+ case Opt_uuid:
+ kfree(ctx->opt.uuid);
+ ctx->opt.uuid = kstrdup(param->string, GFP_KERNEL);
+ if (!ctx->opt.uuid)
+ return -ENOMEM;
+ break;
default:
return -ENOPARAM;
}
@@ -593,9 +601,14 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
sb->s_magic = EROFS_SUPER_MAGIC;
- if (!sb_set_blocksize(sb, EROFS_BLKSIZ)) {
- erofs_err(sb, "failed to set erofs blksize");
- return -EINVAL;
+ if (erofs_bdev_mode(sb)) {
+ if (!sb_set_blocksize(sb, EROFS_BLKSIZ)) {
+ erofs_err(sb, "failed to set erofs blksize");
+ return -EINVAL;
+ }
+ } else {
+ sb->s_blocksize = EROFS_BLKSIZ;
+ sb->s_blocksize_bits = LOG_BLOCK_SIZE;
}
sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
@@ -604,11 +617,12 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
sb->s_fs_info = sbi;
sbi->opt = ctx->opt;
- sbi->dax_dev = fs_dax_get_by_bdev(sb->s_bdev, &sbi->dax_part_off);
sbi->devs = ctx->devs;
ctx->devs = NULL;
- if (!erofs_bdev_mode(sb)) {
+ if (erofs_bdev_mode(sb)) {
+ sbi->dax_dev = fs_dax_get_by_bdev(sb->s_bdev, &sbi->dax_part_off);
+ } else {
struct erofs_fscache_context *bootstrap;
bootstrap = erofs_fscache_get_ctx(sb, ctx->opt.uuid, true);
@@ -616,6 +630,7 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
return PTR_ERR(bootstrap);
sbi->bootstrap = bootstrap;
+ sbi->dax_dev = NULL;
}
err = erofs_read_superblock(sb);
@@ -678,6 +693,11 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
static int erofs_fc_get_tree(struct fs_context *fc)
{
+ struct erofs_fs_context *ctx = fc->fs_private;
+
+ if (ctx->opt.uuid)
+ return get_tree_nodev(fc, erofs_fc_fill_super);
+
return get_tree_bdev(fc, erofs_fc_fill_super);
}
@@ -727,6 +747,7 @@ static void erofs_fc_free(struct fs_context *fc)
struct erofs_fs_context *ctx = fc->fs_private;
erofs_free_dev_context(ctx->devs);
+ kfree(ctx->opt.uuid);
kfree(ctx);
}
@@ -767,7 +788,10 @@ static void erofs_kill_sb(struct super_block *sb)
WARN_ON(sb->s_magic != EROFS_SUPER_MAGIC);
- kill_block_super(sb);
+ if (erofs_bdev_mode(sb))
+ kill_block_super(sb);
+ else
+ generic_shutdown_super(sb);
sbi = EROFS_SB(sb);
if (!sbi)
@@ -885,7 +909,12 @@ static int erofs_statfs(struct dentry *dentry, struct kstatfs *buf)
{
struct super_block *sb = dentry->d_sb;
struct erofs_sb_info *sbi = EROFS_SB(sb);
- u64 id = huge_encode_dev(sb->s_bdev->bd_dev);
+ u64 id;
+
+ if (erofs_bdev_mode(sb))
+ id = huge_encode_dev(sb->s_bdev->bd_dev);
+ else
+ id = 0; /* TODO */
buf->f_type = sb->s_magic;
buf->f_bsize = EROFS_BLKSIZ;
--
2.27.0
This patch introduces a new devnode 'cachefiles_ondemand' to support the
newly introduced on-demand read mode.
The precondition for on-demand reading semantic is that, all blob files
have been placed under corresponding directory with correct file size
(sparse files) on the first beginning. When upper fs starts to access
the blob file, it will "cache miss" (hit the hole) and then turn to user
daemon for preparing the data.
The interaction between kernel and user daemon is described as below.
1. Once cache miss, .ondemand_read() callback of corresponding fscache
backend is called to prepare the data. As for cachefiles, it just
packages related metadata (file range to read, etc.) into a pending
read request, and then the process triggering cache miss will fall
asleep until the corresponding data gets fetched later.
2. User daemon needs to poll on the devnode ('cachefiles_ondemand'),
waiting for pending read request.
3. Once there's pending read request, user daemon will be notified and
shall read the devnode ('cachefiles_ondemand') to fetch one pending
read request to process.
4. For the fetched read request, user daemon need to somehow prepare the
data (e.g. download from remote through network) and then write the
fetched data into the backing file to fill the hole.
5. After that, user daemon need to notify cachefiles backend by writing a
'done' command to devnode ('cachefiles_ondemand'). It will also
awake the previous asleep process triggering cache miss.
6. By the time the process gets awaken, the data has been ready in the
backing file. Then fscache will re-initiate a read request from the
backing file.
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/cachefiles/daemon.c | 127 +++++++++++++++++++++++++++++++++++++++
fs/cachefiles/internal.h | 22 +++++++
fs/cachefiles/io.c | 68 +++++++++++++++++++++
fs/cachefiles/main.c | 27 +++++++++
4 files changed, 244 insertions(+)
diff --git a/fs/cachefiles/daemon.c b/fs/cachefiles/daemon.c
index aa2e5e354afb..7af3e17e04c8 100644
--- a/fs/cachefiles/daemon.c
+++ b/fs/cachefiles/daemon.c
@@ -108,6 +108,10 @@ static int cachefiles_daemon_open(struct inode *inode, struct file *file)
INIT_LIST_HEAD(&cache->volumes);
INIT_LIST_HEAD(&cache->object_list);
spin_lock_init(&cache->object_list_lock);
+#ifdef CONFIG_CACHEFILES_ONDEMAND
+ idr_init(&cache->reqs);
+ set_bit(CACHEFILES_ONDEMAND_MODE, &cache->flags);
+#endif
/* set default caching limits
* - limit at 1% free space and/or free files
@@ -142,6 +146,9 @@ static int cachefiles_daemon_release(struct inode *inode, struct file *file)
cachefiles_daemon_unbind(cache);
/* clean up the control file interface */
+#ifdef CONFIG_CACHEFILES_ONDEMAND
+ idr_destroy(&cache->reqs);
+#endif
cache->cachefilesd = NULL;
file->private_data = NULL;
cachefiles_open = 0;
@@ -747,3 +754,123 @@ static void cachefiles_daemon_unbind(struct cachefiles_cache *cache)
_leave("");
}
+
+#ifdef CONFIG_CACHEFILES_ONDEMAND
+static ssize_t cachefiles_ondemand_write(struct file *, const char __user *,
+ size_t, loff_t *);
+static ssize_t cachefiles_ondemand_read(struct file *, char __user *, size_t,
+ loff_t *);
+static __poll_t cachefiles_ondemand_poll(struct file *,
+ struct poll_table_struct *);
+static int cachefiles_daemon_done(struct cachefiles_cache *, char *);
+
+const struct file_operations cachefiles_ondemand_fops = {
+ .owner = THIS_MODULE,
+ .open = cachefiles_daemon_open,
+ .release = cachefiles_daemon_release,
+ .read = cachefiles_ondemand_read,
+ .write = cachefiles_ondemand_write,
+ .poll = cachefiles_ondemand_poll,
+ .llseek = noop_llseek,
+};
+
+static const struct cachefiles_daemon_cmd cachefiles_ondemand_cmds[] = {
+ { "bind", cachefiles_daemon_bind },
+ { "brun", cachefiles_daemon_brun },
+ { "bcull", cachefiles_daemon_bcull },
+ { "bstop", cachefiles_daemon_bstop },
+ { "cull", cachefiles_daemon_cull },
+ { "debug", cachefiles_daemon_debug },
+ { "dir", cachefiles_daemon_dir },
+ { "frun", cachefiles_daemon_frun },
+ { "fcull", cachefiles_daemon_fcull },
+ { "fstop", cachefiles_daemon_fstop },
+ { "inuse", cachefiles_daemon_inuse },
+ { "secctx", cachefiles_daemon_secctx },
+ { "tag", cachefiles_daemon_tag },
+ { "done", cachefiles_daemon_done },
+ { "", NULL }
+};
+
+static ssize_t cachefiles_ondemand_write(struct file *file,
+ const char __user *_data,
+ size_t datalen,
+ loff_t *pos)
+{
+ return cachefiles_daemon_do_write(file, _data, datalen, pos,
+ cachefiles_ondemand_cmds);
+}
+
+static ssize_t cachefiles_ondemand_read(struct file *file, char __user *_buffer,
+ size_t buflen, loff_t *pos)
+{
+ struct cachefiles_cache *cache = file->private_data;
+ struct cachefiles_req *req;
+ int n, id = 0;
+
+ if (!test_bit(CACHEFILES_READY, &cache->flags))
+ return 0;
+
+ idr_lock(&cache->reqs);
+ req = idr_get_next(&cache->reqs, &id);
+ idr_unlock(&cache->reqs);
+ if (!req)
+ return 0;
+
+ n = sizeof(req->req_in);
+ if (n > buflen)
+ return -EMSGSIZE;
+
+ if (copy_to_user(_buffer, &req->req_in, n) != 0)
+ return -EFAULT;
+
+ return n;
+}
+
+static __poll_t cachefiles_ondemand_poll(struct file *file,
+ struct poll_table_struct *poll)
+{
+ struct cachefiles_cache *cache = file->private_data;
+ __poll_t mask;
+
+ poll_wait(file, &cache->daemon_pollwq, poll);
+ mask = 0;
+
+ if (!idr_is_empty(&cache->reqs))
+ mask |= EPOLLIN;
+
+ return mask;
+}
+
+/*
+ * Request completion
+ * - command: "done <id>"
+ */
+static int cachefiles_daemon_done(struct cachefiles_cache *cache, char *args)
+{
+ unsigned long id;
+ int ret;
+ struct cachefiles_req *req;
+
+ _enter(",%s", args);
+
+ if (!*args) {
+ pr_err("Empty id specified\n");
+ return -EINVAL;
+ }
+
+ ret = kstrtoul(args, 0, &id);
+ if (ret)
+ return ret;
+
+ idr_lock(&cache->reqs);
+ req = idr_remove(&cache->reqs, id);
+ idr_unlock(&cache->reqs);
+ if (!req)
+ return -EINVAL;
+
+ complete(&req->done);
+
+ return 0;
+}
+#endif
diff --git a/fs/cachefiles/internal.h b/fs/cachefiles/internal.h
index 2bb441197106..aa622b966802 100644
--- a/fs/cachefiles/internal.h
+++ b/fs/cachefiles/internal.h
@@ -15,6 +15,7 @@
#include <linux/fscache-cache.h>
#include <linux/cred.h>
#include <linux/security.h>
+#include <linux/idr.h>
#define CACHEFILES_DIO_BLOCK_SIZE 4096
@@ -60,6 +61,20 @@ struct cachefiles_object {
#define CACHEFILES_OBJECT_USING_TMPFILE 0 /* Have an unlinked tmpfile */
};
+#ifdef CONFIG_CACHEFILES_ONDEMAND
+struct cachefiles_req_in {
+ uint64_t id;
+ uint64_t off;
+ uint64_t len;
+ char path[NAME_MAX];
+};
+
+struct cachefiles_req {
+ struct completion done;
+ struct cachefiles_req_in req_in;
+};
+#endif
+
/*
* Cache files cache definition
*/
@@ -102,6 +117,10 @@ struct cachefiles_cache {
char *rootdirname; /* name of cache root directory */
char *secctx; /* LSM security context */
char *tag; /* cache binding tag */
+
+#ifdef CONFIG_CACHEFILES_ONDEMAND
+ struct idr reqs;
+#endif
};
#include <trace/events/cachefiles.h>
@@ -146,6 +165,9 @@ extern int cachefiles_has_space(struct cachefiles_cache *cache,
* daemon.c
*/
extern const struct file_operations cachefiles_daemon_fops;
+#ifdef CONFIG_CACHEFILES_ONDEMAND
+extern const struct file_operations cachefiles_ondemand_fops;
+#endif
/*
* error_inject.c
diff --git a/fs/cachefiles/io.c b/fs/cachefiles/io.c
index 5da0bfd78188..f7418d02fde1 100644
--- a/fs/cachefiles/io.c
+++ b/fs/cachefiles/io.c
@@ -539,12 +539,80 @@ static void cachefiles_end_operation(struct netfs_cache_resources *cres)
fscache_end_cookie_access(fscache_cres_cookie(cres), fscache_access_io_end);
}
+#ifdef CONFIG_CACHEFILES_ONDEMAND
+static struct cachefiles_req *cachefiles_alloc_req(struct cachefiles_object *object,
+ loff_t start_pos,
+ size_t len)
+{
+ struct cachefiles_req *req;
+ struct cachefiles_req_in *req_in;
+
+ req = kzalloc(sizeof(*req), GFP_KERNEL);
+ if (!req)
+ return NULL;
+
+ req_in = &req->req_in;
+
+ req_in->off = start_pos;
+ req_in->len = len;
+ strncpy(req_in->path, object->d_name, sizeof(req_in->path) - 1);
+
+ init_completion(&req->done);
+
+ return req;
+}
+
+int cachefiles_ondemand_read(struct netfs_cache_resources *cres,
+ loff_t start_pos, size_t len)
+{
+ struct cachefiles_object *object;
+ struct cachefiles_cache *cache;
+ struct cachefiles_req *req;
+ int ret;
+
+ object = cachefiles_cres_object(cres);
+ cache = object->volume->cache;
+
+ if (!test_bit(CACHEFILES_ONDEMAND_MODE, &cache->flags))
+ return -EOPNOTSUPP;
+
+ req = cachefiles_alloc_req(object, start_pos, len);
+ if (!req)
+ return -ENOMEM;
+
+ idr_preload(GFP_KERNEL);
+ idr_lock(&cache->reqs);
+
+ ret = idr_alloc(&cache->reqs, req, 0, 0, GFP_ATOMIC);
+ if (ret >= 0)
+ req->req_in.id = ret;
+
+ idr_unlock(&cache->reqs);
+ idr_preload_end();
+
+ if (ret < 0) {
+ kfree(req);
+ return -ENOMEM;
+ }
+
+ wake_up_all(&cache->daemon_pollwq);
+
+ wait_for_completion(&req->done);
+ kfree(req);
+
+ return 0;
+}
+#endif
+
static const struct netfs_cache_ops cachefiles_netfs_cache_ops = {
.end_operation = cachefiles_end_operation,
.read = cachefiles_read,
.write = cachefiles_write,
.prepare_read = cachefiles_prepare_read,
.prepare_write = cachefiles_prepare_write,
+#ifdef CONFIG_CACHEFILES_ONDEMAND
+ .ondemand_read = cachefiles_ondemand_read,
+#endif
};
/*
diff --git a/fs/cachefiles/main.c b/fs/cachefiles/main.c
index 3f369c6f816d..eab17c3140d9 100644
--- a/fs/cachefiles/main.c
+++ b/fs/cachefiles/main.c
@@ -39,6 +39,27 @@ static struct miscdevice cachefiles_dev = {
.fops = &cachefiles_daemon_fops,
};
+#ifdef CONFIG_CACHEFILES_ONDEMAND
+static struct miscdevice cachefiles_ondemand_dev = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = "cachefiles_ondemand",
+ .fops = &cachefiles_ondemand_fops,
+};
+
+static inline int cachefiles_init_ondemand(void)
+{
+ return misc_register(&cachefiles_ondemand_dev);
+}
+
+static inline void cachefiles_exit_ondemand(void)
+{
+ misc_deregister(&cachefiles_ondemand_dev);
+}
+#else
+static inline int cachefiles_init_ondemand(void) { return 0; }
+static inline void cachefiles_exit_ondemand(void) {}
+#endif
+
/*
* initialise the fs caching module
*/
@@ -52,6 +73,9 @@ static int __init cachefiles_init(void)
ret = misc_register(&cachefiles_dev);
if (ret < 0)
goto error_dev;
+ ret = cachefiles_init_ondemand();
+ if (ret < 0)
+ goto error_ondemand_dev;
/* create an object jar */
ret = -ENOMEM;
@@ -68,6 +92,8 @@ static int __init cachefiles_init(void)
return 0;
error_object_jar:
+ cachefiles_exit_ondemand();
+error_ondemand_dev:
misc_deregister(&cachefiles_dev);
error_dev:
cachefiles_unregister_error_injection();
@@ -86,6 +112,7 @@ static void __exit cachefiles_exit(void)
pr_info("Unloading\n");
kmem_cache_destroy(cachefiles_object_jar);
+ cachefiles_exit_ondemand();
misc_deregister(&cachefiles_dev);
cachefiles_unregister_error_injection();
}
--
2.27.0
Introduce 'struct erofs_cookie_ctx' for managing cookie for backing
file, and the following introduced API for reading from backing file.
Besides, introduce two helper functions for initializing and cleaning
up erofs_cookie_ctx.
struct erofs_cookie_ctx *
erofs_fscache_get_ctx(struct super_block *sb, char *path);
void erofs_fscache_put_ctx(struct erofs_cookie_ctx *ctx);
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/erofs/fscache.c | 78 +++++++++++++++++++++++++++++++++++++++++++++
fs/erofs/internal.h | 8 +++++
2 files changed, 86 insertions(+)
diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index 9c32f42e1056..10c3f5ea9e24 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -6,6 +6,84 @@
static struct fscache_volume *volume;
+static int erofs_fscache_init_cookie(struct erofs_fscache_context *ctx,
+ char *path)
+{
+ struct fscache_cookie *cookie;
+
+ /*
+ * @object_size shall be non-zero to avoid
+ * FSCACHE_COOKIE_NO_DATA_TO_READ.
+ */
+ cookie = fscache_acquire_cookie(volume, 0,
+ path, strlen(path),
+ NULL, 0, -1);
+ if (!cookie)
+ return -EINVAL;
+
+ fscache_use_cookie(cookie, false);
+ ctx->cookie = cookie;
+ return 0;
+}
+
+static inline
+void erofs_fscache_cleanup_cookie(struct erofs_fscache_context *ctx)
+{
+ struct fscache_cookie *cookie = ctx->cookie;
+
+ fscache_unuse_cookie(cookie, NULL, NULL);
+ fscache_relinquish_cookie(cookie, false);
+ ctx->cookie = NULL;
+}
+
+static int erofs_fscahce_init_ctx(struct erofs_fscache_context *ctx,
+ struct super_block *sb, char *path)
+{
+ int ret;
+
+ ret = erofs_fscache_init_cookie(ctx, path);
+ if (ret) {
+ erofs_err(sb, "failed to init cookie");
+ return ret;
+ }
+
+ return 0;
+}
+
+static inline
+void erofs_fscache_cleanup_ctx(struct erofs_fscache_context *ctx)
+{
+ erofs_fscache_cleanup_cookie(ctx);
+}
+
+struct erofs_fscache_context *erofs_fscache_get_ctx(struct super_block *sb,
+ char *path)
+{
+ struct erofs_fscache_context *ctx;
+ int ret;
+
+ ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+ if (!ctx)
+ return ERR_PTR(-ENOMEM);
+
+ ret = erofs_fscahce_init_ctx(ctx, sb, path);
+ if (ret) {
+ kfree(ctx);
+ return ERR_PTR(ret);
+ }
+
+ return ctx;
+}
+
+void erofs_fscache_put_ctx(struct erofs_fscache_context *ctx)
+{
+ if (!ctx)
+ return;
+
+ erofs_fscache_cleanup_ctx(ctx);
+ kfree(ctx);
+}
+
int __init erofs_init_fscache(void)
{
volume = fscache_acquire_volume("erofs", NULL, NULL, 0);
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index c2608a469107..1f5bc69e8e9f 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -97,6 +97,10 @@ struct erofs_sb_lz4_info {
u16 max_pclusterblks;
};
+struct erofs_fscache_context {
+ struct fscache_cookie *cookie;
+};
+
struct erofs_sb_info {
struct erofs_mount_opts opt; /* options */
#ifdef CONFIG_EROFS_FS_ZIP
@@ -621,6 +625,10 @@ static inline int z_erofs_load_lzma_config(struct super_block *sb,
int erofs_init_fscache(void);
void erofs_exit_fscache(void);
+struct erofs_fscache_context *erofs_fscache_get_ctx(struct super_block *sb,
+ char *path);
+void erofs_fscache_put_ctx(struct erofs_fscache_context *ctx);
+
#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
#endif /* __EROFS_INTERNAL_H */
--
2.27.0
fscache/cachefiles used to serve as a local cache for remote fs. The
following patches will introduce a new use case, in which local
read-only fs could implement on-demand reading with fscache. Then in
this case, the upper read-only fs may has no idea on the size of the
backed file.
Besides it is worth nothing that, in this scenario, user daemon is
responsible for preparing all backing files with correct file size
(backing files are all sparse files in this case). And since it's
read-only, we can trust the backing file size as the backed file size.
With this precondition, cachefiles can detect the actual size of the
backing file, and set it as the size of the backed file.
This patch also adds one flag bit to distinguish the new introduced
on-demand read mode from the original mode. The following patch will
make it configurable by users.
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/cachefiles/Kconfig | 8 ++++++
fs/cachefiles/internal.h | 1 +
fs/cachefiles/namei.c | 60 +++++++++++++++++++++++++++++++++++++++-
3 files changed, 68 insertions(+), 1 deletion(-)
diff --git a/fs/cachefiles/Kconfig b/fs/cachefiles/Kconfig
index 719faeeda168..0aaef4dd3866 100644
--- a/fs/cachefiles/Kconfig
+++ b/fs/cachefiles/Kconfig
@@ -26,3 +26,11 @@ config CACHEFILES_ERROR_INJECTION
help
This permits error injection to be enabled in cachefiles whilst a
cache is in service.
+
+config CACHEFILES_ONDEMAND
+ bool "Support for on-demand reading"
+ depends on CACHEFILES && FSCACHE_ONDEMAND
+ default n
+ help
+ This permits on-demand read mode of cachefiles.
+ If unsure, say N.
diff --git a/fs/cachefiles/internal.h b/fs/cachefiles/internal.h
index 421423819d63..2bb441197106 100644
--- a/fs/cachefiles/internal.h
+++ b/fs/cachefiles/internal.h
@@ -98,6 +98,7 @@ struct cachefiles_cache {
#define CACHEFILES_DEAD 1 /* T if cache dead */
#define CACHEFILES_CULLING 2 /* T if cull engaged */
#define CACHEFILES_STATE_CHANGED 3 /* T if state changed (poll trigger) */
+#define CACHEFILES_ONDEMAND_MODE 4 /* T if in on-demand read mode */
char *rootdirname; /* name of cache root directory */
char *secctx; /* LSM security context */
char *tag; /* cache binding tag */
diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
index 9399153e1c99..1469f94cb229 100644
--- a/fs/cachefiles/namei.c
+++ b/fs/cachefiles/namei.c
@@ -506,15 +506,69 @@ struct file *cachefiles_create_tmpfile(struct cachefiles_object *object)
return file;
}
+#ifdef CONFIG_CACHEFILES_ONDEMAND
+static inline bool cachefiles_can_create_file(struct cachefiles_cache *cache)
+{
+ /*
+ * On-demand read mode requires that backing files have been prepared
+ * with correct file size under corresponding directory. We can get here
+ * when the backing file doesn't exist under corresponding directory, or
+ * the file size is unexpected 0.
+ */
+ return !test_bit(CACHEFILES_ONDEMAND_MODE, &cache->flags);
+
+}
+
+/*
+ * Fs using fscache for on-demand reading may have no idea of the file size of
+ * backing files. Thus the on-demand read mode requires that backing files have
+ * been prepared with correct file size under corresponding directory. Then
+ * fscache backend is responsible for taking the file size of the backing file
+ * as the object size.
+ */
+static int cachefiles_recheck_size(struct cachefiles_object *object,
+ struct file *file)
+{
+ loff_t size;
+ struct cachefiles_cache *cache = object->volume->cache;
+
+ if (!test_bit(CACHEFILES_ONDEMAND_MODE, &cache->flags))
+ return 0;
+
+ size = i_size_read(file_inode(file));
+ if (!size)
+ return -EINVAL;
+
+ object->cookie->object_size = size;
+ return 0;
+}
+#else
+static inline bool cachefiles_can_create_file(struct cachefiles_cache *cache)
+{
+ return true;
+}
+
+static int cachefiles_recheck_size(struct cachefiles_object *object,
+ struct file *file)
+{
+ return 0;
+}
+#endif
+
+
/*
* Create a new file.
*/
static bool cachefiles_create_file(struct cachefiles_object *object)
{
+ struct cachefiles_cache *cache = object->volume->cache;
struct file *file;
int ret;
- ret = cachefiles_has_space(object->volume->cache, 1, 0,
+ if (!cachefiles_can_create_file(cache))
+ return false;
+
+ ret = cachefiles_has_space(cache, 1, 0,
cachefiles_has_space_for_create);
if (ret < 0)
return false;
@@ -569,6 +623,10 @@ static bool cachefiles_open_file(struct cachefiles_object *object,
}
_debug("file -> %pd positive", dentry);
+ ret = cachefiles_recheck_size(object, file);
+ if (ret < 0)
+ goto check_failed;
+
ret = cachefiles_check_auxdata(object, file);
if (ret < 0)
goto check_failed;
--
2.27.0
This patch implements the data plane of reading data from data blob file
over fscache.
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/erofs/data.c | 3 +++
fs/erofs/fscache.c | 15 ++++++++++++---
fs/erofs/internal.h | 1 +
3 files changed, 16 insertions(+), 3 deletions(-)
diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index 51ccbc02dd73..56db391a3411 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -200,6 +200,7 @@ int erofs_map_dev(struct super_block *sb, struct erofs_map_dev *map)
map->m_bdev = sb->s_bdev;
map->m_daxdev = EROFS_SB(sb)->dax_dev;
map->m_dax_part_off = EROFS_SB(sb)->dax_part_off;
+ map->m_ctx = EROFS_SB(sb)->bootstrap;
if (map->m_deviceid) {
down_read(&devs->rwsem);
@@ -211,6 +212,7 @@ int erofs_map_dev(struct super_block *sb, struct erofs_map_dev *map)
map->m_bdev = dif->bdev;
map->m_daxdev = dif->dax_dev;
map->m_dax_part_off = dif->dax_part_off;
+ map->m_ctx = dif->ctx;
up_read(&devs->rwsem);
} else if (devs->extra_devices) {
down_read(&devs->rwsem);
@@ -228,6 +230,7 @@ int erofs_map_dev(struct super_block *sb, struct erofs_map_dev *map)
map->m_bdev = dif->bdev;
map->m_daxdev = dif->dax_dev;
map->m_dax_part_off = dif->dax_part_off;
+ map->m_ctx = dif->ctx;
break;
}
}
diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index 8c56bd54b2af..e8df35ee4ba8 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -144,8 +144,8 @@ static int erofs_fscache_readpage(struct file *file, struct page *page)
struct inode *inode = page->mapping->host;
struct erofs_inode *vi = EROFS_I(inode);
struct super_block *sb = inode->i_sb;
- struct erofs_sb_info *sbi = EROFS_SB(sb);
struct erofs_map_blocks map;
+ struct erofs_map_dev mdev;
struct erofs_fscache_map fsmap;
int ret;
@@ -168,9 +168,18 @@ static int erofs_fscache_readpage(struct file *file, struct page *page)
return 0;
}
- fsmap.m_ctx = sbi->bootstrap;
+ mdev = (struct erofs_map_dev) {
+ .m_deviceid = map.m_deviceid,
+ .m_pa = map.m_pa,
+ };
+
+ ret = erofs_map_dev(sb, &mdev);
+ if (ret)
+ return ret;
+
+ fsmap.m_ctx = mdev.m_ctx;
fsmap.m_la = map.m_la;
- fsmap.m_pa = map.m_pa;
+ fsmap.m_pa = mdev.m_pa;
fsmap.m_llen = map.m_llen;
switch (vi->datalayout) {
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 5d514c7b73cc..6ccf14952b2d 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -486,6 +486,7 @@ struct erofs_map_dev {
struct block_device *m_bdev;
struct dax_device *m_daxdev;
u64 m_dax_part_off;
+ struct erofs_fscache_context *m_ctx;
erofs_off_t m_pa;
unsigned int m_deviceid;
--
2.27.0
All erofs instances will share one global fscache volume.
In this using scenario, one erofs instance could be mounted from one (or
multiple) blob files instead of blkdev. The number of blob files that
each erofs instance could correspond to is limited, since these blob
files are quite large in size. For example, when used for container
image distribution, one erofs instance used for container image for
node.js will correspond to ~20 blob files in total. Thus in densely
employed environment, there could be as many as hundreds of containers
and thus thousands of fscache cookies under one fscache volume.
Then as for cachefiles backend, the hash table managing all cookies
under one volume contains 32K slots. Thus the hashing functionality shall
scale well in this case. Besides, cachefiles backend will scatter
backing files under 256 fan sub-directoris, and thus the scalability of
looking up backing files shall also not be an issue.
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/erofs/Makefile | 3 ++-
fs/erofs/fscache.c | 21 +++++++++++++++++++++
fs/erofs/internal.h | 5 +++++
fs/erofs/super.c | 7 +++++++
4 files changed, 35 insertions(+), 1 deletion(-)
create mode 100644 fs/erofs/fscache.c
diff --git a/fs/erofs/Makefile b/fs/erofs/Makefile
index 8a3317e38e5a..21999e8a4728 100644
--- a/fs/erofs/Makefile
+++ b/fs/erofs/Makefile
@@ -1,7 +1,8 @@
# SPDX-License-Identifier: GPL-2.0-only
obj-$(CONFIG_EROFS_FS) += erofs.o
-erofs-objs := super.o inode.o data.o namei.o dir.o utils.o pcpubuf.o sysfs.o
+erofs-objs := super.o inode.o data.o namei.o dir.o utils.o pcpubuf.o sysfs.o \
+ fscache.o
erofs-$(CONFIG_EROFS_FS_XATTR) += xattr.o
erofs-$(CONFIG_EROFS_FS_ZIP) += decompressor.o zmap.o zdata.o
erofs-$(CONFIG_EROFS_FS_ZIP_LZMA) += decompressor_lzma.o
diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
new file mode 100644
index 000000000000..9c32f42e1056
--- /dev/null
+++ b/fs/erofs/fscache.c
@@ -0,0 +1,21 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2021, Alibaba Cloud
+ */
+#include "internal.h"
+
+static struct fscache_volume *volume;
+
+int __init erofs_init_fscache(void)
+{
+ volume = fscache_acquire_volume("erofs", NULL, NULL, 0);
+ if (!volume)
+ return -EINVAL;
+
+ return 0;
+}
+
+void erofs_exit_fscache(void)
+{
+ fscache_relinquish_volume(volume, NULL, false);
+}
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 2b9337d385ce..c2608a469107 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -17,6 +17,7 @@
#include <linux/slab.h>
#include <linux/vmalloc.h>
#include <linux/iomap.h>
+#include <linux/fscache.h>
#include "erofs_fs.h"
/* redefine pr_fmt "erofs: " */
@@ -616,6 +617,10 @@ static inline int z_erofs_load_lzma_config(struct super_block *sb,
}
#endif /* !CONFIG_EROFS_FS_ZIP */
+/* fscache.c */
+int erofs_init_fscache(void);
+void erofs_exit_fscache(void);
+
#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
#endif /* __EROFS_INTERNAL_H */
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 12755217631f..798f0c379e35 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -814,6 +814,10 @@ static int __init erofs_module_init(void)
if (err)
goto sysfs_err;
+ err = erofs_init_fscache();
+ if (err)
+ goto fscache_err;
+
err = register_filesystem(&erofs_fs_type);
if (err)
goto fs_err;
@@ -821,6 +825,8 @@ static int __init erofs_module_init(void)
return 0;
fs_err:
+ erofs_exit_fscache();
+fscache_err:
erofs_exit_sysfs();
sysfs_err:
z_erofs_exit_zip_subsystem();
@@ -841,6 +847,7 @@ static void __exit erofs_module_exit(void)
/* Ensure all RCU free inodes / pclusters are safe to be destroyed. */
rcu_barrier();
+ erofs_exit_fscache();
erofs_exit_sysfs();
z_erofs_exit_zip_subsystem();
z_erofs_lzma_exit();
--
2.27.0
This patch implements the data plane of reading data from bootstrap blob
file over fscache for non-inline layout.
Be noted that compressed layout is not supported yet.
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/erofs/fscache.c | 111 ++++++++++++++++++++++++++++++++++++++++++++
fs/erofs/inode.c | 6 ++-
fs/erofs/internal.h | 1 +
3 files changed, 117 insertions(+), 1 deletion(-)
diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index 5a25ae523e5e..588c33ab6a90 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -4,6 +4,17 @@
*/
#include "internal.h"
+struct erofs_fscache_map {
+ struct erofs_fscache_context *m_ctx;
+ erofs_off_t m_pa, m_la, o_la;
+ u64 m_llen;
+};
+
+struct erofs_fscache_priv {
+ struct fscache_cookie *cookie;
+ loff_t offset;
+};
+
static struct fscache_volume *volume;
static int erofs_blob_begin_cache_operation(struct netfs_read_request *rreq)
@@ -22,6 +33,33 @@ static const struct netfs_read_request_ops erofs_blob_req_ops = {
.cleanup = erofs_noop_cleanup,
};
+static int erofs_begin_cache_operation(struct netfs_read_request *rreq)
+{
+ struct erofs_fscache_priv *priv = rreq->netfs_priv;
+
+ rreq->p_start = priv->offset;
+ return fscache_begin_read_operation(&rreq->cache_resources,
+ priv->cookie);
+}
+
+static bool erofs_clamp_length(struct netfs_read_subrequest *subreq)
+{
+ /*
+ * For non-inline layout, rreq->i_size is actually the size of upper
+ * file in erofs rather than that of blob file. Thus when cache miss,
+ * subreq->len can be restricted to the upper file size, while we hope
+ * blob file can be filled in a EROFS_BLKSIZ granule.
+ */
+ subreq->len = round_up(subreq->len, EROFS_BLKSIZ);
+ return true;
+}
+
+static const struct netfs_read_request_ops erofs_req_ops = {
+ .begin_cache_operation = erofs_begin_cache_operation,
+ .cleanup = erofs_noop_cleanup,
+ .clamp_length = erofs_clamp_length,
+};
+
static int erofs_fscache_blob_readpage(struct file *data, struct page *page)
{
struct folio *folio = page_folio(page);
@@ -42,6 +80,79 @@ struct page *erofs_fscache_read_cache_page(struct erofs_fscache_context *ctx,
return read_mapping_page(ctx->inode->i_mapping, index, ctx);
}
+static int erofs_fscache_readpage_noinline(struct page *page,
+ struct erofs_fscache_map *fsmap)
+{
+ struct folio *folio = page_folio(page);
+ struct erofs_fscache_priv priv;
+
+ /*
+ * 1) For FLAT_PLAIN layout, the output map.m_la shall be equal to o_la,
+ * and the output map.m_pa is exactly the physical address of o_la.
+ * 2) For CHUNK_BASED layout, the output map.m_la is rounded down to the
+ * nearest chunk boundary, and the output map.m_pa is actually the
+ * physical address of this chunk boundary. So we need to recalculate
+ * the actual physical address of o_la.
+ */
+ priv.offset = fsmap->m_pa + fsmap->o_la - fsmap->m_la;
+ priv.cookie = fsmap->m_ctx->cookie;
+
+ return netfs_readpage(NULL, folio, &erofs_req_ops, &priv);
+}
+
+static int erofs_fscache_readpage(struct file *file, struct page *page)
+{
+ struct inode *inode = page->mapping->host;
+ struct erofs_inode *vi = EROFS_I(inode);
+ struct super_block *sb = inode->i_sb;
+ struct erofs_sb_info *sbi = EROFS_SB(sb);
+ struct erofs_map_blocks map;
+ struct erofs_fscache_map fsmap;
+ int ret;
+
+ if (erofs_inode_is_data_compressed(vi->datalayout)) {
+ erofs_info(sb, "compressed layout not supported yet");
+ ret = -EOPNOTSUPP;
+ goto err_out;
+ }
+
+ map.m_la = fsmap.o_la = page_offset(page);
+
+ ret = erofs_map_blocks(inode, &map, EROFS_GET_BLOCKS_RAW);
+ if (ret)
+ goto err_out;
+
+ if (!(map.m_flags & EROFS_MAP_MAPPED)) {
+ zero_user(page, 0, PAGE_SIZE);
+ SetPageUptodate(page);
+ unlock_page(page);
+ return 0;
+ }
+
+ fsmap.m_ctx = sbi->bootstrap;
+ fsmap.m_la = map.m_la;
+ fsmap.m_pa = map.m_pa;
+ fsmap.m_llen = map.m_llen;
+
+ switch (vi->datalayout) {
+ case EROFS_INODE_FLAT_PLAIN:
+ case EROFS_INODE_CHUNK_BASED:
+ return erofs_fscache_readpage_noinline(page, &fsmap);
+ default:
+ DBG_BUGON(1);
+ ret = -EOPNOTSUPP;
+ }
+
+err_out:
+ SetPageError(page);
+ unlock_page(page);
+ return ret;
+}
+
+const struct address_space_operations erofs_fscache_access_aops = {
+ .readpage = erofs_fscache_readpage,
+};
+
static int erofs_fscache_init_cookie(struct erofs_fscache_context *ctx,
char *path)
{
diff --git a/fs/erofs/inode.c b/fs/erofs/inode.c
index ff62f84f47d3..2f450cb3a7b9 100644
--- a/fs/erofs/inode.c
+++ b/fs/erofs/inode.c
@@ -296,7 +296,11 @@ static int erofs_fill_inode(struct inode *inode, int isdir)
err = z_erofs_fill_inode(inode);
goto out_unlock;
}
- inode->i_mapping->a_ops = &erofs_raw_access_aops;
+
+ if (erofs_bdev_mode(inode->i_sb))
+ inode->i_mapping->a_ops = &erofs_raw_access_aops;
+ else
+ inode->i_mapping->a_ops = &erofs_fscache_access_aops;
out_unlock:
erofs_put_metabuf(&buf);
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index fca706cfaf72..548f928b0ded 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -393,6 +393,7 @@ struct page *erofs_grab_cache_page_nowait(struct address_space *mapping,
extern const struct super_operations erofs_sops;
extern const struct address_space_operations erofs_raw_access_aops;
+extern const struct address_space_operations erofs_fscache_access_aops;
extern const struct address_space_operations z_erofs_aops;
/*
--
2.27.0
Registers fscache_cookie for the bootstrap blob file. The bootstrap blob
file can be specified by a new mount option, which is going to be
introduced by a following patch.
Something worth mentioning about the cleanup routine.
1. The init routine is prior to when the root inode gets initialized,
and thus the corresponding cleanup routine shall be placed under
.kill_sb() callback.
2. The init routine will instantiate anonymous inodes under the
super_block, and thus .put_super() callback shall also contain the
cleanup routine. Or we'll get "VFS: Busy inodes after unmount." warning.
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/erofs/internal.h | 3 +++
fs/erofs/super.c | 13 +++++++++++++
2 files changed, 16 insertions(+)
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index bb5e992fe0df..277dcd5888ea 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -75,6 +75,7 @@ struct erofs_mount_opts {
unsigned int max_sync_decompress_pages;
#endif
unsigned int mount_opt;
+ char *uuid;
};
struct erofs_dev_context {
@@ -152,6 +153,8 @@ struct erofs_sb_info {
/* sysfs support */
struct kobject s_kobj; /* /sys/fs/erofs/<devname> */
struct completion s_kobj_unregister;
+
+ struct erofs_fscache_context *bootstrap;
};
#define EROFS_SB(sb) ((struct erofs_sb_info *)(sb)->s_fs_info)
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 798f0c379e35..8c5783c6f71f 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -598,6 +598,16 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
sbi->devs = ctx->devs;
ctx->devs = NULL;
+ if (!erofs_bdev_mode(sb)) {
+ struct erofs_fscache_context *bootstrap;
+
+ bootstrap = erofs_fscache_get_ctx(sb, ctx->opt.uuid, true);
+ if (IS_ERR(bootstrap))
+ return PTR_ERR(bootstrap);
+
+ sbi->bootstrap = bootstrap;
+ }
+
err = erofs_read_superblock(sb);
if (err)
return err;
@@ -753,6 +763,7 @@ static void erofs_kill_sb(struct super_block *sb)
return;
erofs_free_dev_context(sbi->devs);
+ erofs_fscache_put_ctx(sbi->bootstrap);
fs_put_dax(sbi->dax_dev);
kfree(sbi);
sb->s_fs_info = NULL;
@@ -771,6 +782,8 @@ static void erofs_put_super(struct super_block *sb)
iput(sbi->managed_cache);
sbi->managed_cache = NULL;
#endif
+ erofs_fscache_put_ctx(sbi->bootstrap);
+ sbi->bootstrap = NULL;
}
static struct file_system_type erofs_fs_type = {
--
2.27.0
This patch implements the data plane of reading metadata from bootstrap
blob file over fscache.
Be noted that currently it only supports the scenario where the backing
file has no hole. Once it hits a hole of the backing file, erofs will
fail the IO with -EOPNOTSUPP for now. The following patch will fix this
issue, i.e. implementing the demand reading mode.
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/erofs/data.c | 11 +++++++++--
fs/erofs/fscache.c | 33 +++++++++++++++++++++++++++++++++
fs/erofs/internal.h | 3 +++
3 files changed, 45 insertions(+), 2 deletions(-)
diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index f3aa133866e5..51ccbc02dd73 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -31,15 +31,22 @@ void erofs_put_metabuf(struct erofs_buf *buf)
void *erofs_read_metabuf(struct erofs_buf *buf, struct super_block *sb,
erofs_blk_t blkaddr, enum erofs_kmap_type type)
{
- struct address_space *const mapping = sb->s_bdev->bd_inode->i_mapping;
+ struct address_space *mapping;
+ struct erofs_sb_info *sbi = EROFS_SB(sb);
erofs_off_t offset = blknr_to_addr(blkaddr);
pgoff_t index = offset >> PAGE_SHIFT;
struct page *page = buf->page;
if (!page || page->index != index) {
erofs_put_metabuf(buf);
- page = read_cache_page_gfp(mapping, index,
+ if (erofs_bdev_mode(sb)) {
+ mapping = sb->s_bdev->bd_inode->i_mapping;
+ page = read_cache_page_gfp(mapping, index,
mapping_gfp_constraint(mapping, ~__GFP_FS));
+ } else {
+ page = erofs_fscache_read_cache_page(sbi->bootstrap,
+ index);
+ }
if (IS_ERR(page))
return page;
/* should already be PageUptodate, no need to lock page */
diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index 74683df6144d..5a25ae523e5e 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -6,9 +6,42 @@
static struct fscache_volume *volume;
+static int erofs_blob_begin_cache_operation(struct netfs_read_request *rreq)
+{
+ return fscache_begin_read_operation(&rreq->cache_resources,
+ rreq->netfs_priv);
+}
+
+/* .cleanup() is needed if rreq->netfs_priv is non-NULL */
+static void erofs_noop_cleanup(struct address_space *mapping, void *netfs_priv)
+{
+}
+
+static const struct netfs_read_request_ops erofs_blob_req_ops = {
+ .begin_cache_operation = erofs_blob_begin_cache_operation,
+ .cleanup = erofs_noop_cleanup,
+};
+
+static int erofs_fscache_blob_readpage(struct file *data, struct page *page)
+{
+ struct folio *folio = page_folio(page);
+ struct erofs_fscache_context *ctx =
+ (struct erofs_fscache_context *)data;
+
+ return netfs_readpage(NULL, folio, &erofs_blob_req_ops, ctx->cookie);
+}
+
static const struct address_space_operations erofs_fscache_blob_aops = {
+ .readpage = erofs_fscache_blob_readpage,
};
+struct page *erofs_fscache_read_cache_page(struct erofs_fscache_context *ctx,
+ pgoff_t index)
+{
+ DBG_BUGON(!ctx->inode);
+ return read_mapping_page(ctx->inode->i_mapping, index, ctx);
+}
+
static int erofs_fscache_init_cookie(struct erofs_fscache_context *ctx,
char *path)
{
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 277dcd5888ea..fca706cfaf72 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -633,6 +633,9 @@ struct erofs_fscache_context *erofs_fscache_get_ctx(struct super_block *sb,
char *path, bool need_inode);
void erofs_fscache_put_ctx(struct erofs_fscache_context *ctx);
+struct page *erofs_fscache_read_cache_page(struct erofs_fscache_context *ctx,
+ pgoff_t index);
+
#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
#endif /* __EROFS_INTERNAL_H */
--
2.27.0
Implement the .issue_op() callback, and all work is done by
netfs_ondemand_read().
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/erofs/fscache.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index e8df35ee4ba8..9ba668c42098 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -28,9 +28,15 @@ static void erofs_noop_cleanup(struct address_space *mapping, void *netfs_priv)
{
}
+static void erofs_issue_op(struct netfs_read_subrequest *subreq)
+{
+ netfs_ondemand_read(subreq);
+}
+
static const struct netfs_read_request_ops erofs_blob_req_ops = {
.begin_cache_operation = erofs_blob_begin_cache_operation,
.cleanup = erofs_noop_cleanup,
+ .issue_op = erofs_issue_op,
};
static int erofs_begin_cache_operation(struct netfs_read_request *rreq)
@@ -58,6 +64,7 @@ static const struct netfs_read_request_ops erofs_req_ops = {
.begin_cache_operation = erofs_begin_cache_operation,
.cleanup = erofs_noop_cleanup,
.clamp_length = erofs_clamp_length,
+ .issue_op = erofs_issue_op,
};
static int erofs_fscache_blob_readpage(struct file *data, struct page *page)
--
2.27.0
Introduce one anonymous inode for managing page cache of corresponding
blob file. Then erofs could read directly from the address space of the
anonymous inode when cache hit.
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/erofs/fscache.c | 45 ++++++++++++++++++++++++++++++++++++++++++---
fs/erofs/internal.h | 3 ++-
2 files changed, 44 insertions(+), 4 deletions(-)
diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index 10c3f5ea9e24..74683df6144d 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -6,6 +6,9 @@
static struct fscache_volume *volume;
+static const struct address_space_operations erofs_fscache_blob_aops = {
+};
+
static int erofs_fscache_init_cookie(struct erofs_fscache_context *ctx,
char *path)
{
@@ -36,8 +39,34 @@ void erofs_fscache_cleanup_cookie(struct erofs_fscache_context *ctx)
ctx->cookie = NULL;
}
+static int erofs_fscache_get_inode(struct erofs_fscache_context *ctx,
+ struct super_block *sb)
+{
+ struct inode *const inode = new_inode(sb);
+
+ if (!inode)
+ return -ENOMEM;
+
+ set_nlink(inode, 1);
+ inode->i_size = OFFSET_MAX;
+
+ inode->i_mapping->a_ops = &erofs_fscache_blob_aops;
+ mapping_set_gfp_mask(inode->i_mapping,
+ GFP_NOFS | __GFP_HIGHMEM | __GFP_MOVABLE);
+ ctx->inode = inode;
+ return 0;
+}
+
+static inline
+void erofs_fscache_put_inode(struct erofs_fscache_context *ctx)
+{
+ iput(ctx->inode);
+ ctx->inode = NULL;
+}
+
static int erofs_fscahce_init_ctx(struct erofs_fscache_context *ctx,
- struct super_block *sb, char *path)
+ struct super_block *sb, char *path,
+ bool need_inode)
{
int ret;
@@ -47,6 +76,15 @@ static int erofs_fscahce_init_ctx(struct erofs_fscache_context *ctx,
return ret;
}
+ if (need_inode) {
+ ret = erofs_fscache_get_inode(ctx, sb);
+ if (ret) {
+ erofs_err(sb, "failed to get anonymous inode");
+ erofs_fscache_cleanup_cookie(ctx);
+ return ret;
+ }
+ }
+
return 0;
}
@@ -54,10 +92,11 @@ static inline
void erofs_fscache_cleanup_ctx(struct erofs_fscache_context *ctx)
{
erofs_fscache_cleanup_cookie(ctx);
+ erofs_fscache_put_inode(ctx);
}
struct erofs_fscache_context *erofs_fscache_get_ctx(struct super_block *sb,
- char *path)
+ char *path, bool need_inode)
{
struct erofs_fscache_context *ctx;
int ret;
@@ -66,7 +105,7 @@ struct erofs_fscache_context *erofs_fscache_get_ctx(struct super_block *sb,
if (!ctx)
return ERR_PTR(-ENOMEM);
- ret = erofs_fscahce_init_ctx(ctx, sb, path);
+ ret = erofs_fscahce_init_ctx(ctx, sb, path, need_inode);
if (ret) {
kfree(ctx);
return ERR_PTR(ret);
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 1f5bc69e8e9f..bb5e992fe0df 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -99,6 +99,7 @@ struct erofs_sb_lz4_info {
struct erofs_fscache_context {
struct fscache_cookie *cookie;
+ struct inode *inode;
};
struct erofs_sb_info {
@@ -626,7 +627,7 @@ int erofs_init_fscache(void);
void erofs_exit_fscache(void);
struct erofs_fscache_context *erofs_fscache_get_ctx(struct super_block *sb,
- char *path);
+ char *path, bool need_inode);
void erofs_fscache_put_ctx(struct erofs_fscache_context *ctx);
#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
--
2.27.0
Until then erofs is exactly blockdev based filesystem. In other using
scenarios (e.g. container image), erofs needs to run upon files.
This patch set is going to introduces a new nodev mode, in which erofs
could be mounted from a bootstrap blob file containing complete erofs
image.
Add a helper checking which mode erofs works in.
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/erofs/internal.h | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index f9f94d63d40f..2b9337d385ce 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -161,6 +161,11 @@ struct erofs_sb_info {
#define set_opt(opt, option) ((opt)->mount_opt |= EROFS_MOUNT_##option)
#define test_opt(opt, option) ((opt)->mount_opt & EROFS_MOUNT_##option)
+static inline bool erofs_bdev_mode(struct super_block *sb)
+{
+ return sb->s_bdev;
+}
+
enum {
EROFS_ZIP_CACHE_DISABLED,
EROFS_ZIP_CACHE_READAHEAD,
--
2.27.0
Similar to the multi device mode, erofs could be mounted from multiple
blob files (one bootstrap blob file and optional multiple data blob
files). In this case, each device slot contains the path of
corresponding data blob file.
This patch registers corresponding cookie context for each data blob
file.
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/erofs/internal.h | 1 +
fs/erofs/super.c | 27 +++++++++++++++++++--------
2 files changed, 20 insertions(+), 8 deletions(-)
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 548f928b0ded..5d514c7b73cc 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -53,6 +53,7 @@ struct erofs_device_info {
struct block_device *bdev;
struct dax_device *dax_dev;
u64 dax_part_off;
+ struct erofs_fscache_context *ctx;
u32 blocks;
u32 mapped_blkaddr;
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 8c5783c6f71f..f058a04a00c7 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -250,6 +250,7 @@ static int erofs_init_devices(struct super_block *sb,
down_read(&sbi->devs->rwsem);
idr_for_each_entry(&sbi->devs->tree, dif, id) {
struct block_device *bdev;
+ struct erofs_fscache_context *ctx;
ptr = erofs_read_metabuf(&buf, sb, erofs_blknr(pos),
EROFS_KMAP);
@@ -259,15 +260,24 @@ static int erofs_init_devices(struct super_block *sb,
}
dis = ptr + erofs_blkoff(pos);
- bdev = blkdev_get_by_path(dif->path,
- FMODE_READ | FMODE_EXCL,
- sb->s_type);
- if (IS_ERR(bdev)) {
- err = PTR_ERR(bdev);
- break;
+ if (erofs_bdev_mode(sb)) {
+ bdev = blkdev_get_by_path(dif->path,
+ FMODE_READ | FMODE_EXCL,
+ sb->s_type);
+ if (IS_ERR(bdev)) {
+ err = PTR_ERR(bdev);
+ break;
+ }
+ dif->bdev = bdev;
+ dif->dax_dev = fs_dax_get_by_bdev(bdev, &dif->dax_part_off);
+ } else {
+ ctx = erofs_fscache_get_ctx(sb, dif->path, false);
+ if (IS_ERR(ctx)) {
+ err = PTR_ERR(ctx);
+ break;
+ }
+ dif->ctx = ctx;
}
- dif->bdev = bdev;
- dif->dax_dev = fs_dax_get_by_bdev(bdev, &dif->dax_part_off);
dif->blocks = le32_to_cpu(dis->blocks);
dif->mapped_blkaddr = le32_to_cpu(dis->mapped_blkaddr);
sbi->total_blocks += dif->blocks;
@@ -694,6 +704,7 @@ static int erofs_release_device_info(int id, void *ptr, void *data)
{
struct erofs_device_info *dif = ptr;
+ erofs_fscache_put_ctx(dif->ctx);
fs_put_dax(dif->dax_dev);
if (dif->bdev)
blkdev_put(dif->bdev, FMODE_READ | FMODE_EXCL);
--
2.27.0
The only change is that, meta buffers read cache page without __GFP_FS
flag, which shall not matter.
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/erofs/super.c | 13 +++++--------
1 file changed, 5 insertions(+), 8 deletions(-)
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 915eefe0d7e2..12755217631f 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -281,21 +281,19 @@ static int erofs_init_devices(struct super_block *sb,
static int erofs_read_superblock(struct super_block *sb)
{
struct erofs_sb_info *sbi;
- struct page *page;
+ struct erofs_buf buf = __EROFS_BUF_INITIALIZER;
struct erofs_super_block *dsb;
unsigned int blkszbits;
void *data;
int ret;
- page = read_mapping_page(sb->s_bdev->bd_inode->i_mapping, 0, NULL);
- if (IS_ERR(page)) {
+ data = erofs_read_metabuf(&buf, sb, 0, EROFS_KMAP);
+ if (IS_ERR(data)) {
erofs_err(sb, "cannot read erofs superblock");
- return PTR_ERR(page);
+ return PTR_ERR(data);
}
sbi = EROFS_SB(sb);
-
- data = kmap(page);
dsb = (struct erofs_super_block *)(data + EROFS_SUPER_OFFSET);
ret = -EINVAL;
@@ -365,8 +363,7 @@ static int erofs_read_superblock(struct super_block *sb)
if (erofs_sb_has_ztailpacking(sbi))
erofs_info(sb, "EXPERIMENTAL compressed inline data feature in use. Use at your own risk!");
out:
- kunmap(page);
- put_page(page);
+ erofs_put_metabuf(&buf);
return ret;
}
--
2.27.0
... so that it can be used in the following introduced fs/erofs/fscache.c.
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/erofs/data.c | 4 ++--
fs/erofs/internal.h | 2 ++
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index fa7ddb7ad980..f3aa133866e5 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -104,8 +104,8 @@ static int erofs_map_blocks_flatmode(struct inode *inode,
return 0;
}
-static int erofs_map_blocks(struct inode *inode,
- struct erofs_map_blocks *map, int flags)
+int erofs_map_blocks(struct inode *inode,
+ struct erofs_map_blocks *map, int flags)
{
struct super_block *sb = inode->i_sb;
struct erofs_inode *vi = EROFS_I(inode);
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index b8272fb95fd6..f9f94d63d40f 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -484,6 +484,8 @@ void *erofs_read_metabuf(struct erofs_buf *buf, struct super_block *sb,
int erofs_map_dev(struct super_block *sb, struct erofs_map_dev *dev);
int erofs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
u64 start, u64 len);
+int erofs_map_blocks(struct inode *inode,
+ struct erofs_map_blocks *map, int flags);
/* inode.c */
static inline unsigned long erofs_inode_hash(erofs_nid_t nid)
--
2.27.0
Hi David,
On Tue, Jan 18, 2022 at 09:11:56PM +0800, Jeffle Xu wrote:
> changes since v1:
> - rebase to v5.17
> - erofs: In chunk based layout, since the logical file offset has the
> same remainder over PAGE_SIZE with the corresponding physical address
> inside the data blob file, the file page cache can be directly
> transferred to netfs library to contain the data from data blob file.
> (patch 15) (Gao Xiang)
> - netfs,cachefiles: manage logical/physical offset separately. (patch 2)
> (It is used by erofs_begin_cache_operation() in patch 15.)
> - cachefiles: introduce a new devnode specificaly for on-demand reading.
> (patch 6)
> - netfs,fscache,cachefiles: add new CONFIG_* for on-demand reading.
> (patch 3/5)
> - You could start a quick test by
> https://github.com/lostjeffle/demand-read-cachefilesd
> - add more background information (mainly introduction to nydus) in the
> "Background" part of this cover letter
>
> [Important Issues]
> The following issues still need further discussion. Thanks for your time
> and patience.
>
> 1. I noticed that there's refactoring of netfs library[1], and patch 1
> is not needed since [2].
>
> 2. The current implementation will severely conflict with the
> refactoring of netfs library[1][2]. The assumption of 'struct
> netfs_i_context' [2] is that, every file in the upper netfs will
> correspond to only one backing file. While in our scenario, one file in
> erofs can correspond to multiple backing files. That is, the content of
> one file can be divided into multiple chunks, and are distrubuted over
> multiple blob files, i.e. multiple backing files. Currently I have no
> good idea solving this conflic.
>
Would you mind give more hints on this? Personally, I still think fscache
is useful and clean way for image distribution on-demand load use cases
in addition to cache network fs data as a more generic in-kernel caching
framework. From the point view of current codestat, it has slight
modification of netfslib and cachefiles (except for a new daemon):
fs/netfs/Kconfig | 8 +
fs/netfs/read_helper.c | 65 ++++++--
include/linux/netfs.h | 10 ++
fs/cachefiles/Kconfig | 8 +
fs/cachefiles/daemon.c | 147 ++++++++++++++++-
fs/cachefiles/internal.h | 23 +++
fs/cachefiles/io.c | 82 +++++++++-
fs/cachefiles/main.c | 27 ++++
fs/cachefiles/namei.c | 60 ++++++-
Besides, I think that cookies can be set according to data mapping
(instead of fixed per file) will benefit the following scenario in
addition to our on-demand load use cases:
It will benefit file cache data deduplication. What I can see is that
netfslib may have some follow-on development in order to support
encryption and compression. However, I think cache data deduplication
is also potentially useful to minimize cache storage since many local
fses already support reflink. However, I'm not sure if it's a great
idea that cachefile relies on underlayfs abilities for cache deduplication.
So for cache deduplication scenarios, I'm not sure per-file cookie is
still a good idea for us (or alternatively, maintain more complicated
mapping per cookie inside fscache besides filesystem mapping, too
unnecessary IMO).
By the way, in general, I'm not sure if it's a great idea to cache in
per-file basis (especially for too many small files), that is why we
introduced data deduplicated blobs. At least, it's simpler for read-only
fses. Recently, I found another good article to summarize this:
http://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html
Thanks,
Gao Xiang
Hi David,
Sincerely would you mind sharing if you like this patch set or not? It
seems that the use case of file-based on-demand load is quite general.
And as Gao Xaing noted, we still prefer fscache to implement this
scenario, whilst fscache has well worked as the local cache for remote
netfs.
Humbly I'd like to know if this potential new requirement for fscache
meets your expectation or future plan for fscache. If it is, then we can
improve the patch set in the later versions. Besides let me know if it
indeed deviates from the roadmap of fscache.
Thanks,
Jeffle
On 1/19/22 2:40 PM, Gao Xiang wrote:
> Hi David,
>
> On Tue, Jan 18, 2022 at 09:11:56PM +0800, Jeffle Xu wrote:
>> changes since v1:
>> - rebase to v5.17
>> - erofs: In chunk based layout, since the logical file offset has the
>> same remainder over PAGE_SIZE with the corresponding physical address
>> inside the data blob file, the file page cache can be directly
>> transferred to netfs library to contain the data from data blob file.
>> (patch 15) (Gao Xiang)
>> - netfs,cachefiles: manage logical/physical offset separately. (patch 2)
>> (It is used by erofs_begin_cache_operation() in patch 15.)
>> - cachefiles: introduce a new devnode specificaly for on-demand reading.
>> (patch 6)
>> - netfs,fscache,cachefiles: add new CONFIG_* for on-demand reading.
>> (patch 3/5)
>> - You could start a quick test by
>> https://github.com/lostjeffle/demand-read-cachefilesd
>> - add more background information (mainly introduction to nydus) in the
>> "Background" part of this cover letter
>>
>> [Important Issues]
>> The following issues still need further discussion. Thanks for your time
>> and patience.
>>
>> 1. I noticed that there's refactoring of netfs library[1], and patch 1
>> is not needed since [2].
>>
>> 2. The current implementation will severely conflict with the
>> refactoring of netfs library[1][2]. The assumption of 'struct
>> netfs_i_context' [2] is that, every file in the upper netfs will
>> correspond to only one backing file. While in our scenario, one file in
>> erofs can correspond to multiple backing files. That is, the content of
>> one file can be divided into multiple chunks, and are distrubuted over
>> multiple blob files, i.e. multiple backing files. Currently I have no
>> good idea solving this conflic.
>>
>
> Would you mind give more hints on this? Personally, I still think fscache
> is useful and clean way for image distribution on-demand load use cases
> in addition to cache network fs data as a more generic in-kernel caching
> framework. From the point view of current codestat, it has slight
> modification of netfslib and cachefiles (except for a new daemon):
> fs/netfs/Kconfig | 8 +
> fs/netfs/read_helper.c | 65 ++++++--
> include/linux/netfs.h | 10 ++
>
> fs/cachefiles/Kconfig | 8 +
> fs/cachefiles/daemon.c | 147 ++++++++++++++++-
> fs/cachefiles/internal.h | 23 +++
> fs/cachefiles/io.c | 82 +++++++++-
> fs/cachefiles/main.c | 27 ++++
> fs/cachefiles/namei.c | 60 ++++++-
>
> Besides, I think that cookies can be set according to data mapping
> (instead of fixed per file) will benefit the following scenario in
> addition to our on-demand load use cases:
> It will benefit file cache data deduplication. What I can see is that
> netfslib may have some follow-on development in order to support
> encryption and compression. However, I think cache data deduplication
> is also potentially useful to minimize cache storage since many local
> fses already support reflink. However, I'm not sure if it's a great
> idea that cachefile relies on underlayfs abilities for cache deduplication.
> So for cache deduplication scenarios, I'm not sure per-file cookie is
> still a good idea for us (or alternatively, maintain more complicated
> mapping per cookie inside fscache besides filesystem mapping, too
> unnecessary IMO).
>
> By the way, in general, I'm not sure if it's a great idea to cache in
> per-file basis (especially for too many small files), that is why we
> introduced data deduplicated blobs. At least, it's simpler for read-only
> fses. Recently, I found another good article to summarize this:
> http://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html
>
> Thanks,
> Gao Xiang
>
--
Thanks,
Jeffle
Jeffle Xu <[email protected]> wrote:
> You could start a quick test by
> https://github.com/lostjeffle/demand-read-cachefilesd
Can you pull this up to v5.17-rc1 or my netfs-lib branch?
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-lib
I'll do my best to have a look at it tomorrow.
Thanks,
David
On 1/25/22 1:23 AM, David Howells wrote:
> Jeffle Xu <[email protected]> wrote:
>
>> You could start a quick test by
>> https://github.com/lostjeffle/demand-read-cachefilesd
There is a quick test script in this repo in addition to the daemon
(temporarily named with cachefilesd2).
>
> Can you pull this up to v5.17-rc1 or my netfs-lib branch?
>
> https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-lib
While this kernel patch set is basically rebased to v5.17-rc1, rather
than netfs-lib branch. I can see there's quite many refactoring for
netfs lib in netfs-lib branch.
>
> I'll do my best to have a look at it tomorrow.
>
Thanks a lot.
--
Thanks,
Jeffle
On 1/25/22 1:23 AM, David Howells wrote:
> Jeffle Xu <[email protected]> wrote:
>
>> You could start a quick test by
>> https://github.com/lostjeffle/demand-read-cachefilesd
>
> Can you pull this up to v5.17-rc1 or my netfs-lib branch?
>
> https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-lib
>
Hi, you can check this kernel patch set on the following branch, which
has been rebased to v5.17-rc1.
https://github.com/lostjeffle/linux/commits/master
--
Thanks,
Jeffle
Jeffle Xu <[email protected]> wrote:
> +static int erofs_fscahce_init_ctx(struct erofs_fscache_context *ctx,
fscahce => fscache?
David
Jeffle Xu <[email protected]> wrote:
> The following issues still need further discussion. Thanks for your time
> and patience.
>
> 1. I noticed that there's refactoring of netfs library[1],
> ...
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-lib
Yes. I'm working towards getting netfslib to do handling writes and dio as
well as reads, along with content crypto/compression, and the idea I'm aiming
towards is that you just point your address_space_ops at netfs directly if
possible - but it's going to require its own context now to manage pending
writes.
See my netfs-experimental branch for more of that - it's still a work in
progress, though.
Btw, you could set rreq->netfs_priv in ->init_rreq() rather than passing it in
to netfs_readpage().
> 2. The current implementation will severely conflict with the
> refactoring of netfs library[1][2]. The assumption of 'struct
> netfs_i_context' [2] is that, every file in the upper netfs will
> correspond to only one backing file. While in our scenario, one file in
> erofs can correspond to multiple backing files. That is, the content of
> one file can be divided into multiple chunks, and are distrubuted over
> multiple blob files, i.e. multiple backing files. Currently I have no
> good idea solving this conflic.
I can think of a couple of options to explore:
(1) Duplicate the cachefiles backend. You can discard a lot of it, since a
much of it is concerned with managing local modifications - which you're
not going to do since you have a R/O filesystem and you're looking at
importing files into the cache externally to the kernel.
I would suggest looking to see if you can do the blob mapping in the
backend rather than passing the offset down. Maybe make the cookie index
key hold the index too, e.g. "/path/to/file+offset".
Btw, do you still need cachefilesd for its culling duties?
(2) Do you actually need to go through netfslib? Might it be easier to call
fscache_read() directly? Have a look at fs/nfs/fscache.c
> Besides there are still two quetions:
> - What's the plan of [1]? When is it planned to be merged?
Hopefully next merge window, but that's going to depend on a number of things.
> - It seems that all upper fs using fscache is going to use netfs API,
> while the APIs like fscache_read_or_alloc_page() are deprecated. Is
> that true?
fscache_read_or_alloc_page() is gone completely.
You don't have to use the netfs API. You can talk to fscache directly,
doing DIO from the cache to an xarray-class iov_iter constructed from your
inode's pagecache.
netfslib provides/will provide a number of services, such as multipage
folios, transparent caching, crypto, compression and hiding the existence of
pages/folios from the filesystem as entirely as possible. However, you
already have some of these implemented on top of iomap for the blockdev
interface, it would appear.
David
David Howells <[email protected]> wrote:
> (1) Duplicate the cachefiles backend. You can discard a lot of it, since a
> much of it is concerned with managing local modifications - which you're
> not going to do since you have a R/O filesystem and you're looking at
> importing files into the cache externally to the kernel.
Take the attached as a start. It's completely untested. I've stripped out
anything to do with writing to the cache, making directories, etc. as that can
probably be delegated to the on-demand creation. You could drive on-demand
creation from the points where it would create files. I've put some "TODO"
comments in there as markers.
You could also strip out everything to do with invalidation and also make it
just fail if it encounters a file type that it doesn't like or a file that is
not correctly labelled for a coherency attribute.
Also, since you aren't intending to write anything or create new files here,
there's no need to do the space checking - so I've got rid of all that too.
I've also made it open the backing files read only and got rid of the trimming
to I/O blocksize for DIO purposes. The userspace side can take care of that -
and, besides, you want to have multiple files within a backing file, right?
You might want to stop it from marking cache *files* in use (but only mark
directories). It doesn't matter so much as you aren't going to get coherency
issues from having multiple writers to the same file.
You then need to add a file offset member to the erofscache_object struct, set
that when the backing file is looked up and add it to the file position in
erofscache_read(). You also need to look at erofscache_prepare_read(). If
your files are contiguous complete blobs, that can be a lot simpler.
Also, you might want to rename erofscache to something more suitable.
David
---
commit 6fb0e557451e1cd909679fea183822ae92eed67f
Author: David Howells <[email protected]>
Date: Tue Jan 25 16:44:10 2022 +0000
erofs: Create specialised cache backend
diff --git a/fs/Kconfig b/fs/Kconfig
index 7a2b11c0b803..4ed1e704c0d4 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -133,6 +133,7 @@ menu "Caches"
source "fs/netfs/Kconfig"
source "fs/fscache/Kconfig"
source "fs/cachefiles/Kconfig"
+source "fs/erofscache/Kconfig"
endmenu
diff --git a/fs/Makefile b/fs/Makefile
index dab324aea08f..4780b9c919a8 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -128,6 +128,7 @@ obj-$(CONFIG_NILFS2_FS) += nilfs2/
obj-$(CONFIG_BEFS_FS) += befs/
obj-$(CONFIG_HOSTFS) += hostfs/
obj-$(CONFIG_CACHEFILES) += cachefiles/
+obj-$(CONFIG_EROFSCACHE) += erofscache/
obj-$(CONFIG_DEBUG_FS) += debugfs/
obj-$(CONFIG_TRACING) += tracefs/
obj-$(CONFIG_OCFS2_FS) += ocfs2/
diff --git a/fs/erofscache/Kconfig b/fs/erofscache/Kconfig
new file mode 100644
index 000000000000..29503879dce4
--- /dev/null
+++ b/fs/erofscache/Kconfig
@@ -0,0 +1,28 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+config EROFSCACHE
+ tristate "Filesystem caching on files"
+ depends on FSCACHE && BLOCK
+ help
+ This permits use of a mounted filesystem as a cache for other
+ filesystems - primarily networking filesystems - thus allowing fast
+ local disk to enhance the speed of slower devices.
+
+ See Documentation/filesystems/caching/erofscache.rst for more
+ information.
+
+config EROFSCACHE_DEBUG
+ bool "Debug Erofscache"
+ depends on EROFSCACHE
+ help
+ This permits debugging to be dynamically enabled in the filesystem
+ caching on files module. If this is set, the debugging output may be
+ enabled by setting bits in /sys/modules/erofscache/parameter/debug or
+ by including a debugging specifier in /etc/erofscached.conf.
+
+config EROFSCACHE_ERROR_INJECTION
+ bool "Provide error injection for erofscache"
+ depends on EROFSCACHE && SYSCTL
+ help
+ This permits error injection to be enabled in erofscache whilst a
+ cache is in service.
diff --git a/fs/erofscache/Makefile b/fs/erofscache/Makefile
new file mode 100644
index 000000000000..e22df4eb400b
--- /dev/null
+++ b/fs/erofscache/Makefile
@@ -0,0 +1,20 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Makefile for on-demand caching for erofs
+#
+
+erofscache-y := \
+ cache.o \
+ daemon.o \
+ interface.o \
+ io.o \
+ key.o \
+ main.o \
+ namei.o \
+ security.o \
+ volume.o \
+ xattr.o
+
+erofscache-$(CONFIG_EROFSCACHE_ERROR_INJECTION) += error_inject.o
+
+obj-$(CONFIG_EROFSCACHE) := erofscache.o
diff --git a/fs/erofscache/cache.c b/fs/erofscache/cache.c
new file mode 100644
index 000000000000..fad00abbc218
--- /dev/null
+++ b/fs/erofscache/cache.c
@@ -0,0 +1,199 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Manage high-level VFS aspects of a cache.
+ *
+ * Copyright (C) 2007, 2021 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ */
+
+#include <linux/slab.h>
+#include <linux/statfs.h>
+#include <linux/namei.h>
+#include "internal.h"
+
+/*
+ * Bring a cache online.
+ */
+int erofscache_add_cache(struct erofscache_cache *cache)
+{
+ struct fscache_cache *cache_cookie;
+ struct path path;
+ struct dentry *graveyard, *cachedir, *root;
+ const struct cred *saved_cred;
+ int ret;
+
+ _enter("");
+
+ cache_cookie = fscache_acquire_cache(cache->tag);
+ if (IS_ERR(cache_cookie))
+ return PTR_ERR(cache_cookie);
+
+ /* we want to work under the module's security ID */
+ ret = erofscache_get_security_ID(cache);
+ if (ret < 0)
+ goto error_getsec;
+
+ erofscache_begin_secure(cache, &saved_cred);
+
+ /* look up the directory at the root of the cache */
+ ret = kern_path(cache->rootdirname, LOOKUP_DIRECTORY, &path);
+ if (ret < 0)
+ goto error_open_root;
+
+ cache->mnt = path.mnt;
+ root = path.dentry;
+
+ cache->bsize = EROFSCACHE_DIO_BLOCK_SIZE;
+ cache->bshift = ilog2(cache->bsize);
+
+ ret = -EINVAL;
+ if (is_idmapped_mnt(path.mnt)) {
+ pr_warn("File cache on idmapped mounts not supported");
+ goto error_unsupported;
+ }
+
+ /* Check features of the backing filesystem:
+ * - Directories must support looking up
+ * - We use xattrs to store metadata
+ * - We use DIO to pages, so the blocksize mustn't be too big.
+ */
+ ret = -EOPNOTSUPP;
+ if (d_is_negative(root) ||
+ !d_backing_inode(root)->i_op->lookup ||
+ !(d_backing_inode(root)->i_opflags & IOP_XATTR) ||
+ root->d_sb->s_blocksize > PAGE_SIZE)
+ goto error_unsupported;
+
+ /* determine the security of the on-disk cache as this governs
+ * security ID of files we create */
+ ret = erofscache_determine_cache_security(cache, root, &saved_cred);
+ if (ret < 0)
+ goto error_unsupported;
+
+ /* get the cache directory and check its type */
+ cachedir = erofscache_get_directory(cache, root, "cache");
+ if (IS_ERR(cachedir)) {
+ ret = PTR_ERR(cachedir);
+ goto error_unsupported;
+ }
+
+ cache->store = cachedir;
+
+ /* get the graveyard directory */
+ graveyard = erofscache_get_directory(cache, root, "graveyard");
+ if (IS_ERR(graveyard)) {
+ ret = PTR_ERR(graveyard);
+ goto error_unsupported;
+ }
+
+ cache->graveyard = graveyard;
+ cache->cache = cache_cookie;
+
+ ret = fscache_add_cache(cache_cookie, &erofscache_cache_ops, cache);
+ if (ret < 0)
+ goto error_add_cache;
+
+ /* done */
+ set_bit(EROFSCACHE_READY, &cache->flags);
+ dput(root);
+
+ pr_info("File cache on %s registered\n", cache_cookie->name);
+
+ erofscache_end_secure(cache, saved_cred);
+ _leave(" = 0 [%px]", cache->cache);
+ return 0;
+
+error_add_cache:
+ erofscache_put_directory(cache->graveyard);
+ cache->graveyard = NULL;
+error_unsupported:
+ erofscache_put_directory(cache->store);
+ cache->store = NULL;
+ mntput(cache->mnt);
+ cache->mnt = NULL;
+ dput(root);
+error_open_root:
+ erofscache_end_secure(cache, saved_cred);
+error_getsec:
+ fscache_relinquish_cache(cache_cookie);
+ cache->cache = NULL;
+ pr_err("Failed to register: %d\n", ret);
+ return ret;
+}
+
+/*
+ * Mark all the objects as being out of service and queue them all for cleanup.
+ */
+static void erofscache_withdraw_objects(struct erofscache_cache *cache)
+{
+ struct erofscache_object *object;
+ unsigned int count = 0;
+
+ _enter("");
+
+ spin_lock(&cache->object_list_lock);
+
+ while (!list_empty(&cache->object_list)) {
+ object = list_first_entry(&cache->object_list,
+ struct erofscache_object, cache_link);
+ erofscache_see_object(object, erofscache_obj_see_withdrawal);
+ list_del_init(&object->cache_link);
+ fscache_withdraw_cookie(object->cookie);
+ count++;
+ if ((count & 63) == 0) {
+ spin_unlock(&cache->object_list_lock);
+ cond_resched();
+ spin_lock(&cache->object_list_lock);
+ }
+ }
+
+ spin_unlock(&cache->object_list_lock);
+ _leave(" [%u objs]", count);
+}
+
+/*
+ * Withdraw volumes.
+ */
+static void erofscache_withdraw_volumes(struct erofscache_cache *cache)
+{
+ _enter("");
+
+ for (;;) {
+ struct erofscache_volume *volume = NULL;
+
+ spin_lock(&cache->object_list_lock);
+ if (!list_empty(&cache->volumes)) {
+ volume = list_first_entry(&cache->volumes,
+ struct erofscache_volume, cache_link);
+ list_del_init(&volume->cache_link);
+ }
+ spin_unlock(&cache->object_list_lock);
+ if (!volume)
+ break;
+
+ erofscache_withdraw_volume(volume);
+ }
+
+ _leave("");
+}
+
+/*
+ * Withdraw cache objects.
+ */
+void erofscache_withdraw_cache(struct erofscache_cache *cache)
+{
+ struct fscache_cache *fscache = cache->cache;
+
+ pr_info("File cache on %s unregistering\n", fscache->name);
+
+ fscache_withdraw_cache(fscache);
+
+ /* we now have to destroy all the active objects pertaining to this
+ * cache - which we do by passing them off to thread pool to be
+ * disposed of */
+ erofscache_withdraw_objects(cache);
+ fscache_wait_for_objects(fscache);
+
+ erofscache_withdraw_volumes(cache);
+ cache->cache = NULL;
+ fscache_relinquish_cache(fscache);
+}
diff --git a/fs/erofscache/daemon.c b/fs/erofscache/daemon.c
new file mode 100644
index 000000000000..863db9a61f37
--- /dev/null
+++ b/fs/erofscache/daemon.c
@@ -0,0 +1,525 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Daemon interface
+ *
+ * Copyright (C) 2007, 2021 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/completion.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/namei.h>
+#include <linux/poll.h>
+#include <linux/mount.h>
+#include <linux/statfs.h>
+#include <linux/ctype.h>
+#include <linux/string.h>
+#include <linux/fs_struct.h>
+#include "internal.h"
+
+static int erofscache_daemon_open(struct inode *, struct file *);
+static int erofscache_daemon_release(struct inode *, struct file *);
+static ssize_t erofscache_daemon_read(struct file *, char __user *, size_t,
+ loff_t *);
+static ssize_t erofscache_daemon_write(struct file *, const char __user *,
+ size_t, loff_t *);
+static __poll_t erofscache_daemon_poll(struct file *,
+ struct poll_table_struct *);
+static int erofscache_daemon_cull(struct erofscache_cache *, char *);
+static int erofscache_daemon_debug(struct erofscache_cache *, char *);
+static int erofscache_daemon_dir(struct erofscache_cache *, char *);
+static int erofscache_daemon_inuse(struct erofscache_cache *, char *);
+static int erofscache_daemon_secctx(struct erofscache_cache *, char *);
+static int erofscache_daemon_tag(struct erofscache_cache *, char *);
+static int erofscache_daemon_bind(struct erofscache_cache *, char *);
+static void erofscache_daemon_unbind(struct erofscache_cache *);
+
+static unsigned long erofscache_open;
+
+const struct file_operations erofscache_daemon_fops = {
+ .owner = THIS_MODULE,
+ .open = erofscache_daemon_open,
+ .release = erofscache_daemon_release,
+ .read = erofscache_daemon_read,
+ .write = erofscache_daemon_write,
+ .poll = erofscache_daemon_poll,
+ .llseek = noop_llseek,
+};
+
+struct erofscache_daemon_cmd {
+ char name[8];
+ int (*handler)(struct erofscache_cache *cache, char *args);
+};
+
+static const struct erofscache_daemon_cmd erofscache_daemon_cmds[] = {
+ { "bind", erofscache_daemon_bind },
+ { "cull", erofscache_daemon_cull },
+ { "debug", erofscache_daemon_debug },
+ { "dir", erofscache_daemon_dir },
+ { "inuse", erofscache_daemon_inuse },
+ { "secctx", erofscache_daemon_secctx },
+ { "tag", erofscache_daemon_tag },
+ { "", NULL }
+};
+
+
+/*
+ * Prepare a cache for caching.
+ */
+static int erofscache_daemon_open(struct inode *inode, struct file *file)
+{
+ struct erofscache_cache *cache;
+
+ _enter("");
+
+ /* only the superuser may do this */
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ /* the erofscache device may only be open once at a time */
+ if (xchg(&erofscache_open, 1) == 1)
+ return -EBUSY;
+
+ /* allocate a cache record */
+ cache = kzalloc(sizeof(struct erofscache_cache), GFP_KERNEL);
+ if (!cache) {
+ erofscache_open = 0;
+ return -ENOMEM;
+ }
+
+ mutex_init(&cache->daemon_mutex);
+ init_waitqueue_head(&cache->daemon_pollwq);
+ INIT_LIST_HEAD(&cache->volumes);
+ INIT_LIST_HEAD(&cache->object_list);
+ spin_lock_init(&cache->object_list_lock);
+
+ file->private_data = cache;
+ cache->erofscached = file;
+ return 0;
+}
+
+/*
+ * Release a cache.
+ */
+static int erofscache_daemon_release(struct inode *inode, struct file *file)
+{
+ struct erofscache_cache *cache = file->private_data;
+
+ _enter("");
+
+ ASSERT(cache);
+
+ set_bit(EROFSCACHE_DEAD, &cache->flags);
+
+ erofscache_daemon_unbind(cache);
+
+ /* clean up the control file interface */
+ cache->erofscached = NULL;
+ file->private_data = NULL;
+ erofscache_open = 0;
+
+ kfree(cache);
+
+ _leave("");
+ return 0;
+}
+
+/*
+ * Read the cache state.
+ */
+static ssize_t erofscache_daemon_read(struct file *file, char __user *_buffer,
+ size_t buflen, loff_t *pos)
+{
+ struct erofscache_cache *cache = file->private_data;
+ unsigned long long b_released;
+ unsigned f_released;
+ char buffer[256];
+ int n;
+
+ //_enter(",,%zu,", buflen);
+
+ if (!test_bit(EROFSCACHE_READY, &cache->flags))
+ return 0;
+
+ /* summarise */
+ f_released = atomic_xchg(&cache->f_released, 0);
+ b_released = atomic_long_xchg(&cache->b_released, 0);
+ clear_bit(EROFSCACHE_STATE_CHANGED, &cache->flags);
+
+ n = snprintf(buffer, sizeof(buffer),
+ "cull=%c"
+ " freleased=%x"
+ " breleased=%llx",
+ test_bit(EROFSCACHE_CULLING, &cache->flags) ? '1' : '0',
+ f_released,
+ b_released);
+
+ if (n > buflen)
+ return -EMSGSIZE;
+
+ if (copy_to_user(_buffer, buffer, n) != 0)
+ return -EFAULT;
+
+ return n;
+}
+
+/*
+ * Take a command from erofscached, parse it and act on it.
+ */
+static ssize_t erofscache_daemon_write(struct file *file,
+ const char __user *_data,
+ size_t datalen,
+ loff_t *pos)
+{
+ const struct erofscache_daemon_cmd *cmd;
+ struct erofscache_cache *cache = file->private_data;
+ ssize_t ret;
+ char *data, *args, *cp;
+
+ //_enter(",,%zu,", datalen);
+
+ ASSERT(cache);
+
+ if (test_bit(EROFSCACHE_DEAD, &cache->flags))
+ return -EIO;
+
+ if (datalen > PAGE_SIZE - 1)
+ return -EOPNOTSUPP;
+
+ /* drag the command string into the kernel so we can parse it */
+ data = memdup_user_nul(_data, datalen);
+ if (IS_ERR(data))
+ return PTR_ERR(data);
+
+ ret = -EINVAL;
+ if (memchr(data, '\0', datalen))
+ goto error;
+
+ /* strip any newline */
+ cp = memchr(data, '\n', datalen);
+ if (cp) {
+ if (cp == data)
+ goto error;
+
+ *cp = '\0';
+ }
+
+ /* parse the command */
+ ret = -EOPNOTSUPP;
+
+ for (args = data; *args; args++)
+ if (isspace(*args))
+ break;
+ if (*args) {
+ if (args == data)
+ goto error;
+ *args = '\0';
+ args = skip_spaces(++args);
+ }
+
+ /* run the appropriate command handler */
+ for (cmd = erofscache_daemon_cmds; cmd->name[0]; cmd++)
+ if (strcmp(cmd->name, data) == 0)
+ goto found_command;
+
+error:
+ kfree(data);
+ //_leave(" = %zd", ret);
+ return ret;
+
+found_command:
+ mutex_lock(&cache->daemon_mutex);
+
+ ret = -EIO;
+ if (!test_bit(EROFSCACHE_DEAD, &cache->flags))
+ ret = cmd->handler(cache, args);
+
+ mutex_unlock(&cache->daemon_mutex);
+
+ if (ret == 0)
+ ret = datalen;
+ goto error;
+}
+
+/*
+ * Poll for culling state
+ * - use EPOLLOUT to indicate culling state
+ */
+static __poll_t erofscache_daemon_poll(struct file *file,
+ struct poll_table_struct *poll)
+{
+ struct erofscache_cache *cache = file->private_data;
+ __poll_t mask;
+
+ poll_wait(file, &cache->daemon_pollwq, poll);
+ mask = 0;
+
+ if (test_bit(EROFSCACHE_STATE_CHANGED, &cache->flags))
+ mask |= EPOLLIN;
+
+ if (test_bit(EROFSCACHE_CULLING, &cache->flags))
+ mask |= EPOLLOUT;
+
+ return mask;
+}
+
+/*
+ * Set the cache directory
+ * - command: "dir <name>"
+ */
+static int erofscache_daemon_dir(struct erofscache_cache *cache, char *args)
+{
+ char *dir;
+
+ _enter(",%s", args);
+
+ if (!*args) {
+ pr_err("Empty directory specified\n");
+ return -EINVAL;
+ }
+
+ if (cache->rootdirname) {
+ pr_err("Second cache directory specified\n");
+ return -EEXIST;
+ }
+
+ dir = kstrdup(args, GFP_KERNEL);
+ if (!dir)
+ return -ENOMEM;
+
+ cache->rootdirname = dir;
+ return 0;
+}
+
+/*
+ * Set the cache security context
+ * - command: "secctx <ctx>"
+ */
+static int erofscache_daemon_secctx(struct erofscache_cache *cache, char *args)
+{
+ char *secctx;
+
+ _enter(",%s", args);
+
+ if (!*args) {
+ pr_err("Empty security context specified\n");
+ return -EINVAL;
+ }
+
+ if (cache->secctx) {
+ pr_err("Second security context specified\n");
+ return -EINVAL;
+ }
+
+ secctx = kstrdup(args, GFP_KERNEL);
+ if (!secctx)
+ return -ENOMEM;
+
+ cache->secctx = secctx;
+ return 0;
+}
+
+/*
+ * Set the cache tag
+ * - command: "tag <name>"
+ */
+static int erofscache_daemon_tag(struct erofscache_cache *cache, char *args)
+{
+ char *tag;
+
+ _enter(",%s", args);
+
+ if (!*args) {
+ pr_err("Empty tag specified\n");
+ return -EINVAL;
+ }
+
+ if (cache->tag)
+ return -EEXIST;
+
+ tag = kstrdup(args, GFP_KERNEL);
+ if (!tag)
+ return -ENOMEM;
+
+ cache->tag = tag;
+ return 0;
+}
+
+/*
+ * Request a node in the cache be culled from the current working directory
+ * - command: "cull <name>"
+ */
+static int erofscache_daemon_cull(struct erofscache_cache *cache, char *args)
+{
+ struct path path;
+ const struct cred *saved_cred;
+ int ret;
+
+ _enter(",%s", args);
+
+ if (strchr(args, '/'))
+ goto inval;
+
+ if (!test_bit(EROFSCACHE_READY, &cache->flags)) {
+ pr_err("cull applied to unready cache\n");
+ return -EIO;
+ }
+
+ if (test_bit(EROFSCACHE_DEAD, &cache->flags)) {
+ pr_err("cull applied to dead cache\n");
+ return -EIO;
+ }
+
+ get_fs_pwd(current->fs, &path);
+
+ if (!d_can_lookup(path.dentry))
+ goto notdir;
+
+ erofscache_begin_secure(cache, &saved_cred);
+ ret = erofscache_cull(cache, path.dentry, args);
+ erofscache_end_secure(cache, saved_cred);
+
+ path_put(&path);
+ _leave(" = %d", ret);
+ return ret;
+
+notdir:
+ path_put(&path);
+ pr_err("cull command requires dirfd to be a directory\n");
+ return -ENOTDIR;
+
+inval:
+ pr_err("cull command requires dirfd and filename\n");
+ return -EINVAL;
+}
+
+/*
+ * Set debugging mode
+ * - command: "debug <mask>"
+ */
+static int erofscache_daemon_debug(struct erofscache_cache *cache, char *args)
+{
+ unsigned long mask;
+
+ _enter(",%s", args);
+
+ mask = simple_strtoul(args, &args, 0);
+ if (args[0] != '\0')
+ goto inval;
+
+ erofscache_debug = mask;
+ _leave(" = 0");
+ return 0;
+
+inval:
+ pr_err("debug command requires mask\n");
+ return -EINVAL;
+}
+
+/*
+ * Find out whether an object in the current working directory is in use or not
+ * - command: "inuse <name>"
+ */
+static int erofscache_daemon_inuse(struct erofscache_cache *cache, char *args)
+{
+ struct path path;
+ const struct cred *saved_cred;
+ int ret;
+
+ //_enter(",%s", args);
+
+ if (strchr(args, '/'))
+ goto inval;
+
+ if (!test_bit(EROFSCACHE_READY, &cache->flags)) {
+ pr_err("inuse applied to unready cache\n");
+ return -EIO;
+ }
+
+ if (test_bit(EROFSCACHE_DEAD, &cache->flags)) {
+ pr_err("inuse applied to dead cache\n");
+ return -EIO;
+ }
+
+ get_fs_pwd(current->fs, &path);
+
+ if (!d_can_lookup(path.dentry))
+ goto notdir;
+
+ erofscache_begin_secure(cache, &saved_cred);
+ ret = erofscache_check_in_use(cache, path.dentry, args);
+ erofscache_end_secure(cache, saved_cred);
+
+ path_put(&path);
+ //_leave(" = %d", ret);
+ return ret;
+
+notdir:
+ path_put(&path);
+ pr_err("inuse command requires dirfd to be a directory\n");
+ return -ENOTDIR;
+
+inval:
+ pr_err("inuse command requires dirfd and filename\n");
+ return -EINVAL;
+}
+
+/*
+ * Bind a directory as a cache
+ */
+static int erofscache_daemon_bind(struct erofscache_cache *cache, char *args)
+{
+ if (*args) {
+ pr_err("'bind' command doesn't take an argument\n");
+ return -EINVAL;
+ }
+
+ if (!cache->rootdirname) {
+ pr_err("No cache directory specified\n");
+ return -EINVAL;
+ }
+
+ /* Don't permit already bound caches to be re-bound */
+ if (test_bit(EROFSCACHE_READY, &cache->flags)) {
+ pr_err("Cache already bound\n");
+ return -EBUSY;
+ }
+
+ /* Make sure we have copies of the tag string */
+ if (!cache->tag) {
+ /*
+ * The tag string is released by the fops->release()
+ * function, so we don't release it on error here
+ */
+ cache->tag = kstrdup("Erofscache", GFP_KERNEL);
+ if (!cache->tag)
+ return -ENOMEM;
+ }
+
+ return erofscache_add_cache(cache);
+}
+
+/*
+ * Unbind a cache.
+ */
+static void erofscache_daemon_unbind(struct erofscache_cache *cache)
+{
+ _enter("");
+
+ if (test_bit(EROFSCACHE_READY, &cache->flags))
+ erofscache_withdraw_cache(cache);
+
+ erofscache_put_directory(cache->graveyard);
+ erofscache_put_directory(cache->store);
+ mntput(cache->mnt);
+
+ kfree(cache->rootdirname);
+ kfree(cache->secctx);
+ kfree(cache->tag);
+
+ _leave("");
+}
diff --git a/fs/erofscache/error_inject.c b/fs/erofscache/error_inject.c
new file mode 100644
index 000000000000..958b61198b36
--- /dev/null
+++ b/fs/erofscache/error_inject.c
@@ -0,0 +1,46 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Error injection handling.
+ *
+ * Copyright (C) 2021 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ */
+
+#include <linux/sysctl.h>
+#include "internal.h"
+
+unsigned int erofscache_error_injection_state;
+
+static struct ctl_table_header *erofscache_sysctl;
+static struct ctl_table erofscache_sysctls[] = {
+ {
+ .procname = "error_injection",
+ .data = &erofscache_error_injection_state,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_douintvec,
+ },
+ {}
+};
+
+static struct ctl_table erofscache_sysctls_root[] = {
+ {
+ .procname = "erofscache",
+ .mode = 0555,
+ .child = erofscache_sysctls,
+ },
+ {}
+};
+
+int __init erofscache_register_error_injection(void)
+{
+ erofscache_sysctl = register_sysctl_table(erofscache_sysctls_root);
+ if (!erofscache_sysctl)
+ return -ENOMEM;
+ return 0;
+
+}
+
+void erofscache_unregister_error_injection(void)
+{
+ unregister_sysctl_table(erofscache_sysctl);
+}
diff --git a/fs/erofscache/interface.c b/fs/erofscache/interface.c
new file mode 100644
index 000000000000..a4cb182dacdd
--- /dev/null
+++ b/fs/erofscache/interface.c
@@ -0,0 +1,254 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* FS-Cache interface to Erofscache
+ *
+ * Copyright (C) 2021 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ */
+
+#include <linux/slab.h>
+#include <linux/mount.h>
+#include <linux/xattr.h>
+#include <linux/file.h>
+#include <linux/falloc.h>
+#include <trace/events/fscache.h>
+#include "internal.h"
+
+static atomic_t erofscache_object_debug_id;
+
+/*
+ * Allocate a cache object record.
+ */
+static
+struct erofscache_object *erofscache_alloc_object(struct fscache_cookie *cookie)
+{
+ struct fscache_volume *vcookie = cookie->volume;
+ struct erofscache_volume *volume = vcookie->cache_priv;
+ struct erofscache_object *object;
+
+ _enter("{%s},%x,", vcookie->key, cookie->debug_id);
+
+ object = kmem_cache_zalloc(erofscache_object_jar, GFP_KERNEL);
+ if (!object)
+ return NULL;
+
+ refcount_set(&object->ref, 1);
+
+ spin_lock_init(&object->lock);
+ INIT_LIST_HEAD(&object->cache_link);
+ object->volume = volume;
+ object->debug_id = atomic_inc_return(&erofscache_object_debug_id);
+ object->cookie = fscache_get_cookie(cookie, fscache_cookie_get_attach_object);
+
+ fscache_count_object(vcookie->cache);
+ trace_erofscache_ref(object->debug_id, cookie->debug_id, 1,
+ erofscache_obj_new);
+ return object;
+}
+
+/*
+ * Note that an object has been seen.
+ */
+void erofscache_see_object(struct erofscache_object *object,
+ enum erofscache_obj_ref_trace why)
+{
+ trace_erofscache_ref(object->debug_id, object->cookie->debug_id,
+ refcount_read(&object->ref), why);
+}
+
+/*
+ * Increment the usage count on an object;
+ */
+struct erofscache_object *erofscache_grab_object(struct erofscache_object *object,
+ enum erofscache_obj_ref_trace why)
+{
+ int r;
+
+ __refcount_inc(&object->ref, &r);
+ trace_erofscache_ref(object->debug_id, object->cookie->debug_id, r, why);
+ return object;
+}
+
+/*
+ * dispose of a reference to an object
+ */
+void erofscache_put_object(struct erofscache_object *object,
+ enum erofscache_obj_ref_trace why)
+{
+ unsigned int object_debug_id = object->debug_id;
+ unsigned int cookie_debug_id = object->cookie->debug_id;
+ struct fscache_cache *cache;
+ bool done;
+ int r;
+
+ done = __refcount_dec_and_test(&object->ref, &r);
+ trace_erofscache_ref(object_debug_id, cookie_debug_id, r, why);
+ if (done) {
+ _debug("- kill object OBJ%x", object_debug_id);
+
+ ASSERTCMP(object->file, ==, NULL);
+
+ kfree(object->d_name);
+
+ cache = object->volume->cache->cache;
+ fscache_put_cookie(object->cookie, fscache_cookie_put_object);
+ object->cookie = NULL;
+ kmem_cache_free(erofscache_object_jar, object);
+ fscache_uncount_object(cache);
+ }
+
+ _leave("");
+}
+
+/*
+ * Attempt to look up the nominated node in this cache
+ */
+static bool erofscache_lookup_cookie(struct fscache_cookie *cookie)
+{
+ struct erofscache_object *object;
+ struct erofscache_cache *cache = cookie->volume->cache->cache_priv;
+ const struct cred *saved_cred;
+ bool success;
+
+ object = erofscache_alloc_object(cookie);
+ if (!object)
+ goto fail;
+
+ _enter("{OBJ%x}", object->debug_id);
+
+ if (!erofscache_cook_key(object))
+ goto fail_put;
+
+ cookie->cache_priv = object;
+
+ erofscache_begin_secure(cache, &saved_cred);
+
+ success = erofscache_look_up_object(object);
+ if (!success)
+ goto fail_withdraw;
+
+ erofscache_see_object(object, erofscache_obj_see_lookup_cookie);
+
+ spin_lock(&cache->object_list_lock);
+ list_add(&object->cache_link, &cache->object_list);
+ spin_unlock(&cache->object_list_lock);
+ // TODO: Do we need erofscache_adjust_size(object)?
+
+ erofscache_end_secure(cache, saved_cred);
+ _leave(" = t");
+ return true;
+
+fail_withdraw:
+ erofscache_end_secure(cache, saved_cred);
+ erofscache_see_object(object, erofscache_obj_see_lookup_failed);
+ fscache_caching_failed(cookie);
+ _debug("failed c=%08x o=%08x", cookie->debug_id, object->debug_id);
+ /* The caller holds an access count on the cookie, so we need them to
+ * drop it before we can withdraw the object.
+ */
+ return false;
+
+fail_put:
+ erofscache_put_object(object, erofscache_obj_put_alloc_fail);
+fail:
+ return false;
+}
+
+/*
+ * Finalise and object and close the VFS structs that we have.
+ */
+static void erofscache_clean_up_object(struct erofscache_object *object,
+ struct erofscache_cache *cache)
+{
+ if (test_bit(FSCACHE_COOKIE_RETIRED, &object->cookie->flags)) {
+ erofscache_see_object(object, erofscache_obj_see_clean_delete);
+ _debug("- inval object OBJ%x", object->debug_id);
+ erofscache_delete_object(object, FSCACHE_OBJECT_WAS_RETIRED);
+ }
+
+ erofscache_unmark_inode_in_use(object, object->file);
+ if (object->file) {
+ fput(object->file);
+ object->file = NULL;
+ }
+}
+
+/*
+ * Withdraw caching for a cookie.
+ */
+static void erofscache_withdraw_cookie(struct fscache_cookie *cookie)
+{
+ struct erofscache_object *object = cookie->cache_priv;
+ struct erofscache_cache *cache = object->volume->cache;
+ const struct cred *saved_cred;
+
+ _enter("o=%x", object->debug_id);
+ erofscache_see_object(object, erofscache_obj_see_withdraw_cookie);
+
+ if (!list_empty(&object->cache_link)) {
+ spin_lock(&cache->object_list_lock);
+ erofscache_see_object(object, erofscache_obj_see_withdrawal);
+ list_del_init(&object->cache_link);
+ spin_unlock(&cache->object_list_lock);
+ }
+
+ if (object->file) {
+ erofscache_begin_secure(cache, &saved_cred);
+ erofscache_clean_up_object(object, cache);
+ erofscache_end_secure(cache, saved_cred);
+ }
+
+ cookie->cache_priv = NULL;
+ erofscache_put_object(object, erofscache_obj_put_detach);
+}
+
+/*
+ * Invalidate the storage associated with a cookie.
+ */
+static bool erofscache_invalidate_cookie(struct fscache_cookie *cookie)
+{
+ struct erofscache_object *object = cookie->cache_priv;
+ struct erofscache_volume *volume = object->volume;
+ struct dentry *fan = volume->fanout[(u8)cookie->key_hash];
+ struct file *file;
+
+ _enter("o=%x,[%llu]", object->debug_id, object->cookie->object_size);
+
+ if (!object->file) {
+ fscache_resume_after_invalidation(cookie);
+ _leave(" = t [light]");
+ return true;
+ }
+
+ /* Remove the VFS target and mark disabled */
+ spin_lock(&object->lock);
+
+ file = object->file;
+ object->file = NULL;
+ set_bit(FSCACHE_COOKIE_DISABLED, &object->cookie->flags);
+
+ spin_unlock(&object->lock);
+
+ /* Allow I/O to take place again */
+ fscache_resume_after_invalidation(cookie);
+
+ if (file) {
+ inode_lock_nested(d_inode(fan), I_MUTEX_PARENT);
+ erofscache_bury_object(volume->cache, object, fan,
+ file->f_path.dentry,
+ FSCACHE_OBJECT_INVALIDATED);
+ fput(file);
+ }
+
+ _leave(" = t");
+ return true;
+}
+
+const struct fscache_cache_ops erofscache_cache_ops = {
+ .name = "erofscache",
+ .acquire_volume = erofscache_acquire_volume,
+ .free_volume = erofscache_free_volume,
+ .lookup_cookie = erofscache_lookup_cookie,
+ .withdraw_cookie = erofscache_withdraw_cookie,
+ .invalidate_cookie = erofscache_invalidate_cookie,
+ .begin_operation = erofscache_begin_operation,
+};
diff --git a/fs/erofscache/internal.h b/fs/erofscache/internal.h
new file mode 100644
index 000000000000..f7f00ba42f11
--- /dev/null
+++ b/fs/erofscache/internal.h
@@ -0,0 +1,349 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* Internal defs for erofs demand-load netfs cache on cache files.
+ *
+ * Copyright (C) 2022 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ */
+
+#ifdef pr_fmt
+#undef pr_fmt
+#endif
+
+#define pr_fmt(fmt) "Erofscache: " fmt
+
+
+#include <linux/fscache-cache.h>
+#include <linux/cred.h>
+#include <linux/security.h>
+
+#define EROFSCACHE_DIO_BLOCK_SIZE 4096
+
+struct erofscache_cache;
+struct erofscache_object;
+
+/*
+ * Cached volume representation.
+ */
+struct erofscache_volume {
+ struct erofscache_cache *cache;
+ struct list_head cache_link; /* Link in cache->volumes */
+ struct fscache_volume *vcookie; /* The netfs's representation */
+ struct dentry *dentry; /* The volume dentry */
+ struct dentry *fanout[256]; /* Fanout subdirs */
+};
+
+/*
+ * Backing file state.
+ */
+struct erofscache_object {
+ struct fscache_cookie *cookie; /* Netfs data storage object cookie */
+ struct erofscache_volume *volume; /* Cache volume that holds this object */
+ struct list_head cache_link; /* Link in cache->*_list */
+ struct file *file; /* The file representing this object */
+ char *d_name; /* Backing file name */
+ int debug_id;
+ spinlock_t lock;
+ refcount_t ref;
+ u8 d_name_len; /* Length of filename */
+};
+
+/*
+ * Cache files cache definition
+ */
+struct erofscache_cache {
+ struct fscache_cache *cache; /* Cache cookie */
+ struct vfsmount *mnt; /* mountpoint holding the cache */
+ struct dentry *store; /* Directory into which live objects go */
+ struct dentry *graveyard; /* directory into which dead objects go */
+ struct file *erofscached; /* manager daemon handle */
+ struct list_head volumes; /* List of volume objects */
+ struct list_head object_list; /* List of active objects */
+ spinlock_t object_list_lock; /* Lock for volumes and object_list */
+ const struct cred *cache_cred; /* security override for accessing cache */
+ struct mutex daemon_mutex; /* command serialisation mutex */
+ wait_queue_head_t daemon_pollwq; /* poll waitqueue for daemon */
+ unsigned bsize; /* cache's block size */
+ unsigned bshift; /* ilog2(bsize) */
+ atomic_t gravecounter; /* graveyard uniquifier */
+ atomic_t f_released; /* number of objects released lately */
+ atomic_long_t b_released; /* number of blocks released lately */
+ unsigned long flags;
+#define EROFSCACHE_READY 0 /* T if cache prepared */
+#define EROFSCACHE_DEAD 1 /* T if cache dead */
+#define EROFSCACHE_CULLING 2 /* T if cull engaged */
+#define EROFSCACHE_STATE_CHANGED 3 /* T if state changed (poll trigger) */
+ char *rootdirname; /* name of cache root directory */
+ char *secctx; /* LSM security context */
+ char *tag; /* cache binding tag */
+};
+
+#include <trace/events/erofscache.h>
+
+static inline
+struct file *erofscache_cres_file(struct netfs_cache_resources *cres)
+{
+ return cres->cache_priv2;
+}
+
+static inline
+struct erofscache_object *erofscache_cres_object(struct netfs_cache_resources *cres)
+{
+ return fscache_cres_cookie(cres)->cache_priv;
+}
+
+/*
+ * note change of state for daemon
+ */
+static inline void erofscache_state_changed(struct erofscache_cache *cache)
+{
+ set_bit(EROFSCACHE_STATE_CHANGED, &cache->flags);
+ wake_up_all(&cache->daemon_pollwq);
+}
+
+/*
+ * cache.c
+ */
+extern int erofscache_add_cache(struct erofscache_cache *cache);
+extern void erofscache_withdraw_cache(struct erofscache_cache *cache);
+
+/*
+ * daemon.c
+ */
+extern const struct file_operations erofscache_daemon_fops;
+
+/*
+ * error_inject.c
+ */
+#ifdef CONFIG_EROFSCACHE_ERROR_INJECTION
+extern unsigned int erofscache_error_injection_state;
+extern int erofscache_register_error_injection(void);
+extern void erofscache_unregister_error_injection(void);
+
+#else
+#define erofscache_error_injection_state 0
+
+static inline int erofscache_register_error_injection(void)
+{
+ return 0;
+}
+
+static inline void erofscache_unregister_error_injection(void)
+{
+}
+#endif
+
+
+static inline int erofscache_inject_read_error(void)
+{
+ return erofscache_error_injection_state & 2 ? -EIO : 0;
+}
+
+static inline int erofscache_inject_remove_error(void)
+{
+ return erofscache_error_injection_state & 2 ? -EIO : 0;
+}
+
+/*
+ * interface.c
+ */
+extern const struct fscache_cache_ops erofscache_cache_ops;
+extern void erofscache_see_object(struct erofscache_object *object,
+ enum erofscache_obj_ref_trace why);
+extern struct erofscache_object *erofscache_grab_object(struct erofscache_object *object,
+ enum erofscache_obj_ref_trace why);
+extern void erofscache_put_object(struct erofscache_object *object,
+ enum erofscache_obj_ref_trace why);
+
+/*
+ * io.c
+ */
+extern bool erofscache_begin_operation(struct netfs_cache_resources *cres,
+ enum fscache_want_state want_state);
+
+/*
+ * key.c
+ */
+extern bool erofscache_cook_key(struct erofscache_object *object);
+
+/*
+ * main.c
+ */
+extern struct kmem_cache *erofscache_object_jar;
+
+/*
+ * namei.c
+ */
+extern void erofscache_unmark_inode_in_use(struct erofscache_object *object,
+ struct file *file);
+extern int erofscache_bury_object(struct erofscache_cache *cache,
+ struct erofscache_object *object,
+ struct dentry *dir,
+ struct dentry *rep,
+ enum fscache_why_object_killed why);
+extern int erofscache_delete_object(struct erofscache_object *object,
+ enum fscache_why_object_killed why);
+extern bool erofscache_look_up_object(struct erofscache_object *object);
+extern struct dentry *erofscache_get_directory(struct erofscache_cache *cache,
+ struct dentry *dir,
+ const char *name);
+extern void erofscache_put_directory(struct dentry *dir);
+
+extern int erofscache_cull(struct erofscache_cache *cache, struct dentry *dir,
+ char *filename);
+
+extern int erofscache_check_in_use(struct erofscache_cache *cache,
+ struct dentry *dir, char *filename);
+
+/*
+ * security.c
+ */
+extern int erofscache_get_security_ID(struct erofscache_cache *cache);
+extern int erofscache_determine_cache_security(struct erofscache_cache *cache,
+ struct dentry *root,
+ const struct cred **_saved_cred);
+
+static inline void erofscache_begin_secure(struct erofscache_cache *cache,
+ const struct cred **_saved_cred)
+{
+ *_saved_cred = override_creds(cache->cache_cred);
+}
+
+static inline void erofscache_end_secure(struct erofscache_cache *cache,
+ const struct cred *saved_cred)
+{
+ revert_creds(saved_cred);
+}
+
+/*
+ * volume.c
+ */
+void erofscache_acquire_volume(struct fscache_volume *volume);
+void erofscache_free_volume(struct fscache_volume *volume);
+void erofscache_withdraw_volume(struct erofscache_volume *volume);
+
+/*
+ * xattr.c
+ */
+extern int erofscache_check_auxdata(struct erofscache_object *object,
+ struct file *file);
+extern int erofscache_remove_object_xattr(struct erofscache_cache *cache,
+ struct erofscache_object *object,
+ struct dentry *dentry);
+extern int erofscache_check_volume_xattr(struct erofscache_volume *volume);
+
+/*
+ * Error handling
+ */
+#define erofscache_io_error(___cache, FMT, ...) \
+do { \
+ pr_err("I/O Error: " FMT"\n", ##__VA_ARGS__); \
+ fscache_io_error((___cache)->cache); \
+ set_bit(EROFSCACHE_DEAD, &(___cache)->flags); \
+} while (0)
+
+#define erofscache_io_error_obj(object, FMT, ...) \
+do { \
+ struct erofscache_cache *___cache; \
+ \
+ ___cache = (object)->volume->cache; \
+ erofscache_io_error(___cache, FMT " [o=%08x]", ##__VA_ARGS__, \
+ (object)->debug_id); \
+} while (0)
+
+
+/*
+ * Debug tracing
+ */
+extern unsigned erofscache_debug;
+#define EROFSCACHE_DEBUG_KENTER 1
+#define EROFSCACHE_DEBUG_KLEAVE 2
+#define EROFSCACHE_DEBUG_KDEBUG 4
+
+#define dbgprintk(FMT, ...) \
+ printk(KERN_DEBUG "[%-6.6s] "FMT"\n", current->comm, ##__VA_ARGS__)
+
+#define kenter(FMT, ...) dbgprintk("==> %s("FMT")", __func__, ##__VA_ARGS__)
+#define kleave(FMT, ...) dbgprintk("<== %s()"FMT"", __func__, ##__VA_ARGS__)
+#define kdebug(FMT, ...) dbgprintk(FMT, ##__VA_ARGS__)
+
+
+#if defined(__KDEBUG)
+#define _enter(FMT, ...) kenter(FMT, ##__VA_ARGS__)
+#define _leave(FMT, ...) kleave(FMT, ##__VA_ARGS__)
+#define _debug(FMT, ...) kdebug(FMT, ##__VA_ARGS__)
+
+#elif defined(CONFIG_EROFSCACHE_DEBUG)
+#define _enter(FMT, ...) \
+do { \
+ if (erofscache_debug & EROFSCACHE_DEBUG_KENTER) \
+ kenter(FMT, ##__VA_ARGS__); \
+} while (0)
+
+#define _leave(FMT, ...) \
+do { \
+ if (erofscache_debug & EROFSCACHE_DEBUG_KLEAVE) \
+ kleave(FMT, ##__VA_ARGS__); \
+} while (0)
+
+#define _debug(FMT, ...) \
+do { \
+ if (erofscache_debug & EROFSCACHE_DEBUG_KDEBUG) \
+ kdebug(FMT, ##__VA_ARGS__); \
+} while (0)
+
+#else
+#define _enter(FMT, ...) no_printk("==> %s("FMT")", __func__, ##__VA_ARGS__)
+#define _leave(FMT, ...) no_printk("<== %s()"FMT"", __func__, ##__VA_ARGS__)
+#define _debug(FMT, ...) no_printk(FMT, ##__VA_ARGS__)
+#endif
+
+#if 1 /* defined(__KDEBUGALL) */
+
+#define ASSERT(X) \
+do { \
+ if (unlikely(!(X))) { \
+ pr_err("\n"); \
+ pr_err("Assertion failed\n"); \
+ BUG(); \
+ } \
+} while (0)
+
+#define ASSERTCMP(X, OP, Y) \
+do { \
+ if (unlikely(!((X) OP (Y)))) { \
+ pr_err("\n"); \
+ pr_err("Assertion failed\n"); \
+ pr_err("%lx " #OP " %lx is false\n", \
+ (unsigned long)(X), (unsigned long)(Y)); \
+ BUG(); \
+ } \
+} while (0)
+
+#define ASSERTIF(C, X) \
+do { \
+ if (unlikely((C) && !(X))) { \
+ pr_err("\n"); \
+ pr_err("Assertion failed\n"); \
+ BUG(); \
+ } \
+} while (0)
+
+#define ASSERTIFCMP(C, X, OP, Y) \
+do { \
+ if (unlikely((C) && !((X) OP (Y)))) { \
+ pr_err("\n"); \
+ pr_err("Assertion failed\n"); \
+ pr_err("%lx " #OP " %lx is false\n", \
+ (unsigned long)(X), (unsigned long)(Y)); \
+ BUG(); \
+ } \
+} while (0)
+
+#else
+
+#define ASSERT(X) do {} while (0)
+#define ASSERTCMP(X, OP, Y) do {} while (0)
+#define ASSERTIF(C, X) do {} while (0)
+#define ASSERTIFCMP(C, X, OP, Y) do {} while (0)
+
+#endif
diff --git a/fs/erofscache/io.c b/fs/erofscache/io.c
new file mode 100644
index 000000000000..0234e9a2a992
--- /dev/null
+++ b/fs/erofscache/io.c
@@ -0,0 +1,339 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* kiocb-using read/write
+ *
+ * Copyright (C) 2021 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ */
+
+#include <linux/mount.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/uio.h>
+#include <linux/falloc.h>
+#include <linux/sched/mm.h>
+#include <trace/events/fscache.h>
+#include "internal.h"
+
+struct erofscache_kiocb {
+ struct kiocb iocb;
+ refcount_t ki_refcnt;
+ loff_t start;
+ union {
+ size_t skipped;
+ size_t len;
+ };
+ struct erofscache_object *object;
+ netfs_io_terminated_t term_func;
+ void *term_func_priv;
+ bool was_async;
+ unsigned int inval_counter; /* Copy of cookie->inval_counter */
+ u64 b_writing;
+};
+
+static inline void erofscache_put_kiocb(struct erofscache_kiocb *ki)
+{
+ if (refcount_dec_and_test(&ki->ki_refcnt)) {
+ erofscache_put_object(ki->object, erofscache_obj_put_ioreq);
+ fput(ki->iocb.ki_filp);
+ kfree(ki);
+ }
+}
+
+/*
+ * Handle completion of a read from the cache.
+ */
+static void erofscache_read_complete(struct kiocb *iocb, long ret)
+{
+ struct erofscache_kiocb *ki = container_of(iocb, struct erofscache_kiocb, iocb);
+ struct inode *inode = file_inode(ki->iocb.ki_filp);
+
+ _enter("%ld", ret);
+
+ if (ret < 0)
+ trace_erofscache_io_error(ki->object, inode, ret,
+ erofscache_trace_read_error);
+
+ if (ki->term_func) {
+ if (ret >= 0) {
+ if (ki->object->cookie->inval_counter == ki->inval_counter)
+ ki->skipped += ret;
+ else
+ ret = -ESTALE;
+ }
+
+ ki->term_func(ki->term_func_priv, ret, ki->was_async);
+ }
+
+ erofscache_put_kiocb(ki);
+}
+
+/*
+ * Initiate a read from the cache.
+ */
+static int erofscache_read(struct netfs_cache_resources *cres,
+ loff_t start_pos,
+ struct iov_iter *iter,
+ enum netfs_read_from_hole read_hole,
+ netfs_io_terminated_t term_func,
+ void *term_func_priv)
+{
+ struct erofscache_object *object;
+ struct erofscache_kiocb *ki;
+ struct file *file;
+ unsigned int old_nofs;
+ ssize_t ret = -ENOBUFS;
+ size_t len = iov_iter_count(iter), skipped = 0;
+
+ if (!fscache_wait_for_operation(cres, FSCACHE_WANT_READ))
+ goto presubmission_error;
+
+ fscache_count_read();
+ object = erofscache_cres_object(cres);
+ file = erofscache_cres_file(cres);
+
+ _enter("%pD,%li,%llx,%zx/%llx",
+ file, file_inode(file)->i_ino, start_pos, len,
+ i_size_read(file_inode(file)));
+
+ /* If the caller asked us to seek for data before doing the read, then
+ * we should do that now. If we find a gap, we fill it with zeros.
+ */
+ if (read_hole != NETFS_READ_HOLE_IGNORE) {
+ loff_t off = start_pos, off2;
+
+ off2 = erofscache_inject_read_error();
+ if (off2 == 0)
+ off2 = vfs_llseek(file, off, SEEK_DATA);
+ if (off2 < 0 && off2 >= (loff_t)-MAX_ERRNO && off2 != -ENXIO) {
+ skipped = 0;
+ ret = off2;
+ goto presubmission_error;
+ }
+
+ if (off2 == -ENXIO || off2 >= start_pos + len) {
+ /* The region is beyond the EOF or there's no more data
+ * in the region, so clear the rest of the buffer and
+ * return success.
+ */
+ ret = -ENODATA;
+ if (read_hole == NETFS_READ_HOLE_FAIL)
+ goto presubmission_error;
+
+ iov_iter_zero(len, iter);
+ skipped = len;
+ ret = 0;
+ goto presubmission_error;
+ }
+
+ skipped = off2 - off;
+ iov_iter_zero(skipped, iter);
+ }
+
+ ret = -ENOMEM;
+ ki = kzalloc(sizeof(struct erofscache_kiocb), GFP_KERNEL);
+ if (!ki)
+ goto presubmission_error;
+
+ refcount_set(&ki->ki_refcnt, 2);
+ ki->iocb.ki_filp = file;
+ ki->iocb.ki_pos = start_pos + skipped;
+ ki->iocb.ki_flags = IOCB_DIRECT;
+ ki->iocb.ki_hint = ki_hint_validate(file_write_hint(file));
+ ki->iocb.ki_ioprio = get_current_ioprio();
+ ki->skipped = skipped;
+ ki->object = object;
+ ki->inval_counter = cres->inval_counter;
+ ki->term_func = term_func;
+ ki->term_func_priv = term_func_priv;
+ ki->was_async = true;
+
+ if (ki->term_func)
+ ki->iocb.ki_complete = erofscache_read_complete;
+
+ get_file(ki->iocb.ki_filp);
+ erofscache_grab_object(object, erofscache_obj_get_ioreq);
+
+ trace_erofscache_read(object, file_inode(file), ki->iocb.ki_pos, len - skipped);
+ old_nofs = memalloc_nofs_save();
+ ret = erofscache_inject_read_error();
+ if (ret == 0)
+ ret = vfs_iocb_iter_read(file, &ki->iocb, iter);
+ memalloc_nofs_restore(old_nofs);
+ switch (ret) {
+ case -EIOCBQUEUED:
+ goto in_progress;
+
+ case -ERESTARTSYS:
+ case -ERESTARTNOINTR:
+ case -ERESTARTNOHAND:
+ case -ERESTART_RESTARTBLOCK:
+ /* There's no easy way to restart the syscall since other AIO's
+ * may be already running. Just fail this IO with EINTR.
+ */
+ ret = -EINTR;
+ fallthrough;
+ default:
+ ki->was_async = false;
+ erofscache_read_complete(&ki->iocb, ret);
+ if (ret > 0)
+ ret = 0;
+ break;
+ }
+
+in_progress:
+ erofscache_put_kiocb(ki);
+ _leave(" = %zd", ret);
+ return ret;
+
+presubmission_error:
+ if (term_func)
+ term_func(term_func_priv, ret < 0 ? ret : skipped, false);
+ return ret;
+}
+
+/*
+ * Prepare a read operation, shortening it to a cached/uncached
+ * boundary as appropriate.
+ */
+static enum netfs_read_source erofscache_prepare_read(struct netfs_read_subrequest *subreq,
+ loff_t i_size)
+{
+ enum erofscache_prepare_read_trace why;
+ struct netfs_read_request *rreq = subreq->rreq;
+ struct netfs_cache_resources *cres = &rreq->cache_resources;
+ struct erofscache_object *object;
+ struct erofscache_cache *cache;
+ struct fscache_cookie *cookie = fscache_cres_cookie(cres);
+ const struct cred *saved_cred;
+ struct file *file = erofscache_cres_file(cres);
+ enum netfs_read_source ret = NETFS_DOWNLOAD_FROM_SERVER;
+ loff_t off, to;
+ ino_t ino = file ? file_inode(file)->i_ino : 0;
+
+ _enter("%zx @%llx/%llx", subreq->len, subreq->start, i_size);
+
+ if (subreq->start >= i_size) {
+ ret = NETFS_FILL_WITH_ZEROES;
+ why = erofscache_trace_read_after_eof;
+ goto out_no_object;
+ }
+
+ if (test_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags)) {
+ __set_bit(NETFS_SREQ_WRITE_TO_CACHE, &subreq->flags);
+ why = erofscache_trace_read_no_data;
+ goto out_no_object;
+ }
+
+ /* The object and the file may be being created in the background. */
+ if (!file) {
+ why = erofscache_trace_read_no_file;
+ if (!fscache_wait_for_operation(cres, FSCACHE_WANT_READ))
+ goto out_no_object;
+ file = erofscache_cres_file(cres);
+ if (!file)
+ goto out_no_object;
+ ino = file_inode(file)->i_ino;
+ }
+
+ object = erofscache_cres_object(cres);
+ cache = object->volume->cache;
+ erofscache_begin_secure(cache, &saved_cred);
+
+ off = erofscache_inject_read_error();
+ if (off == 0)
+ off = vfs_llseek(file, subreq->start, SEEK_DATA);
+ if (off < 0 && off >= (loff_t)-MAX_ERRNO) {
+ if (off == (loff_t)-ENXIO) {
+ why = erofscache_trace_read_seek_nxio;
+ goto out;
+ }
+ trace_erofscache_io_error(object, file_inode(file), off,
+ erofscache_trace_seek_error);
+ why = erofscache_trace_read_seek_error;
+ goto out;
+ }
+
+ if (off >= subreq->start + subreq->len) {
+ why = erofscache_trace_read_found_hole;
+ goto out;
+ }
+
+ if (off > subreq->start) {
+ off = round_up(off, cache->bsize);
+ subreq->len = off - subreq->start;
+ why = erofscache_trace_read_found_part;
+ goto out;
+ }
+
+ to = erofscache_inject_read_error();
+ if (to == 0)
+ to = vfs_llseek(file, subreq->start, SEEK_HOLE);
+ if (to < 0 && to >= (loff_t)-MAX_ERRNO) {
+ trace_erofscache_io_error(object, file_inode(file), to,
+ erofscache_trace_seek_error);
+ why = erofscache_trace_read_seek_error;
+ goto out;
+ }
+
+ if (to < subreq->start + subreq->len) {
+ if (subreq->start + subreq->len >= i_size)
+ to = round_up(to, cache->bsize);
+ else
+ to = round_down(to, cache->bsize);
+ subreq->len = to - subreq->start;
+ }
+
+ why = erofscache_trace_read_have_data;
+ ret = NETFS_READ_FROM_CACHE;
+ goto out;
+
+out:
+ erofscache_end_secure(cache, saved_cred);
+out_no_object:
+ trace_erofscache_prep_read(subreq, ret, why, ino);
+ return ret;
+}
+
+/*
+ * Clean up an operation.
+ */
+static void erofscache_end_operation(struct netfs_cache_resources *cres)
+{
+ struct file *file = erofscache_cres_file(cres);
+
+ if (file)
+ fput(file);
+ fscache_end_cookie_access(fscache_cres_cookie(cres), fscache_access_io_end);
+}
+
+static const struct netfs_cache_ops erofscache_netfs_cache_ops = {
+ .end_operation = erofscache_end_operation,
+ .read = erofscache_read,
+ .prepare_read = erofscache_prepare_read,
+};
+
+/*
+ * Open the cache file when beginning a cache operation.
+ */
+bool erofscache_begin_operation(struct netfs_cache_resources *cres,
+ enum fscache_want_state want_state)
+{
+ struct erofscache_object *object = erofscache_cres_object(cres);
+
+ if (!erofscache_cres_file(cres)) {
+ cres->ops = &erofscache_netfs_cache_ops;
+ if (object->file) {
+ spin_lock(&object->lock);
+ if (!cres->cache_priv2 && object->file)
+ cres->cache_priv2 = get_file(object->file);
+ spin_unlock(&object->lock);
+ }
+ }
+
+ if (!erofscache_cres_file(cres) && want_state != FSCACHE_WANT_PARAMS) {
+ pr_err("failed to get cres->file\n");
+ return false;
+ }
+
+ return true;
+}
diff --git a/fs/erofscache/key.c b/fs/erofscache/key.c
new file mode 100644
index 000000000000..6bad2d461d42
--- /dev/null
+++ b/fs/erofscache/key.c
@@ -0,0 +1,138 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Key to pathname encoder
+ *
+ * Copyright (C) 2021 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ */
+
+#include <linux/slab.h>
+#include "internal.h"
+
+static const char erofscache_charmap[64] =
+ "0123456789" /* 0 - 9 */
+ "abcdefghijklmnopqrstuvwxyz" /* 10 - 35 */
+ "ABCDEFGHIJKLMNOPQRSTUVWXYZ" /* 36 - 61 */
+ "_-" /* 62 - 63 */
+ ;
+
+static const char erofscache_filecharmap[256] = {
+ /* we skip space and tab and control chars */
+ [33 ... 46] = 1, /* '!' -> '.' */
+ /* we skip '/' as it's significant to pathwalk */
+ [48 ... 127] = 1, /* '0' -> '~' */
+};
+
+static inline unsigned int how_many_hex_digits(unsigned int x)
+{
+ return x ? round_up(ilog2(x) + 1, 4) / 4 : 0;
+}
+
+/*
+ * turn the raw key into something cooked
+ * - the key may be up to NAME_MAX in length (including the length word)
+ * - "base64" encode the strange keys, mapping 3 bytes of raw to four of
+ * cooked
+ * - need to cut the cooked key into 252 char lengths (189 raw bytes)
+ */
+bool erofscache_cook_key(struct erofscache_object *object)
+{
+ const u8 *key = fscache_get_key(object->cookie), *kend;
+ unsigned char ch;
+ unsigned int acc, i, n, nle, nbe, keylen = object->cookie->key_len;
+ unsigned int b64len, len, print, pad;
+ char *name, sep;
+
+ _enter(",%u,%*phN", keylen, keylen, key);
+
+ BUG_ON(keylen > NAME_MAX - 3);
+
+ print = 1;
+ for (i = 0; i < keylen; i++) {
+ ch = key[i];
+ print &= erofscache_filecharmap[ch];
+ }
+
+ /* If the path is usable ASCII, then we render it directly */
+ if (print) {
+ len = 1 + keylen;
+ name = kmalloc(len + 1, GFP_KERNEL);
+ if (!name)
+ return false;
+
+ name[0] = 'D'; /* Data object type, string encoding */
+ memcpy(name + 1, key, keylen);
+ goto success;
+ }
+
+ /* See if it makes sense to encode it as "hex,hex,hex" for each 32-bit
+ * chunk. We rely on the key having been padded out to a whole number
+ * of 32-bit words.
+ */
+ n = round_up(keylen, 4);
+ nbe = nle = 0;
+ for (i = 0; i < n; i += 4) {
+ u32 be = be32_to_cpu(*(__be32 *)(key + i));
+ u32 le = le32_to_cpu(*(__le32 *)(key + i));
+
+ nbe += 1 + how_many_hex_digits(be);
+ nle += 1 + how_many_hex_digits(le);
+ }
+
+ b64len = DIV_ROUND_UP(keylen, 3);
+ pad = b64len * 3 - keylen;
+ b64len = 2 + b64len * 4; /* Length if we base64-encode it */
+ _debug("len=%u nbe=%u nle=%u b64=%u", keylen, nbe, nle, b64len);
+ if (nbe < b64len || nle < b64len) {
+ unsigned int nlen = min(nbe, nle) + 1;
+ name = kmalloc(nlen, GFP_KERNEL);
+ if (!name)
+ return false;
+ sep = (nbe <= nle) ? 'S' : 'T'; /* Encoding indicator */
+ len = 0;
+ for (i = 0; i < n; i += 4) {
+ u32 x;
+ if (nbe <= nle)
+ x = be32_to_cpu(*(__be32 *)(key + i));
+ else
+ x = le32_to_cpu(*(__le32 *)(key + i));
+ name[len++] = sep;
+ if (x != 0)
+ len += snprintf(name + len, nlen - len, "%x", x);
+ sep = ',';
+ }
+ goto success;
+ }
+
+ /* We need to base64-encode it */
+ name = kmalloc(b64len + 1, GFP_KERNEL);
+ if (!name)
+ return false;
+
+ name[0] = 'E';
+ name[1] = '0' + pad;
+ len = 2;
+ kend = key + keylen;
+ do {
+ acc = *key++;
+ if (key < kend) {
+ acc |= *key++ << 8;
+ if (key < kend)
+ acc |= *key++ << 16;
+ }
+
+ name[len++] = erofscache_charmap[acc & 63];
+ acc >>= 6;
+ name[len++] = erofscache_charmap[acc & 63];
+ acc >>= 6;
+ name[len++] = erofscache_charmap[acc & 63];
+ acc >>= 6;
+ name[len++] = erofscache_charmap[acc & 63];
+ } while (key < kend);
+
+success:
+ name[len] = 0;
+ object->d_name = name;
+ object->d_name_len = len;
+ _leave(" = %s", object->d_name);
+ return true;
+}
diff --git a/fs/erofscache/main.c b/fs/erofscache/main.c
new file mode 100644
index 000000000000..8daa4f06d09f
--- /dev/null
+++ b/fs/erofscache/main.c
@@ -0,0 +1,93 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Network filesystem caching backend to use cache files on a premounted
+ * filesystem
+ *
+ * Copyright (C) 2021 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/completion.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/namei.h>
+#include <linux/mount.h>
+#include <linux/statfs.h>
+#include <linux/sysctl.h>
+#include <linux/miscdevice.h>
+#include <linux/netfs.h>
+#include <trace/events/netfs.h>
+#define CREATE_TRACE_POINTS
+#include "internal.h"
+
+unsigned erofscache_debug;
+module_param_named(debug, erofscache_debug, uint, S_IWUSR | S_IRUGO);
+MODULE_PARM_DESC(erofscache_debug, "Erofscache debugging mask");
+
+MODULE_DESCRIPTION("Mounted-filesystem based cache");
+MODULE_AUTHOR("Red Hat, Inc.");
+MODULE_LICENSE("GPL");
+
+struct kmem_cache *erofscache_object_jar;
+
+static struct miscdevice erofscache_dev = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = "erofscache",
+ .fops = &erofscache_daemon_fops,
+};
+
+/*
+ * initialise the fs caching module
+ */
+static int __init erofscache_init(void)
+{
+ int ret;
+
+ ret = erofscache_register_error_injection();
+ if (ret < 0)
+ goto error_einj;
+ ret = misc_register(&erofscache_dev);
+ if (ret < 0)
+ goto error_dev;
+
+ /* create an object jar */
+ ret = -ENOMEM;
+ erofscache_object_jar =
+ kmem_cache_create("erofscache_object_jar",
+ sizeof(struct erofscache_object),
+ 0, SLAB_HWCACHE_ALIGN, NULL);
+ if (!erofscache_object_jar) {
+ pr_notice("Failed to allocate an object jar\n");
+ goto error_object_jar;
+ }
+
+ pr_info("Loaded\n");
+ return 0;
+
+error_object_jar:
+ misc_deregister(&erofscache_dev);
+error_dev:
+ erofscache_unregister_error_injection();
+error_einj:
+ pr_err("failed to register: %d\n", ret);
+ return ret;
+}
+
+fs_initcall(erofscache_init);
+
+/*
+ * clean up on module removal
+ */
+static void __exit erofscache_exit(void)
+{
+ pr_info("Unloading\n");
+
+ kmem_cache_destroy(erofscache_object_jar);
+ misc_deregister(&erofscache_dev);
+ erofscache_unregister_error_injection();
+}
+
+module_exit(erofscache_exit);
diff --git a/fs/erofscache/namei.c b/fs/erofscache/namei.c
new file mode 100644
index 000000000000..28c3f11c0fae
--- /dev/null
+++ b/fs/erofscache/namei.c
@@ -0,0 +1,635 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Erofscache path walking and related routines
+ *
+ * Copyright (C) 2021 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ */
+
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include "internal.h"
+
+/*
+ * Mark the backing file as being a cache file if it's not already in use. The
+ * mark tells the culling request command that it's not allowed to cull the
+ * file or directory. The caller must hold the inode lock.
+ */
+static bool __erofscache_mark_inode_in_use(struct erofscache_object *object,
+ struct dentry *dentry)
+{
+ struct inode *inode = d_backing_inode(dentry);
+ bool can_use = false;
+
+ if (!(inode->i_flags & S_KERNEL_FILE)) {
+ inode->i_flags |= S_KERNEL_FILE;
+ trace_erofscache_mark_active(object, inode);
+ can_use = true;
+ } else {
+ trace_erofscache_mark_failed(object, inode);
+ pr_notice("erofscache: Inode already in use: %pd (B=%lx)\n",
+ dentry, inode->i_ino);
+ }
+
+ return can_use;
+}
+
+static bool erofscache_mark_inode_in_use(struct erofscache_object *object,
+ struct dentry *dentry)
+{
+ struct inode *inode = d_backing_inode(dentry);
+ bool can_use;
+
+ inode_lock(inode);
+ can_use = __erofscache_mark_inode_in_use(object, dentry);
+ inode_unlock(inode);
+ return can_use;
+}
+
+/*
+ * Unmark a backing inode. The caller must hold the inode lock.
+ */
+static void __erofscache_unmark_inode_in_use(struct erofscache_object *object,
+ struct dentry *dentry)
+{
+ struct inode *inode = d_backing_inode(dentry);
+
+ inode->i_flags &= ~S_KERNEL_FILE;
+ trace_erofscache_mark_inactive(object, inode);
+}
+
+/*
+ * Unmark a backing inode and tell erofscached that there's something that can
+ * be culled.
+ */
+void erofscache_unmark_inode_in_use(struct erofscache_object *object,
+ struct file *file)
+{
+ struct erofscache_cache *cache = object->volume->cache;
+ struct inode *inode = file_inode(file);
+
+ if (inode) {
+ inode_lock(inode);
+ __erofscache_unmark_inode_in_use(object, file->f_path.dentry);
+ inode_unlock(inode);
+
+ atomic_long_add(inode->i_blocks, &cache->b_released);
+ if (atomic_inc_return(&cache->f_released))
+ erofscache_state_changed(cache);
+ }
+}
+
+/*
+ * get a subdirectory
+ */
+struct dentry *erofscache_get_directory(struct erofscache_cache *cache,
+ struct dentry *dir,
+ const char *dirname)
+{
+ struct dentry *subdir;
+ int ret;
+
+ _enter(",,%s", dirname);
+
+ /* search the current directory for the element name */
+ inode_lock_nested(d_inode(dir), I_MUTEX_PARENT);
+
+ ret = erofscache_inject_read_error();
+ if (ret == 0)
+ subdir = lookup_one_len(dirname, dir, strlen(dirname));
+ else
+ subdir = ERR_PTR(ret);
+ trace_erofscache_lookup(NULL, dir, subdir);
+ if (IS_ERR(subdir)) {
+ trace_erofscache_vfs_error(NULL, d_backing_inode(dir),
+ PTR_ERR(subdir),
+ erofscache_trace_lookup_error);
+ if (PTR_ERR(subdir) == -ENOMEM)
+ goto nomem_d_alloc;
+ goto lookup_error;
+ }
+
+ _debug("subdir -> %pd %s",
+ subdir, d_backing_inode(subdir) ? "positive" : "negative");
+
+ /* TODO: Do we need to create the subdir if it doesn't exist? */
+ if (d_is_negative(subdir))
+ goto enoent;
+
+ /* Tell rmdir() it's not allowed to delete the subdir */
+ inode_lock(d_inode(subdir));
+ inode_unlock(d_inode(dir));
+
+ if (!__erofscache_mark_inode_in_use(NULL, subdir))
+ goto mark_error;
+
+ inode_unlock(d_inode(subdir));
+
+ /* we need to make sure the subdir is a directory */
+ ASSERT(d_backing_inode(subdir));
+
+ if (!d_can_lookup(subdir)) {
+ pr_err("%s is not a directory\n", dirname);
+ ret = -EIO;
+ goto check_error;
+ }
+
+ ret = -EPERM;
+ if (!(d_backing_inode(subdir)->i_opflags & IOP_XATTR) ||
+ !d_backing_inode(subdir)->i_op->lookup ||
+ !d_backing_inode(subdir)->i_op->rename ||
+ !d_backing_inode(subdir)->i_op->rmdir ||
+ !d_backing_inode(subdir)->i_op->unlink)
+ goto check_error;
+
+ _leave(" = [%lu]", d_backing_inode(subdir)->i_ino);
+ return subdir;
+
+check_error:
+ erofscache_put_directory(subdir);
+ _leave(" = %d [check]", ret);
+ return ERR_PTR(ret);
+
+mark_error:
+ inode_unlock(d_inode(subdir));
+ dput(subdir);
+ return ERR_PTR(-EBUSY);
+
+enoent:
+ inode_unlock(d_inode(dir));
+ dput(subdir);
+ pr_err("No such directory %s\n", dirname);
+ return ERR_PTR(-ENOENT);
+
+lookup_error:
+ inode_unlock(d_inode(dir));
+ ret = PTR_ERR(subdir);
+ pr_err("Lookup %s failed with error %d\n", dirname, ret);
+ return ERR_PTR(ret);
+
+nomem_d_alloc:
+ inode_unlock(d_inode(dir));
+ _leave(" = -ENOMEM");
+ return ERR_PTR(-ENOMEM);
+}
+
+/*
+ * Put a subdirectory.
+ */
+void erofscache_put_directory(struct dentry *dir)
+{
+ if (dir) {
+ inode_lock(dir->d_inode);
+ __erofscache_unmark_inode_in_use(NULL, dir);
+ inode_unlock(dir->d_inode);
+ dput(dir);
+ }
+}
+
+/*
+ * Remove a regular file from the cache.
+ */
+static int erofscache_unlink(struct erofscache_cache *cache,
+ struct erofscache_object *object,
+ struct dentry *dir, struct dentry *dentry,
+ enum fscache_why_object_killed why)
+{
+ struct path path = {
+ .mnt = cache->mnt,
+ .dentry = dir,
+ };
+ int ret;
+
+ trace_erofscache_unlink(object, d_inode(dentry)->i_ino, why);
+ ret = security_path_unlink(&path, dentry);
+ if (ret < 0) {
+ erofscache_io_error(cache, "Unlink security error");
+ return ret;
+ }
+
+ ret = erofscache_inject_remove_error();
+ if (ret == 0) {
+ ret = vfs_unlink(&init_user_ns, d_backing_inode(dir), dentry, NULL);
+ if (ret == -EIO)
+ erofscache_io_error(cache, "Unlink failed");
+ }
+ if (ret != 0)
+ trace_erofscache_vfs_error(object, d_backing_inode(dir), ret,
+ erofscache_trace_unlink_error);
+ return ret;
+}
+
+/*
+ * Delete an object representation from the cache
+ * - File backed objects are unlinked
+ * - Directory backed objects are stuffed into the graveyard for userspace to
+ * delete
+ */
+int erofscache_bury_object(struct erofscache_cache *cache,
+ struct erofscache_object *object,
+ struct dentry *dir,
+ struct dentry *rep,
+ enum fscache_why_object_killed why)
+{
+ struct dentry *grave, *trap;
+ struct path path, path_to_graveyard;
+ char nbuffer[8 + 8 + 1];
+ int ret;
+
+ _enter(",'%pd','%pd'", dir, rep);
+
+ if (rep->d_parent != dir) {
+ inode_unlock(d_inode(dir));
+ _leave(" = -ESTALE");
+ return -ESTALE;
+ }
+
+ /* non-directories can just be unlinked */
+ if (!d_is_dir(rep)) {
+ dget(rep); /* Stop the dentry being negated if it's only pinned
+ * by a file struct.
+ */
+ ret = erofscache_unlink(cache, object, dir, rep, why);
+ dput(rep);
+
+ inode_unlock(d_inode(dir));
+ _leave(" = %d", ret);
+ return ret;
+ }
+
+ /* directories have to be moved to the graveyard */
+ _debug("move stale object to graveyard");
+ inode_unlock(d_inode(dir));
+
+try_again:
+ /* first step is to make up a grave dentry in the graveyard */
+ sprintf(nbuffer, "%08x%08x",
+ (uint32_t) ktime_get_real_seconds(),
+ (uint32_t) atomic_inc_return(&cache->gravecounter));
+
+ /* do the multiway lock magic */
+ trap = lock_rename(cache->graveyard, dir);
+
+ /* do some checks before getting the grave dentry */
+ if (rep->d_parent != dir || IS_DEADDIR(d_inode(rep))) {
+ /* the entry was probably culled when we dropped the parent dir
+ * lock */
+ unlock_rename(cache->graveyard, dir);
+ _leave(" = 0 [culled?]");
+ return 0;
+ }
+
+ if (!d_can_lookup(cache->graveyard)) {
+ unlock_rename(cache->graveyard, dir);
+ erofscache_io_error(cache, "Graveyard no longer a directory");
+ return -EIO;
+ }
+
+ if (trap == rep) {
+ unlock_rename(cache->graveyard, dir);
+ erofscache_io_error(cache, "May not make directory loop");
+ return -EIO;
+ }
+
+ if (d_mountpoint(rep)) {
+ unlock_rename(cache->graveyard, dir);
+ erofscache_io_error(cache, "Mountpoint in cache");
+ return -EIO;
+ }
+
+ grave = lookup_one_len(nbuffer, cache->graveyard, strlen(nbuffer));
+ if (IS_ERR(grave)) {
+ unlock_rename(cache->graveyard, dir);
+ trace_erofscache_vfs_error(object, d_inode(cache->graveyard),
+ PTR_ERR(grave),
+ erofscache_trace_lookup_error);
+
+ if (PTR_ERR(grave) == -ENOMEM) {
+ _leave(" = -ENOMEM");
+ return -ENOMEM;
+ }
+
+ erofscache_io_error(cache, "Lookup error %ld", PTR_ERR(grave));
+ return -EIO;
+ }
+
+ if (d_is_positive(grave)) {
+ unlock_rename(cache->graveyard, dir);
+ dput(grave);
+ grave = NULL;
+ cond_resched();
+ goto try_again;
+ }
+
+ if (d_mountpoint(grave)) {
+ unlock_rename(cache->graveyard, dir);
+ dput(grave);
+ erofscache_io_error(cache, "Mountpoint in graveyard");
+ return -EIO;
+ }
+
+ /* target should not be an ancestor of source */
+ if (trap == grave) {
+ unlock_rename(cache->graveyard, dir);
+ dput(grave);
+ erofscache_io_error(cache, "May not make directory loop");
+ return -EIO;
+ }
+
+ /* attempt the rename */
+ path.mnt = cache->mnt;
+ path.dentry = dir;
+ path_to_graveyard.mnt = cache->mnt;
+ path_to_graveyard.dentry = cache->graveyard;
+ ret = security_path_rename(&path, rep, &path_to_graveyard, grave, 0);
+ if (ret < 0) {
+ erofscache_io_error(cache, "Rename security error %d", ret);
+ } else {
+ struct renamedata rd = {
+ .old_mnt_userns = &init_user_ns,
+ .old_dir = d_inode(dir),
+ .old_dentry = rep,
+ .new_mnt_userns = &init_user_ns,
+ .new_dir = d_inode(cache->graveyard),
+ .new_dentry = grave,
+ };
+ trace_erofscache_rename(object, d_inode(rep)->i_ino, why);
+ ret = erofscache_inject_read_error();
+ if (ret == 0)
+ ret = vfs_rename(&rd);
+ if (ret != 0)
+ trace_erofscache_vfs_error(object, d_inode(dir), ret,
+ erofscache_trace_rename_error);
+ if (ret != 0 && ret != -ENOMEM)
+ erofscache_io_error(cache,
+ "Rename failed with error %d", ret);
+ }
+
+ __erofscache_unmark_inode_in_use(object, rep);
+ unlock_rename(cache->graveyard, dir);
+ dput(grave);
+ _leave(" = 0");
+ return 0;
+}
+
+/*
+ * Delete a cache file.
+ */
+int erofscache_delete_object(struct erofscache_object *object,
+ enum fscache_why_object_killed why)
+{
+ struct erofscache_volume *volume = object->volume;
+ struct dentry *dentry = object->file->f_path.dentry;
+ struct dentry *fan = volume->fanout[(u8)object->cookie->key_hash];
+ int ret;
+
+ _enter(",OBJ%x{%pD}", object->debug_id, object->file);
+
+ /* Stop the dentry being negated if it's only pinned by a file struct. */
+ dget(dentry);
+
+ inode_lock_nested(d_backing_inode(fan), I_MUTEX_PARENT);
+ ret = erofscache_unlink(volume->cache, object, fan, dentry, why);
+ inode_unlock(d_backing_inode(fan));
+ dput(dentry);
+ return ret;
+}
+
+/*
+ * Open an existing file, checking its attributes and replacing it if it is
+ * stale.
+ */
+static bool erofscache_open_file(struct erofscache_object *object,
+ struct dentry *dentry)
+{
+ struct erofscache_cache *cache = object->volume->cache;
+ struct file *file;
+ struct path path;
+ int ret;
+
+ _enter("%pd", dentry);
+
+ if (!erofscache_mark_inode_in_use(object, dentry))
+ return false;
+
+ /* We need to open a file interface onto a data file now as we can't do
+ * it on demand because writeback called from do_exit() sees
+ * current->fs == NULL - which breaks d_path() called from ext4 open.
+ */
+ path.mnt = cache->mnt;
+ path.dentry = dentry;
+ file = open_with_fake_path(&path, O_RDONLY | O_LARGEFILE | O_DIRECT,
+ d_backing_inode(dentry), cache->cache_cred);
+ if (IS_ERR(file)) {
+ trace_erofscache_vfs_error(object, d_backing_inode(dentry),
+ PTR_ERR(file),
+ erofscache_trace_open_error);
+ goto error;
+ }
+
+ if (unlikely(!file->f_op->read_iter)) {
+ pr_notice("Cache does not support read_iter\n");
+ goto error_fput;
+ }
+ _debug("file -> %pd positive", dentry);
+
+ ret = erofscache_check_auxdata(object, file);
+ if (ret < 0)
+ goto check_failed;
+
+ object->file = file;
+
+ /* Always update the atime on an object we've just looked up (this is
+ * used to keep track of culling, and atimes are only updated by read,
+ * write and readdir but not lookup or open).
+ */
+ touch_atime(&file->f_path);
+ dput(dentry);
+ return true;
+
+check_failed:
+ fscache_cookie_lookup_negative(object->cookie);
+ erofscache_unmark_inode_in_use(object, file);
+ if (ret == -ESTALE) {
+ fput(file);
+ dput(dentry);
+ // TODO: Do on-demand load
+ return false;
+ }
+error_fput:
+ fput(file);
+error:
+ dput(dentry);
+ return false;
+}
+
+/*
+ * walk from the parent object to the child object through the backing
+ * filesystem, creating directories as we go
+ */
+bool erofscache_look_up_object(struct erofscache_object *object)
+{
+ struct erofscache_volume *volume = object->volume;
+ struct dentry *dentry, *fan = volume->fanout[(u8)object->cookie->key_hash];
+ int ret;
+
+ _enter("OBJ%x,%s,", object->debug_id, object->d_name);
+
+ /* Look up path "cache/vol/fanout/file". */
+ ret = erofscache_inject_read_error();
+ if (ret == 0)
+ dentry = lookup_positive_unlocked(object->d_name, fan,
+ object->d_name_len);
+ else
+ dentry = ERR_PTR(ret);
+ trace_erofscache_lookup(object, fan, dentry);
+ if (IS_ERR(dentry)) {
+ if (dentry == ERR_PTR(-ENOENT))
+ goto new_file;
+ if (dentry == ERR_PTR(-EIO))
+ erofscache_io_error_obj(object, "Lookup failed");
+ return false;
+ }
+
+ if (!d_is_reg(dentry)) {
+ pr_err("%pd is not a file\n", dentry);
+ inode_lock_nested(d_inode(fan), I_MUTEX_PARENT);
+ ret = erofscache_bury_object(volume->cache, object, fan, dentry,
+ FSCACHE_OBJECT_IS_WEIRD);
+ dput(dentry);
+ if (ret < 0)
+ return false;
+ goto new_file;
+ }
+
+ if (!erofscache_open_file(object, dentry))
+ return false;
+
+ _leave(" = t [%lu]", file_inode(object->file)->i_ino);
+ return true;
+
+new_file:
+ fscache_cookie_lookup_negative(object->cookie);
+ return false; // TODO: Trigger on-demand file creation
+}
+
+/*
+ * Look up an inode to be checked or culled. Return -EBUSY if the inode is
+ * marked in use.
+ */
+static struct dentry *erofscache_lookup_for_cull(struct erofscache_cache *cache,
+ struct dentry *dir,
+ char *filename)
+{
+ struct dentry *victim;
+ int ret = -ENOENT;
+
+ inode_lock_nested(d_inode(dir), I_MUTEX_PARENT);
+
+ victim = lookup_one_len(filename, dir, strlen(filename));
+ if (IS_ERR(victim))
+ goto lookup_error;
+ if (d_is_negative(victim))
+ goto lookup_put;
+ if (d_inode(victim)->i_flags & S_KERNEL_FILE)
+ goto lookup_busy;
+ return victim;
+
+lookup_busy:
+ ret = -EBUSY;
+lookup_put:
+ inode_unlock(d_inode(dir));
+ dput(victim);
+ return ERR_PTR(ret);
+
+lookup_error:
+ inode_unlock(d_inode(dir));
+ ret = PTR_ERR(victim);
+ if (ret == -ENOENT)
+ return ERR_PTR(-ESTALE); /* Probably got retired by the netfs */
+
+ if (ret == -EIO) {
+ erofscache_io_error(cache, "Lookup failed");
+ } else if (ret != -ENOMEM) {
+ pr_err("Internal error: %d\n", ret);
+ ret = -EIO;
+ }
+
+ return ERR_PTR(ret);
+}
+
+/*
+ * Cull an object if it's not in use
+ * - called only by cache manager daemon
+ */
+int erofscache_cull(struct erofscache_cache *cache, struct dentry *dir,
+ char *filename)
+{
+ struct dentry *victim;
+ struct inode *inode;
+ int ret;
+
+ _enter(",%pd/,%s", dir, filename);
+
+ victim = erofscache_lookup_for_cull(cache, dir, filename);
+ if (IS_ERR(victim))
+ return PTR_ERR(victim);
+
+ /* check to see if someone is using this object */
+ inode = d_inode(victim);
+ inode_lock(inode);
+ if (inode->i_flags & S_KERNEL_FILE) {
+ ret = -EBUSY;
+ } else {
+ /* Stop the cache from picking it back up */
+ inode->i_flags |= S_KERNEL_FILE;
+ ret = 0;
+ }
+ inode_unlock(inode);
+ if (ret < 0)
+ goto error_unlock;
+
+ ret = erofscache_bury_object(cache, NULL, dir, victim,
+ FSCACHE_OBJECT_WAS_CULLED);
+ if (ret < 0)
+ goto error;
+
+ fscache_count_culled();
+ dput(victim);
+ _leave(" = 0");
+ return 0;
+
+error_unlock:
+ inode_unlock(d_inode(dir));
+error:
+ dput(victim);
+ if (ret == -ENOENT)
+ return -ESTALE; /* Probably got retired by the netfs */
+
+ if (ret != -ENOMEM) {
+ pr_err("Internal error: %d\n", ret);
+ ret = -EIO;
+ }
+
+ _leave(" = %d", ret);
+ return ret;
+}
+
+/*
+ * Find out if an object is in use or not
+ * - called only by cache manager daemon
+ * - returns -EBUSY or 0 to indicate whether an object is in use or not
+ */
+int erofscache_check_in_use(struct erofscache_cache *cache, struct dentry *dir,
+ char *filename)
+{
+ struct dentry *victim;
+ int ret = 0;
+
+ victim = erofscache_lookup_for_cull(cache, dir, filename);
+ if (IS_ERR(victim))
+ return PTR_ERR(victim);
+
+ inode_unlock(d_inode(dir));
+ dput(victim);
+ return ret;
+}
diff --git a/fs/erofscache/security.c b/fs/erofscache/security.c
new file mode 100644
index 000000000000..b642f106e761
--- /dev/null
+++ b/fs/erofscache/security.c
@@ -0,0 +1,85 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Erofscache security management
+ *
+ * Copyright (C) 2007, 2021 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ */
+
+#include <linux/fs.h>
+#include <linux/cred.h>
+#include "internal.h"
+
+/*
+ * determine the security context within which we access the cache from within
+ * the kernel
+ */
+int erofscache_get_security_ID(struct erofscache_cache *cache)
+{
+ struct cred *new;
+ int ret;
+
+ _enter("{%s}", cache->secctx);
+
+ new = prepare_kernel_cred(current);
+ if (!new) {
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ if (cache->secctx) {
+ ret = set_security_override_from_ctx(new, cache->secctx);
+ if (ret < 0) {
+ put_cred(new);
+ pr_err("Security denies permission to nominate security context: error %d\n",
+ ret);
+ goto error;
+ }
+ }
+
+ cache->cache_cred = new;
+ ret = 0;
+error:
+ _leave(" = %d", ret);
+ return ret;
+}
+
+/*
+ * check the security details of the on-disk cache
+ * - must be called with security override in force
+ * - must return with a security override in force - even in the case of an
+ * error
+ */
+int erofscache_determine_cache_security(struct erofscache_cache *cache,
+ struct dentry *root,
+ const struct cred **_saved_cred)
+{
+ struct cred *new;
+ int ret;
+
+ _enter("");
+
+ /* duplicate the cache creds for COW (the override is currently in
+ * force, so we can use prepare_creds() to do this) */
+ new = prepare_creds();
+ if (!new)
+ return -ENOMEM;
+
+ erofscache_end_secure(cache, *_saved_cred);
+
+ /* use the cache root dir's security context as the basis with
+ * which create files */
+ ret = set_create_files_as(new, d_backing_inode(root));
+ if (ret < 0) {
+ abort_creds(new);
+ erofscache_begin_secure(cache, _saved_cred);
+ _leave(" = %d [cfa]", ret);
+ return ret;
+ }
+
+ put_cred(cache->cache_cred);
+ cache->cache_cred = new;
+
+ erofscache_begin_secure(cache, _saved_cred);
+ _leave(" = %d", ret);
+ return ret;
+}
diff --git a/fs/erofscache/volume.c b/fs/erofscache/volume.c
new file mode 100644
index 000000000000..8c18281b02fe
--- /dev/null
+++ b/fs/erofscache/volume.c
@@ -0,0 +1,129 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Volume handling.
+ *
+ * Copyright (C) 2021 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include "internal.h"
+#include <trace/events/fscache.h>
+
+/*
+ * Allocate and set up a volume representation. We make sure all the fanout
+ * directories are created and pinned.
+ */
+void erofscache_acquire_volume(struct fscache_volume *vcookie)
+{
+ struct erofscache_volume *volume;
+ struct erofscache_cache *cache = vcookie->cache->cache_priv;
+ const struct cred *saved_cred;
+ struct dentry *vdentry, *fan;
+ size_t len;
+ char *name;
+ int ret, n_accesses, i;
+
+ _enter("");
+
+ volume = kzalloc(sizeof(struct erofscache_volume), GFP_KERNEL);
+ if (!volume)
+ return;
+ volume->vcookie = vcookie;
+ volume->cache = cache;
+ INIT_LIST_HEAD(&volume->cache_link);
+
+ erofscache_begin_secure(cache, &saved_cred);
+
+ len = vcookie->key[0];
+ name = kmalloc(len + 3, GFP_NOFS);
+ if (!name)
+ goto error_vol;
+ name[0] = 'I';
+ memcpy(name + 1, vcookie->key + 1, len);
+ name[len + 1] = 0;
+
+ vdentry = erofscache_get_directory(cache, cache->store, name);
+ if (IS_ERR(vdentry))
+ goto error_name;
+ volume->dentry = vdentry;
+
+ ret = erofscache_check_volume_xattr(volume);
+ if (ret < 0) {
+ if (ret != -ESTALE)
+ goto error_dir;
+ inode_lock_nested(d_inode(cache->store), I_MUTEX_PARENT);
+ erofscache_bury_object(cache, NULL, cache->store, vdentry,
+ FSCACHE_VOLUME_IS_WEIRD);
+ goto error_dir;
+ }
+
+ for (i = 0; i < 256; i++) {
+ sprintf(name, "@%02x", i);
+ fan = erofscache_get_directory(cache, vdentry, name);
+ if (IS_ERR(fan))
+ goto error_fan;
+ volume->fanout[i] = fan;
+ }
+
+ erofscache_end_secure(cache, saved_cred);
+
+ vcookie->cache_priv = volume;
+ n_accesses = atomic_inc_return(&vcookie->n_accesses); /* Stop wakeups on dec-to-0 */
+ trace_fscache_access_volume(vcookie->debug_id, 0,
+ refcount_read(&vcookie->ref),
+ n_accesses, fscache_access_cache_pin);
+
+ spin_lock(&cache->object_list_lock);
+ list_add(&volume->cache_link, &volume->cache->volumes);
+ spin_unlock(&cache->object_list_lock);
+
+ kfree(name);
+ return;
+
+error_fan:
+ for (i = 0; i < 256; i++)
+ erofscache_put_directory(volume->fanout[i]);
+error_dir:
+ erofscache_put_directory(volume->dentry);
+error_name:
+ kfree(name);
+error_vol:
+ kfree(volume);
+ erofscache_end_secure(cache, saved_cred);
+}
+
+/*
+ * Release a volume representation.
+ */
+static void __erofscache_free_volume(struct erofscache_volume *volume)
+{
+ int i;
+
+ _enter("");
+
+ volume->vcookie->cache_priv = NULL;
+
+ for (i = 0; i < 256; i++)
+ erofscache_put_directory(volume->fanout[i]);
+ erofscache_put_directory(volume->dentry);
+ kfree(volume);
+}
+
+void erofscache_free_volume(struct fscache_volume *vcookie)
+{
+ struct erofscache_volume *volume = vcookie->cache_priv;
+
+ if (volume) {
+ spin_lock(&volume->cache->object_list_lock);
+ list_del_init(&volume->cache_link);
+ spin_unlock(&volume->cache->object_list_lock);
+ __erofscache_free_volume(volume);
+ }
+}
+
+void erofscache_withdraw_volume(struct erofscache_volume *volume)
+{
+ fscache_withdraw_volume(volume->vcookie);
+ __erofscache_free_volume(volume);
+}
diff --git a/fs/erofscache/xattr.c b/fs/erofscache/xattr.c
new file mode 100644
index 000000000000..1f2408131c9e
--- /dev/null
+++ b/fs/erofscache/xattr.c
@@ -0,0 +1,146 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Erofscache extended attribute management
+ *
+ * Copyright (C) 2021 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ */
+
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/fsnotify.h>
+#include <linux/quotaops.h>
+#include <linux/xattr.h>
+#include <linux/slab.h>
+#include "internal.h"
+
+#define EROFSCACHE_COOKIE_TYPE_DATA 1
+
+struct erofscache_xattr {
+ __be64 object_size; /* Actual size of the object */
+ __u8 type; /* Type of object */
+ __u8 data[]; /* netfs coherency data */
+} __packed;
+
+static const char erofscache_xattr_cache[] =
+ XATTR_USER_PREFIX "Erofscache.cache";
+
+/*
+ * check the consistency between the backing cache and the FS-Cache cookie
+ */
+int erofscache_check_auxdata(struct erofscache_object *object, struct file *file)
+{
+ struct erofscache_xattr *buf;
+ struct dentry *dentry = file->f_path.dentry;
+ unsigned int len = object->cookie->aux_len, tlen;
+ const void *p = fscache_get_aux(object->cookie);
+ enum erofscache_coherency_trace why;
+ ssize_t xlen;
+ int ret = -ESTALE;
+
+ tlen = sizeof(struct erofscache_xattr) + len;
+ buf = kmalloc(tlen, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ xlen = erofscache_inject_read_error();
+ if (xlen == 0)
+ xlen = vfs_getxattr(&init_user_ns, dentry, erofscache_xattr_cache, buf, tlen);
+ if (xlen != tlen) {
+ if (xlen < 0)
+ trace_erofscache_vfs_error(object, file_inode(file), xlen,
+ erofscache_trace_getxattr_error);
+ if (xlen == -EIO)
+ erofscache_io_error_obj(
+ object,
+ "Failed to read aux with error %zd", xlen);
+ why = erofscache_coherency_check_xattr;
+ } else if (buf->type != EROFSCACHE_COOKIE_TYPE_DATA) {
+ why = erofscache_coherency_check_type;
+ } else if (memcmp(buf->data, p, len) != 0) {
+ why = erofscache_coherency_check_aux;
+ } else if (be64_to_cpu(buf->object_size) != object->cookie->object_size) {
+ why = erofscache_coherency_check_objsize;
+ } else {
+ why = erofscache_coherency_check_ok;
+ ret = 0;
+ }
+
+ trace_erofscache_coherency(object, file_inode(file)->i_ino, why);
+ kfree(buf);
+ return ret;
+}
+
+/*
+ * remove the object's xattr to mark it stale
+ */
+int erofscache_remove_object_xattr(struct erofscache_cache *cache,
+ struct erofscache_object *object,
+ struct dentry *dentry)
+{
+ int ret;
+
+ ret = erofscache_inject_remove_error();
+ if (ret == 0)
+ ret = vfs_removexattr(&init_user_ns, dentry, erofscache_xattr_cache);
+ if (ret < 0) {
+ trace_erofscache_vfs_error(object, d_inode(dentry), ret,
+ erofscache_trace_remxattr_error);
+ if (ret == -ENOENT || ret == -ENODATA)
+ ret = 0;
+ else if (ret != -ENOMEM)
+ erofscache_io_error(cache,
+ "Can't remove xattr from %lu"
+ " (error %d)",
+ d_backing_inode(dentry)->i_ino, -ret);
+ }
+
+ _leave(" = %d", ret);
+ return ret;
+}
+
+/*
+ * Check the consistency between the backing cache and the volume cookie.
+ */
+int erofscache_check_volume_xattr(struct erofscache_volume *volume)
+{
+ struct erofscache_xattr *buf;
+ struct dentry *dentry = volume->dentry;
+ unsigned int len = volume->vcookie->coherency_len;
+ const void *p = volume->vcookie->coherency;
+ enum erofscache_coherency_trace why;
+ ssize_t xlen;
+ int ret = -ESTALE;
+
+ _enter("");
+
+ buf = kmalloc(len, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ xlen = erofscache_inject_read_error();
+ if (xlen == 0)
+ xlen = vfs_getxattr(&init_user_ns, dentry, erofscache_xattr_cache, buf, len);
+ if (xlen != len) {
+ if (xlen < 0) {
+ trace_erofscache_vfs_error(NULL, d_inode(dentry), xlen,
+ erofscache_trace_getxattr_error);
+ if (xlen == -EIO)
+ erofscache_io_error(
+ volume->cache,
+ "Failed to read xattr with error %zd", xlen);
+ }
+ why = erofscache_coherency_vol_check_xattr;
+ } else if (memcmp(buf->data, p, len) != 0) {
+ why = erofscache_coherency_vol_check_cmp;
+ } else {
+ why = erofscache_coherency_vol_check_ok;
+ ret = 0;
+ }
+
+ trace_erofscache_vol_coherency(volume, d_inode(dentry)->i_ino, why);
+ kfree(buf);
+ _leave(" = %d", ret);
+ return ret;
+}
diff --git a/include/trace/events/erofscache.h b/include/trace/events/erofscache.h
new file mode 100644
index 000000000000..96974fd097f0
--- /dev/null
+++ b/include/trace/events/erofscache.h
@@ -0,0 +1,515 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* Erofscache tracepoints
+ *
+ * Copyright (C) 2021 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM erofscache
+
+#if !defined(_TRACE_EROFSCACHE_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_EROFSCACHE_H
+
+#include <linux/tracepoint.h>
+
+/*
+ * Define enums for tracing information.
+ */
+#ifndef __EROFSCACHE_DECLARE_TRACE_ENUMS_ONCE_ONLY
+#define __EROFSCACHE_DECLARE_TRACE_ENUMS_ONCE_ONLY
+
+enum erofscache_obj_ref_trace {
+ erofscache_obj_get_ioreq,
+ erofscache_obj_new,
+ erofscache_obj_put_alloc_fail,
+ erofscache_obj_put_detach,
+ erofscache_obj_put_ioreq,
+ erofscache_obj_see_clean_commit,
+ erofscache_obj_see_clean_delete,
+ erofscache_obj_see_clean_drop_tmp,
+ erofscache_obj_see_lookup_cookie,
+ erofscache_obj_see_lookup_failed,
+ erofscache_obj_see_withdraw_cookie,
+ erofscache_obj_see_withdrawal,
+};
+
+enum fscache_why_object_killed {
+ FSCACHE_OBJECT_IS_STALE,
+ FSCACHE_OBJECT_IS_WEIRD,
+ FSCACHE_OBJECT_INVALIDATED,
+ FSCACHE_OBJECT_NO_SPACE,
+ FSCACHE_OBJECT_WAS_RETIRED,
+ FSCACHE_OBJECT_WAS_CULLED,
+ FSCACHE_VOLUME_IS_WEIRD,
+};
+
+enum erofscache_coherency_trace {
+ erofscache_coherency_check_aux,
+ erofscache_coherency_check_content,
+ erofscache_coherency_check_dirty,
+ erofscache_coherency_check_len,
+ erofscache_coherency_check_objsize,
+ erofscache_coherency_check_ok,
+ erofscache_coherency_check_type,
+ erofscache_coherency_check_xattr,
+ erofscache_coherency_vol_check_cmp,
+ erofscache_coherency_vol_check_ok,
+ erofscache_coherency_vol_check_xattr,
+};
+
+enum erofscache_prepare_read_trace {
+ erofscache_trace_read_after_eof,
+ erofscache_trace_read_found_hole,
+ erofscache_trace_read_found_part,
+ erofscache_trace_read_have_data,
+ erofscache_trace_read_no_data,
+ erofscache_trace_read_no_file,
+ erofscache_trace_read_seek_error,
+ erofscache_trace_read_seek_nxio,
+};
+
+enum erofscache_error_trace {
+ erofscache_trace_getxattr_error,
+ erofscache_trace_link_error,
+ erofscache_trace_lookup_error,
+ erofscache_trace_open_error,
+ erofscache_trace_read_error,
+ erofscache_trace_remxattr_error,
+ erofscache_trace_rename_error,
+ erofscache_trace_seek_error,
+ erofscache_trace_unlink_error,
+};
+
+#endif
+
+/*
+ * Define enum -> string mappings for display.
+ */
+#define erofscache_obj_kill_traces \
+ EM(FSCACHE_OBJECT_IS_STALE, "stale") \
+ EM(FSCACHE_OBJECT_IS_WEIRD, "weird") \
+ EM(FSCACHE_OBJECT_INVALIDATED, "inval") \
+ EM(FSCACHE_OBJECT_NO_SPACE, "no_space") \
+ EM(FSCACHE_OBJECT_WAS_RETIRED, "was_retired") \
+ EM(FSCACHE_OBJECT_WAS_CULLED, "was_culled") \
+ E_(FSCACHE_VOLUME_IS_WEIRD, "volume_weird")
+
+#define erofscache_obj_ref_traces \
+ EM(erofscache_obj_get_ioreq, "GET ioreq") \
+ EM(erofscache_obj_new, "NEW obj") \
+ EM(erofscache_obj_put_alloc_fail, "PUT alloc_fail") \
+ EM(erofscache_obj_put_detach, "PUT detach") \
+ EM(erofscache_obj_put_ioreq, "PUT ioreq") \
+ EM(erofscache_obj_see_clean_commit, "SEE clean_commit") \
+ EM(erofscache_obj_see_clean_delete, "SEE clean_delete") \
+ EM(erofscache_obj_see_clean_drop_tmp, "SEE clean_drop_tmp") \
+ EM(erofscache_obj_see_lookup_cookie, "SEE lookup_cookie") \
+ EM(erofscache_obj_see_lookup_failed, "SEE lookup_failed") \
+ EM(erofscache_obj_see_withdraw_cookie, "SEE withdraw_cookie") \
+ E_(erofscache_obj_see_withdrawal, "SEE withdrawal")
+
+#define erofscache_coherency_traces \
+ EM(erofscache_coherency_check_aux, "BAD aux ") \
+ EM(erofscache_coherency_check_content, "BAD cont") \
+ EM(erofscache_coherency_check_dirty, "BAD dirt") \
+ EM(erofscache_coherency_check_len, "BAD len ") \
+ EM(erofscache_coherency_check_objsize, "BAD osiz") \
+ EM(erofscache_coherency_check_ok, "OK ") \
+ EM(erofscache_coherency_check_type, "BAD type") \
+ EM(erofscache_coherency_check_xattr, "BAD xatt") \
+ EM(erofscache_coherency_vol_check_cmp, "VOL BAD cmp ") \
+ EM(erofscache_coherency_vol_check_ok, "VOL OK ") \
+ E_(erofscache_coherency_vol_check_xattr,"VOL BAD xatt")
+
+#define erofscache_prepare_read_traces \
+ EM(erofscache_trace_read_after_eof, "after-eof ") \
+ EM(erofscache_trace_read_found_hole, "found-hole") \
+ EM(erofscache_trace_read_found_part, "found-part") \
+ EM(erofscache_trace_read_have_data, "have-data ") \
+ EM(erofscache_trace_read_no_data, "no-data ") \
+ EM(erofscache_trace_read_no_file, "no-file ") \
+ EM(erofscache_trace_read_seek_error, "seek-error") \
+ E_(erofscache_trace_read_seek_nxio, "seek-enxio")
+
+#define erofscache_error_traces \
+ EM(erofscache_trace_getxattr_error, "getxattr") \
+ EM(erofscache_trace_lookup_error, "lookup") \
+ EM(erofscache_trace_open_error, "open") \
+ EM(erofscache_trace_read_error, "read") \
+ EM(erofscache_trace_remxattr_error, "remxattr") \
+ EM(erofscache_trace_rename_error, "rename") \
+ EM(erofscache_trace_seek_error, "seek") \
+ E_(erofscache_trace_unlink_error, "unlink")
+
+
+/*
+ * Export enum symbols via userspace.
+ */
+#undef EM
+#undef E_
+#define EM(a, b) TRACE_DEFINE_ENUM(a);
+#define E_(a, b) TRACE_DEFINE_ENUM(a);
+
+erofscache_obj_kill_traces;
+erofscache_obj_ref_traces;
+erofscache_coherency_traces;
+erofscache_prepare_read_traces;
+erofscache_error_traces;
+
+/*
+ * Now redefine the EM() and E_() macros to map the enums to the strings that
+ * will be printed in the output.
+ */
+#undef EM
+#undef E_
+#define EM(a, b) { a, b },
+#define E_(a, b) { a, b }
+
+
+TRACE_EVENT(erofscache_ref,
+ TP_PROTO(unsigned int object_debug_id,
+ unsigned int cookie_debug_id,
+ int usage,
+ enum erofscache_obj_ref_trace why),
+
+ TP_ARGS(object_debug_id, cookie_debug_id, usage, why),
+
+ /* Note that obj may be NULL */
+ TP_STRUCT__entry(
+ __field(unsigned int, obj )
+ __field(unsigned int, cookie )
+ __field(enum erofscache_obj_ref_trace, why )
+ __field(int, usage )
+ ),
+
+ TP_fast_assign(
+ __entry->obj = object_debug_id;
+ __entry->cookie = cookie_debug_id;
+ __entry->usage = usage;
+ __entry->why = why;
+ ),
+
+ TP_printk("c=%08x o=%08x u=%d %s",
+ __entry->cookie, __entry->obj, __entry->usage,
+ __print_symbolic(__entry->why, erofscache_obj_ref_traces))
+ );
+
+TRACE_EVENT(erofscache_lookup,
+ TP_PROTO(struct erofscache_object *obj,
+ struct dentry *dir,
+ struct dentry *de),
+
+ TP_ARGS(obj, dir, de),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, obj )
+ __field(short, error )
+ __field(unsigned long, dino )
+ __field(unsigned long, ino )
+ ),
+
+ TP_fast_assign(
+ __entry->obj = obj ? obj->debug_id : 0;
+ __entry->dino = d_backing_inode(dir)->i_ino;
+ __entry->ino = (!IS_ERR(de) && d_backing_inode(de) ?
+ d_backing_inode(de)->i_ino : 0);
+ __entry->error = IS_ERR(de) ? PTR_ERR(de) : 0;
+ ),
+
+ TP_printk("o=%08x dB=%lx B=%lx e=%d",
+ __entry->obj, __entry->dino, __entry->ino, __entry->error)
+ );
+
+TRACE_EVENT(erofscache_unlink,
+ TP_PROTO(struct erofscache_object *obj,
+ ino_t ino,
+ enum fscache_why_object_killed why),
+
+ TP_ARGS(obj, ino, why),
+
+ /* Note that obj may be NULL */
+ TP_STRUCT__entry(
+ __field(unsigned int, obj )
+ __field(unsigned int, ino )
+ __field(enum fscache_why_object_killed, why )
+ ),
+
+ TP_fast_assign(
+ __entry->obj = obj ? obj->debug_id : UINT_MAX;
+ __entry->ino = ino;
+ __entry->why = why;
+ ),
+
+ TP_printk("o=%08x B=%x w=%s",
+ __entry->obj, __entry->ino,
+ __print_symbolic(__entry->why, erofscache_obj_kill_traces))
+ );
+
+TRACE_EVENT(erofscache_rename,
+ TP_PROTO(struct erofscache_object *obj,
+ ino_t ino,
+ enum fscache_why_object_killed why),
+
+ TP_ARGS(obj, ino, why),
+
+ /* Note that obj may be NULL */
+ TP_STRUCT__entry(
+ __field(unsigned int, obj )
+ __field(unsigned int, ino )
+ __field(enum fscache_why_object_killed, why )
+ ),
+
+ TP_fast_assign(
+ __entry->obj = obj ? obj->debug_id : UINT_MAX;
+ __entry->ino = ino;
+ __entry->why = why;
+ ),
+
+ TP_printk("o=%08x B=%x w=%s",
+ __entry->obj, __entry->ino,
+ __print_symbolic(__entry->why, erofscache_obj_kill_traces))
+ );
+
+TRACE_EVENT(erofscache_coherency,
+ TP_PROTO(struct erofscache_object *obj,
+ ino_t ino,
+ enum erofscache_coherency_trace why),
+
+ TP_ARGS(obj, ino, why),
+
+ /* Note that obj may be NULL */
+ TP_STRUCT__entry(
+ __field(unsigned int, obj )
+ __field(enum erofscache_coherency_trace, why )
+ __field(u64, ino )
+ ),
+
+ TP_fast_assign(
+ __entry->obj = obj->debug_id;
+ __entry->why = why;
+ __entry->ino = ino;
+ ),
+
+ TP_printk("o=%08x %s B=%llx",
+ __entry->obj,
+ __print_symbolic(__entry->why, erofscache_coherency_traces),
+ __entry->ino)
+ );
+
+TRACE_EVENT(erofscache_vol_coherency,
+ TP_PROTO(struct erofscache_volume *volume,
+ ino_t ino,
+ enum erofscache_coherency_trace why),
+
+ TP_ARGS(volume, ino, why),
+
+ /* Note that obj may be NULL */
+ TP_STRUCT__entry(
+ __field(unsigned int, vol )
+ __field(enum erofscache_coherency_trace, why )
+ __field(u64, ino )
+ ),
+
+ TP_fast_assign(
+ __entry->vol = volume->vcookie->debug_id;
+ __entry->why = why;
+ __entry->ino = ino;
+ ),
+
+ TP_printk("V=%08x %s B=%llx",
+ __entry->vol,
+ __print_symbolic(__entry->why, erofscache_coherency_traces),
+ __entry->ino)
+ );
+
+TRACE_EVENT(erofscache_prep_read,
+ TP_PROTO(struct netfs_read_subrequest *sreq,
+ enum netfs_read_source source,
+ enum erofscache_prepare_read_trace why,
+ ino_t cache_inode),
+
+ TP_ARGS(sreq, source, why, cache_inode),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, rreq )
+ __field(unsigned short, index )
+ __field(unsigned short, flags )
+ __field(enum netfs_read_source, source )
+ __field(enum erofscache_prepare_read_trace, why )
+ __field(size_t, len )
+ __field(loff_t, start )
+ __field(unsigned int, netfs_inode )
+ __field(unsigned int, cache_inode )
+ ),
+
+ TP_fast_assign(
+ __entry->rreq = sreq->rreq->debug_id;
+ __entry->index = sreq->debug_index;
+ __entry->flags = sreq->flags;
+ __entry->source = source;
+ __entry->why = why;
+ __entry->len = sreq->len;
+ __entry->start = sreq->start;
+ __entry->netfs_inode = sreq->rreq->inode->i_ino;
+ __entry->cache_inode = cache_inode;
+ ),
+
+ TP_printk("R=%08x[%u] %s %s f=%02x s=%llx %zx ni=%x B=%x",
+ __entry->rreq, __entry->index,
+ __print_symbolic(__entry->source, netfs_sreq_sources),
+ __print_symbolic(__entry->why, erofscache_prepare_read_traces),
+ __entry->flags,
+ __entry->start, __entry->len,
+ __entry->netfs_inode, __entry->cache_inode)
+ );
+
+TRACE_EVENT(erofscache_read,
+ TP_PROTO(struct erofscache_object *obj,
+ struct inode *backer,
+ loff_t start,
+ size_t len),
+
+ TP_ARGS(obj, backer, start, len),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, obj )
+ __field(unsigned int, backer )
+ __field(size_t, len )
+ __field(loff_t, start )
+ ),
+
+ TP_fast_assign(
+ __entry->obj = obj->debug_id;
+ __entry->backer = backer->i_ino;
+ __entry->start = start;
+ __entry->len = len;
+ ),
+
+ TP_printk("o=%08x B=%x s=%llx l=%zx",
+ __entry->obj,
+ __entry->backer,
+ __entry->start,
+ __entry->len)
+ );
+
+TRACE_EVENT(erofscache_mark_active,
+ TP_PROTO(struct erofscache_object *obj,
+ struct inode *inode),
+
+ TP_ARGS(obj, inode),
+
+ /* Note that obj may be NULL */
+ TP_STRUCT__entry(
+ __field(unsigned int, obj )
+ __field(ino_t, inode )
+ ),
+
+ TP_fast_assign(
+ __entry->obj = obj ? obj->debug_id : 0;
+ __entry->inode = inode->i_ino;
+ ),
+
+ TP_printk("o=%08x B=%lx",
+ __entry->obj, __entry->inode)
+ );
+
+TRACE_EVENT(erofscache_mark_failed,
+ TP_PROTO(struct erofscache_object *obj,
+ struct inode *inode),
+
+ TP_ARGS(obj, inode),
+
+ /* Note that obj may be NULL */
+ TP_STRUCT__entry(
+ __field(unsigned int, obj )
+ __field(ino_t, inode )
+ ),
+
+ TP_fast_assign(
+ __entry->obj = obj ? obj->debug_id : 0;
+ __entry->inode = inode->i_ino;
+ ),
+
+ TP_printk("o=%08x B=%lx",
+ __entry->obj, __entry->inode)
+ );
+
+TRACE_EVENT(erofscache_mark_inactive,
+ TP_PROTO(struct erofscache_object *obj,
+ struct inode *inode),
+
+ TP_ARGS(obj, inode),
+
+ /* Note that obj may be NULL */
+ TP_STRUCT__entry(
+ __field(unsigned int, obj )
+ __field(ino_t, inode )
+ ),
+
+ TP_fast_assign(
+ __entry->obj = obj ? obj->debug_id : 0;
+ __entry->inode = inode->i_ino;
+ ),
+
+ TP_printk("o=%08x B=%lx",
+ __entry->obj, __entry->inode)
+ );
+
+TRACE_EVENT(erofscache_vfs_error,
+ TP_PROTO(struct erofscache_object *obj, struct inode *backer,
+ int error, enum erofscache_error_trace where),
+
+ TP_ARGS(obj, backer, error, where),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, obj )
+ __field(unsigned int, backer )
+ __field(enum erofscache_error_trace, where )
+ __field(short, error )
+ ),
+
+ TP_fast_assign(
+ __entry->obj = obj ? obj->debug_id : 0;
+ __entry->backer = backer->i_ino;
+ __entry->error = error;
+ __entry->where = where;
+ ),
+
+ TP_printk("o=%08x B=%x %s e=%d",
+ __entry->obj,
+ __entry->backer,
+ __print_symbolic(__entry->where, erofscache_error_traces),
+ __entry->error)
+ );
+
+TRACE_EVENT(erofscache_io_error,
+ TP_PROTO(struct erofscache_object *obj, struct inode *backer,
+ int error, enum erofscache_error_trace where),
+
+ TP_ARGS(obj, backer, error, where),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, obj )
+ __field(unsigned int, backer )
+ __field(enum erofscache_error_trace, where )
+ __field(short, error )
+ ),
+
+ TP_fast_assign(
+ __entry->obj = obj ? obj->debug_id : 0;
+ __entry->backer = backer->i_ino;
+ __entry->error = error;
+ __entry->where = where;
+ ),
+
+ TP_printk("o=%08x B=%x %s e=%d",
+ __entry->obj,
+ __entry->backer,
+ __print_symbolic(__entry->where, erofscache_error_traces),
+ __entry->error)
+ );
+
+#endif /* _TRACE_EROFSCACHE_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
On 1/26/22 4:27 AM, David Howells wrote:
> David Howells <[email protected]> wrote:
>
>> (1) Duplicate the cachefiles backend. You can discard a lot of it, since a
>> much of it is concerned with managing local modifications - which you're
>> not going to do since you have a R/O filesystem and you're looking at
>> importing files into the cache externally to the kernel.
>
> Take the attached as a start. It's completely untested. I've stripped out
> anything to do with writing to the cache, making directories, etc. as that can
> probably be delegated to the on-demand creation. You could drive on-demand
> creation from the points where it would create files. I've put some "TODO"
> comments in there as markers.
Thanks for your inspiring work. Some questions below.
>
> You could also strip out everything to do with invalidation and also make it
> just fail if it encounters a file type that it doesn't like or a file that is
> not correctly labelled for a coherency attribute.
>
> Also, since you aren't intending to write anything or create new files here,
> there's no need to do the space checking - so I've got rid of all that too.
>
> I've also made it open the backing files read only and got rid of the trimming
> to I/O blocksize for DIO purposes. The userspace side can take care of that -
> and, besides, you want to have multiple files within a backing file, right?
>
> You might want to stop it from marking cache *files* in use (but only mark
> directories). It doesn't matter so much as you aren't going to get coherency
> issues from having multiple writers to the same file.
>
> You then need to add a file offset member to the erofscache_object struct, set
> that when the backing file is looked up and add it to the file position in
> erofscache_read(). You also need to look at erofscache_prepare_read(). If
> your files are contiguous complete blobs, that can be a lot simpler.
To be honest, I'm not sure if I get your points correctly. Do you mean
each file in erofs has only one chunk (and thus corresponds to only one
backing blob file), so that netfs lib can work well while given the only
cookie associated with the netfs file?
By the way, let me explain the blob mapping in erofs further. To
implement deduplication, one erofs file can be divided into multiple
chunks, while these chunks can be distributed over several backing blob
files quite randomly (rather than a round-robin style). Each erofs file
maintains an on-disk map describing the mapping relationship between
chunks and backing blob files. Something like the extent map. Thus
there's a multi-to-multi relationship between erofs file and backing
blob file.
Thus each erofs file can correspond to multiple cookies in this way,
i.e. one 'struct netfs_i_context' can correspond to multiple cookies.
Besides, the mapping relationship between chunks and backing blob files
is totally implemented in upper fs (i.e. erofs), I have no idea how we
can "do the blob mapping in the backend" [1]. So I don't think we can
use netfs lib **directly** even with this R/O fscache backend
implemented. Please correct me if I misunderstand it.
[1]
https://lore.kernel.org/lkml/[email protected]/T/#mfbb2053476760d8fac723c57dad529192a5084c6
Besides, IMHO it may suffer great challenges when implementing a new R/O
backend, since there's quite many code duplication. I know it's just a
starting version from scratch, but I'm not sure if it's worth it.
--
Thanks,
Jeffle
On 1/26/22 12:15 AM, David Howells wrote:
> Jeffle Xu <[email protected]> wrote:
>
>> The following issues still need further discussion. Thanks for your time
>> and patience.
>>
>> 1. I noticed that there's refactoring of netfs library[1],
>> ...
>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-lib
>
> Yes. I'm working towards getting netfslib to do handling writes and dio as
> well as reads, along with content crypto/compression, and the idea I'm aiming
> towards is that you just point your address_space_ops at netfs directly if
> possible - but it's going to require its own context now to manage pending
> writes.
>
> See my netfs-experimental branch for more of that - it's still a work in
> progress, though.
Got it.
>
> Btw, you could set rreq->netfs_priv in ->init_rreq() rather than passing it in
> to netfs_readpage().
>
>> 2. The current implementation will severely conflict with the
>> refactoring of netfs library[1][2]. The assumption of 'struct
>> netfs_i_context' [2] is that, every file in the upper netfs will
>> correspond to only one backing file. While in our scenario, one file in
>> erofs can correspond to multiple backing files. That is, the content of
>> one file can be divided into multiple chunks, and are distrubuted over
>> multiple blob files, i.e. multiple backing files. Currently I have no
>> good idea solving this conflic.
>
> I can think of a couple of options to explore:
>
> (1) Duplicate the cachefiles backend. You can discard a lot of it, since a
> much of it is concerned with managing local modifications - which you're
> not going to do since you have a R/O filesystem and you're looking at
> importing files into the cache externally to the kernel.
>
> I would suggest looking to see if you can do the blob mapping in the
> backend rather than passing the offset down. Maybe make the cookie index
> key hold the index too, e.g. "/path/to/file+offset".
Have been discussed in [1].
[1]
https://lore.kernel.org/lkml/[email protected]/T/#m25b1229f96bf24929fb73746a07e9996e8222ac6
"/path/to/file+offset"
^
Besides, what does the 'offset' mean?
>
> Btw, do you still need cachefilesd for its culling duties?
Yes we still need cache management in this on-demand scenario, in case
of backing files exhausting the available blocks. (Though these backing
files are prepared by daemon in advance, these files can all be sparse
files.) And similarly the actual culling work should be done under
protection of S_KERNEL_FILE, so that the culled backing file can't be
picked back up.
>
> (2) Do you actually need to go through netfslib? Might it be easier to call
> fscache_read() directly? Have a look at fs/nfs/fscache.c
It would be great if we can use fscache_read() directly.
>
>> Besides there are still two quetions:
>> - What's the plan of [1]? When is it planned to be merged?
>
> Hopefully next merge window, but that's going to depend on a number of things.
>
>> - It seems that all upper fs using fscache is going to use netfs API,
>> while the APIs like fscache_read_or_alloc_page() are deprecated. Is
>> that true?
>
> fscache_read_or_alloc_page() is gone completely.
>
> You don't have to use the netfs API. You can talk to fscache directly,
> doing DIO from the cache to an xarray-class iov_iter constructed from your
> inode's pagecache.
>
> netfslib provides/will provide a number of services, such as multipage
> folios, transparent caching, crypto, compression and hiding the existence of
> pages/folios from the filesystem as entirely as possible. However, you
> already have some of these implemented on top of iomap for the blockdev
> interface, it would appear.
>
Got it.
In summary,
1) I prefer option 2, i.e. calling fscache_read() directly, as the one
at hand. In this case, the conflict with the netfs lib refactoring can
be avoided. Besides, there will be less modification needed to
cachefiles/netfs. Patch 1~3 are no longer required, while patch 4~6 are
still needed, which mainly introduce the new devnode.
2) Later we can change to option 1, i.e. calling netfs lib and also a
potential new R/O backend, if the issues in [1] can be clarified or solved.
[1]
https://lore.kernel.org/lkml/[email protected]/T/#m25b1229f96bf24929fb73746a07e9996e8222ac6
--
Thanks,
Jeffle
On 1/25/22 11:34 PM, David Howells wrote:
> Jeffle Xu <[email protected]> wrote:
>
>> +static int erofs_fscahce_init_ctx(struct erofs_fscache_context *ctx,
>
> fscahce => fscache?
>
Right. Thanks.
--
Thanks,
Jeffle
JeffleXu <[email protected]> wrote:
> "/path/to/file+offset"
> ^
>
> Besides, what does the 'offset' mean?
Assuming you're storing multiple erofs files within the same backend file, you
need to tell the the cache backend how to find the data. Assuming the erofs
data is arranged such that each erofs file is a single contiguous region, then
you need a pathname and a file offset to find one of them.
David
On 1/26/22 4:51 PM, David Howells wrote:
> JeffleXu <[email protected]> wrote:
>
>> "/path/to/file+offset"
>> ^
>>
>> Besides, what does the 'offset' mean?
>
> Assuming you're storing multiple erofs files within the same backend file, you
> need to tell the the cache backend how to find the data. Assuming the erofs
> data is arranged such that each erofs file is a single contiguous region, then
> you need a pathname and a file offset to find one of them.
>
Alright. In fact one erofs file can contain multiple chunks, which can
correspond to different backing blob files. Thus currently I will use
fscache_read() directly, to push this feature forward.
Thanks a lot.
--
Thanks,
Jeffle