2019-08-15 04:45:57

by Gao Xiang

[permalink] [raw]
Subject: [PATCH v8 00/24] erofs: promote erofs from staging v8

[I strip the previous cover letter, the old one can be found in v6:
https://lore.kernel.org/r/[email protected]/]

We'd like to submit a formal moving patch applied to staging tree
for 5.4, before that we'd like to hear if there are some ACKs,
suggestions or NAKs, objections of EROFS. Therefore, we can improve
it in this round or rethink about the whole thing.

As related materials mentioned [1] [2], the goal of EROFS is to
save extra storage space with guaranteed end-to-end performance
for read-only files, which has better performance over exist Linux
compression filesystems based on fixed-sized output compression
and inplace decompression. It even has better performance in
a large compression ratio range compared with generic uncompressed
filesystems with proper CPU-storage combinations. And we think this
direction is correct and a dedicated kernel team is continuously /
actively working on improving it, enough testers and beta / end
users using it.

EROFS has been applied to almost all in-service HUAWEI smartphones
(Yes, the number is still increasing by time) and it seems like
a success. It can be used in more wider scenarios. We think it's
useful for Linux / Android OS community and it's the time moving
out of staging.

In order to get started, latest stable mkfs.erofs is available at

git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git -b dev

with README in the repository.

We are still tuning sequential read performance for ultra-fast
speed NVME SSDs like Samsung 970PRO, but at least now you can
try on your PC with some data with proper compression ratio,
the latest Linux kernel, USB stick for convenience sake and
a not very old-fashioned CPU. There are also benchmarks available
in the above materials mentioned.

EROFS is a self-contained filesystem driver. Although there are
still some TODOs to be more generic, we will actively keep on
developping / tuning EROFS with the evolution of Linux kernel
as the other in-kernel filesystems.

As I mentioned before in LSF/MM 2019, in the future, we'd like
to generalize the decompression engine into a library for other
fses to use after the whole system is mature like fscrypt.
However, such metadata should be designed respectively for
each fs, and synchronous metadata read cost will be larger
than EROFS because of those ondisk limitation. Therefore EROFS
is still a better choice for read-only scenarios.

EROFS is now ready for reviewing and moving, and the code is
already cleaned up as shiny floors... Please kindly take some
precious time, share your comments about EROFS and let us know
your opinion about this. It's really important for us since
generally speaking, we like to use Linux _in-tree_ stuffs rather
than lack of supported out-of-tree / orphan stuffs as well.

Thank you in advance,
Gao Xiang

[1] https://kccncosschn19eng.sched.com/event/Nru2/erofs-an-introduction-and-our-smartphone-practice-xiang-gao-huawei
[2] https://www.usenix.org/conference/atc19/presentation/gao

Changelog from v7:
o keep up with the latest staging tree in addition to
the latest staging patch:
https://lore.kernel.org/r/[email protected]/
- use EUCLEAN for fs corruption cases suggested by Pavel;
- turn EIO into EOPNOTSUPP for unsupported on-disk format;
- fix all misused ENOTSUPP into EOPNOTSUPP pointed out by Chao;
o update cover letter

It can also be found in git at tag "erofs_2019-08-15" (will be shown later) at:
https://git.kernel.org/pub/scm/linux/kernel/git/xiang/linux.git/

and the latest fs code is available at:
https://git.kernel.org/pub/scm/linux/kernel/git/xiang/linux.git/tree/fs/erofs?h=erofs-outofstaging

Changelog from v6:
o keep up with the latest staging patchset
https://lore.kernel.org/linux-fsdevel/[email protected]/
in order to fix the following cases:
- inline erofs_inode_is_data_compressed() in erofs_fs.h;
- remove incomplete cleancache;
- remove all BUG_ON in EROFS.
o Removing the file names from the comments at the top of the files
suggested by Stephen will be applied to the real moving patch later.

Changelog from v5:
o keep up with "[PATCH v2] staging: erofs: updates according to erofs-outofstaging v4"
https://lore.kernel.org/lkml/[email protected]/
which mainly addresses review comments from Chao:
- keep the marco EROFS_IO_MAX_RETRIES_NOFAIL in internal.h;
- kill a redundant NULL check in "__stagingpage_alloc";
- add some descriptions in document about "use_vmap";
- rearrange erofs_vmap of "staging: erofs: kill CONFIG_EROFS_FS_USE_VM_MAP_RAM";

o all changes have been merged into staging tree, which are under staging-testing:
https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging.git/log/?h=staging-testing

Changelog from v4:
o rebase on Linux 5.3-rc1;

o keep up with "staging: erofs: updates according to erofs-outofstaging v4"
in order to get main code bit-for-bit identical with staging tree:
https://lore.kernel.org/lkml/[email protected]/

Changelog from v3:
o use GPL-2.0-only for SPDX-License-Identifier suggested by Stephen;

o kill all kconfig cache strategies and turn them into mount options
"cache_strategy={disable|readahead|readaround}" suggested by Ted.
As the first step, cached pages can still be usable after cache is
disabled by remounting, and these pages will be fallen out over
time, which can be refined in the later version if some requirement
is needed. Update related document as well;

o turn on CONFIG_EROFS_FS_SECURITY by default suggested by David;

o kill CONFIG_EROFS_FS_IO_MAX_RETRIES and fold it into code; turn
EROFS_FS_USE_VM_MAP_RAM into a module parameter ("use_vmap")
suggested by David.

Changelog from v2:
o kill sbi->dev_name and clean up all failure handling in
fill_super() suggested by Al.
Note that the initialzation of managed_cache is now moved
after s_root is assigned since it's more preferred to iput()
in .put_super() and all inodes should be evicted before
the end of generic_shutdown_super(sb);

o fold in the following staging patches (and thanks):
staging: erofs:converting all 'unsigned' to 'unsigned int'
staging: erofs: Remove function erofs_kill_sb()
- However it was revoked due to erofs_kill_sb reused...
staging: erofs: avoid opened loop codes
staging: erofs: support bmap

o move EROFS_SUPER_MAGIC_V1 from linux/fs/erofs/erofs_fs.h to
include/uapi/linux/magic.h for userspace utilities.

Changelog from v1:
o resend the whole filesystem into a patchset suggested by Greg;
o code is more cleaner, especially for decompression frontend.

Cc: Greg Kroah-Hartman <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Pavel Machek <[email protected]>
Cc: David Sterba <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Darrick J . Wong <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Miao Xie <[email protected]>
Cc: Li Guifu <[email protected]>
Cc: Fang Wei <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>


Gao Xiang (24):
erofs: add on-disk layout
erofs: add erofs in-memory stuffs
erofs: add super block operations
erofs: add raw address_space operations
erofs: add inode operations
erofs: support special inode
erofs: add directory operations
erofs: add namei functions
erofs: support tracepoint
erofs: update Kconfig and Makefile
erofs: introduce xattr & posixacl support
erofs: introduce tagged pointer
erofs: add compression indexes support
erofs: introduce superblock registration
erofs: introduce erofs shrinker
erofs: introduce workstation for decompression
erofs: introduce per-CPU buffers implementation
erofs: introduce pagevec for decompression subsystem
erofs: add erofs_allocpage()
erofs: introduce generic decompression backend
erofs: introduce LZ4 decompression inplace
erofs: introduce the decompression frontend
erofs: introduce cached decompression
erofs: add document

Documentation/filesystems/erofs.txt | 225 +++++
fs/Kconfig | 1 +
fs/Makefile | 1 +
fs/erofs/Kconfig | 98 ++
fs/erofs/Makefile | 11 +
fs/erofs/compress.h | 62 ++
fs/erofs/data.c | 425 ++++++++
fs/erofs/decompressor.c | 360 +++++++
fs/erofs/dir.c | 148 +++
fs/erofs/erofs_fs.h | 316 ++++++
fs/erofs/inode.c | 333 +++++++
fs/erofs/internal.h | 555 +++++++++++
fs/erofs/namei.c | 253 +++++
fs/erofs/super.c | 666 +++++++++++++
fs/erofs/tagptr.h | 110 +++
fs/erofs/utils.c | 335 +++++++
fs/erofs/xattr.c | 705 ++++++++++++++
fs/erofs/xattr.h | 94 ++
fs/erofs/zdata.c | 1405 +++++++++++++++++++++++++++
fs/erofs/zdata.h | 195 ++++
fs/erofs/zmap.c | 463 +++++++++
fs/erofs/zpvec.h | 159 +++
include/trace/events/erofs.h | 256 +++++
include/uapi/linux/magic.h | 1 +
24 files changed, 7177 insertions(+)
create mode 100644 Documentation/filesystems/erofs.txt
create mode 100644 fs/erofs/Kconfig
create mode 100644 fs/erofs/Makefile
create mode 100644 fs/erofs/compress.h
create mode 100644 fs/erofs/data.c
create mode 100644 fs/erofs/decompressor.c
create mode 100644 fs/erofs/dir.c
create mode 100644 fs/erofs/erofs_fs.h
create mode 100644 fs/erofs/inode.c
create mode 100644 fs/erofs/internal.h
create mode 100644 fs/erofs/namei.c
create mode 100644 fs/erofs/super.c
create mode 100644 fs/erofs/tagptr.h
create mode 100644 fs/erofs/utils.c
create mode 100644 fs/erofs/xattr.c
create mode 100644 fs/erofs/xattr.h
create mode 100644 fs/erofs/zdata.c
create mode 100644 fs/erofs/zdata.h
create mode 100644 fs/erofs/zmap.c
create mode 100644 fs/erofs/zpvec.h
create mode 100644 include/trace/events/erofs.h

--
2.17.1


2019-08-15 04:46:00

by Gao Xiang

[permalink] [raw]
Subject: [PATCH v8 22/24] erofs: introduce the decompression frontend

This patch introduces the basic inplace fixed-sized
output decompression implementation for erofs
filesystem.

In constant to fixed-sized input compression, it has
fixed-sized capacity for each compressed cluster to
contain compressed data with the following advantages:
1) improved storage density;
2) decompression inplace support;
3) all data in a compressed physical cluster can be
decompressed and utilized.

The key point of inplace refers to one of all erofs
decompression strategies: Instead of allocating extra
compressed pages and data management structures, it
reuses the allocated file cache pages as much as
possible to store its compressed data (called inplace
I/O) and the corresponding pagevec in a time-sharing
approach, which is particularly useful for low memory
scenario.

In addition, decompression inplace technology is based
on inplace I/O, which eliminates page allocation and
all extra compressed data memcpy.

Signed-off-by: Gao Xiang <[email protected]>
---
fs/erofs/Kconfig | 1 +
fs/erofs/Makefile | 2 +-
fs/erofs/internal.h | 14 +-
fs/erofs/super.c | 11 +
fs/erofs/zdata.c | 1254 +++++++++++++++++++++++++++++++++++++++++++
fs/erofs/zdata.h | 192 +++++++
fs/erofs/zmap.c | 4 +-
7 files changed, 1474 insertions(+), 4 deletions(-)
create mode 100644 fs/erofs/zdata.c
create mode 100644 fs/erofs/zdata.h

diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig
index 5f8787c0cf89..16316d1adca3 100644
--- a/fs/erofs/Kconfig
+++ b/fs/erofs/Kconfig
@@ -76,6 +76,7 @@ config EROFS_FS_ZIP
bool "EROFS Data Compression Support"
depends on EROFS_FS
select LZ4_DECOMPRESS
+ default y
help
Enable fixed-sized output compression for EROFS.

diff --git a/fs/erofs/Makefile b/fs/erofs/Makefile
index 5594abca6f95..46f2aa4ba46c 100644
--- a/fs/erofs/Makefile
+++ b/fs/erofs/Makefile
@@ -7,5 +7,5 @@ ccflags-y += -DEROFS_VERSION=\"$(EROFS_VERSION)\"
obj-$(CONFIG_EROFS_FS) += erofs.o
erofs-objs := super.o inode.o data.o namei.o dir.o utils.o
erofs-$(CONFIG_EROFS_FS_XATTR) += xattr.o
-erofs-$(CONFIG_EROFS_FS_ZIP) += decompressor.o zmap.o
+erofs-$(CONFIG_EROFS_FS_ZIP) += decompressor.o zmap.o zdata.o

diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 9dc3d47347db..2be1ae700aca 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -68,6 +68,9 @@ struct erofs_sb_info {
/* the dedicated workstation for compression */
struct radix_tree_root workstn_tree;

+ /* threshold for decompression synchronously */
+ unsigned int max_sync_decompress_pages;
+
unsigned int shrinker_run_no;
#endif /* CONFIG_EROFS_FS_ZIP */
u32 blocks;
@@ -327,6 +330,9 @@ static inline bool is_inode_flat_inline(struct inode *inode)
extern const struct super_operations erofs_sops;

extern const struct address_space_operations erofs_raw_access_aops;
+#ifdef CONFIG_EROFS_FS_ZIP
+extern const struct address_space_operations z_erofs_vle_normalaccess_aops;
+#endif

/*
* Logical to physical block mapping, used by erofs_map_blocks()
@@ -487,7 +493,7 @@ int erofs_namei(struct inode *dir, struct qstr *name,
/* dir.c */
extern const struct file_operations erofs_dir_fops;

-/* utils.c */
+/* utils.c / zdata.c */
struct page *erofs_allocpage(struct list_head *pool, gfp_t gfp, bool nofail);

#if (EROFS_PCPUBUF_NR_PAGES > 0)
@@ -511,16 +517,20 @@ struct erofs_workgroup *erofs_find_workgroup(struct super_block *sb,
pgoff_t index, bool *tag);
int erofs_register_workgroup(struct super_block *sb,
struct erofs_workgroup *grp, bool tag);
-static inline void erofs_workgroup_free_rcu(struct erofs_workgroup *grp) {}
+void erofs_workgroup_free_rcu(struct erofs_workgroup *grp);
void erofs_shrinker_register(struct super_block *sb);
void erofs_shrinker_unregister(struct super_block *sb);
int __init erofs_init_shrinker(void);
void erofs_exit_shrinker(void);
+int __init z_erofs_init_zip_subsystem(void);
+void z_erofs_exit_zip_subsystem(void);
#else
static inline void erofs_shrinker_register(struct super_block *sb) {}
static inline void erofs_shrinker_unregister(struct super_block *sb) {}
static inline int erofs_init_shrinker(void) { return 0; }
static inline void erofs_exit_shrinker(void) {}
+static inline int z_erofs_init_zip_subsystem(void) { return 0; }
+static inline void z_erofs_exit_zip_subsystem(void) {}
#endif /* !CONFIG_EROFS_FS_ZIP */

#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index ea8d065068fa..bdac8abf3aa7 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -200,6 +200,9 @@ static unsigned int erofs_get_fault_rate(struct erofs_sb_info *sbi)
/* set up default EROFS parameters */
static void default_options(struct erofs_sb_info *sbi)
{
+#ifdef CONFIG_EROFS_FS_ZIP
+ sbi->max_sync_decompress_pages = 3;
+#endif
#ifdef CONFIG_EROFS_FS_XATTR
set_opt(sbi, XATTR_USER);
#endif
@@ -420,6 +423,11 @@ static int __init erofs_module_init(void)
err = erofs_init_shrinker();
if (err)
goto shrinker_err;
+
+ err = z_erofs_init_zip_subsystem();
+ if (err)
+ goto zip_err;
+
err = register_filesystem(&erofs_fs_type);
if (err)
goto fs_err;
@@ -428,6 +436,8 @@ static int __init erofs_module_init(void)
return 0;

fs_err:
+ z_erofs_exit_zip_subsystem();
+zip_err:
erofs_exit_shrinker();
shrinker_err:
erofs_exit_inode_cache();
@@ -438,6 +448,7 @@ static int __init erofs_module_init(void)
static void __exit erofs_module_exit(void)
{
unregister_filesystem(&erofs_fs_type);
+ z_erofs_exit_zip_subsystem();
erofs_exit_shrinker();
erofs_exit_inode_cache();
infoln("successfully finalize erofs");
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
new file mode 100644
index 000000000000..ffa64b1d8908
--- /dev/null
+++ b/fs/erofs/zdata.c
@@ -0,0 +1,1254 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * linux/fs/erofs/zdata.c
+ *
+ * Copyright (C) 2018 HUAWEI, Inc.
+ * http://www.huawei.com/
+ * Created by Gao Xiang <[email protected]>
+ */
+#include "zdata.h"
+#include "compress.h"
+#include <linux/prefetch.h>
+
+#include <trace/events/erofs.h>
+
+/*
+ * a compressed_pages[] placeholder in order to avoid
+ * being filled with file pages for in-place decompression.
+ */
+#define PAGE_UNALLOCATED ((void *)0x5F0E4B1D)
+
+/* how to allocate cached pages for a pcluster */
+enum z_erofs_cache_alloctype {
+ DONTALLOC, /* don't allocate any cached pages */
+ DELAYEDALLOC, /* delayed allocation (at the time of submitting io) */
+};
+
+/*
+ * tagged pointer with 1-bit tag for all compressed pages
+ * tag 0 - the page is just found with an extra page reference
+ */
+typedef tagptr1_t compressed_page_t;
+
+#define tag_compressed_page_justfound(page) \
+ tagptr_fold(compressed_page_t, page, 1)
+
+static struct workqueue_struct *z_erofs_workqueue __read_mostly;
+static struct kmem_cache *pcluster_cachep __read_mostly;
+
+void z_erofs_exit_zip_subsystem(void)
+{
+ destroy_workqueue(z_erofs_workqueue);
+ kmem_cache_destroy(pcluster_cachep);
+}
+
+static inline int init_unzip_workqueue(void)
+{
+ const unsigned int onlinecpus = num_possible_cpus();
+ const unsigned int flags = WQ_UNBOUND | WQ_HIGHPRI | WQ_CPU_INTENSIVE;
+
+ /*
+ * no need to spawn too many threads, limiting threads could minimum
+ * scheduling overhead, perhaps per-CPU threads should be better?
+ */
+ z_erofs_workqueue = alloc_workqueue("erofs_unzipd", flags,
+ onlinecpus + onlinecpus / 4);
+ return z_erofs_workqueue ? 0 : -ENOMEM;
+}
+
+static void init_once(void *ptr)
+{
+ struct z_erofs_pcluster *pcl = ptr;
+ struct z_erofs_collection *cl = z_erofs_primarycollection(pcl);
+ unsigned int i;
+
+ mutex_init(&cl->lock);
+ cl->nr_pages = 0;
+ cl->vcnt = 0;
+ for (i = 0; i < Z_EROFS_CLUSTER_MAX_PAGES; ++i)
+ pcl->compressed_pages[i] = NULL;
+}
+
+static void init_always(struct z_erofs_pcluster *pcl)
+{
+ struct z_erofs_collection *cl = z_erofs_primarycollection(pcl);
+
+ atomic_set(&pcl->obj.refcount, 1);
+
+ DBG_BUGON(cl->nr_pages);
+ DBG_BUGON(cl->vcnt);
+}
+
+int __init z_erofs_init_zip_subsystem(void)
+{
+ pcluster_cachep = kmem_cache_create("erofs_compress",
+ Z_EROFS_WORKGROUP_SIZE, 0,
+ SLAB_RECLAIM_ACCOUNT, init_once);
+ if (pcluster_cachep) {
+ if (!init_unzip_workqueue())
+ return 0;
+
+ kmem_cache_destroy(pcluster_cachep);
+ }
+ return -ENOMEM;
+}
+
+enum z_erofs_collectmode {
+ COLLECT_SECONDARY,
+ COLLECT_PRIMARY,
+ /*
+ * The current collection was the tail of an exist chain, in addition
+ * that the previous processed chained collections are all decided to
+ * be hooked up to it.
+ * A new chain will be created for the remaining collections which are
+ * not processed yet, therefore different from COLLECT_PRIMARY_FOLLOWED,
+ * the next collection cannot reuse the whole page safely in
+ * the following scenario:
+ * ________________________________________________________________
+ * | tail (partial) page | head (partial) page |
+ * | (belongs to the next cl) | (belongs to the current cl) |
+ * |_______PRIMARY_FOLLOWED_______|________PRIMARY_HOOKED___________|
+ */
+ COLLECT_PRIMARY_HOOKED,
+ COLLECT_PRIMARY_FOLLOWED_NOINPLACE,
+ /*
+ * The current collection has been linked with the owned chain, and
+ * could also be linked with the remaining collections, which means
+ * if the processing page is the tail page of the collection, thus
+ * the current collection can safely use the whole page (since
+ * the previous collection is under control) for in-place I/O, as
+ * illustrated below:
+ * ________________________________________________________________
+ * | tail (partial) page | head (partial) page |
+ * | (of the current cl) | (of the previous collection) |
+ * | PRIMARY_FOLLOWED or | |
+ * |_____PRIMARY_HOOKED___|____________PRIMARY_FOLLOWED____________|
+ *
+ * [ (*) the above page can be used as inplace I/O. ]
+ */
+ COLLECT_PRIMARY_FOLLOWED,
+};
+
+struct z_erofs_collector {
+ struct z_erofs_pagevec_ctor vector;
+
+ struct z_erofs_pcluster *pcl;
+ struct z_erofs_collection *cl;
+ struct page **compressedpages;
+ z_erofs_next_pcluster_t owned_head;
+
+ enum z_erofs_collectmode mode;
+};
+
+struct z_erofs_decompress_frontend {
+ struct inode *const inode;
+
+ struct z_erofs_collector clt;
+ struct erofs_map_blocks map;
+
+ /* used for applying cache strategy on the fly */
+ bool backmost;
+ erofs_off_t headoffset;
+};
+
+#define COLLECTOR_INIT() { \
+ .owned_head = Z_EROFS_PCLUSTER_TAIL, \
+ .mode = COLLECT_PRIMARY_FOLLOWED }
+
+#define DECOMPRESS_FRONTEND_INIT(__i) { \
+ .inode = __i, .clt = COLLECTOR_INIT(), \
+ .backmost = true, }
+
+static struct page *z_pagemap_global[Z_EROFS_VMAP_GLOBAL_PAGES];
+static DEFINE_MUTEX(z_pagemap_global_lock);
+
+static void preload_compressed_pages(struct z_erofs_collector *clt,
+ struct address_space *mc,
+ enum z_erofs_cache_alloctype type,
+ struct list_head *pagepool)
+{
+ /* nowhere to load compressed pages from */
+}
+
+/* page_type must be Z_EROFS_PAGE_TYPE_EXCLUSIVE */
+static inline bool try_inplace_io(struct z_erofs_collector *clt,
+ struct page *page)
+{
+ struct z_erofs_pcluster *const pcl = clt->pcl;
+ const unsigned int clusterpages = BIT(pcl->clusterbits);
+
+ while (clt->compressedpages < pcl->compressed_pages + clusterpages) {
+ if (!cmpxchg(clt->compressedpages++, NULL, page))
+ return true;
+ }
+ return false;
+}
+
+/* callers must be with collection lock held */
+static int z_erofs_attach_page(struct z_erofs_collector *clt,
+ struct page *page,
+ enum z_erofs_page_type type)
+{
+ int ret;
+ bool occupied;
+
+ /* give priority for inplaceio */
+ if (clt->mode >= COLLECT_PRIMARY &&
+ type == Z_EROFS_PAGE_TYPE_EXCLUSIVE &&
+ try_inplace_io(clt, page))
+ return 0;
+
+ ret = z_erofs_pagevec_enqueue(&clt->vector,
+ page, type, &occupied);
+ clt->cl->vcnt += (unsigned int)ret;
+
+ return ret ? 0 : -EAGAIN;
+}
+
+static enum z_erofs_collectmode
+try_to_claim_pcluster(struct z_erofs_pcluster *pcl,
+ z_erofs_next_pcluster_t *owned_head)
+{
+ /* let's claim these following types of pclusters */
+retry:
+ if (pcl->next == Z_EROFS_PCLUSTER_NIL) {
+ /* type 1, nil pcluster */
+ if (cmpxchg(&pcl->next, Z_EROFS_PCLUSTER_NIL,
+ *owned_head) != Z_EROFS_PCLUSTER_NIL)
+ goto retry;
+
+ *owned_head = &pcl->next;
+ /* lucky, I am the followee :) */
+ return COLLECT_PRIMARY_FOLLOWED;
+ } else if (pcl->next == Z_EROFS_PCLUSTER_TAIL) {
+ /*
+ * type 2, link to the end of a existing open chain,
+ * be careful that its submission itself is governed
+ * by the original owned chain.
+ */
+ if (cmpxchg(&pcl->next, Z_EROFS_PCLUSTER_TAIL,
+ *owned_head) != Z_EROFS_PCLUSTER_TAIL)
+ goto retry;
+ *owned_head = Z_EROFS_PCLUSTER_TAIL;
+ return COLLECT_PRIMARY_HOOKED;
+ }
+ return COLLECT_PRIMARY; /* :( better luck next time */
+}
+
+static struct z_erofs_collection *cllookup(struct z_erofs_collector *clt,
+ struct inode *inode,
+ struct erofs_map_blocks *map)
+{
+ struct erofs_workgroup *grp;
+ struct z_erofs_pcluster *pcl;
+ struct z_erofs_collection *cl;
+ unsigned int length;
+ bool tag;
+
+ grp = erofs_find_workgroup(inode->i_sb, map->m_pa >> PAGE_SHIFT, &tag);
+ if (!grp)
+ return NULL;
+
+ pcl = container_of(grp, struct z_erofs_pcluster, obj);
+
+ cl = z_erofs_primarycollection(pcl);
+ if (unlikely(cl->pageofs != (map->m_la & ~PAGE_MASK))) {
+ DBG_BUGON(1);
+ return ERR_PTR(-EIO);
+ }
+
+ length = READ_ONCE(pcl->length);
+ if (length & Z_EROFS_PCLUSTER_FULL_LENGTH) {
+ if ((map->m_llen << Z_EROFS_PCLUSTER_LENGTH_BIT) > length) {
+ DBG_BUGON(1);
+ return ERR_PTR(-EIO);
+ }
+ } else {
+ unsigned int llen = map->m_llen << Z_EROFS_PCLUSTER_LENGTH_BIT;
+
+ if (map->m_flags & EROFS_MAP_FULL_MAPPED)
+ llen |= Z_EROFS_PCLUSTER_FULL_LENGTH;
+
+ while (llen > length &&
+ length != cmpxchg_relaxed(&pcl->length, length, llen)) {
+ cpu_relax();
+ length = READ_ONCE(pcl->length);
+ }
+ }
+ mutex_lock(&cl->lock);
+ clt->mode = try_to_claim_pcluster(pcl, &clt->owned_head);
+ clt->pcl = pcl;
+ clt->cl = cl;
+ return cl;
+}
+
+static struct z_erofs_collection *clregister(struct z_erofs_collector *clt,
+ struct inode *inode,
+ struct erofs_map_blocks *map)
+{
+ struct z_erofs_pcluster *pcl;
+ struct z_erofs_collection *cl;
+ int err;
+
+ /* no available workgroup, let's allocate one */
+ pcl = kmem_cache_alloc(pcluster_cachep, GFP_NOFS);
+ if (unlikely(!pcl))
+ return ERR_PTR(-ENOMEM);
+
+ init_always(pcl);
+ pcl->obj.index = map->m_pa >> PAGE_SHIFT;
+
+ pcl->length = (map->m_llen << Z_EROFS_PCLUSTER_LENGTH_BIT) |
+ (map->m_flags & EROFS_MAP_FULL_MAPPED ?
+ Z_EROFS_PCLUSTER_FULL_LENGTH : 0);
+
+ if (map->m_flags & EROFS_MAP_ZIPPED)
+ pcl->algorithmformat = Z_EROFS_COMPRESSION_LZ4;
+ else
+ pcl->algorithmformat = Z_EROFS_COMPRESSION_SHIFTED;
+
+ pcl->clusterbits = EROFS_V(inode)->z_physical_clusterbits[0];
+ pcl->clusterbits -= PAGE_SHIFT;
+
+ /* new pclusters should be claimed as type 1, primary and followed */
+ pcl->next = clt->owned_head;
+ clt->mode = COLLECT_PRIMARY_FOLLOWED;
+
+ cl = z_erofs_primarycollection(pcl);
+ cl->pageofs = map->m_la & ~PAGE_MASK;
+
+ /*
+ * lock all primary followed works before visible to others
+ * and mutex_trylock *never* fails for a new pcluster.
+ */
+ mutex_trylock(&cl->lock);
+
+ err = erofs_register_workgroup(inode->i_sb, &pcl->obj, 0);
+ if (err) {
+ mutex_unlock(&cl->lock);
+ kmem_cache_free(pcluster_cachep, pcl);
+ return ERR_PTR(-EAGAIN);
+ }
+ clt->owned_head = &pcl->next;
+ clt->pcl = pcl;
+ clt->cl = cl;
+ return cl;
+}
+
+static int z_erofs_collector_begin(struct z_erofs_collector *clt,
+ struct inode *inode,
+ struct erofs_map_blocks *map)
+{
+ struct z_erofs_collection *cl;
+
+ DBG_BUGON(clt->cl);
+
+ /* must be Z_EROFS_PCLUSTER_TAIL or pointed to previous collection */
+ DBG_BUGON(clt->owned_head == Z_EROFS_PCLUSTER_NIL);
+ DBG_BUGON(clt->owned_head == Z_EROFS_PCLUSTER_TAIL_CLOSED);
+
+ if (!PAGE_ALIGNED(map->m_pa)) {
+ DBG_BUGON(1);
+ return -EINVAL;
+ }
+
+repeat:
+ cl = cllookup(clt, inode, map);
+ if (!cl) {
+ cl = clregister(clt, inode, map);
+
+ if (unlikely(cl == ERR_PTR(-EAGAIN)))
+ goto repeat;
+ }
+
+ if (IS_ERR(cl))
+ return PTR_ERR(cl);
+
+ z_erofs_pagevec_ctor_init(&clt->vector, Z_EROFS_NR_INLINE_PAGEVECS,
+ cl->pagevec, cl->vcnt);
+
+ clt->compressedpages = clt->pcl->compressed_pages;
+ if (clt->mode <= COLLECT_PRIMARY) /* cannot do in-place I/O */
+ clt->compressedpages += Z_EROFS_CLUSTER_MAX_PAGES;
+ return 0;
+}
+
+/*
+ * keep in mind that no referenced pclusters will be freed
+ * only after a RCU grace period.
+ */
+static void z_erofs_rcu_callback(struct rcu_head *head)
+{
+ struct z_erofs_collection *const cl =
+ container_of(head, struct z_erofs_collection, rcu);
+
+ kmem_cache_free(pcluster_cachep,
+ container_of(cl, struct z_erofs_pcluster,
+ primary_collection));
+}
+
+void erofs_workgroup_free_rcu(struct erofs_workgroup *grp)
+{
+ struct z_erofs_pcluster *const pcl =
+ container_of(grp, struct z_erofs_pcluster, obj);
+ struct z_erofs_collection *const cl = z_erofs_primarycollection(pcl);
+
+ call_rcu(&cl->rcu, z_erofs_rcu_callback);
+}
+
+static void z_erofs_collection_put(struct z_erofs_collection *cl)
+{
+ struct z_erofs_pcluster *const pcl =
+ container_of(cl, struct z_erofs_pcluster, primary_collection);
+
+ erofs_workgroup_put(&pcl->obj);
+}
+
+static bool z_erofs_collector_end(struct z_erofs_collector *clt)
+{
+ struct z_erofs_collection *cl = clt->cl;
+
+ if (!cl)
+ return false;
+
+ z_erofs_pagevec_ctor_exit(&clt->vector, false);
+ mutex_unlock(&cl->lock);
+
+ /*
+ * if all pending pages are added, don't hold its reference
+ * any longer if the pcluster isn't hosted by ourselves.
+ */
+ if (clt->mode < COLLECT_PRIMARY_FOLLOWED_NOINPLACE)
+ z_erofs_collection_put(cl);
+
+ clt->cl = NULL;
+ return true;
+}
+
+static inline struct page *__stagingpage_alloc(struct list_head *pagepool,
+ gfp_t gfp)
+{
+ struct page *page = erofs_allocpage(pagepool, gfp, true);
+
+ page->mapping = Z_EROFS_MAPPING_STAGING;
+ return page;
+}
+
+static int z_erofs_do_read_page(struct z_erofs_decompress_frontend *fe,
+ struct page *page,
+ struct list_head *pagepool)
+{
+ struct inode *const inode = fe->inode;
+ struct erofs_sb_info *const sbi __maybe_unused = EROFS_I_SB(inode);
+ struct erofs_map_blocks *const map = &fe->map;
+ struct z_erofs_collector *const clt = &fe->clt;
+ const loff_t offset = page_offset(page);
+ bool tight = (clt->mode >= COLLECT_PRIMARY_HOOKED);
+
+ enum z_erofs_cache_alloctype cache_strategy;
+ enum z_erofs_page_type page_type;
+ unsigned int cur, end, spiltted, index;
+ int err = 0;
+
+ /* register locked file pages as online pages in pack */
+ z_erofs_onlinepage_init(page);
+
+ spiltted = 0;
+ end = PAGE_SIZE;
+repeat:
+ cur = end - 1;
+
+ /* lucky, within the range of the current map_blocks */
+ if (offset + cur >= map->m_la &&
+ offset + cur < map->m_la + map->m_llen) {
+ /* didn't get a valid collection previously (very rare) */
+ if (!clt->cl)
+ goto restart_now;
+ goto hitted;
+ }
+
+ /* go ahead the next map_blocks */
+ debugln("%s: [out-of-range] pos %llu", __func__, offset + cur);
+
+ if (z_erofs_collector_end(clt))
+ fe->backmost = false;
+
+ map->m_la = offset + cur;
+ map->m_llen = 0;
+ err = z_erofs_map_blocks_iter(inode, map, 0);
+ if (unlikely(err))
+ goto err_out;
+
+restart_now:
+ if (unlikely(!(map->m_flags & EROFS_MAP_MAPPED)))
+ goto hitted;
+
+ err = z_erofs_collector_begin(clt, inode, map);
+ if (unlikely(err))
+ goto err_out;
+
+ /* preload all compressed pages (maybe downgrade role if necessary) */
+ preload_compressed_pages(clt, MNGD_MAPPING(sbi), DONTALLOC, pagepool);
+
+ tight &= (clt->mode >= COLLECT_PRIMARY_HOOKED);
+hitted:
+ cur = end - min_t(unsigned int, offset + end - map->m_la, end);
+ if (unlikely(!(map->m_flags & EROFS_MAP_MAPPED))) {
+ zero_user_segment(page, cur, end);
+ goto next_part;
+ }
+
+ /* let's derive page type */
+ page_type = cur ? Z_EROFS_VLE_PAGE_TYPE_HEAD :
+ (!spiltted ? Z_EROFS_PAGE_TYPE_EXCLUSIVE :
+ (tight ? Z_EROFS_PAGE_TYPE_EXCLUSIVE :
+ Z_EROFS_VLE_PAGE_TYPE_TAIL_SHARED));
+
+ if (cur)
+ tight &= (clt->mode >= COLLECT_PRIMARY_FOLLOWED);
+
+retry:
+ err = z_erofs_attach_page(clt, page, page_type);
+ /* should allocate an additional staging page for pagevec */
+ if (err == -EAGAIN) {
+ struct page *const newpage =
+ __stagingpage_alloc(pagepool, GFP_NOFS);
+
+ err = z_erofs_attach_page(clt, newpage,
+ Z_EROFS_PAGE_TYPE_EXCLUSIVE);
+ if (likely(!err))
+ goto retry;
+ }
+
+ if (unlikely(err))
+ goto err_out;
+
+ index = page->index - (map->m_la >> PAGE_SHIFT);
+
+ z_erofs_onlinepage_fixup(page, index, true);
+
+ /* bump up the number of spiltted parts of a page */
+ ++spiltted;
+ /* also update nr_pages */
+ clt->cl->nr_pages = max_t(pgoff_t, clt->cl->nr_pages, index + 1);
+next_part:
+ /* can be used for verification */
+ map->m_llen = offset + cur - map->m_la;
+
+ end = cur;
+ if (end > 0)
+ goto repeat;
+
+out:
+ z_erofs_onlinepage_endio(page);
+
+ debugln("%s, finish page: %pK spiltted: %u map->m_llen %llu",
+ __func__, page, spiltted, map->m_llen);
+ return err;
+
+ /* if some error occurred while processing this page */
+err_out:
+ SetPageError(page);
+ goto out;
+}
+
+static void z_erofs_vle_unzip_kickoff(void *ptr, int bios)
+{
+ tagptr1_t t = tagptr_init(tagptr1_t, ptr);
+ struct z_erofs_unzip_io *io = tagptr_unfold_ptr(t);
+ bool background = tagptr_unfold_tags(t);
+
+ if (!background) {
+ unsigned long flags;
+
+ spin_lock_irqsave(&io->u.wait.lock, flags);
+ if (!atomic_add_return(bios, &io->pending_bios))
+ wake_up_locked(&io->u.wait);
+ spin_unlock_irqrestore(&io->u.wait.lock, flags);
+ return;
+ }
+
+ if (!atomic_add_return(bios, &io->pending_bios))
+ queue_work(z_erofs_workqueue, &io->u.work);
+}
+
+static inline void z_erofs_vle_read_endio(struct bio *bio)
+{
+ struct erofs_sb_info *sbi = NULL;
+ blk_status_t err = bio->bi_status;
+ struct bio_vec *bvec;
+ struct bvec_iter_all iter_all;
+
+ bio_for_each_segment_all(bvec, bio, iter_all) {
+ struct page *page = bvec->bv_page;
+ bool cachemngd = false;
+
+ DBG_BUGON(PageUptodate(page));
+ DBG_BUGON(!page->mapping);
+
+ if (unlikely(!sbi && !z_erofs_page_is_staging(page))) {
+ sbi = EROFS_SB(page->mapping->host->i_sb);
+
+ if (time_to_inject(sbi, FAULT_READ_IO)) {
+ erofs_show_injection_info(FAULT_READ_IO);
+ err = BLK_STS_IOERR;
+ }
+ }
+
+ /* sbi should already be gotten if the page is managed */
+ if (sbi)
+ cachemngd = erofs_page_is_managed(sbi, page);
+
+ if (unlikely(err))
+ SetPageError(page);
+ else if (cachemngd)
+ SetPageUptodate(page);
+
+ if (cachemngd)
+ unlock_page(page);
+ }
+
+ z_erofs_vle_unzip_kickoff(bio->bi_private, -1);
+ bio_put(bio);
+}
+
+static int z_erofs_decompress_pcluster(struct super_block *sb,
+ struct z_erofs_pcluster *pcl,
+ struct list_head *pagepool)
+{
+ struct erofs_sb_info *const sbi = EROFS_SB(sb);
+ const unsigned int clusterpages = BIT(pcl->clusterbits);
+ struct z_erofs_pagevec_ctor ctor;
+ unsigned int i, outputsize, llen, nr_pages;
+ struct page *pages_onstack[Z_EROFS_VMAP_ONSTACK_PAGES];
+ struct page **pages, **compressed_pages, *page;
+
+ enum z_erofs_page_type page_type;
+ bool overlapped, partial;
+ struct z_erofs_collection *cl;
+ int err;
+
+ might_sleep();
+ cl = z_erofs_primarycollection(pcl);
+ DBG_BUGON(!READ_ONCE(cl->nr_pages));
+
+ mutex_lock(&cl->lock);
+ nr_pages = cl->nr_pages;
+
+ if (likely(nr_pages <= Z_EROFS_VMAP_ONSTACK_PAGES)) {
+ pages = pages_onstack;
+ } else if (nr_pages <= Z_EROFS_VMAP_GLOBAL_PAGES &&
+ mutex_trylock(&z_pagemap_global_lock)) {
+ pages = z_pagemap_global;
+ } else {
+ gfp_t gfp_flags = GFP_KERNEL;
+
+ if (nr_pages > Z_EROFS_VMAP_GLOBAL_PAGES)
+ gfp_flags |= __GFP_NOFAIL;
+
+ pages = kvmalloc_array(nr_pages, sizeof(struct page *),
+ gfp_flags);
+
+ /* fallback to global pagemap for the lowmem scenario */
+ if (unlikely(!pages)) {
+ mutex_lock(&z_pagemap_global_lock);
+ pages = z_pagemap_global;
+ }
+ }
+
+ for (i = 0; i < nr_pages; ++i)
+ pages[i] = NULL;
+
+ z_erofs_pagevec_ctor_init(&ctor, Z_EROFS_NR_INLINE_PAGEVECS,
+ cl->pagevec, 0);
+
+ for (i = 0; i < cl->vcnt; ++i) {
+ unsigned int pagenr;
+
+ page = z_erofs_pagevec_dequeue(&ctor, &page_type);
+
+ /* all pages in pagevec ought to be valid */
+ DBG_BUGON(!page);
+ DBG_BUGON(!page->mapping);
+
+ if (z_erofs_put_stagingpage(pagepool, page))
+ continue;
+
+ if (page_type == Z_EROFS_VLE_PAGE_TYPE_HEAD)
+ pagenr = 0;
+ else
+ pagenr = z_erofs_onlinepage_index(page);
+
+ DBG_BUGON(pagenr >= nr_pages);
+ DBG_BUGON(pages[pagenr]);
+
+ pages[pagenr] = page;
+ }
+ z_erofs_pagevec_ctor_exit(&ctor, true);
+
+ overlapped = false;
+ compressed_pages = pcl->compressed_pages;
+
+ err = 0;
+ for (i = 0; i < clusterpages; ++i) {
+ unsigned int pagenr;
+
+ page = compressed_pages[i];
+
+ /* all compressed pages ought to be valid */
+ DBG_BUGON(!page);
+ DBG_BUGON(!page->mapping);
+
+ if (!z_erofs_page_is_staging(page)) {
+ if (erofs_page_is_managed(sbi, page)) {
+ if (unlikely(!PageUptodate(page)))
+ err = -EIO;
+ continue;
+ }
+
+ /*
+ * only if non-head page can be selected
+ * for inplace decompression
+ */
+ pagenr = z_erofs_onlinepage_index(page);
+
+ DBG_BUGON(pagenr >= nr_pages);
+ DBG_BUGON(pages[pagenr]);
+ pages[pagenr] = page;
+
+ overlapped = true;
+ }
+
+ /* PG_error needs checking for inplaced and staging pages */
+ if (unlikely(PageError(page))) {
+ DBG_BUGON(PageUptodate(page));
+ err = -EIO;
+ }
+ }
+
+ if (unlikely(err))
+ goto out;
+
+ llen = pcl->length >> Z_EROFS_PCLUSTER_LENGTH_BIT;
+ if (nr_pages << PAGE_SHIFT >= cl->pageofs + llen) {
+ outputsize = llen;
+ partial = !(pcl->length & Z_EROFS_PCLUSTER_FULL_LENGTH);
+ } else {
+ outputsize = (nr_pages << PAGE_SHIFT) - cl->pageofs;
+ partial = true;
+ }
+
+ err = z_erofs_decompress(&(struct z_erofs_decompress_req) {
+ .sb = sb,
+ .in = compressed_pages,
+ .out = pages,
+ .pageofs_out = cl->pageofs,
+ .inputsize = PAGE_SIZE,
+ .outputsize = outputsize,
+ .alg = pcl->algorithmformat,
+ .inplace_io = overlapped,
+ .partial_decoding = partial
+ }, pagepool);
+
+out:
+ /* must handle all compressed pages before endding pages */
+ for (i = 0; i < clusterpages; ++i) {
+ page = compressed_pages[i];
+
+ if (erofs_page_is_managed(sbi, page))
+ continue;
+
+ /* recycle all individual staging pages */
+ (void)z_erofs_put_stagingpage(pagepool, page);
+
+ WRITE_ONCE(compressed_pages[i], NULL);
+ }
+
+ for (i = 0; i < nr_pages; ++i) {
+ page = pages[i];
+ if (!page)
+ continue;
+
+ DBG_BUGON(!page->mapping);
+
+ /* recycle all individual staging pages */
+ if (z_erofs_put_stagingpage(pagepool, page))
+ continue;
+
+ if (unlikely(err < 0))
+ SetPageError(page);
+
+ z_erofs_onlinepage_endio(page);
+ }
+
+ if (pages == z_pagemap_global)
+ mutex_unlock(&z_pagemap_global_lock);
+ else if (unlikely(pages != pages_onstack))
+ kvfree(pages);
+
+ cl->nr_pages = 0;
+ cl->vcnt = 0;
+
+ /* all cl locks MUST be taken before the following line */
+ WRITE_ONCE(pcl->next, Z_EROFS_PCLUSTER_NIL);
+
+ /* all cl locks SHOULD be released right now */
+ mutex_unlock(&cl->lock);
+
+ z_erofs_collection_put(cl);
+ return err;
+}
+
+static void z_erofs_vle_unzip_all(struct super_block *sb,
+ struct z_erofs_unzip_io *io,
+ struct list_head *pagepool)
+{
+ z_erofs_next_pcluster_t owned = io->head;
+
+ while (owned != Z_EROFS_PCLUSTER_TAIL_CLOSED) {
+ struct z_erofs_pcluster *pcl;
+
+ /* no possible that 'owned' equals Z_EROFS_WORK_TPTR_TAIL */
+ DBG_BUGON(owned == Z_EROFS_PCLUSTER_TAIL);
+
+ /* no possible that 'owned' equals NULL */
+ DBG_BUGON(owned == Z_EROFS_PCLUSTER_NIL);
+
+ pcl = container_of(owned, struct z_erofs_pcluster, next);
+ owned = READ_ONCE(pcl->next);
+
+ z_erofs_decompress_pcluster(sb, pcl, pagepool);
+ }
+}
+
+static void z_erofs_vle_unzip_wq(struct work_struct *work)
+{
+ struct z_erofs_unzip_io_sb *iosb =
+ container_of(work, struct z_erofs_unzip_io_sb, io.u.work);
+ LIST_HEAD(pagepool);
+
+ DBG_BUGON(iosb->io.head == Z_EROFS_PCLUSTER_TAIL_CLOSED);
+ z_erofs_vle_unzip_all(iosb->sb, &iosb->io, &pagepool);
+
+ put_pages_list(&pagepool);
+ kvfree(iosb);
+}
+
+static struct page *pickup_page_for_submission(struct z_erofs_pcluster *pcl,
+ unsigned int nr,
+ struct list_head *pagepool,
+ struct address_space *mc,
+ gfp_t gfp)
+{
+ /* determined at compile time to avoid too many #ifdefs */
+ const bool nocache = __builtin_constant_p(mc) ? !mc : false;
+ const pgoff_t index = pcl->obj.index;
+ bool tocache = false;
+
+ struct address_space *mapping;
+ struct page *oldpage, *page;
+
+ compressed_page_t t;
+ int justfound;
+
+repeat:
+ page = READ_ONCE(pcl->compressed_pages[nr]);
+ oldpage = page;
+
+ if (!page)
+ goto out_allocpage;
+
+ /*
+ * the cached page has not been allocated and
+ * an placeholder is out there, prepare it now.
+ */
+ if (!nocache && page == PAGE_UNALLOCATED) {
+ tocache = true;
+ goto out_allocpage;
+ }
+
+ /* process the target tagged pointer */
+ t = tagptr_init(compressed_page_t, page);
+ justfound = tagptr_unfold_tags(t);
+ page = tagptr_unfold_ptr(t);
+
+ mapping = READ_ONCE(page->mapping);
+
+ /*
+ * if managed cache is disabled, it's no way to
+ * get such a cached-like page.
+ */
+ if (nocache) {
+ /* if managed cache is disabled, it is impossible `justfound' */
+ DBG_BUGON(justfound);
+
+ /* and it should be locked, not uptodate, and not truncated */
+ DBG_BUGON(!PageLocked(page));
+ DBG_BUGON(PageUptodate(page));
+ DBG_BUGON(!mapping);
+ goto out;
+ }
+
+ /*
+ * unmanaged (file) pages are all locked solidly,
+ * therefore it is impossible for `mapping' to be NULL.
+ */
+ if (mapping && mapping != mc)
+ /* ought to be unmanaged pages */
+ goto out;
+
+ lock_page(page);
+
+ /* only true if page reclaim goes wrong, should never happen */
+ DBG_BUGON(justfound && PagePrivate(page));
+
+ /* the page is still in manage cache */
+ if (page->mapping == mc) {
+ WRITE_ONCE(pcl->compressed_pages[nr], page);
+
+ ClearPageError(page);
+ if (!PagePrivate(page)) {
+ /*
+ * impossible to be !PagePrivate(page) for
+ * the current restriction as well if
+ * the page is already in compressed_pages[].
+ */
+ DBG_BUGON(!justfound);
+
+ justfound = 0;
+ set_page_private(page, (unsigned long)pcl);
+ SetPagePrivate(page);
+ }
+
+ /* no need to submit io if it is already up-to-date */
+ if (PageUptodate(page)) {
+ unlock_page(page);
+ page = NULL;
+ }
+ goto out;
+ }
+
+ /*
+ * the managed page has been truncated, it's unsafe to
+ * reuse this one, let's allocate a new cache-managed page.
+ */
+ DBG_BUGON(page->mapping);
+ DBG_BUGON(!justfound);
+
+ tocache = true;
+ unlock_page(page);
+ put_page(page);
+out_allocpage:
+ page = __stagingpage_alloc(pagepool, gfp);
+ if (oldpage != cmpxchg(&pcl->compressed_pages[nr], oldpage, page)) {
+ list_add(&page->lru, pagepool);
+ cpu_relax();
+ goto repeat;
+ }
+ if (nocache || !tocache)
+ goto out;
+ if (add_to_page_cache_lru(page, mc, index + nr, gfp)) {
+ page->mapping = Z_EROFS_MAPPING_STAGING;
+ goto out;
+ }
+
+ set_page_private(page, (unsigned long)pcl);
+ SetPagePrivate(page);
+out: /* the only exit (for tracing and debugging) */
+ return page;
+}
+
+static struct z_erofs_unzip_io *jobqueue_init(struct super_block *sb,
+ struct z_erofs_unzip_io *io,
+ bool foreground)
+{
+ struct z_erofs_unzip_io_sb *iosb;
+
+ if (foreground) {
+ /* waitqueue available for foreground io */
+ DBG_BUGON(!io);
+
+ init_waitqueue_head(&io->u.wait);
+ atomic_set(&io->pending_bios, 0);
+ goto out;
+ }
+
+ iosb = kvzalloc(sizeof(*iosb), GFP_KERNEL | __GFP_NOFAIL);
+ DBG_BUGON(!iosb);
+
+ /* initialize fields in the allocated descriptor */
+ io = &iosb->io;
+ iosb->sb = sb;
+ INIT_WORK(&io->u.work, z_erofs_vle_unzip_wq);
+out:
+ io->head = Z_EROFS_PCLUSTER_TAIL_CLOSED;
+ return io;
+}
+
+/* define decompression jobqueue types */
+enum {
+ JQ_SUBMIT,
+ NR_JOBQUEUES,
+};
+
+static void *jobqueueset_init(struct super_block *sb,
+ z_erofs_next_pcluster_t qtail[],
+ struct z_erofs_unzip_io *q[],
+ struct z_erofs_unzip_io *fgq,
+ bool forcefg)
+{
+ q[JQ_SUBMIT] = jobqueue_init(sb, fgq + JQ_SUBMIT, forcefg);
+ qtail[JQ_SUBMIT] = &q[JQ_SUBMIT]->head;
+
+ return tagptr_cast_ptr(tagptr_fold(tagptr1_t, q[JQ_SUBMIT], !forcefg));
+}
+
+static void move_to_bypass_jobqueue(struct z_erofs_pcluster *pcl,
+ z_erofs_next_pcluster_t qtail[],
+ z_erofs_next_pcluster_t owned_head)
+{
+ /* impossible to bypass submission for managed cache disabled */
+ DBG_BUGON(1);
+}
+
+static bool postsubmit_is_all_bypassed(struct z_erofs_unzip_io *q[],
+ unsigned int nr_bios,
+ bool force_fg)
+{
+ /* bios should be >0 if managed cache is disabled */
+ DBG_BUGON(!nr_bios);
+ return false;
+}
+
+static bool z_erofs_vle_submit_all(struct super_block *sb,
+ z_erofs_next_pcluster_t owned_head,
+ struct list_head *pagepool,
+ struct z_erofs_unzip_io *fgq,
+ bool force_fg)
+{
+ struct erofs_sb_info *const sbi __maybe_unused = EROFS_SB(sb);
+ z_erofs_next_pcluster_t qtail[NR_JOBQUEUES];
+ struct z_erofs_unzip_io *q[NR_JOBQUEUES];
+ struct bio *bio;
+ void *bi_private;
+ /* since bio will be NULL, no need to initialize last_index */
+ pgoff_t uninitialized_var(last_index);
+ bool force_submit = false;
+ unsigned int nr_bios;
+
+ if (unlikely(owned_head == Z_EROFS_PCLUSTER_TAIL))
+ return false;
+
+ force_submit = false;
+ bio = NULL;
+ nr_bios = 0;
+ bi_private = jobqueueset_init(sb, qtail, q, fgq, force_fg);
+
+ /* by default, all need io submission */
+ q[JQ_SUBMIT]->head = owned_head;
+
+ do {
+ struct z_erofs_pcluster *pcl;
+ unsigned int clusterpages;
+ pgoff_t first_index;
+ struct page *page;
+ unsigned int i = 0, bypass = 0;
+ int err;
+
+ /* no possible 'owned_head' equals the following */
+ DBG_BUGON(owned_head == Z_EROFS_PCLUSTER_TAIL_CLOSED);
+ DBG_BUGON(owned_head == Z_EROFS_PCLUSTER_NIL);
+
+ pcl = container_of(owned_head, struct z_erofs_pcluster, next);
+
+ clusterpages = BIT(pcl->clusterbits);
+
+ /* close the main owned chain at first */
+ owned_head = cmpxchg(&pcl->next, Z_EROFS_PCLUSTER_TAIL,
+ Z_EROFS_PCLUSTER_TAIL_CLOSED);
+
+ first_index = pcl->obj.index;
+ force_submit |= (first_index != last_index + 1);
+
+repeat:
+ page = pickup_page_for_submission(pcl, i, pagepool,
+ MNGD_MAPPING(sbi),
+ GFP_NOFS);
+ if (!page) {
+ force_submit = true;
+ ++bypass;
+ goto skippage;
+ }
+
+ if (bio && force_submit) {
+submit_bio_retry:
+ __submit_bio(bio, REQ_OP_READ, 0);
+ bio = NULL;
+ }
+
+ if (!bio) {
+ bio = erofs_grab_bio(sb, first_index + i,
+ BIO_MAX_PAGES, bi_private,
+ z_erofs_vle_read_endio, true);
+ ++nr_bios;
+ }
+
+ err = bio_add_page(bio, page, PAGE_SIZE, 0);
+ if (err < PAGE_SIZE)
+ goto submit_bio_retry;
+
+ force_submit = false;
+ last_index = first_index + i;
+skippage:
+ if (++i < clusterpages)
+ goto repeat;
+
+ if (bypass < clusterpages)
+ qtail[JQ_SUBMIT] = &pcl->next;
+ else
+ move_to_bypass_jobqueue(pcl, qtail, owned_head);
+ } while (owned_head != Z_EROFS_PCLUSTER_TAIL);
+
+ if (bio)
+ __submit_bio(bio, REQ_OP_READ, 0);
+
+ if (postsubmit_is_all_bypassed(q, nr_bios, force_fg))
+ return true;
+
+ z_erofs_vle_unzip_kickoff(bi_private, nr_bios);
+ return true;
+}
+
+static void z_erofs_submit_and_unzip(struct super_block *sb,
+ struct z_erofs_collector *clt,
+ struct list_head *pagepool,
+ bool force_fg)
+{
+ struct z_erofs_unzip_io io[NR_JOBQUEUES];
+
+ if (!z_erofs_vle_submit_all(sb, clt->owned_head,
+ pagepool, io, force_fg))
+ return;
+
+ if (!force_fg)
+ return;
+
+ /* wait until all bios are completed */
+ wait_event(io[JQ_SUBMIT].u.wait,
+ !atomic_read(&io[JQ_SUBMIT].pending_bios));
+
+ /* let's synchronous decompression */
+ z_erofs_vle_unzip_all(sb, &io[JQ_SUBMIT], pagepool);
+}
+
+static int z_erofs_vle_normalaccess_readpage(struct file *file,
+ struct page *page)
+{
+ struct inode *const inode = page->mapping->host;
+ struct z_erofs_decompress_frontend f = DECOMPRESS_FRONTEND_INIT(inode);
+ int err;
+ LIST_HEAD(pagepool);
+
+ trace_erofs_readpage(page, false);
+
+ f.headoffset = (erofs_off_t)page->index << PAGE_SHIFT;
+
+ err = z_erofs_do_read_page(&f, page, &pagepool);
+ (void)z_erofs_collector_end(&f.clt);
+
+ if (err) {
+ errln("%s, failed to read, err [%d]", __func__, err);
+ goto out;
+ }
+
+ z_erofs_submit_and_unzip(inode->i_sb, &f.clt, &pagepool, true);
+out:
+ if (f.map.mpage)
+ put_page(f.map.mpage);
+
+ /* clean up the remaining free pages */
+ put_pages_list(&pagepool);
+ return 0;
+}
+
+static bool should_decompress_synchronously(struct erofs_sb_info *sbi,
+ unsigned int nr)
+{
+ return nr <= sbi->max_sync_decompress_pages;
+}
+
+static int z_erofs_vle_normalaccess_readpages(struct file *filp,
+ struct address_space *mapping,
+ struct list_head *pages,
+ unsigned int nr_pages)
+{
+ struct inode *const inode = mapping->host;
+ struct erofs_sb_info *const sbi = EROFS_I_SB(inode);
+
+ bool sync = should_decompress_synchronously(sbi, nr_pages);
+ struct z_erofs_decompress_frontend f = DECOMPRESS_FRONTEND_INIT(inode);
+ gfp_t gfp = mapping_gfp_constraint(mapping, GFP_KERNEL);
+ struct page *head = NULL;
+ LIST_HEAD(pagepool);
+
+ trace_erofs_readpages(mapping->host, lru_to_page(pages),
+ nr_pages, false);
+
+ f.headoffset = (erofs_off_t)lru_to_page(pages)->index << PAGE_SHIFT;
+
+ for (; nr_pages; --nr_pages) {
+ struct page *page = lru_to_page(pages);
+
+ prefetchw(&page->flags);
+ list_del(&page->lru);
+
+ /*
+ * A pure asynchronous readahead is indicated if
+ * a PG_readahead marked page is hitted at first.
+ * Let's also do asynchronous decompression for this case.
+ */
+ sync &= !(PageReadahead(page) && !head);
+
+ if (add_to_page_cache_lru(page, mapping, page->index, gfp)) {
+ list_add(&page->lru, &pagepool);
+ continue;
+ }
+
+ set_page_private(page, (unsigned long)head);
+ head = page;
+ }
+
+ while (head) {
+ struct page *page = head;
+ int err;
+
+ /* traversal in reverse order */
+ head = (void *)page_private(page);
+
+ err = z_erofs_do_read_page(&f, page, &pagepool);
+ if (err) {
+ struct erofs_vnode *vi = EROFS_V(inode);
+
+ errln("%s, readahead error at page %lu of nid %llu",
+ __func__, page->index, vi->nid);
+ }
+ put_page(page);
+ }
+
+ (void)z_erofs_collector_end(&f.clt);
+
+ z_erofs_submit_and_unzip(inode->i_sb, &f.clt, &pagepool, sync);
+
+ if (f.map.mpage)
+ put_page(f.map.mpage);
+
+ /* clean up the remaining free pages */
+ put_pages_list(&pagepool);
+ return 0;
+}
+
+const struct address_space_operations z_erofs_vle_normalaccess_aops = {
+ .readpage = z_erofs_vle_normalaccess_readpage,
+ .readpages = z_erofs_vle_normalaccess_readpages,
+};
+
diff --git a/fs/erofs/zdata.h b/fs/erofs/zdata.h
new file mode 100644
index 000000000000..3a82ae933015
--- /dev/null
+++ b/fs/erofs/zdata.h
@@ -0,0 +1,192 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * linux/fs/erofs/zdata.h
+ *
+ * Copyright (C) 2018 HUAWEI, Inc.
+ * http://www.huawei.com/
+ * Created by Gao Xiang <[email protected]>
+ */
+#ifndef __EROFS_FS_ZDATA_H
+#define __EROFS_FS_ZDATA_H
+
+#include "internal.h"
+#include "zpvec.h"
+
+#define Z_EROFS_NR_INLINE_PAGEVECS 3
+
+/*
+ * Structure fields follow one of the following exclusion rules.
+ *
+ * I: Modifiable by initialization/destruction paths and read-only
+ * for everyone else;
+ *
+ * L: Field should be protected by pageset lock;
+ *
+ * A: Field should be accessed / updated in atomic for parallelized code.
+ */
+struct z_erofs_collection {
+ struct mutex lock;
+
+ /* I: page offset of start position of decompression */
+ unsigned short pageofs;
+
+ /* L: maximum relative page index in pagevec[] */
+ unsigned short nr_pages;
+
+ /* L: total number of pages in pagevec[] */
+ unsigned int vcnt;
+
+ union {
+ /* L: inline a certain number of pagevecs for bootstrap */
+ erofs_vtptr_t pagevec[Z_EROFS_NR_INLINE_PAGEVECS];
+
+ /* I: can be used to free the pcluster by RCU. */
+ struct rcu_head rcu;
+ };
+};
+
+#define Z_EROFS_PCLUSTER_FULL_LENGTH 0x00000001
+#define Z_EROFS_PCLUSTER_LENGTH_BIT 1
+
+/*
+ * let's leave a type here in case of introducing
+ * another tagged pointer later.
+ */
+typedef void *z_erofs_next_pcluster_t;
+
+struct z_erofs_pcluster {
+ struct erofs_workgroup obj;
+ struct z_erofs_collection primary_collection;
+
+ /* A: point to next chained pcluster or TAILs */
+ z_erofs_next_pcluster_t next;
+
+ /* A: compressed pages (including multi-usage pages) */
+ struct page *compressed_pages[Z_EROFS_CLUSTER_MAX_PAGES];
+
+ /* A: lower limit of decompressed length and if full length or not */
+ unsigned int length;
+
+ /* I: compression algorithm format */
+ unsigned char algorithmformat;
+ /* I: bit shift of physical cluster size */
+ unsigned char clusterbits;
+};
+
+#define z_erofs_primarycollection(pcluster) (&(pcluster)->primary_collection)
+
+/* let's avoid the valid 32-bit kernel addresses */
+
+/* the chained workgroup has't submitted io (still open) */
+#define Z_EROFS_PCLUSTER_TAIL ((void *)0x5F0ECAFE)
+/* the chained workgroup has already submitted io */
+#define Z_EROFS_PCLUSTER_TAIL_CLOSED ((void *)0x5F0EDEAD)
+
+#define Z_EROFS_PCLUSTER_NIL (NULL)
+
+#define Z_EROFS_WORKGROUP_SIZE sizeof(struct z_erofs_pcluster)
+
+struct z_erofs_unzip_io {
+ atomic_t pending_bios;
+ z_erofs_next_pcluster_t head;
+
+ union {
+ wait_queue_head_t wait;
+ struct work_struct work;
+ } u;
+};
+
+struct z_erofs_unzip_io_sb {
+ struct z_erofs_unzip_io io;
+ struct super_block *sb;
+};
+
+#define MNGD_MAPPING(sbi) (NULL)
+static inline bool erofs_page_is_managed(const struct erofs_sb_info *sbi,
+ struct page *page) { return false; }
+
+#define Z_EROFS_ONLINEPAGE_COUNT_BITS 2
+#define Z_EROFS_ONLINEPAGE_COUNT_MASK ((1 << Z_EROFS_ONLINEPAGE_COUNT_BITS) - 1)
+#define Z_EROFS_ONLINEPAGE_INDEX_SHIFT (Z_EROFS_ONLINEPAGE_COUNT_BITS)
+
+/*
+ * waiters (aka. ongoing_packs): # to unlock the page
+ * sub-index: 0 - for partial page, >= 1 full page sub-index
+ */
+typedef atomic_t z_erofs_onlinepage_t;
+
+/* type punning */
+union z_erofs_onlinepage_converter {
+ z_erofs_onlinepage_t *o;
+ unsigned long *v;
+};
+
+static inline unsigned int z_erofs_onlinepage_index(struct page *page)
+{
+ union z_erofs_onlinepage_converter u;
+
+ DBG_BUGON(!PagePrivate(page));
+ u.v = &page_private(page);
+
+ return atomic_read(u.o) >> Z_EROFS_ONLINEPAGE_INDEX_SHIFT;
+}
+
+static inline void z_erofs_onlinepage_init(struct page *page)
+{
+ union {
+ z_erofs_onlinepage_t o;
+ unsigned long v;
+ /* keep from being unlocked in advance */
+ } u = { .o = ATOMIC_INIT(1) };
+
+ set_page_private(page, u.v);
+ smp_wmb();
+ SetPagePrivate(page);
+}
+
+static inline void z_erofs_onlinepage_fixup(struct page *page,
+ uintptr_t index, bool down)
+{
+ unsigned long *p, o, v, id;
+repeat:
+ p = &page_private(page);
+ o = READ_ONCE(*p);
+
+ id = o >> Z_EROFS_ONLINEPAGE_INDEX_SHIFT;
+ if (id) {
+ if (!index)
+ return;
+
+ DBG_BUGON(id != index);
+ }
+
+ v = (index << Z_EROFS_ONLINEPAGE_INDEX_SHIFT) |
+ ((o & Z_EROFS_ONLINEPAGE_COUNT_MASK) + (unsigned int)down);
+ if (cmpxchg(p, o, v) != o)
+ goto repeat;
+}
+
+static inline void z_erofs_onlinepage_endio(struct page *page)
+{
+ union z_erofs_onlinepage_converter u;
+ unsigned int v;
+
+ DBG_BUGON(!PagePrivate(page));
+ u.v = &page_private(page);
+
+ v = atomic_dec_return(u.o);
+ if (!(v & Z_EROFS_ONLINEPAGE_COUNT_MASK)) {
+ ClearPagePrivate(page);
+ if (!PageError(page))
+ SetPageUptodate(page);
+ unlock_page(page);
+ }
+ debugln("%s, page %p value %x", __func__, page, atomic_read(u.o));
+}
+
+#define Z_EROFS_VMAP_ONSTACK_PAGES \
+ min_t(unsigned int, THREAD_SIZE / 8 / sizeof(struct page *), 96U)
+#define Z_EROFS_VMAP_GLOBAL_PAGES 2048
+
+#endif
+
diff --git a/fs/erofs/zmap.c b/fs/erofs/zmap.c
index f1924b4e9f36..8d88cbc00a0d 100644
--- a/fs/erofs/zmap.c
+++ b/fs/erofs/zmap.c
@@ -23,7 +23,9 @@ int z_erofs_fill_inode(struct inode *inode)
vi->z_physical_clusterbits[1] = vi->z_logical_clusterbits;
set_bit(EROFS_V_Z_INITED_BIT, &vi->flags);
}
- return -EOPNOTSUPP;
+
+ inode->i_mapping->a_ops = &z_erofs_vle_normalaccess_aops;
+ return 0;
}

static int fill_inode_lazy(struct inode *inode)
--
2.17.1

2019-08-15 05:05:19

by Gao Xiang

[permalink] [raw]
Subject: [PATCH v8 02/24] erofs: add erofs in-memory stuffs

- erofs_sb_info:
contains erofs-specific in-memory information.

- erofs_vnode:
contains vfs_inode and other fs-specific information.
same as super block, the only one in-memory definition exists.

- erofs_map_blocks
Logical to physical block mapping, used by erofs_map_blocks().

Signed-off-by: Gao Xiang <[email protected]>
---
fs/erofs/internal.h | 357 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 357 insertions(+)
create mode 100644 fs/erofs/internal.h

diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
new file mode 100644
index 000000000000..5b946eabc239
--- /dev/null
+++ b/fs/erofs/internal.h
@@ -0,0 +1,357 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * linux/fs/erofs/internal.h
+ *
+ * Copyright (C) 2017-2018 HUAWEI, Inc.
+ * http://www.huawei.com/
+ * Created by Gao Xiang <[email protected]>
+ */
+#ifndef __EROFS_INTERNAL_H
+#define __EROFS_INTERNAL_H
+
+#include <linux/fs.h>
+#include <linux/dcache.h>
+#include <linux/mm.h>
+#include <linux/pagemap.h>
+#include <linux/bio.h>
+#include <linux/buffer_head.h>
+#include <linux/magic.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include "erofs_fs.h"
+
+/* redefine pr_fmt "erofs: " */
+#undef pr_fmt
+#define pr_fmt(fmt) "erofs: " fmt
+
+#define errln(x, ...) pr_err(x "\n", ##__VA_ARGS__)
+#define infoln(x, ...) pr_info(x "\n", ##__VA_ARGS__)
+#ifdef CONFIG_EROFS_FS_DEBUG
+#define debugln(x, ...) pr_debug(x "\n", ##__VA_ARGS__)
+#define DBG_BUGON BUG_ON
+#else
+#define debugln(x, ...) ((void)0)
+#define DBG_BUGON(x) ((void)(x))
+#endif /* !CONFIG_EROFS_FS_DEBUG */
+
+enum {
+ FAULT_KMALLOC,
+ FAULT_READ_IO,
+ FAULT_MAX,
+};
+
+#ifdef CONFIG_EROFS_FAULT_INJECTION
+extern const char *erofs_fault_name[FAULT_MAX];
+#define IS_FAULT_SET(fi, type) ((fi)->inject_type & (1 << (type)))
+
+struct erofs_fault_info {
+ atomic_t inject_ops;
+ unsigned int inject_rate;
+ unsigned int inject_type;
+};
+#endif /* CONFIG_EROFS_FAULT_INJECTION */
+
+/* EROFS_SUPER_MAGIC_V1 to represent the whole file system */
+#define EROFS_SUPER_MAGIC EROFS_SUPER_MAGIC_V1
+
+typedef u64 erofs_nid_t;
+typedef u64 erofs_off_t;
+/* data type for filesystem-wide blocks number */
+typedef u32 erofs_blk_t;
+
+struct erofs_sb_info {
+ u32 blocks;
+ u32 meta_blkaddr;
+
+ /* inode slot unit size in bit shift */
+ unsigned char islotbits;
+
+ u32 build_time_nsec;
+ u64 build_time;
+
+ /* what we really care is nid, rather than ino.. */
+ erofs_nid_t root_nid;
+ /* used for statfs, f_files - f_favail */
+ u64 inos;
+
+ u8 uuid[16]; /* 128-bit uuid for volume */
+ u8 volume_name[16]; /* volume name */
+ u32 requirements;
+
+ unsigned int mount_opt;
+
+#ifdef CONFIG_EROFS_FAULT_INJECTION
+ struct erofs_fault_info fault_info; /* For fault injection */
+#endif
+};
+
+#ifdef CONFIG_EROFS_FAULT_INJECTION
+#define erofs_show_injection_info(type) \
+ infoln("inject %s in %s of %pS", erofs_fault_name[type], \
+ __func__, __builtin_return_address(0))
+
+static inline bool time_to_inject(struct erofs_sb_info *sbi, int type)
+{
+ struct erofs_fault_info *ffi = &sbi->fault_info;
+
+ if (!ffi->inject_rate)
+ return false;
+
+ if (!IS_FAULT_SET(ffi, type))
+ return false;
+
+ atomic_inc(&ffi->inject_ops);
+ if (atomic_read(&ffi->inject_ops) >= ffi->inject_rate) {
+ atomic_set(&ffi->inject_ops, 0);
+ return true;
+ }
+ return false;
+}
+#else
+static inline bool time_to_inject(struct erofs_sb_info *sbi, int type)
+{
+ return false;
+}
+
+static inline void erofs_show_injection_info(int type)
+{
+}
+#endif /* !CONFIG_EROFS_FAULT_INJECTION */
+
+static inline void *erofs_kmalloc(struct erofs_sb_info *sbi,
+ size_t size, gfp_t flags)
+{
+ if (time_to_inject(sbi, FAULT_KMALLOC)) {
+ erofs_show_injection_info(FAULT_KMALLOC);
+ return NULL;
+ }
+ return kmalloc(size, flags);
+}
+
+#define EROFS_SB(sb) ((struct erofs_sb_info *)(sb)->s_fs_info)
+#define EROFS_I_SB(inode) ((struct erofs_sb_info *)(inode)->i_sb->s_fs_info)
+
+/* Mount flags set via mount options or defaults */
+#define EROFS_MOUNT_FAULT_INJECTION 0x00000040
+
+#define clear_opt(sbi, option) ((sbi)->mount_opt &= ~EROFS_MOUNT_##option)
+#define set_opt(sbi, option) ((sbi)->mount_opt |= EROFS_MOUNT_##option)
+#define test_opt(sbi, option) ((sbi)->mount_opt & EROFS_MOUNT_##option)
+
+/* we strictly follow PAGE_SIZE and no buffer head yet */
+#define LOG_BLOCK_SIZE PAGE_SHIFT
+
+#undef LOG_SECTORS_PER_BLOCK
+#define LOG_SECTORS_PER_BLOCK (PAGE_SHIFT - 9)
+
+#undef SECTORS_PER_BLOCK
+#define SECTORS_PER_BLOCK (1 << SECTORS_PER_BLOCK)
+
+#define EROFS_BLKSIZ (1 << LOG_BLOCK_SIZE)
+
+#if (EROFS_BLKSIZ % 4096 || !EROFS_BLKSIZ)
+#error erofs cannot be used in this platform
+#endif
+
+#define EROFS_IO_MAX_RETRIES_NOFAIL 5
+
+#define ROOT_NID(sb) ((sb)->root_nid)
+
+#define erofs_blknr(addr) ((addr) / EROFS_BLKSIZ)
+#define erofs_blkoff(addr) ((addr) % EROFS_BLKSIZ)
+#define blknr_to_addr(nr) ((erofs_off_t)(nr) * EROFS_BLKSIZ)
+
+static inline erofs_off_t iloc(struct erofs_sb_info *sbi, erofs_nid_t nid)
+{
+ return blknr_to_addr(sbi->meta_blkaddr) + (nid << sbi->islotbits);
+}
+
+struct erofs_vnode {
+ erofs_nid_t nid;
+
+ unsigned char datamode;
+ unsigned char inode_isize;
+ unsigned short xattr_isize;
+
+ erofs_blk_t raw_blkaddr;
+ /* the corresponding vfs inode */
+ struct inode vfs_inode;
+};
+
+#define EROFS_V(ptr) \
+ container_of(ptr, struct erofs_vnode, vfs_inode)
+
+#define __inode_advise(x, bit, bits) \
+ (((x) >> (bit)) & ((1 << (bits)) - 1))
+
+#define __inode_version(advise) \
+ __inode_advise(advise, EROFS_I_VERSION_BIT, \
+ EROFS_I_VERSION_BITS)
+
+#define __inode_data_mapping(advise) \
+ __inode_advise(advise, EROFS_I_DATA_MAPPING_BIT,\
+ EROFS_I_DATA_MAPPING_BITS)
+
+static inline unsigned long inode_datablocks(struct inode *inode)
+{
+ /* since i_size cannot be changed */
+ return DIV_ROUND_UP(inode->i_size, EROFS_BLKSIZ);
+}
+
+static inline bool is_inode_layout_compression(struct inode *inode)
+{
+ return erofs_inode_is_data_compressed(EROFS_V(inode)->datamode);
+}
+
+static inline bool is_inode_flat_inline(struct inode *inode)
+{
+ return EROFS_V(inode)->datamode == EROFS_INODE_FLAT_INLINE;
+}
+
+extern const struct super_operations erofs_sops;
+
+extern const struct address_space_operations erofs_raw_access_aops;
+
+/*
+ * Logical to physical block mapping, used by erofs_map_blocks()
+ *
+ * Different with other file systems, it is used for 2 access modes:
+ *
+ * 1) RAW access mode:
+ *
+ * Users pass a valid (m_lblk, m_lofs -- usually 0) pair,
+ * and get the valid m_pblk, m_pofs and the longest m_len(in bytes).
+ *
+ * Note that m_lblk in the RAW access mode refers to the number of
+ * the compressed ondisk block rather than the uncompressed
+ * in-memory block for the compressed file.
+ *
+ * m_pofs equals to m_lofs except for the inline data page.
+ *
+ * 2) Normal access mode:
+ *
+ * If the inode is not compressed, it has no difference with
+ * the RAW access mode. However, if the inode is compressed,
+ * users should pass a valid (m_lblk, m_lofs) pair, and get
+ * the needed m_pblk, m_pofs, m_len to get the compressed data
+ * and the updated m_lblk, m_lofs which indicates the start
+ * of the corresponding uncompressed data in the file.
+ */
+enum {
+ BH_Zipped = BH_PrivateStart,
+ BH_FullMapped,
+};
+
+/* Has a disk mapping */
+#define EROFS_MAP_MAPPED (1 << BH_Mapped)
+/* Located in metadata (could be copied from bd_inode) */
+#define EROFS_MAP_META (1 << BH_Meta)
+
+struct erofs_map_blocks {
+ erofs_off_t m_pa, m_la;
+ u64 m_plen, m_llen;
+
+ unsigned int m_flags;
+
+ struct page *mpage;
+};
+
+/* Flags used by erofs_map_blocks() */
+#define EROFS_GET_BLOCKS_RAW 0x0001
+
+/* data.c */
+static inline struct bio *erofs_grab_bio(struct super_block *sb,
+ erofs_blk_t blkaddr,
+ unsigned int nr_pages,
+ void *bi_private, bio_end_io_t endio,
+ bool nofail)
+{
+ const gfp_t gfp = GFP_NOIO;
+ struct bio *bio;
+
+ do {
+ if (nr_pages == 1) {
+ bio = bio_alloc(gfp | (nofail ? __GFP_NOFAIL : 0), 1);
+ if (unlikely(!bio)) {
+ DBG_BUGON(nofail);
+ return ERR_PTR(-ENOMEM);
+ }
+ break;
+ }
+ bio = bio_alloc(gfp, nr_pages);
+ nr_pages /= 2;
+ } while (unlikely(!bio));
+
+ bio->bi_end_io = endio;
+ bio_set_dev(bio, sb->s_bdev);
+ bio->bi_iter.bi_sector = (sector_t)blkaddr << LOG_SECTORS_PER_BLOCK;
+ bio->bi_private = bi_private;
+ return bio;
+}
+
+static inline void __submit_bio(struct bio *bio, unsigned int op,
+ unsigned int op_flags)
+{
+ bio_set_op_attrs(bio, op, op_flags);
+ submit_bio(bio);
+}
+
+struct page *__erofs_get_meta_page(struct super_block *sb, erofs_blk_t blkaddr,
+ bool prio, bool nofail);
+
+static inline struct page *erofs_get_meta_page(struct super_block *sb,
+ erofs_blk_t blkaddr, bool prio)
+{
+ return __erofs_get_meta_page(sb, blkaddr, prio, false);
+}
+
+int erofs_map_blocks(struct inode *, struct erofs_map_blocks *, int);
+
+static inline struct page *erofs_get_inline_page(struct inode *inode,
+ erofs_blk_t blkaddr)
+{
+ return erofs_get_meta_page(inode->i_sb, blkaddr,
+ S_ISDIR(inode->i_mode));
+}
+
+/* inode.c */
+static inline unsigned long erofs_inode_hash(erofs_nid_t nid)
+{
+#if BITS_PER_LONG == 32
+ return (nid >> 32) ^ (nid & 0xffffffff);
+#else
+ return nid;
+#endif
+}
+
+extern const struct inode_operations erofs_generic_iops;
+extern const struct inode_operations erofs_symlink_iops;
+extern const struct inode_operations erofs_fast_symlink_iops;
+
+static inline void set_inode_fast_symlink(struct inode *inode)
+{
+ inode->i_op = &erofs_fast_symlink_iops;
+}
+
+static inline bool is_inode_fast_symlink(struct inode *inode)
+{
+ return inode->i_op == &erofs_fast_symlink_iops;
+}
+
+struct inode *erofs_iget(struct super_block *sb, erofs_nid_t nid, bool dir);
+int erofs_getattr(const struct path *path, struct kstat *stat,
+ u32 request_mask, unsigned int query_flags);
+
+/* namei.c */
+extern const struct inode_operations erofs_dir_iops;
+
+int erofs_namei(struct inode *dir, struct qstr *name,
+ erofs_nid_t *nid, unsigned int *d_type);
+
+/* dir.c */
+extern const struct file_operations erofs_dir_fops;
+
+#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
+
+#endif /* __EROFS_INTERNAL_H */
+
--
2.17.1

2019-08-15 05:06:39

by Gao Xiang

[permalink] [raw]
Subject: [PATCH v8 12/24] erofs: introduce tagged pointer

Currently kernel has scattered tagged pointer usages
hacked by hand in plain code, without a unique and
portable functionset to highlight the tagged pointer
itself and wrap these hacked code in order to clean up
all over meaningless magic masks.

This patch introduces simple generic methods to fold
tags into a pointer integer. Currently it supports
the last n bits of the pointer for tags, which can be
selected by users.

In addition, it will also be used for the upcoming EROFS
filesystem, which heavily uses tagged pointer pproach
to reduce extra memory allocation.

Link: https://en.wikipedia.org/wiki/Tagged_pointer

Signed-off-by: Gao Xiang <[email protected]>
---
fs/erofs/tagptr.h | 110 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 110 insertions(+)
create mode 100644 fs/erofs/tagptr.h

diff --git a/fs/erofs/tagptr.h b/fs/erofs/tagptr.h
new file mode 100644
index 000000000000..a72897c86744
--- /dev/null
+++ b/fs/erofs/tagptr.h
@@ -0,0 +1,110 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * A tagged pointer implementation
+ *
+ * Copyright (C) 2018 Gao Xiang <[email protected]>
+ */
+#ifndef __EROFS_FS_TAGPTR_H
+#define __EROFS_FS_TAGPTR_H
+
+#include <linux/types.h>
+#include <linux/build_bug.h>
+
+/*
+ * the name of tagged pointer types are tagptr{1, 2, 3...}_t
+ * avoid directly using the internal structs __tagptr{1, 2, 3...}
+ */
+#define __MAKE_TAGPTR(n) \
+typedef struct __tagptr##n { \
+ uintptr_t v; \
+} tagptr##n##_t;
+
+__MAKE_TAGPTR(1)
+__MAKE_TAGPTR(2)
+__MAKE_TAGPTR(3)
+__MAKE_TAGPTR(4)
+
+#undef __MAKE_TAGPTR
+
+extern void __compiletime_error("bad tagptr tags")
+ __bad_tagptr_tags(void);
+
+extern void __compiletime_error("bad tagptr type")
+ __bad_tagptr_type(void);
+
+/* fix the broken usage of "#define tagptr2_t tagptr3_t" by users */
+#define __tagptr_mask_1(ptr, n) \
+ __builtin_types_compatible_p(typeof(ptr), struct __tagptr##n) ? \
+ (1UL << (n)) - 1 :
+
+#define __tagptr_mask(ptr) (\
+ __tagptr_mask_1(ptr, 1) ( \
+ __tagptr_mask_1(ptr, 2) ( \
+ __tagptr_mask_1(ptr, 3) ( \
+ __tagptr_mask_1(ptr, 4) ( \
+ __bad_tagptr_type(), 0)))))
+
+/* generate a tagged pointer from a raw value */
+#define tagptr_init(type, val) \
+ ((typeof(type)){ .v = (uintptr_t)(val) })
+
+/*
+ * directly cast a tagged pointer to the native pointer type, which
+ * could be used for backward compatibility of existing code.
+ */
+#define tagptr_cast_ptr(tptr) ((void *)(tptr).v)
+
+/* encode tagged pointers */
+#define tagptr_fold(type, ptr, _tags) ({ \
+ const typeof(_tags) tags = (_tags); \
+ if (__builtin_constant_p(tags) && (tags & ~__tagptr_mask(type))) \
+ __bad_tagptr_tags(); \
+tagptr_init(type, (uintptr_t)(ptr) | tags); })
+
+/* decode tagged pointers */
+#define tagptr_unfold_ptr(tptr) \
+ ((void *)((tptr).v & ~__tagptr_mask(tptr)))
+
+#define tagptr_unfold_tags(tptr) \
+ ((tptr).v & __tagptr_mask(tptr))
+
+/* operations for the tagger pointer */
+#define tagptr_eq(_tptr1, _tptr2) ({ \
+ typeof(_tptr1) tptr1 = (_tptr1); \
+ typeof(_tptr2) tptr2 = (_tptr2); \
+ (void)(&tptr1 == &tptr2); \
+(tptr1).v == (tptr2).v; })
+
+/* lock-free CAS operation */
+#define tagptr_cmpxchg(_ptptr, _o, _n) ({ \
+ typeof(_ptptr) ptptr = (_ptptr); \
+ typeof(_o) o = (_o); \
+ typeof(_n) n = (_n); \
+ (void)(&o == &n); \
+ (void)(&o == ptptr); \
+tagptr_init(o, cmpxchg(&ptptr->v, o.v, n.v)); })
+
+/* wrap WRITE_ONCE if atomic update is needed */
+#define tagptr_replace_tags(_ptptr, tags) ({ \
+ typeof(_ptptr) ptptr = (_ptptr); \
+ *ptptr = tagptr_fold(*ptptr, tagptr_unfold_ptr(*ptptr), tags); \
+*ptptr; })
+
+#define tagptr_set_tags(_ptptr, _tags) ({ \
+ typeof(_ptptr) ptptr = (_ptptr); \
+ const typeof(_tags) tags = (_tags); \
+ if (__builtin_constant_p(tags) && (tags & ~__tagptr_mask(*ptptr))) \
+ __bad_tagptr_tags(); \
+ ptptr->v |= tags; \
+*ptptr; })
+
+#define tagptr_clear_tags(_ptptr, _tags) ({ \
+ typeof(_ptptr) ptptr = (_ptptr); \
+ const typeof(_tags) tags = (_tags); \
+ if (__builtin_constant_p(tags) && (tags & ~__tagptr_mask(*ptptr))) \
+ __bad_tagptr_tags(); \
+ ptptr->v &= ~tags; \
+*ptptr; })
+
+#endif /* __EROFS_FS_TAGPTR_H */
+
--
2.17.1

2019-08-15 05:17:17

by Gao Xiang

[permalink] [raw]
Subject: [PATCH v8 17/24] erofs: introduce per-CPU buffers implementation

This patch introduces per-CPU buffers in order for
the upcoming generic decompression framework to use.

Note that I tried to use in-kernel per-CPU buffer or
per-CPU page approaches to clean up further, however
noticeable performanace regression (about 2% for
sequential read) was observed.

Let's leave it as-is for now.

Signed-off-by: Gao Xiang <[email protected]>
---
fs/erofs/Kconfig | 14 ++++++++++++++
fs/erofs/internal.h | 21 +++++++++++++++++++++
fs/erofs/utils.c | 12 ++++++++++++
3 files changed, 47 insertions(+)

diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig
index a475fbebb831..5f8787c0cf89 100644
--- a/fs/erofs/Kconfig
+++ b/fs/erofs/Kconfig
@@ -81,3 +81,17 @@ config EROFS_FS_ZIP

If you don't want to enable compression feature, say N.

+config EROFS_FS_CLUSTER_PAGE_LIMIT
+ int "EROFS Cluster Pages Hard Limit"
+ depends on EROFS_FS_ZIP
+ range 1 256
+ default "1"
+ help
+ Indicates maximum # of pages of a compressed
+ physical cluster.
+
+ For example, if files in a image were compressed
+ into 8k-unit, hard limit should not be configured
+ less than 2. Otherwise, the image will be refused
+ to mount on this kernel.
+
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 6a2407fb3013..3222947c9bab 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -222,6 +222,12 @@ static inline int erofs_wait_on_workgroup_freezed(struct erofs_workgroup *grp)
return v;
}
#endif /* !CONFIG_SMP */
+
+/* hard limit of pages per compressed cluster */
+#define Z_EROFS_CLUSTER_MAX_PAGES (CONFIG_EROFS_FS_CLUSTER_PAGE_LIMIT)
+#define EROFS_PCPUBUF_NR_PAGES Z_EROFS_CLUSTER_MAX_PAGES
+#else
+#define EROFS_PCPUBUF_NR_PAGES 0
#endif /* !CONFIG_EROFS_FS_ZIP */

/* we strictly follow PAGE_SIZE and no buffer head yet */
@@ -482,6 +488,21 @@ int erofs_namei(struct inode *dir, struct qstr *name,
extern const struct file_operations erofs_dir_fops;

/* utils.c */
+#if (EROFS_PCPUBUF_NR_PAGES > 0)
+void *erofs_get_pcpubuf(unsigned int pagenr);
+#define erofs_put_pcpubuf(buf) do { \
+ (void)&(buf); \
+ preempt_enable(); \
+} while (0)
+#else
+static inline void *erofs_get_pcpubuf(unsigned int pagenr)
+{
+ return ERR_PTR(-EOPNOTSUPP);
+}
+
+#define erofs_put_pcpubuf(buf) do {} while (0)
+#endif
+
#ifdef CONFIG_EROFS_FS_ZIP
int erofs_workgroup_put(struct erofs_workgroup *grp);
struct erofs_workgroup *erofs_find_workgroup(struct super_block *sb,
diff --git a/fs/erofs/utils.c b/fs/erofs/utils.c
index 628178261056..f3eed9af24d6 100644
--- a/fs/erofs/utils.c
+++ b/fs/erofs/utils.c
@@ -9,6 +9,18 @@
#include "internal.h"
#include <linux/pagevec.h>

+#if (EROFS_PCPUBUF_NR_PAGES > 0)
+static struct {
+ u8 data[PAGE_SIZE * EROFS_PCPUBUF_NR_PAGES];
+} ____cacheline_aligned_in_smp erofs_pcpubuf[NR_CPUS];
+
+void *erofs_get_pcpubuf(unsigned int pagenr)
+{
+ preempt_disable();
+ return &erofs_pcpubuf[smp_processor_id()].data[pagenr * PAGE_SIZE];
+}
+#endif
+
#ifdef CONFIG_EROFS_FS_ZIP
/* global shrink count (for all mounted EROFS instances) */
static atomic_long_t erofs_global_shrink_cnt;
--
2.17.1

2019-08-15 05:20:42

by Gao Xiang

[permalink] [raw]
Subject: [PATCH v8 16/24] erofs: introduce workstation for decompression

This patch introduces another concept used by decompress
subsystem called 'workstation'. It can be seen as
a sparse array that stores pointers pointed to data
structures related to the corresponding physical clusters.

All lookups are protected by RCU read lock. Besides,
reference count and spin_lock are also introduced
to manage its lifetime and serialize all update
operations.

`workstation' is currently implemented on the in-kernel
radix tree approach for backward compatibility. With the
evolution of linux kernel, it will be migrated into
new XArray implementation in the future.

Signed-off-by: Gao Xiang <[email protected]>
---
fs/erofs/internal.h | 80 +++++++++++++++++++++
fs/erofs/super.c | 4 ++
fs/erofs/utils.c | 166 +++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 248 insertions(+), 2 deletions(-)

diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 4bcdf32a45ad..6a2407fb3013 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -65,6 +65,9 @@ struct erofs_sb_info {
struct list_head list;
struct mutex umount_mutex;

+ /* the dedicated workstation for compression */
+ struct radix_tree_root workstn_tree;
+
unsigned int shrinker_run_no;
#endif /* CONFIG_EROFS_FS_ZIP */
u32 blocks;
@@ -150,6 +153,77 @@ static inline void *erofs_kmalloc(struct erofs_sb_info *sbi,
#define set_opt(sbi, option) ((sbi)->mount_opt |= EROFS_MOUNT_##option)
#define test_opt(sbi, option) ((sbi)->mount_opt & EROFS_MOUNT_##option)

+#ifdef CONFIG_EROFS_FS_ZIP
+#define EROFS_LOCKED_MAGIC (INT_MIN | 0xE0F510CCL)
+
+/* basic unit of the workstation of a super_block */
+struct erofs_workgroup {
+ /* the workgroup index in the workstation */
+ pgoff_t index;
+
+ /* overall workgroup reference count */
+ atomic_t refcount;
+};
+
+#if defined(CONFIG_SMP)
+static inline bool erofs_workgroup_try_to_freeze(struct erofs_workgroup *grp,
+ int val)
+{
+ preempt_disable();
+ if (val != atomic_cmpxchg(&grp->refcount, val, EROFS_LOCKED_MAGIC)) {
+ preempt_enable();
+ return false;
+ }
+ return true;
+}
+
+static inline void erofs_workgroup_unfreeze(struct erofs_workgroup *grp,
+ int orig_val)
+{
+ /*
+ * other observers should notice all modifications
+ * in the freezing period.
+ */
+ smp_mb();
+ atomic_set(&grp->refcount, orig_val);
+ preempt_enable();
+}
+
+static inline int erofs_wait_on_workgroup_freezed(struct erofs_workgroup *grp)
+{
+ return atomic_cond_read_relaxed(&grp->refcount,
+ VAL != EROFS_LOCKED_MAGIC);
+}
+#else
+static inline bool erofs_workgroup_try_to_freeze(struct erofs_workgroup *grp,
+ int val)
+{
+ preempt_disable();
+ /* no need to spin on UP platforms, let's just disable preemption. */
+ if (val != atomic_read(&grp->refcount)) {
+ preempt_enable();
+ return false;
+ }
+ return true;
+}
+
+static inline void erofs_workgroup_unfreeze(struct erofs_workgroup *grp,
+ int orig_val)
+{
+ preempt_enable();
+}
+
+static inline int erofs_wait_on_workgroup_freezed(struct erofs_workgroup *grp)
+{
+ int v = atomic_read(&grp->refcount);
+
+ /* workgroup is never freezed on uniprocessor systems */
+ DBG_BUGON(v == EROFS_LOCKED_MAGIC);
+ return v;
+}
+#endif /* !CONFIG_SMP */
+#endif /* !CONFIG_EROFS_FS_ZIP */
+
/* we strictly follow PAGE_SIZE and no buffer head yet */
#define LOG_BLOCK_SIZE PAGE_SHIFT

@@ -409,6 +483,12 @@ extern const struct file_operations erofs_dir_fops;

/* utils.c */
#ifdef CONFIG_EROFS_FS_ZIP
+int erofs_workgroup_put(struct erofs_workgroup *grp);
+struct erofs_workgroup *erofs_find_workgroup(struct super_block *sb,
+ pgoff_t index, bool *tag);
+int erofs_register_workgroup(struct super_block *sb,
+ struct erofs_workgroup *grp, bool tag);
+static inline void erofs_workgroup_free_rcu(struct erofs_workgroup *grp) {}
void erofs_shrinker_register(struct super_block *sb);
void erofs_shrinker_unregister(struct super_block *sb);
int __init erofs_init_shrinker(void);
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 09992cc3b2fd..ea8d065068fa 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -338,6 +338,10 @@ static int erofs_fill_super(struct super_block *sb, void *data, int silent)
else
sb->s_flags &= ~SB_POSIXACL;

+#ifdef CONFIG_EROFS_FS_ZIP
+ INIT_RADIX_TREE(&sbi->workstn_tree, GFP_ATOMIC);
+#endif
+
/* get the root inode */
inode = erofs_iget(sb, ROOT_NID(sbi), true);
if (IS_ERR(inode))
diff --git a/fs/erofs/utils.c b/fs/erofs/utils.c
index cab7d77c4e59..628178261056 100644
--- a/fs/erofs/utils.c
+++ b/fs/erofs/utils.c
@@ -7,11 +7,173 @@
* Created by Gao Xiang <[email protected]>
*/
#include "internal.h"
+#include <linux/pagevec.h>

#ifdef CONFIG_EROFS_FS_ZIP
/* global shrink count (for all mounted EROFS instances) */
static atomic_long_t erofs_global_shrink_cnt;

+#define __erofs_workgroup_get(grp) atomic_inc(&(grp)->refcount)
+#define __erofs_workgroup_put(grp) atomic_dec(&(grp)->refcount)
+
+static int erofs_workgroup_get(struct erofs_workgroup *grp)
+{
+ int o;
+
+repeat:
+ o = erofs_wait_on_workgroup_freezed(grp);
+ if (unlikely(o <= 0))
+ return -1;
+
+ if (unlikely(atomic_cmpxchg(&grp->refcount, o, o + 1) != o))
+ goto repeat;
+
+ /* decrease refcount paired by erofs_workgroup_put */
+ if (unlikely(o == 1))
+ atomic_long_dec(&erofs_global_shrink_cnt);
+ return 0;
+}
+
+struct erofs_workgroup *erofs_find_workgroup(struct super_block *sb,
+ pgoff_t index, bool *tag)
+{
+ struct erofs_sb_info *sbi = EROFS_SB(sb);
+ struct erofs_workgroup *grp;
+
+repeat:
+ rcu_read_lock();
+ grp = radix_tree_lookup(&sbi->workstn_tree, index);
+ if (grp) {
+ *tag = xa_pointer_tag(grp);
+ grp = xa_untag_pointer(grp);
+
+ if (erofs_workgroup_get(grp)) {
+ /* prefer to relax rcu read side */
+ rcu_read_unlock();
+ goto repeat;
+ }
+
+ DBG_BUGON(index != grp->index);
+ }
+ rcu_read_unlock();
+ return grp;
+}
+
+int erofs_register_workgroup(struct super_block *sb,
+ struct erofs_workgroup *grp,
+ bool tag)
+{
+ struct erofs_sb_info *sbi;
+ int err;
+
+ /* grp shouldn't be broken or used before */
+ if (unlikely(atomic_read(&grp->refcount) != 1)) {
+ DBG_BUGON(1);
+ return -EINVAL;
+ }
+
+ err = radix_tree_preload(GFP_NOFS);
+ if (err)
+ return err;
+
+ sbi = EROFS_SB(sb);
+ xa_lock(&sbi->workstn_tree);
+
+ grp = xa_tag_pointer(grp, tag);
+
+ /*
+ * Bump up reference count before making this workgroup
+ * visible to other users in order to avoid potential UAF
+ * without serialized by workstn_lock.
+ */
+ __erofs_workgroup_get(grp);
+
+ err = radix_tree_insert(&sbi->workstn_tree, grp->index, grp);
+ if (unlikely(err))
+ /*
+ * it's safe to decrease since the workgroup isn't visible
+ * and refcount >= 2 (cannot be freezed).
+ */
+ __erofs_workgroup_put(grp);
+
+ xa_unlock(&sbi->workstn_tree);
+ radix_tree_preload_end();
+ return err;
+}
+
+static void __erofs_workgroup_free(struct erofs_workgroup *grp)
+{
+ atomic_long_dec(&erofs_global_shrink_cnt);
+ erofs_workgroup_free_rcu(grp);
+}
+
+int erofs_workgroup_put(struct erofs_workgroup *grp)
+{
+ int count = atomic_dec_return(&grp->refcount);
+
+ if (count == 1)
+ atomic_long_inc(&erofs_global_shrink_cnt);
+ else if (!count)
+ __erofs_workgroup_free(grp);
+ return count;
+}
+
+/* for nocache case, no customized reclaim path at all */
+static bool erofs_try_to_release_workgroup(struct erofs_sb_info *sbi,
+ struct erofs_workgroup *grp,
+ bool cleanup)
+{
+ int cnt = atomic_read(&grp->refcount);
+
+ DBG_BUGON(cnt <= 0);
+ DBG_BUGON(cleanup && cnt != 1);
+
+ if (cnt > 1)
+ return false;
+
+ DBG_BUGON(xa_untag_pointer(radix_tree_delete(&sbi->workstn_tree,
+ grp->index)) != grp);
+
+ /* (rarely) could be grabbed again when freeing */
+ erofs_workgroup_put(grp);
+ return true;
+}
+
+static unsigned long erofs_shrink_workstation(struct erofs_sb_info *sbi,
+ unsigned long nr_shrink,
+ bool cleanup)
+{
+ pgoff_t first_index = 0;
+ void *batch[PAGEVEC_SIZE];
+ unsigned int freed = 0;
+
+ int i, found;
+repeat:
+ xa_lock(&sbi->workstn_tree);
+
+ found = radix_tree_gang_lookup(&sbi->workstn_tree,
+ batch, first_index, PAGEVEC_SIZE);
+
+ for (i = 0; i < found; ++i) {
+ struct erofs_workgroup *grp = xa_untag_pointer(batch[i]);
+
+ first_index = grp->index + 1;
+
+ /* try to shrink each valid workgroup */
+ if (!erofs_try_to_release_workgroup(sbi, grp, cleanup))
+ continue;
+
+ ++freed;
+ if (unlikely(!--nr_shrink))
+ break;
+ }
+ xa_unlock(&sbi->workstn_tree);
+
+ if (i && nr_shrink)
+ goto repeat;
+ return freed;
+}
+
/* protected by 'erofs_sb_list_lock' */
static unsigned int shrinker_run_no;

@@ -35,7 +197,7 @@ void erofs_shrinker_unregister(struct super_block *sb)
struct erofs_sb_info *const sbi = EROFS_SB(sb);

mutex_lock(&sbi->umount_mutex);
- /* will add shrink final handler here */
+ erofs_shrink_workstation(sbi, ~0UL, true);

spin_lock(&erofs_sb_list_lock);
list_del(&sbi->list);
@@ -84,7 +246,7 @@ static unsigned long erofs_shrink_scan(struct shrinker *shrink,
spin_unlock(&erofs_sb_list_lock);
sbi->shrinker_run_no = run_no;

- /* will add shrink handler here */
+ freed += erofs_shrink_workstation(sbi, nr, false);

spin_lock(&erofs_sb_list_lock);
/* Get the next list element before we move this one */
--
2.17.1

2019-08-15 05:40:21

by Gao Xiang

[permalink] [raw]
Subject: [PATCH v8 13/24] erofs: add compression indexes support

This patch adds EROFS compression indexes support,
including legacy and compacted 2/4B indexes.

In addition, it introduces an iterable L2P mapping
operation called 'z_erofs_map_blocks_iter'.

Compared with 'erofs_map_blocks', it avoids a number
of redundant 'release and regrab' processes if they
request the same meta page.

Signed-off-by: Gao Xiang <[email protected]>
---
fs/erofs/Kconfig | 9 +
fs/erofs/Makefile | 1 +
fs/erofs/data.c | 10 +-
fs/erofs/inode.c | 2 +-
fs/erofs/internal.h | 35 ++-
fs/erofs/zmap.c | 461 +++++++++++++++++++++++++++++++++++
include/trace/events/erofs.h | 17 +-
7 files changed, 530 insertions(+), 5 deletions(-)
create mode 100644 fs/erofs/zmap.c

diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig
index c5e7c5ae026e..a475fbebb831 100644
--- a/fs/erofs/Kconfig
+++ b/fs/erofs/Kconfig
@@ -72,3 +72,12 @@ config EROFS_FS_SECURITY

If you are not using a security module, say N.

+config EROFS_FS_ZIP
+ bool "EROFS Data Compression Support"
+ depends on EROFS_FS
+ select LZ4_DECOMPRESS
+ help
+ Enable fixed-sized output compression for EROFS.
+
+ If you don't want to enable compression feature, say N.
+
diff --git a/fs/erofs/Makefile b/fs/erofs/Makefile
index 190a73083f23..481a966caf06 100644
--- a/fs/erofs/Makefile
+++ b/fs/erofs/Makefile
@@ -7,4 +7,5 @@ ccflags-y += -DEROFS_VERSION=\"$(EROFS_VERSION)\"
obj-$(CONFIG_EROFS_FS) += erofs.o
erofs-objs := super.o inode.o data.o namei.o dir.o
erofs-$(CONFIG_EROFS_FS_XATTR) += xattr.o
+erofs-$(CONFIG_EROFS_FS_ZIP) += zmap.o

diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index 3d8f1511cacb..0850b4725196 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -172,9 +172,15 @@ static int erofs_map_blocks_flatmode(struct inode *inode,
int erofs_map_blocks(struct inode *inode,
struct erofs_map_blocks *map, int flags)
{
- if (is_inode_layout_compression(inode))
- return -EOPNOTSUPP;
+ if (unlikely(is_inode_layout_compression(inode))) {
+ int err = z_erofs_map_blocks_iter(inode, map, flags);

+ if (map->mpage) {
+ put_page(map->mpage);
+ map->mpage = NULL;
+ }
+ return err;
+ }
return erofs_map_blocks_flatmode(inode, map, flags);
}

diff --git a/fs/erofs/inode.c b/fs/erofs/inode.c
index d4e5de383435..7a9383a2b0a5 100644
--- a/fs/erofs/inode.c
+++ b/fs/erofs/inode.c
@@ -212,7 +212,7 @@ static int fill_inode(struct inode *inode, int isdir)
}

if (is_inode_layout_compression(inode)) {
- err = -EOPNOTSUPP;
+ err = z_erofs_fill_inode(inode);
goto out_unlock;
}

diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index f42cdda2eebc..8432f488409d 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -173,9 +173,12 @@ static inline erofs_off_t iloc(struct erofs_sb_info *sbi, erofs_nid_t nid)

/* atomic flag definitions */
#define EROFS_V_EA_INITED_BIT 0
+#define EROFS_V_Z_INITED_BIT 1

/* bitlock definitions (arranged in reverse order) */
#define EROFS_V_BL_XATTR_BIT (BITS_PER_LONG - 1)
+#define EROFS_V_BL_Z_BIT (BITS_PER_LONG - 2)
+
struct erofs_vnode {
erofs_nid_t nid;

@@ -189,7 +192,17 @@ struct erofs_vnode {
unsigned int xattr_shared_count;
unsigned int *xattr_shared_xattrs;

- erofs_blk_t raw_blkaddr;
+ union {
+ erofs_blk_t raw_blkaddr;
+#ifdef CONFIG_EROFS_FS_ZIP
+ struct {
+ unsigned short z_advise;
+ unsigned char z_algorithmtype[2];
+ unsigned char z_logical_clusterbits;
+ unsigned char z_physical_clusterbits[2];
+ };
+#endif /* CONFIG_EROFS_FS_ZIP */
+ };
/* the corresponding vfs inode */
struct inode vfs_inode;
};
@@ -262,6 +275,10 @@ enum {
#define EROFS_MAP_MAPPED (1 << BH_Mapped)
/* Located in metadata (could be copied from bd_inode) */
#define EROFS_MAP_META (1 << BH_Meta)
+/* The extent has been compressed */
+#define EROFS_MAP_ZIPPED (1 << BH_Zipped)
+/* The length of extent is full */
+#define EROFS_MAP_FULL_MAPPED (1 << BH_FullMapped)

struct erofs_map_blocks {
erofs_off_t m_pa, m_la;
@@ -275,6 +292,22 @@ struct erofs_map_blocks {
/* Flags used by erofs_map_blocks() */
#define EROFS_GET_BLOCKS_RAW 0x0001

+/* zmap.c */
+#ifdef CONFIG_EROFS_FS_ZIP
+int z_erofs_fill_inode(struct inode *inode);
+int z_erofs_map_blocks_iter(struct inode *inode,
+ struct erofs_map_blocks *map,
+ int flags);
+#else
+static inline int z_erofs_fill_inode(struct inode *inode) { return -EOPNOTSUPP; }
+static inline int z_erofs_map_blocks_iter(struct inode *inode,
+ struct erofs_map_blocks *map,
+ int flags)
+{
+ return -EOPNOTSUPP;
+}
+#endif /* !CONFIG_EROFS_FS_ZIP */
+
/* data.c */
static inline struct bio *erofs_grab_bio(struct super_block *sb,
erofs_blk_t blkaddr,
diff --git a/fs/erofs/zmap.c b/fs/erofs/zmap.c
new file mode 100644
index 000000000000..f1924b4e9f36
--- /dev/null
+++ b/fs/erofs/zmap.c
@@ -0,0 +1,461 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * linux/fs/erofs/zmap.c
+ *
+ * Copyright (C) 2018-2019 HUAWEI, Inc.
+ * http://www.huawei.com/
+ * Created by Gao Xiang <[email protected]>
+ */
+#include "internal.h"
+#include <asm/unaligned.h>
+#include <trace/events/erofs.h>
+
+int z_erofs_fill_inode(struct inode *inode)
+{
+ struct erofs_vnode *const vi = EROFS_V(inode);
+
+ if (vi->datamode == EROFS_INODE_FLAT_COMPRESSION_LEGACY) {
+ vi->z_advise = 0;
+ vi->z_algorithmtype[0] = 0;
+ vi->z_algorithmtype[1] = 0;
+ vi->z_logical_clusterbits = LOG_BLOCK_SIZE;
+ vi->z_physical_clusterbits[0] = vi->z_logical_clusterbits;
+ vi->z_physical_clusterbits[1] = vi->z_logical_clusterbits;
+ set_bit(EROFS_V_Z_INITED_BIT, &vi->flags);
+ }
+ return -EOPNOTSUPP;
+}
+
+static int fill_inode_lazy(struct inode *inode)
+{
+ struct erofs_vnode *const vi = EROFS_V(inode);
+ struct super_block *const sb = inode->i_sb;
+ int err;
+ erofs_off_t pos;
+ struct page *page;
+ void *kaddr;
+ struct z_erofs_map_header *h;
+
+ if (test_bit(EROFS_V_Z_INITED_BIT, &vi->flags))
+ return 0;
+
+ if (wait_on_bit_lock(&vi->flags, EROFS_V_BL_Z_BIT, TASK_KILLABLE))
+ return -ERESTARTSYS;
+
+ err = 0;
+ if (test_bit(EROFS_V_Z_INITED_BIT, &vi->flags))
+ goto out_unlock;
+
+ DBG_BUGON(vi->datamode == EROFS_INODE_FLAT_COMPRESSION_LEGACY);
+
+ pos = ALIGN(iloc(EROFS_SB(sb), vi->nid) + vi->inode_isize +
+ vi->xattr_isize, 8);
+ page = erofs_get_meta_page(sb, erofs_blknr(pos), false);
+ if (IS_ERR(page)) {
+ err = PTR_ERR(page);
+ goto out_unlock;
+ }
+
+ kaddr = kmap_atomic(page);
+
+ h = kaddr + erofs_blkoff(pos);
+ vi->z_advise = le16_to_cpu(h->h_advise);
+ vi->z_algorithmtype[0] = h->h_algorithmtype & 15;
+ vi->z_algorithmtype[1] = h->h_algorithmtype >> 4;
+
+ if (vi->z_algorithmtype[0] >= Z_EROFS_COMPRESSION_MAX) {
+ errln("unknown compression format %u for nid %llu, please upgrade kernel",
+ vi->z_algorithmtype[0], vi->nid);
+ err = -EOPNOTSUPP;
+ goto unmap_done;
+ }
+
+ vi->z_logical_clusterbits = LOG_BLOCK_SIZE + (h->h_clusterbits & 7);
+ vi->z_physical_clusterbits[0] = vi->z_logical_clusterbits +
+ ((h->h_clusterbits >> 3) & 3);
+
+ if (vi->z_physical_clusterbits[0] != LOG_BLOCK_SIZE) {
+ errln("unsupported physical clusterbits %u for nid %llu, please upgrade kernel",
+ vi->z_physical_clusterbits[0], vi->nid);
+ err = -EOPNOTSUPP;
+ goto unmap_done;
+ }
+
+ vi->z_physical_clusterbits[1] = vi->z_logical_clusterbits +
+ ((h->h_clusterbits >> 5) & 7);
+unmap_done:
+ kunmap_atomic(kaddr);
+ unlock_page(page);
+ put_page(page);
+
+ set_bit(EROFS_V_Z_INITED_BIT, &vi->flags);
+out_unlock:
+ clear_and_wake_up_bit(EROFS_V_BL_Z_BIT, &vi->flags);
+ return err;
+}
+
+struct z_erofs_maprecorder {
+ struct inode *inode;
+ struct erofs_map_blocks *map;
+ void *kaddr;
+
+ unsigned long lcn;
+ /* compression extent information gathered */
+ u8 type;
+ u16 clusterofs;
+ u16 delta[2];
+ erofs_blk_t pblk;
+};
+
+static int z_erofs_reload_indexes(struct z_erofs_maprecorder *m,
+ erofs_blk_t eblk)
+{
+ struct super_block *const sb = m->inode->i_sb;
+ struct erofs_map_blocks *const map = m->map;
+ struct page *mpage = map->mpage;
+
+ if (mpage) {
+ if (mpage->index == eblk) {
+ if (!m->kaddr)
+ m->kaddr = kmap_atomic(mpage);
+ return 0;
+ }
+
+ if (m->kaddr) {
+ kunmap_atomic(m->kaddr);
+ m->kaddr = NULL;
+ }
+ put_page(mpage);
+ }
+
+ mpage = erofs_get_meta_page(sb, eblk, false);
+ if (IS_ERR(mpage)) {
+ map->mpage = NULL;
+ return PTR_ERR(mpage);
+ }
+ m->kaddr = kmap_atomic(mpage);
+ unlock_page(mpage);
+ map->mpage = mpage;
+ return 0;
+}
+
+static int vle_legacy_load_cluster_from_disk(struct z_erofs_maprecorder *m,
+ unsigned long lcn)
+{
+ struct inode *const inode = m->inode;
+ struct erofs_vnode *const vi = EROFS_V(inode);
+ const erofs_off_t ibase = iloc(EROFS_I_SB(inode), vi->nid);
+ const erofs_off_t pos =
+ Z_EROFS_VLE_LEGACY_INDEX_ALIGN(ibase + vi->inode_isize +
+ vi->xattr_isize) +
+ lcn * sizeof(struct z_erofs_vle_decompressed_index);
+ struct z_erofs_vle_decompressed_index *di;
+ unsigned int advise, type;
+ int err;
+
+ err = z_erofs_reload_indexes(m, erofs_blknr(pos));
+ if (err)
+ return err;
+
+ m->lcn = lcn;
+ di = m->kaddr + erofs_blkoff(pos);
+
+ advise = le16_to_cpu(di->di_advise);
+ type = (advise >> Z_EROFS_VLE_DI_CLUSTER_TYPE_BIT) &
+ ((1 << Z_EROFS_VLE_DI_CLUSTER_TYPE_BITS) - 1);
+ switch (type) {
+ case Z_EROFS_VLE_CLUSTER_TYPE_NONHEAD:
+ m->clusterofs = 1 << vi->z_logical_clusterbits;
+ m->delta[0] = le16_to_cpu(di->di_u.delta[0]);
+ m->delta[1] = le16_to_cpu(di->di_u.delta[1]);
+ break;
+ case Z_EROFS_VLE_CLUSTER_TYPE_PLAIN:
+ case Z_EROFS_VLE_CLUSTER_TYPE_HEAD:
+ m->clusterofs = le16_to_cpu(di->di_clusterofs);
+ m->pblk = le32_to_cpu(di->di_u.blkaddr);
+ break;
+ default:
+ DBG_BUGON(1);
+ return -EOPNOTSUPP;
+ }
+ m->type = type;
+ return 0;
+}
+
+static unsigned int decode_compactedbits(unsigned int lobits,
+ unsigned int lomask,
+ u8 *in, unsigned int pos, u8 *type)
+{
+ const unsigned int v = get_unaligned_le32(in + pos / 8) >> (pos & 7);
+ const unsigned int lo = v & lomask;
+
+ *type = (v >> lobits) & 3;
+ return lo;
+}
+
+static int unpack_compacted_index(struct z_erofs_maprecorder *m,
+ unsigned int amortizedshift,
+ unsigned int eofs)
+{
+ struct erofs_vnode *const vi = EROFS_V(m->inode);
+ const unsigned int lclusterbits = vi->z_logical_clusterbits;
+ const unsigned int lomask = (1 << lclusterbits) - 1;
+ unsigned int vcnt, base, lo, encodebits, nblk;
+ int i;
+ u8 *in, type;
+
+ if (1 << amortizedshift == 4)
+ vcnt = 2;
+ else if (1 << amortizedshift == 2 && lclusterbits == 12)
+ vcnt = 16;
+ else
+ return -EOPNOTSUPP;
+
+ encodebits = ((vcnt << amortizedshift) - sizeof(__le32)) * 8 / vcnt;
+ base = round_down(eofs, vcnt << amortizedshift);
+ in = m->kaddr + base;
+
+ i = (eofs - base) >> amortizedshift;
+
+ lo = decode_compactedbits(lclusterbits, lomask,
+ in, encodebits * i, &type);
+ m->type = type;
+ if (type == Z_EROFS_VLE_CLUSTER_TYPE_NONHEAD) {
+ m->clusterofs = 1 << lclusterbits;
+ if (i + 1 != vcnt) {
+ m->delta[0] = lo;
+ return 0;
+ }
+ /*
+ * since the last lcluster in the pack is special,
+ * of which lo saves delta[1] rather than delta[0].
+ * Hence, get delta[0] by the previous lcluster indirectly.
+ */
+ lo = decode_compactedbits(lclusterbits, lomask,
+ in, encodebits * (i - 1), &type);
+ if (type != Z_EROFS_VLE_CLUSTER_TYPE_NONHEAD)
+ lo = 0;
+ m->delta[0] = lo + 1;
+ return 0;
+ }
+ m->clusterofs = lo;
+ m->delta[0] = 0;
+ /* figout out blkaddr (pblk) for HEAD lclusters */
+ nblk = 1;
+ while (i > 0) {
+ --i;
+ lo = decode_compactedbits(lclusterbits, lomask,
+ in, encodebits * i, &type);
+ if (type == Z_EROFS_VLE_CLUSTER_TYPE_NONHEAD)
+ i -= lo;
+
+ if (i >= 0)
+ ++nblk;
+ }
+ in += (vcnt << amortizedshift) - sizeof(__le32);
+ m->pblk = le32_to_cpu(*(__le32 *)in) + nblk;
+ return 0;
+}
+
+static int compacted_load_cluster_from_disk(struct z_erofs_maprecorder *m,
+ unsigned long lcn)
+{
+ struct inode *const inode = m->inode;
+ struct erofs_vnode *const vi = EROFS_V(inode);
+ const unsigned int lclusterbits = vi->z_logical_clusterbits;
+ const erofs_off_t ebase = ALIGN(iloc(EROFS_I_SB(inode), vi->nid) +
+ vi->inode_isize + vi->xattr_isize, 8) +
+ sizeof(struct z_erofs_map_header);
+ const unsigned int totalidx = DIV_ROUND_UP(inode->i_size, EROFS_BLKSIZ);
+ unsigned int compacted_4b_initial, compacted_2b;
+ unsigned int amortizedshift;
+ erofs_off_t pos;
+ int err;
+
+ if (lclusterbits != 12)
+ return -EOPNOTSUPP;
+
+ if (lcn >= totalidx)
+ return -EINVAL;
+
+ m->lcn = lcn;
+ /* used to align to 32-byte (compacted_2b) alignment */
+ compacted_4b_initial = (32 - ebase % 32) / 4;
+ if (compacted_4b_initial == 32 / 4)
+ compacted_4b_initial = 0;
+
+ if (vi->z_advise & Z_EROFS_ADVISE_COMPACTED_2B)
+ compacted_2b = rounddown(totalidx - compacted_4b_initial, 16);
+ else
+ compacted_2b = 0;
+
+ pos = ebase;
+ if (lcn < compacted_4b_initial) {
+ amortizedshift = 2;
+ goto out;
+ }
+ pos += compacted_4b_initial * 4;
+ lcn -= compacted_4b_initial;
+
+ if (lcn < compacted_2b) {
+ amortizedshift = 1;
+ goto out;
+ }
+ pos += compacted_2b * 2;
+ lcn -= compacted_2b;
+ amortizedshift = 2;
+out:
+ pos += lcn * (1 << amortizedshift);
+ err = z_erofs_reload_indexes(m, erofs_blknr(pos));
+ if (err)
+ return err;
+ return unpack_compacted_index(m, amortizedshift, erofs_blkoff(pos));
+}
+
+static int vle_load_cluster_from_disk(struct z_erofs_maprecorder *m,
+ unsigned int lcn)
+{
+ const unsigned int datamode = EROFS_V(m->inode)->datamode;
+
+ if (datamode == EROFS_INODE_FLAT_COMPRESSION_LEGACY)
+ return vle_legacy_load_cluster_from_disk(m, lcn);
+
+ if (datamode == EROFS_INODE_FLAT_COMPRESSION)
+ return compacted_load_cluster_from_disk(m, lcn);
+
+ return -EINVAL;
+}
+
+static int vle_extent_lookback(struct z_erofs_maprecorder *m,
+ unsigned int lookback_distance)
+{
+ struct erofs_vnode *const vi = EROFS_V(m->inode);
+ struct erofs_map_blocks *const map = m->map;
+ const unsigned int lclusterbits = vi->z_logical_clusterbits;
+ unsigned long lcn = m->lcn;
+ int err;
+
+ if (lcn < lookback_distance) {
+ errln("bogus lookback distance @ nid %llu", vi->nid);
+ DBG_BUGON(1);
+ return -EFSCORRUPTED;
+ }
+
+ /* load extent head logical cluster if needed */
+ lcn -= lookback_distance;
+ err = vle_load_cluster_from_disk(m, lcn);
+ if (err)
+ return err;
+
+ switch (m->type) {
+ case Z_EROFS_VLE_CLUSTER_TYPE_NONHEAD:
+ return vle_extent_lookback(m, m->delta[0]);
+ case Z_EROFS_VLE_CLUSTER_TYPE_PLAIN:
+ map->m_flags &= ~EROFS_MAP_ZIPPED;
+ /* fallthrough */
+ case Z_EROFS_VLE_CLUSTER_TYPE_HEAD:
+ map->m_la = (lcn << lclusterbits) | m->clusterofs;
+ break;
+ default:
+ errln("unknown type %u at lcn %lu of nid %llu",
+ m->type, lcn, vi->nid);
+ DBG_BUGON(1);
+ return -EOPNOTSUPP;
+ }
+ return 0;
+}
+
+int z_erofs_map_blocks_iter(struct inode *inode,
+ struct erofs_map_blocks *map,
+ int flags)
+{
+ struct erofs_vnode *const vi = EROFS_V(inode);
+ struct z_erofs_maprecorder m = {
+ .inode = inode,
+ .map = map,
+ };
+ int err = 0;
+ unsigned int lclusterbits, endoff;
+ unsigned long long ofs, end;
+
+ trace_z_erofs_map_blocks_iter_enter(inode, map, flags);
+
+ /* when trying to read beyond EOF, leave it unmapped */
+ if (unlikely(map->m_la >= inode->i_size)) {
+ map->m_llen = map->m_la + 1 - inode->i_size;
+ map->m_la = inode->i_size;
+ map->m_flags = 0;
+ goto out;
+ }
+
+ err = fill_inode_lazy(inode);
+ if (err)
+ goto out;
+
+ lclusterbits = vi->z_logical_clusterbits;
+ ofs = map->m_la;
+ m.lcn = ofs >> lclusterbits;
+ endoff = ofs & ((1 << lclusterbits) - 1);
+
+ err = vle_load_cluster_from_disk(&m, m.lcn);
+ if (err)
+ goto unmap_out;
+
+ map->m_flags = EROFS_MAP_ZIPPED; /* by default, compressed */
+ end = (m.lcn + 1ULL) << lclusterbits;
+
+ switch (m.type) {
+ case Z_EROFS_VLE_CLUSTER_TYPE_PLAIN:
+ if (endoff >= m.clusterofs)
+ map->m_flags &= ~EROFS_MAP_ZIPPED;
+ /* fallthrough */
+ case Z_EROFS_VLE_CLUSTER_TYPE_HEAD:
+ if (endoff >= m.clusterofs) {
+ map->m_la = (m.lcn << lclusterbits) | m.clusterofs;
+ break;
+ }
+ /* m.lcn should be >= 1 if endoff < m.clusterofs */
+ if (unlikely(!m.lcn)) {
+ errln("invalid logical cluster 0 at nid %llu",
+ vi->nid);
+ err = -EFSCORRUPTED;
+ goto unmap_out;
+ }
+ end = (m.lcn << lclusterbits) | m.clusterofs;
+ map->m_flags |= EROFS_MAP_FULL_MAPPED;
+ m.delta[0] = 1;
+ /* fallthrough */
+ case Z_EROFS_VLE_CLUSTER_TYPE_NONHEAD:
+ /* get the correspoinding first chunk */
+ err = vle_extent_lookback(&m, m.delta[0]);
+ if (unlikely(err))
+ goto unmap_out;
+ break;
+ default:
+ errln("unknown type %u at offset %llu of nid %llu",
+ m.type, ofs, vi->nid);
+ err = -EOPNOTSUPP;
+ goto unmap_out;
+ }
+
+ map->m_llen = end - map->m_la;
+ map->m_plen = 1 << lclusterbits;
+ map->m_pa = blknr_to_addr(m.pblk);
+ map->m_flags |= EROFS_MAP_MAPPED;
+
+unmap_out:
+ if (m.kaddr)
+ kunmap_atomic(m.kaddr);
+
+out:
+ debugln("%s, m_la %llu m_pa %llu m_llen %llu m_plen %llu m_flags 0%o",
+ __func__, map->m_la, map->m_pa,
+ map->m_llen, map->m_plen, map->m_flags);
+
+ trace_z_erofs_map_blocks_iter_exit(inode, map, flags, err);
+
+ /* aggressively BUG_ON iff CONFIG_EROFS_FS_DEBUG is on */
+ DBG_BUGON(err < 0 && err != -ENOMEM);
+ return err;
+}
+
diff --git a/include/trace/events/erofs.h b/include/trace/events/erofs.h
index 0c5847c54b60..bfb2da9c4eee 100644
--- a/include/trace/events/erofs.h
+++ b/include/trace/events/erofs.h
@@ -20,7 +20,8 @@

#define show_mflags(flags) __print_flags(flags, "", \
{ EROFS_MAP_MAPPED, "M" }, \
- { EROFS_MAP_META, "I" })
+ { EROFS_MAP_META, "I" }, \
+ { EROFS_MAP_ZIPPED, "Z" })

TRACE_EVENT(erofs_lookup,

@@ -172,6 +173,13 @@ DEFINE_EVENT(erofs__map_blocks_enter, erofs_map_blocks_flatmode_enter,
TP_ARGS(inode, map, flags)
);

+DEFINE_EVENT(erofs__map_blocks_enter, z_erofs_map_blocks_iter_enter,
+ TP_PROTO(struct inode *inode, struct erofs_map_blocks *map,
+ unsigned int flags),
+
+ TP_ARGS(inode, map, flags)
+);
+
DECLARE_EVENT_CLASS(erofs__map_blocks_exit,
TP_PROTO(struct inode *inode, struct erofs_map_blocks *map,
unsigned int flags, int ret),
@@ -217,6 +225,13 @@ DEFINE_EVENT(erofs__map_blocks_exit, erofs_map_blocks_flatmode_exit,
TP_ARGS(inode, map, flags, ret)
);

+DEFINE_EVENT(erofs__map_blocks_exit, z_erofs_map_blocks_iter_exit,
+ TP_PROTO(struct inode *inode, struct erofs_map_blocks *map,
+ unsigned int flags, int ret),
+
+ TP_ARGS(inode, map, flags, ret)
+);
+
TRACE_EVENT(erofs_destroy_inode,
TP_PROTO(struct inode *inode),

--
2.17.1

2019-08-15 10:27:23

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v8 00/24] erofs: promote erofs from staging v8

On Thu, Aug 15, 2019 at 12:41:31PM +0800, Gao Xiang wrote:
> [I strip the previous cover letter, the old one can be found in v6:
> https://lore.kernel.org/r/[email protected]/]
>
> We'd like to submit a formal moving patch applied to staging tree
> for 5.4, before that we'd like to hear if there are some ACKs,
> suggestions or NAKs, objections of EROFS. Therefore, we can improve
> it in this round or rethink about the whole thing.
>
> As related materials mentioned [1] [2], the goal of EROFS is to
> save extra storage space with guaranteed end-to-end performance
> for read-only files, which has better performance over exist Linux
> compression filesystems based on fixed-sized output compression
> and inplace decompression. It even has better performance in
> a large compression ratio range compared with generic uncompressed
> filesystems with proper CPU-storage combinations. And we think this
> direction is correct and a dedicated kernel team is continuously /
> actively working on improving it, enough testers and beta / end
> users using it.
>
> EROFS has been applied to almost all in-service HUAWEI smartphones
> (Yes, the number is still increasing by time) and it seems like
> a success. It can be used in more wider scenarios. We think it's
> useful for Linux / Android OS community and it's the time moving
> out of staging.
>
> In order to get started, latest stable mkfs.erofs is available at
>
> git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git -b dev
>
> with README in the repository.
>
> We are still tuning sequential read performance for ultra-fast
> speed NVME SSDs like Samsung 970PRO, but at least now you can
> try on your PC with some data with proper compression ratio,
> the latest Linux kernel, USB stick for convenience sake and
> a not very old-fashioned CPU. There are also benchmarks available
> in the above materials mentioned.
>
> EROFS is a self-contained filesystem driver. Although there are
> still some TODOs to be more generic, we will actively keep on
> developping / tuning EROFS with the evolution of Linux kernel
> as the other in-kernel filesystems.
>
> As I mentioned before in LSF/MM 2019, in the future, we'd like
> to generalize the decompression engine into a library for other
> fses to use after the whole system is mature like fscrypt.
> However, such metadata should be designed respectively for
> each fs, and synchronous metadata read cost will be larger
> than EROFS because of those ondisk limitation. Therefore EROFS
> is still a better choice for read-only scenarios.
>
> EROFS is now ready for reviewing and moving, and the code is
> already cleaned up as shiny floors... Please kindly take some
> precious time, share your comments about EROFS and let us know
> your opinion about this. It's really important for us since
> generally speaking, we like to use Linux _in-tree_ stuffs rather
> than lack of supported out-of-tree / orphan stuffs as well.

I know everyone is busy, but given the length this has been in staging,
and the constant good progress toward cleaning it all up that has been
happening, I want to get this moved out of staging soon.

So, unless there are any objections, I'll take this patchset in a week
into my staging tree to move the filesystem into the "real" part of the
kernel.

thanks,

greg k-h

2019-08-15 16:41:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v8 00/24] erofs: promote erofs from staging v8

On Thu, Aug 15, 2019 at 2:06 AM Greg Kroah-Hartman
<[email protected]> wrote:
>
> I know everyone is busy, but given the length this has been in staging,
> and the constant good progress toward cleaning it all up that has been
> happening, I want to get this moved out of staging soon.

Since it doesn't touch anything outside of its own filesystem, I have
no real objections. We've never had huge problems with odd
filesystems.

I read through the patches to look for syntactic stuff (ie very much
*not* looking at actual code working or not), and had only one
comment. It's not critical, but it would be nice to do as part of (or
before) the "get it out of staging".

Linus

2019-08-15 19:10:51

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH v8 00/24] erofs: promote erofs from staging v8

Hi Linus,

On Thu, Aug 15, 2019 at 09:18:12AM -0700, Linus Torvalds wrote:
> On Thu, Aug 15, 2019 at 2:06 AM Greg Kroah-Hartman
> <[email protected]> wrote:
> >
> > I know everyone is busy, but given the length this has been in staging,
> > and the constant good progress toward cleaning it all up that has been
> > happening, I want to get this moved out of staging soon.
>
> Since it doesn't touch anything outside of its own filesystem, I have
> no real objections. We've never had huge problems with odd
> filesystems.
>
> I read through the patches to look for syntactic stuff (ie very much
> *not* looking at actual code working or not), and had only one
> comment. It's not critical, but it would be nice to do as part of (or
> before) the "get it out of staging".

Thanks for your kind reply!

OK, I will submit a patch later to address your comment and
a pending formal moving patch with a suggestion by Stephen earlier
for Greg as well.

Thanks,
Gao Xiang

>
> Linus

2019-08-22 15:38:50

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH v8 00/24] erofs: promote erofs from staging v8

Hi Greg,

On Thu, Aug 15, 2019 at 11:06:03AM +0200, Greg Kroah-Hartman wrote:
> On Thu, Aug 15, 2019 at 12:41:31PM +0800, Gao Xiang wrote:
> > [I strip the previous cover letter, the old one can be found in v6:
> > https://lore.kernel.org/r/[email protected]/]
> >
> > We'd like to submit a formal moving patch applied to staging tree
> > for 5.4, before that we'd like to hear if there are some ACKs,
> > suggestions or NAKs, objections of EROFS. Therefore, we can improve
> > it in this round or rethink about the whole thing.
> >
> > As related materials mentioned [1] [2], the goal of EROFS is to
> > save extra storage space with guaranteed end-to-end performance
> > for read-only files, which has better performance over exist Linux
> > compression filesystems based on fixed-sized output compression
> > and inplace decompression. It even has better performance in
> > a large compression ratio range compared with generic uncompressed
> > filesystems with proper CPU-storage combinations. And we think this
> > direction is correct and a dedicated kernel team is continuously /
> > actively working on improving it, enough testers and beta / end
> > users using it.
> >
> > EROFS has been applied to almost all in-service HUAWEI smartphones
> > (Yes, the number is still increasing by time) and it seems like
> > a success. It can be used in more wider scenarios. We think it's
> > useful for Linux / Android OS community and it's the time moving
> > out of staging.
> >
> > In order to get started, latest stable mkfs.erofs is available at
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git -b dev
> >
> > with README in the repository.
> >
> > We are still tuning sequential read performance for ultra-fast
> > speed NVME SSDs like Samsung 970PRO, but at least now you can
> > try on your PC with some data with proper compression ratio,
> > the latest Linux kernel, USB stick for convenience sake and
> > a not very old-fashioned CPU. There are also benchmarks available
> > in the above materials mentioned.
> >
> > EROFS is a self-contained filesystem driver. Although there are
> > still some TODOs to be more generic, we will actively keep on
> > developping / tuning EROFS with the evolution of Linux kernel
> > as the other in-kernel filesystems.
> >
> > As I mentioned before in LSF/MM 2019, in the future, we'd like
> > to generalize the decompression engine into a library for other
> > fses to use after the whole system is mature like fscrypt.
> > However, such metadata should be designed respectively for
> > each fs, and synchronous metadata read cost will be larger
> > than EROFS because of those ondisk limitation. Therefore EROFS
> > is still a better choice for read-only scenarios.
> >
> > EROFS is now ready for reviewing and moving, and the code is
> > already cleaned up as shiny floors... Please kindly take some
> > precious time, share your comments about EROFS and let us know
> > your opinion about this. It's really important for us since
> > generally speaking, we like to use Linux _in-tree_ stuffs rather
> > than lack of supported out-of-tree / orphan stuffs as well.
>
> I know everyone is busy, but given the length this has been in staging,
> and the constant good progress toward cleaning it all up that has been
> happening, I want to get this moved out of staging soon.
>
> So, unless there are any objections, I'll take this patchset in a week
> into my staging tree to move the filesystem into the "real" part of the
> kernel.

It seem that the time is passed, as a brief conclusion, it seems we don't
get "real" objection in the previous week.

During these days, we have enhanced our robustness against corrupted images
by our first fuzzer based on mkfs.erofs these days (since it's a RO fs, it
will generate reproductable images). Although the original intended use case
of EROFS is on the top of dm-verity for Android, we still want to gain more
wider use so we quickly build a fuzzer and addresses them (yes, we will
develop another independent fuzzer tools as well.)

And thanks all people for all useful suggestions these days, and we think
these wonderful fses (ext4/xfs/btrfs/...) have awesome rich tools, that's
also our next step to address on, especially debugging tools.

As a Newborn communities, we only have a few paid-job people working on
that, but we are doing our best on EROFS, please kindly give us some time
to grow up (I personally speed my all spare/working time on EROFS from its
start), and apply EROFS to more wider use like what we did successfully
for many many HUAWEI smartphones...

As Greg said before [1], we have already proven the advantage of EROFS
solutions, the next step is to develop it more actively... And we would
also like to generalize the decompression engine into a library for other
general fses to use (we're very happy to share our efforts), it seems
interesting to other fs as well [2].

I sent several patchsets from July 4, 2019 (v1-v8), Cc most of fs people
at the first beginning and get responses and suggestions from people (Ted,
Pavel, Eric, Stephen, Amir, David, Jan, Richard, Linus...), so could you
kindly consider the following moving patch [3] and let us join, contribute
Linux "real" part community more actively, again, we have a steady stream
of work on EROFS, and will do our best on it. Thank you all very much!

Sorry about my English...

[1] https://lore.kernel.org/r/[email protected]/
[2] https://lore.kernel.org/r/[email protected]/
[3] https://lore.kernel.org/r/[email protected]/

Thanks,
Gao Xiang

>
> thanks,
>
> greg k-h

2019-08-29 17:13:08

by miaoxie (A)

[permalink] [raw]
Subject: Re: [PATCH v8 00/24] erofs: promote erofs from staging v8



on 2019/8/15 at 12:41, Gao Xiang wrote:
> [I strip the previous cover letter, the old one can be found in v6:
> https://lore.kernel.org/r/[email protected]/]
>
> We'd like to submit a formal moving patch applied to staging tree
> for 5.4, before that we'd like to hear if there are some ACKs,
> suggestions or NAKs, objections of EROFS. Therefore, we can improve
> it in this round or rethink about the whole thing.

ACK since it is stable enough and doesn't affect vfs or other filesystems.
And then, it could attract more users, we can get more feedback, and they
can help us to further improve it.

Thanks
Miao

>
> As related materials mentioned [1] [2], the goal of EROFS is to
> save extra storage space with guaranteed end-to-end performance
> for read-only files, which has better performance over exist Linux
> compression filesystems based on fixed-sized output compression
> and inplace decompression. It even has better performance in
> a large compression ratio range compared with generic uncompressed
> filesystems with proper CPU-storage combinations. And we think this
> direction is correct and a dedicated kernel team is continuously /
> actively working on improving it, enough testers and beta / end
> users using it.
>
> EROFS has been applied to almost all in-service HUAWEI smartphones
> (Yes, the number is still increasing by time) and it seems like
> a success. It can be used in more wider scenarios. We think it's
> useful for Linux / Android OS community and it's the time moving
> out of staging.
>
> In order to get started, latest stable mkfs.erofs is available at
>
> git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git -b dev
>
> with README in the repository.
>
> We are still tuning sequential read performance for ultra-fast
> speed NVME SSDs like Samsung 970PRO, but at least now you can
> try on your PC with some data with proper compression ratio,
> the latest Linux kernel, USB stick for convenience sake and
> a not very old-fashioned CPU. There are also benchmarks available
> in the above materials mentioned.
>
> EROFS is a self-contained filesystem driver. Although there are
> still some TODOs to be more generic, we will actively keep on
> developping / tuning EROFS with the evolution of Linux kernel
> as the other in-kernel filesystems.
>
> As I mentioned before in LSF/MM 2019, in the future, we'd like
> to generalize the decompression engine into a library for other
> fses to use after the whole system is mature like fscrypt.
> However, such metadata should be designed respectively for
> each fs, and synchronous metadata read cost will be larger
> than EROFS because of those ondisk limitation. Therefore EROFS
> is still a better choice for read-only scenarios.
>
> EROFS is now ready for reviewing and moving, and the code is
> already cleaned up as shiny floors... Please kindly take some
> precious time, share your comments about EROFS and let us know
> your opinion about this. It's really important for us since
> generally speaking, we like to use Linux _in-tree_ stuffs rather
> than lack of supported out-of-tree / orphan stuffs as well.
>
> Thank you in advance,
> Gao Xiang
>
> [1] https://kccncosschn19eng.sched.com/event/Nru2/erofs-an-introduction-and-our-smartphone-practice-xiang-gao-huawei
> [2] https://www.usenix.org/conference/atc19/presentation/gao
>
> Changelog from v7:
> o keep up with the latest staging tree in addition to
> the latest staging patch:
> https://lore.kernel.org/r/[email protected]/
> - use EUCLEAN for fs corruption cases suggested by Pavel;
> - turn EIO into EOPNOTSUPP for unsupported on-disk format;
> - fix all misused ENOTSUPP into EOPNOTSUPP pointed out by Chao;
> o update cover letter
>
> It can also be found in git at tag "erofs_2019-08-15" (will be shown later) at:
> https://git.kernel.org/pub/scm/linux/kernel/git/xiang/linux.git/
>
> and the latest fs code is available at:
> https://git.kernel.org/pub/scm/linux/kernel/git/xiang/linux.git/tree/fs/erofs?h=erofs-outofstaging
>
> Changelog from v6:
> o keep up with the latest staging patchset
> https://lore.kernel.org/linux-fsdevel/[email protected]/
> in order to fix the following cases:
> - inline erofs_inode_is_data_compressed() in erofs_fs.h;
> - remove incomplete cleancache;
> - remove all BUG_ON in EROFS.
> o Removing the file names from the comments at the top of the files
> suggested by Stephen will be applied to the real moving patch later.
>
> Changelog from v5:
> o keep up with "[PATCH v2] staging: erofs: updates according to erofs-outofstaging v4"
> https://lore.kernel.org/lkml/[email protected]/
> which mainly addresses review comments from Chao:
> - keep the marco EROFS_IO_MAX_RETRIES_NOFAIL in internal.h;
> - kill a redundant NULL check in "__stagingpage_alloc";
> - add some descriptions in document about "use_vmap";
> - rearrange erofs_vmap of "staging: erofs: kill CONFIG_EROFS_FS_USE_VM_MAP_RAM";
>
> o all changes have been merged into staging tree, which are under staging-testing:
> https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging.git/log/?h=staging-testing
>
> Changelog from v4:
> o rebase on Linux 5.3-rc1;
>
> o keep up with "staging: erofs: updates according to erofs-outofstaging v4"
> in order to get main code bit-for-bit identical with staging tree:
> https://lore.kernel.org/lkml/[email protected]/
>
> Changelog from v3:
> o use GPL-2.0-only for SPDX-License-Identifier suggested by Stephen;
>
> o kill all kconfig cache strategies and turn them into mount options
> "cache_strategy={disable|readahead|readaround}" suggested by Ted.
> As the first step, cached pages can still be usable after cache is
> disabled by remounting, and these pages will be fallen out over
> time, which can be refined in the later version if some requirement
> is needed. Update related document as well;
>
> o turn on CONFIG_EROFS_FS_SECURITY by default suggested by David;
>
> o kill CONFIG_EROFS_FS_IO_MAX_RETRIES and fold it into code; turn
> EROFS_FS_USE_VM_MAP_RAM into a module parameter ("use_vmap")
> suggested by David.
>
> Changelog from v2:
> o kill sbi->dev_name and clean up all failure handling in
> fill_super() suggested by Al.
> Note that the initialzation of managed_cache is now moved
> after s_root is assigned since it's more preferred to iput()
> in .put_super() and all inodes should be evicted before
> the end of generic_shutdown_super(sb);
>
> o fold in the following staging patches (and thanks):
> staging: erofs:converting all 'unsigned' to 'unsigned int'
> staging: erofs: Remove function erofs_kill_sb()
> - However it was revoked due to erofs_kill_sb reused...
> staging: erofs: avoid opened loop codes
> staging: erofs: support bmap
>
> o move EROFS_SUPER_MAGIC_V1 from linux/fs/erofs/erofs_fs.h to
> include/uapi/linux/magic.h for userspace utilities.
>
> Changelog from v1:
> o resend the whole filesystem into a patchset suggested by Greg;
> o code is more cleaner, especially for decompression frontend.
>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Alexander Viro <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Stephen Rothwell <[email protected]>
> Cc: Theodore Ts'o <[email protected]>
> Cc: Pavel Machek <[email protected]>
> Cc: David Sterba <[email protected]>
> Cc: Amir Goldstein <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Darrick J . Wong <[email protected]>
> Cc: Dave Chinner <[email protected]>
> Cc: Jaegeuk Kim <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Richard Weinberger <[email protected]>
> Cc: Chao Yu <[email protected]>
> Cc: Miao Xie <[email protected]>
> Cc: Li Guifu <[email protected]>
> Cc: Fang Wei <[email protected]>
> Signed-off-by: Gao Xiang <[email protected]>
>
>
> Gao Xiang (24):
> erofs: add on-disk layout
> erofs: add erofs in-memory stuffs
> erofs: add super block operations
> erofs: add raw address_space operations
> erofs: add inode operations
> erofs: support special inode
> erofs: add directory operations
> erofs: add namei functions
> erofs: support tracepoint
> erofs: update Kconfig and Makefile
> erofs: introduce xattr & posixacl support
> erofs: introduce tagged pointer
> erofs: add compression indexes support
> erofs: introduce superblock registration
> erofs: introduce erofs shrinker
> erofs: introduce workstation for decompression
> erofs: introduce per-CPU buffers implementation
> erofs: introduce pagevec for decompression subsystem
> erofs: add erofs_allocpage()
> erofs: introduce generic decompression backend
> erofs: introduce LZ4 decompression inplace
> erofs: introduce the decompression frontend
> erofs: introduce cached decompression
> erofs: add document
>
> Documentation/filesystems/erofs.txt | 225 +++++
> fs/Kconfig | 1 +
> fs/Makefile | 1 +
> fs/erofs/Kconfig | 98 ++
> fs/erofs/Makefile | 11 +
> fs/erofs/compress.h | 62 ++
> fs/erofs/data.c | 425 ++++++++
> fs/erofs/decompressor.c | 360 +++++++
> fs/erofs/dir.c | 148 +++
> fs/erofs/erofs_fs.h | 316 ++++++
> fs/erofs/inode.c | 333 +++++++
> fs/erofs/internal.h | 555 +++++++++++
> fs/erofs/namei.c | 253 +++++
> fs/erofs/super.c | 666 +++++++++++++
> fs/erofs/tagptr.h | 110 +++
> fs/erofs/utils.c | 335 +++++++
> fs/erofs/xattr.c | 705 ++++++++++++++
> fs/erofs/xattr.h | 94 ++
> fs/erofs/zdata.c | 1405 +++++++++++++++++++++++++++
> fs/erofs/zdata.h | 195 ++++
> fs/erofs/zmap.c | 463 +++++++++
> fs/erofs/zpvec.h | 159 +++
> include/trace/events/erofs.h | 256 +++++
> include/uapi/linux/magic.h | 1 +
> 24 files changed, 7177 insertions(+)
> create mode 100644 Documentation/filesystems/erofs.txt
> create mode 100644 fs/erofs/Kconfig
> create mode 100644 fs/erofs/Makefile
> create mode 100644 fs/erofs/compress.h
> create mode 100644 fs/erofs/data.c
> create mode 100644 fs/erofs/decompressor.c
> create mode 100644 fs/erofs/dir.c
> create mode 100644 fs/erofs/erofs_fs.h
> create mode 100644 fs/erofs/inode.c
> create mode 100644 fs/erofs/internal.h
> create mode 100644 fs/erofs/namei.c
> create mode 100644 fs/erofs/super.c
> create mode 100644 fs/erofs/tagptr.h
> create mode 100644 fs/erofs/utils.c
> create mode 100644 fs/erofs/xattr.c
> create mode 100644 fs/erofs/xattr.h
> create mode 100644 fs/erofs/zdata.c
> create mode 100644 fs/erofs/zdata.h
> create mode 100644 fs/erofs/zmap.c
> create mode 100644 fs/erofs/zpvec.h
> create mode 100644 include/trace/events/erofs.h
>

2019-08-30 07:50:11

by Chao Yu

[permalink] [raw]
Subject: Re: [PATCH v8 00/24] erofs: promote erofs from staging v8

On 2019/8/15 12:41, Gao Xiang wrote:
> [I strip the previous cover letter, the old one can be found in v6:
> https://lore.kernel.org/r/[email protected]/]
>
> We'd like to submit a formal moving patch applied to staging tree
> for 5.4, before that we'd like to hear if there are some ACKs,
> suggestions or NAKs, objections of EROFS. Therefore, we can improve
> it in this round or rethink about the whole thing.
>
> As related materials mentioned [1] [2], the goal of EROFS is to
> save extra storage space with guaranteed end-to-end performance
> for read-only files, which has better performance over exist Linux
> compression filesystems based on fixed-sized output compression
> and inplace decompression. It even has better performance in
> a large compression ratio range compared with generic uncompressed
> filesystems with proper CPU-storage combinations. And we think this
> direction is correct and a dedicated kernel team is continuously /
> actively working on improving it, enough testers and beta / end
> users using it.
>
> EROFS has been applied to almost all in-service HUAWEI smartphones
> (Yes, the number is still increasing by time) and it seems like
> a success. It can be used in more wider scenarios. We think it's
> useful for Linux / Android OS community and it's the time moving
> out of staging.
>
> In order to get started, latest stable mkfs.erofs is available at
>
> git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git -b dev
>
> with README in the repository.
>
> We are still tuning sequential read performance for ultra-fast
> speed NVME SSDs like Samsung 970PRO, but at least now you can
> try on your PC with some data with proper compression ratio,
> the latest Linux kernel, USB stick for convenience sake and
> a not very old-fashioned CPU. There are also benchmarks available
> in the above materials mentioned.
>
> EROFS is a self-contained filesystem driver. Although there are
> still some TODOs to be more generic, we will actively keep on
> developping / tuning EROFS with the evolution of Linux kernel
> as the other in-kernel filesystems.
>
> As I mentioned before in LSF/MM 2019, in the future, we'd like
> to generalize the decompression engine into a library for other
> fses to use after the whole system is mature like fscrypt.
> However, such metadata should be designed respectively for
> each fs, and synchronous metadata read cost will be larger
> than EROFS because of those ondisk limitation. Therefore EROFS
> is still a better choice for read-only scenarios.
>
> EROFS is now ready for reviewing and moving, and the code is
> already cleaned up as shiny floors... Please kindly take some
> precious time, share your comments about EROFS and let us know
> your opinion about this. It's really important for us since
> generally speaking, we like to use Linux _in-tree_ stuffs rather
> than lack of supported out-of-tree / orphan stuffs as well.

EROFS proposes its very unique fixed-sized output compression and inplace
decompression framework joining into the ecosystem of compression filesystem, I
think it will enrich diversity of compression filesystem, and bring healthy
competition there.

I do believe this is the right time to promote erofs to fs/ directory, let it be
the formal member of filesystem clubhouse.

Acked-by: Chao Yu <[email protected]>

Thanks

>
> Thank you in advance,
> Gao Xiang
>
> [1] https://kccncosschn19eng.sched.com/event/Nru2/erofs-an-introduction-and-our-smartphone-practice-xiang-gao-huawei
> [2] https://www.usenix.org/conference/atc19/presentation/gao
>
> Changelog from v7:
> o keep up with the latest staging tree in addition to
> the latest staging patch:
> https://lore.kernel.org/r/[email protected]/
> - use EUCLEAN for fs corruption cases suggested by Pavel;
> - turn EIO into EOPNOTSUPP for unsupported on-disk format;
> - fix all misused ENOTSUPP into EOPNOTSUPP pointed out by Chao;
> o update cover letter
>
> It can also be found in git at tag "erofs_2019-08-15" (will be shown later) at:
> https://git.kernel.org/pub/scm/linux/kernel/git/xiang/linux.git/
>
> and the latest fs code is available at:
> https://git.kernel.org/pub/scm/linux/kernel/git/xiang/linux.git/tree/fs/erofs?h=erofs-outofstaging
>
> Changelog from v6:
> o keep up with the latest staging patchset
> https://lore.kernel.org/linux-fsdevel/[email protected]/
> in order to fix the following cases:
> - inline erofs_inode_is_data_compressed() in erofs_fs.h;
> - remove incomplete cleancache;
> - remove all BUG_ON in EROFS.
> o Removing the file names from the comments at the top of the files
> suggested by Stephen will be applied to the real moving patch later.
>
> Changelog from v5:
> o keep up with "[PATCH v2] staging: erofs: updates according to erofs-outofstaging v4"
> https://lore.kernel.org/lkml/[email protected]/
> which mainly addresses review comments from Chao:
> - keep the marco EROFS_IO_MAX_RETRIES_NOFAIL in internal.h;
> - kill a redundant NULL check in "__stagingpage_alloc";
> - add some descriptions in document about "use_vmap";
> - rearrange erofs_vmap of "staging: erofs: kill CONFIG_EROFS_FS_USE_VM_MAP_RAM";
>
> o all changes have been merged into staging tree, which are under staging-testing:
> https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging.git/log/?h=staging-testing
>
> Changelog from v4:
> o rebase on Linux 5.3-rc1;
>
> o keep up with "staging: erofs: updates according to erofs-outofstaging v4"
> in order to get main code bit-for-bit identical with staging tree:
> https://lore.kernel.org/lkml/[email protected]/
>
> Changelog from v3:
> o use GPL-2.0-only for SPDX-License-Identifier suggested by Stephen;
>
> o kill all kconfig cache strategies and turn them into mount options
> "cache_strategy={disable|readahead|readaround}" suggested by Ted.
> As the first step, cached pages can still be usable after cache is
> disabled by remounting, and these pages will be fallen out over
> time, which can be refined in the later version if some requirement
> is needed. Update related document as well;
>
> o turn on CONFIG_EROFS_FS_SECURITY by default suggested by David;
>
> o kill CONFIG_EROFS_FS_IO_MAX_RETRIES and fold it into code; turn
> EROFS_FS_USE_VM_MAP_RAM into a module parameter ("use_vmap")
> suggested by David.
>
> Changelog from v2:
> o kill sbi->dev_name and clean up all failure handling in
> fill_super() suggested by Al.
> Note that the initialzation of managed_cache is now moved
> after s_root is assigned since it's more preferred to iput()
> in .put_super() and all inodes should be evicted before
> the end of generic_shutdown_super(sb);
>
> o fold in the following staging patches (and thanks):
> staging: erofs:converting all 'unsigned' to 'unsigned int'
> staging: erofs: Remove function erofs_kill_sb()
> - However it was revoked due to erofs_kill_sb reused...
> staging: erofs: avoid opened loop codes
> staging: erofs: support bmap
>
> o move EROFS_SUPER_MAGIC_V1 from linux/fs/erofs/erofs_fs.h to
> include/uapi/linux/magic.h for userspace utilities.
>
> Changelog from v1:
> o resend the whole filesystem into a patchset suggested by Greg;
> o code is more cleaner, especially for decompression frontend.
>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Alexander Viro <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Stephen Rothwell <[email protected]>
> Cc: Theodore Ts'o <[email protected]>
> Cc: Pavel Machek <[email protected]>
> Cc: David Sterba <[email protected]>
> Cc: Amir Goldstein <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Darrick J . Wong <[email protected]>
> Cc: Dave Chinner <[email protected]>
> Cc: Jaegeuk Kim <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Richard Weinberger <[email protected]>
> Cc: Chao Yu <[email protected]>
> Cc: Miao Xie <[email protected]>
> Cc: Li Guifu <[email protected]>
> Cc: Fang Wei <[email protected]>
> Signed-off-by: Gao Xiang <[email protected]>
>
>
> Gao Xiang (24):
> erofs: add on-disk layout
> erofs: add erofs in-memory stuffs
> erofs: add super block operations
> erofs: add raw address_space operations
> erofs: add inode operations
> erofs: support special inode
> erofs: add directory operations
> erofs: add namei functions
> erofs: support tracepoint
> erofs: update Kconfig and Makefile
> erofs: introduce xattr & posixacl support
> erofs: introduce tagged pointer
> erofs: add compression indexes support
> erofs: introduce superblock registration
> erofs: introduce erofs shrinker
> erofs: introduce workstation for decompression
> erofs: introduce per-CPU buffers implementation
> erofs: introduce pagevec for decompression subsystem
> erofs: add erofs_allocpage()
> erofs: introduce generic decompression backend
> erofs: introduce LZ4 decompression inplace
> erofs: introduce the decompression frontend
> erofs: introduce cached decompression
> erofs: add document
>
> Documentation/filesystems/erofs.txt | 225 +++++
> fs/Kconfig | 1 +
> fs/Makefile | 1 +
> fs/erofs/Kconfig | 98 ++
> fs/erofs/Makefile | 11 +
> fs/erofs/compress.h | 62 ++
> fs/erofs/data.c | 425 ++++++++
> fs/erofs/decompressor.c | 360 +++++++
> fs/erofs/dir.c | 148 +++
> fs/erofs/erofs_fs.h | 316 ++++++
> fs/erofs/inode.c | 333 +++++++
> fs/erofs/internal.h | 555 +++++++++++
> fs/erofs/namei.c | 253 +++++
> fs/erofs/super.c | 666 +++++++++++++
> fs/erofs/tagptr.h | 110 +++
> fs/erofs/utils.c | 335 +++++++
> fs/erofs/xattr.c | 705 ++++++++++++++
> fs/erofs/xattr.h | 94 ++
> fs/erofs/zdata.c | 1405 +++++++++++++++++++++++++++
> fs/erofs/zdata.h | 195 ++++
> fs/erofs/zmap.c | 463 +++++++++
> fs/erofs/zpvec.h | 159 +++
> include/trace/events/erofs.h | 256 +++++
> include/uapi/linux/magic.h | 1 +
> 24 files changed, 7177 insertions(+)
> create mode 100644 Documentation/filesystems/erofs.txt
> create mode 100644 fs/erofs/Kconfig
> create mode 100644 fs/erofs/Makefile
> create mode 100644 fs/erofs/compress.h
> create mode 100644 fs/erofs/data.c
> create mode 100644 fs/erofs/decompressor.c
> create mode 100644 fs/erofs/dir.c
> create mode 100644 fs/erofs/erofs_fs.h
> create mode 100644 fs/erofs/inode.c
> create mode 100644 fs/erofs/internal.h
> create mode 100644 fs/erofs/namei.c
> create mode 100644 fs/erofs/super.c
> create mode 100644 fs/erofs/tagptr.h
> create mode 100644 fs/erofs/utils.c
> create mode 100644 fs/erofs/xattr.c
> create mode 100644 fs/erofs/xattr.h
> create mode 100644 fs/erofs/zdata.c
> create mode 100644 fs/erofs/zdata.h
> create mode 100644 fs/erofs/zmap.c
> create mode 100644 fs/erofs/zpvec.h
> create mode 100644 include/trace/events/erofs.h
>