LinuxLists.cc - [PATCH v2 00/10] erofs: add big pcluster compression support

2021-04-01 03:33:55

Subject: [PATCH v2 00/10] erofs: add big pcluster compression support

Hi folks,

This is the formal version of EROFS big pcluster support, which means
EROFS can compress data into more than 1 fs block after this patchset.

{l,p}cluster are EROFS-specific concepts, standing for `logical cluster'
and `physical cluster' correspondingly. Logical cluster is the basic unit
of compress indexes in file logical mapping, e.g. it can build compress
indexes in 2 blocks rather than 1 block (currently only 1 block lcluster
is supported). Physical cluster is a container of physical compressed
blocks which contains compressed data, the size of which is the multiple
of lclustersize.

Different from previous thoughts, which had fixed-sized pclusterblks
recorded in the on-disk compress index header, our on-disk design allows
variable-sized pclusterblks now. The main reasons are
- user data varies in compression ratio locally, so fixed-sized
clustersize approach is space-wasting and causes extra read
amplificationfor high CR cases;

- inplace decompression needs zero padding to guarantee its safe margin,
but we don't want to pad more than 1 fs block for big pcluster;

- end users can now customize the pcluster size according to data type
since various pclustersize can exist in a file, for example, using
different pcluster size for executable code and one-shot data. such
design should be more flexible than many other public compression fses
(Btw, each file in EROFS can have maximum 2 algorithms at the same time
by using HEAD1/2, which will be formally added with LZMA support.)

In brief, EROFS can now compress from variable-sized input to
variable-sized pcluster blocks, as illustrated below:

|<-_lcluster_->|________________________|<-_lcluster_->|
|____._________|_________ .. ___________|_______.______|
. .
. .
.__________________________________.
|______________| .. |______________|
|<- pcluster ->|

The next step would be how to record the compressed block count in
lclusters. In compress indexes, there are 2 concepts called HEAD and
NONHEAD lclusters. The difference is that HEAD lcluster starts a new
pcluster in the lcluster, but NONHEAD not. It's easy to understand
that big pclusters at least have 2 pclusters, thus at least 2 lclusters
as well.

Therefore, let the delta0 (distance to its HEAD lcluster) of first NONHEAD
compress index store the compressed block count with a special flag as a
new called CBLKCNT compress index. It's also easy to know its delta0 is
constantly 1, as illustrated below:
________________________________________________________
|_HEAD_|_CBLKCNT_|_NONHEAD_|_..._|_NONHEAD_|_HEAD | HEAD |
|<------ a pcluster with CBLKCNT --------->|<-- -->|
^ a pcluster with 1

If another HEAD follows a HEAD lcluster, there is no room to record
CBLKCNT, but it's easy to know the size of pcluster will be 1.

More implementation details about this and compact indexes are in the
commit message.

On the runtime performance side, the current EROFS test results are:
________________________________________________________________
| file system | size | seq read | rand read | rand9m read |
|_______________|___________|_ MiB/s __|__ MiB/s __|___ MiB/s ___|
|___erofs_4k____|_556879872_|_ 781.4 __|__ 55.3 ___|___ 25.3 ___|
|___erofs_16k___|_452509696_|_ 864.8 __|_ 123.2 ___|___ 20.8 ___|
|___erofs_32k___|_415223808_|_ 899.8 __|_ 105.8 _*_|___ 16.8 ____|
|___erofs_64k___|_393814016_|_ 906.6 __|__ 66.6 _*_|___ 11.8 ____|
|__squashfs_8k__|_556191744_|_ 64.9 __|__ 19.3 ___|____ 9.1 ____|
|__squashfs_16k_|_502661120_|_ 98.9 __|__ 38.0 ___|____ 9.8 ____|
|__squashfs_32k_|_458784768_|_ 115.4 __|__ 71.6 _*_|___ 10.0 ____|
|_squashfs_128k_|_398204928_|_ 257.2 __|_ 253.8 _*_|___ 10.9 ____|
|____ext4_4k____|____()_____|_ 786.6 __|__ 28.6 ___|___ 27.8 ____|

* Squashfs grabs more page cache to keep all decompressed data with
grab_cache_page_nowait() than the normal requested readahead (see
squashfs_copy_cache and squashfs_readpage_block).
In principle, EROFS can also cache such all decompressed data
if necessary, yet it's low priority for now and has little use
(rand9m is actually a better rand read workload, since the amount
of I/O is 9m rather than full-sized 1000m).

More details are in
https://lore.kernel.org/r/[email protected]

Also it's easy to know EROFS is not a fixed pcluster design, so users
can make several optimized strategy according to data type when mkfs.
And there is still room to optimize runtime performance for big pcluster
even further.

Finally, it passes ro_fsstress and can also successfully boot buildroot
& Android system with android-mainline repo.

current mkfs repo for big pcluster:
https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git -b experimental-bigpcluster-compact

Thanks for your time on reading this!

Thanks,
Gao Xiang

changes since v1:
- add a missing vunmap in erofs_pcpubuf_exit();
- refine comments and commit messages.

(btw, I'll apply this patchset for -next first for further integration
test, which will be aimed to 5.13-rc1.)

Gao Xiang (10):
erofs: reserve physical_clusterbits[]
erofs: introduce multipage per-CPU buffers
erofs: introduce physical cluster slab pools
erofs: fix up inplace I/O pointer for big pcluster
erofs: add big physical cluster definition
erofs: adjust per-CPU buffers according to max_pclusterblks
erofs: support parsing big pcluster compress indexes
erofs: support parsing big pcluster compact indexes
erofs: support decompress big pcluster for lz4 backend
erofs: enable big pcluster feature

fs/erofs/Kconfig | 14 ---
fs/erofs/Makefile | 2 +-
fs/erofs/decompressor.c | 216 +++++++++++++++++++++++++---------------
fs/erofs/erofs_fs.h | 31 ++++--
fs/erofs/internal.h | 31 ++----
fs/erofs/pcpubuf.c | 134 +++++++++++++++++++++++++
fs/erofs/super.c | 1 +
fs/erofs/utils.c | 12 ---
fs/erofs/zdata.c | 193 ++++++++++++++++++++++-------------
fs/erofs/zdata.h | 14 +--
fs/erofs/zmap.c | 155 ++++++++++++++++++++++------
11 files changed, 560 insertions(+), 243 deletions(-)
create mode 100644 fs/erofs/pcpubuf.c

--
2.20.1

2021-04-01 03:33:56

by Gao Xiang

[permalink] [raw]

Subject: [PATCH v2 05/10] erofs: add big physical cluster definition

From: Gao Xiang <[email protected]>

Big pcluster indicates the size of compressed data for each physical
pcluster is no longer fixed as block size, but could be more than 1
block (more accurately, 1 logical pcluster)

When big pcluster feature is enabled for head0/1, delta0 of the 1st
non-head lcluster index will keep block count of this pcluster in
lcluster size instead of 1. Or, the compressed size of pcluster
should be 1 lcluster if pcluster has no non-head lcluster index.

Also note that BIG_PCLUSTER feature reuses COMPR_CFGS feature since
it depends on COMPR_CFGS and will be released together.

Signed-off-by: Gao Xiang <[email protected]>
---
fs/erofs/erofs_fs.h | 19 +++++++++++++++----
fs/erofs/internal.h | 1 +
2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/erofs/erofs_fs.h b/fs/erofs/erofs_fs.h
index 76777673eb63..ecc3a0ea0bc4 100644
--- a/fs/erofs/erofs_fs.h
+++ b/fs/erofs/erofs_fs.h
@@ -19,6 +19,7 @@
*/
#define EROFS_FEATURE_INCOMPAT_LZ4_0PADDING 0x00000001
#define EROFS_FEATURE_INCOMPAT_COMPR_CFGS 0x00000002
+#define EROFS_FEATURE_INCOMPAT_BIG_PCLUSTER 0x00000002
#define EROFS_ALL_FEATURE_INCOMPAT EROFS_FEATURE_INCOMPAT_LZ4_0PADDING

#define EROFS_SB_EXTSLOT_SIZE 16
@@ -214,17 +215,20 @@ enum {
/* 14 bytes (+ length field = 16 bytes) */
struct z_erofs_lz4_cfgs {
__le16 max_distance;
- u8 reserved[12];
+ __le16 max_pclusterblks;
+ u8 reserved[10];
} __packed;

/*
* bit 0 : COMPACTED_2B indexes (0 - off; 1 - on)
* e.g. for 4k logical cluster size, 4B if compacted 2B is off;
* (4B) + 2B + (4B) if compacted 2B is on.
+ * bit 1 : HEAD1 big pcluster (0 - off; 1 - on)
+ * bit 2 : HEAD2 big pcluster (0 - off; 1 - on)
*/
-#define Z_EROFS_ADVISE_COMPACTED_2B_BIT 0
-
-#define Z_EROFS_ADVISE_COMPACTED_2B (1 << Z_EROFS_ADVISE_COMPACTED_2B_BIT)
+#define Z_EROFS_ADVISE_COMPACTED_2B 0x0001
+#define Z_EROFS_ADVISE_BIG_PCLUSTER_1 0x0002
+#define Z_EROFS_ADVISE_BIG_PCLUSTER_2 0x0004

struct z_erofs_map_header {
__le32 h_reserved1;
@@ -279,6 +283,13 @@ enum {
#define Z_EROFS_VLE_DI_CLUSTER_TYPE_BITS 2
#define Z_EROFS_VLE_DI_CLUSTER_TYPE_BIT 0

+/*
+ * D0_CBLKCNT will be marked _only_ at the 1st non-head lcluster to store the
+ * compressed block count of a compressed extent (in logical clusters, aka.
+ * block count of a pcluster).
+ */
+#define Z_EROFS_VLE_DI_D0_CBLKCNT (1 << 11)
+
struct z_erofs_vle_decompressed_index {
__le16 di_advise;
/* where to decompress in the head cluster */
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 06c294929069..c4b3938a7e56 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -230,6 +230,7 @@ static inline bool erofs_sb_has_##name(struct erofs_sb_info *sbi) \

EROFS_FEATURE_FUNCS(lz4_0padding, incompat, INCOMPAT_LZ4_0PADDING)
EROFS_FEATURE_FUNCS(compr_cfgs, incompat, INCOMPAT_COMPR_CFGS)
+EROFS_FEATURE_FUNCS(big_pcluster, incompat, INCOMPAT_BIG_PCLUSTER)
EROFS_FEATURE_FUNCS(sb_chksum, compat, COMPAT_SB_CHKSUM)

/* atomic flag definitions */
--
2.20.1

2021-04-01 03:33:57

by Gao Xiang

[permalink] [raw]

Subject: [PATCH v2 04/10] erofs: fix up inplace I/O pointer for big pcluster

From: Gao Xiang <[email protected]>

When picking up inplace I/O pages, it should be traversed in reverse
order in aligned with the traversal order of file-backed online pages.
Also, index should be updated together when preloading compressed pages.

Previously, only page-sized pclustersize was supported so no problem
at all. Also rename `compressedpages' to `icpage_ptr' to reflect its
functionality.

Signed-off-by: Gao Xiang <[email protected]>
---
fs/erofs/zdata.c | 28 ++++++++++++++--------------
1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index 7f572086b4e3..03f106ead8d2 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -204,7 +204,8 @@ struct z_erofs_collector {

struct z_erofs_pcluster *pcl, *tailpcl;
struct z_erofs_collection *cl;
- struct page **compressedpages;
+ /* a pointer used to pick up inplace I/O pages */
+ struct page **icpage_ptr;
z_erofs_next_pcluster_t owned_head;

enum z_erofs_collectmode mode;
@@ -238,17 +239,19 @@ static void preload_compressed_pages(struct z_erofs_collector *clt,
enum z_erofs_cache_alloctype type,
struct list_head *pagepool)
{
- const struct z_erofs_pcluster *pcl = clt->pcl;
- struct page **pages = clt->compressedpages;
- pgoff_t index = pcl->obj.index + (pages - pcl->compressed_pages);
+ struct z_erofs_pcluster *pcl = clt->pcl;
bool standalone = true;
gfp_t gfp = (mapping_gfp_mask(mc) & ~__GFP_DIRECT_RECLAIM) |
__GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN;
+ struct page **pages;
+ pgoff_t index;

if (clt->mode < COLLECT_PRIMARY_FOLLOWED)
return;

- for (; pages < pcl->compressed_pages + pcl->pclusterpages; ++pages) {
+ pages = pcl->compressed_pages;
+ index = pcl->obj.index;
+ for (; index < pcl->obj.index + pcl->pclusterpages; ++index, ++pages) {
struct page *page;
compressed_page_t t;
struct page *newpage = NULL;
@@ -360,16 +363,14 @@ int erofs_try_to_free_cached_page(struct address_space *mapping,
}

/* page_type must be Z_EROFS_PAGE_TYPE_EXCLUSIVE */
-static inline bool z_erofs_try_inplace_io(struct z_erofs_collector *clt,
- struct page *page)
+static bool z_erofs_try_inplace_io(struct z_erofs_collector *clt,
+ struct page *page)
{
struct z_erofs_pcluster *const pcl = clt->pcl;

- while (clt->compressedpages <
- pcl->compressed_pages + pcl->pclusterpages) {
- if (!cmpxchg(clt->compressedpages++, NULL, page))
+ while (clt->icpage_ptr > pcl->compressed_pages)
+ if (!cmpxchg(--clt->icpage_ptr, NULL, page))
return true;
- }
return false;
}

@@ -576,9 +577,8 @@ static int z_erofs_collector_begin(struct z_erofs_collector *clt,
z_erofs_pagevec_ctor_init(&clt->vector, Z_EROFS_NR_INLINE_PAGEVECS,
clt->cl->pagevec, clt->cl->vcnt);

- clt->compressedpages = clt->pcl->compressed_pages;
- if (clt->mode <= COLLECT_PRIMARY) /* cannot do in-place I/O */
- clt->compressedpages += clt->pcl->pclusterpages;
+ /* since file-backed online pages are traversed in reverse order */
+ clt->icpage_ptr = clt->pcl->compressed_pages + clt->pcl->pclusterpages;
return 0;
}

--
2.20.1

2021-04-01 03:35:24

by Gao Xiang

[permalink] [raw]

Subject: [PATCH v2 01/10] erofs: reserve physical_clusterbits[]

From: Gao Xiang <[email protected]>

Formal big pcluster design is actually more powerful / flexable than
the previous thought whose pclustersize was fixed as power-of-2 blocks,
which was obviously inefficient and space-wasting. Instead, pclustersize
can now be set independently for each pcluster, so various pcluster
sizes can also be used together in one file if mkfs wants (for example,
according to data type and/or compression ratio).

Let's get rid of previous physical_clusterbits[] setting (also notice
that corresponding on-disk fields are still 0 for now). Therefore,
head1/2 can be used for at most 2 different algorithms in one file and
again pclustersize is now independent of these.

Signed-off-by: Gao Xiang <[email protected]>
---
fs/erofs/erofs_fs.h | 4 +---
fs/erofs/internal.h | 1 -
fs/erofs/zdata.c | 3 +--
fs/erofs/zmap.c | 15 ---------------
4 files changed, 2 insertions(+), 21 deletions(-)

diff --git a/fs/erofs/erofs_fs.h b/fs/erofs/erofs_fs.h
index 17bc0b5f117d..626b7d3e9ab7 100644
--- a/fs/erofs/erofs_fs.h
+++ b/fs/erofs/erofs_fs.h
@@ -233,9 +233,7 @@ struct z_erofs_map_header {
__u8 h_algorithmtype;
/*
* bit 0-2 : logical cluster bits - 12, e.g. 0 for 4096;
- * bit 3-4 : (physical - logical) cluster bits of head 1:
- * For example, if logical clustersize = 4096, 1 for 8192.
- * bit 5-7 : (physical - logical) cluster bits of head 2.
+ * bit 3-7 : reserved.
*/
__u8 h_clusterbits;
};
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 60063bbbb91a..05b02f99324c 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -266,7 +266,6 @@ struct erofs_inode {
unsigned short z_advise;
unsigned char z_algorithmtype[2];
unsigned char z_logical_clusterbits;
- unsigned char z_physical_clusterbits[2];
};
#endif /* CONFIG_EROFS_FS_ZIP */
};
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index cd9b76216925..eabfd8873e12 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -430,8 +430,7 @@ static int z_erofs_register_collection(struct z_erofs_collector *clt,
else
pcl->algorithmformat = Z_EROFS_COMPRESSION_SHIFTED;

- pcl->clusterbits = EROFS_I(inode)->z_physical_clusterbits[0];
- pcl->clusterbits -= PAGE_SHIFT;
+ pcl->clusterbits = 0;

/* new pclusters should be claimed as type 1, primary and followed */
pcl->next = clt->owned_head;
diff --git a/fs/erofs/zmap.c b/fs/erofs/zmap.c
index 14d2de35110c..bd7e10c2fdd3 100644
--- a/fs/erofs/zmap.c
+++ b/fs/erofs/zmap.c
@@ -17,11 +17,8 @@ int z_erofs_fill_inode(struct inode *inode)
vi->z_algorithmtype[0] = 0;
vi->z_algorithmtype[1] = 0;
vi->z_logical_clusterbits = LOG_BLOCK_SIZE;
- vi->z_physical_clusterbits[0] = vi->z_logical_clusterbits;
- vi->z_physical_clusterbits[1] = vi->z_logical_clusterbits;
set_bit(EROFS_I_Z_INITED_BIT, &vi->flags);
}
-
inode->i_mapping->a_ops = &z_erofs_aops;
return 0;
}
@@ -77,18 +74,6 @@ static int z_erofs_fill_inode_lazy(struct inode *inode)
}

vi->z_logical_clusterbits = LOG_BLOCK_SIZE + (h->h_clusterbits & 7);
- vi->z_physical_clusterbits[0] = vi->z_logical_clusterbits +
- ((h->h_clusterbits >> 3) & 3);
-
- if (vi->z_physical_clusterbits[0] != LOG_BLOCK_SIZE) {
- erofs_err(sb, "unsupported physical clusterbits %u for nid %llu, please upgrade kernel",
- vi->z_physical_clusterbits[0], vi->nid);
- err = -EOPNOTSUPP;
- goto unmap_done;
- }
-
- vi->z_physical_clusterbits[1] = vi->z_logical_clusterbits +
- ((h->h_clusterbits >> 5) & 7);
/* paired with smp_mb() at the beginning of the function */
smp_mb();
set_bit(EROFS_I_Z_INITED_BIT, &vi->flags);
--
2.20.1

2021-04-01 03:35:26

by Gao Xiang

[permalink] [raw]

Subject: [PATCH v2 06/10] erofs: adjust per-CPU buffers according to max_pclusterblks

From: Gao Xiang <[email protected]>

Adjust per-CPU buffers on demand since big pcluster definition is
available. Also, bail out unsupported pcluster size according to
Z_EROFS_PCLUSTER_MAX_SIZE.

Signed-off-by: Gao Xiang <[email protected]>
---
fs/erofs/decompressor.c | 16 ++++++++++++----
fs/erofs/internal.h | 2 ++
2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/erofs/decompressor.c b/fs/erofs/decompressor.c
index fb4838c0f0df..5d9f9dbd3681 100644
--- a/fs/erofs/decompressor.c
+++ b/fs/erofs/decompressor.c
@@ -32,6 +32,7 @@ int z_erofs_load_lz4_config(struct super_block *sb,
struct erofs_super_block *dsb,
struct z_erofs_lz4_cfgs *lz4, int size)
{
+ struct erofs_sb_info *sbi = EROFS_SB(sb);
u16 distance;

if (lz4) {
@@ -40,16 +41,23 @@ int z_erofs_load_lz4_config(struct super_block *sb,
return -EINVAL;
}
distance = le16_to_cpu(lz4->max_distance);
+
+ sbi->lz4.max_pclusterblks = le16_to_cpu(lz4->max_pclusterblks);
+ if (sbi->lz4.max_pclusterblks >
+ Z_EROFS_PCLUSTER_MAX_SIZE / EROFS_BLKSIZ) {
+ erofs_err(sb, "too large lz4 pcluster blocks %u",
+ sbi->lz4.max_pclusterblks);
+ return -EINVAL;
+ }
} else {
distance = le16_to_cpu(dsb->u1.lz4_max_distance);
+ sbi->lz4.max_pclusterblks = 1;
}

- EROFS_SB(sb)->lz4.max_distance_pages = distance ?
+ sbi->lz4.max_distance_pages = distance ?
DIV_ROUND_UP(distance, PAGE_SIZE) + 1 :
LZ4_MAX_DISTANCE_PAGES;
-
- /* TODO: use max pclusterblks after bigpcluster is enabled */
- return erofs_pcpubuf_growsize(1);
+ return erofs_pcpubuf_growsize(sbi->lz4.max_pclusterblks);
}

static int z_erofs_lz4_prepare_destpages(struct z_erofs_decompress_req *rq,
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index c4b3938a7e56..f1305af50f67 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -63,6 +63,8 @@ struct erofs_fs_context {
struct erofs_sb_lz4_info {
/* # of pages needed for EROFS lz4 rolling decompression */
u16 max_distance_pages;
+ /* maximum possible blocks for pclusters in the filesystem */
+ u16 max_pclusterblks;
};

struct erofs_sb_info {
--
2.20.1

2021-04-01 03:35:27

by Gao Xiang

[permalink] [raw]

Subject: [PATCH v2 07/10] erofs: support parsing big pcluster compress indexes

From: Gao Xiang <[email protected]>

When INCOMPAT_BIG_PCLUSTER sb feature is enabled, legacy compress indexes
will also have the same on-disk header compact indexes to keep per-file
configurations instead of leaving it zeroed.

If ADVISE_BIG_PCLUSTER is set for a file, CBLKCNT will be loaded for each
pcluster in this file by parsing 1st non-head lcluster.

Signed-off-by: Gao Xiang <[email protected]>
---
fs/erofs/zmap.c | 79 +++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 73 insertions(+), 6 deletions(-)

diff --git a/fs/erofs/zmap.c b/fs/erofs/zmap.c
index bd7e10c2fdd3..d34ff810cc15 100644
--- a/fs/erofs/zmap.c
+++ b/fs/erofs/zmap.c
@@ -11,8 +11,10 @@
int z_erofs_fill_inode(struct inode *inode)
{
struct erofs_inode *const vi = EROFS_I(inode);
+ struct erofs_sb_info *sbi = EROFS_SB(inode->i_sb);

- if (vi->datalayout == EROFS_INODE_FLAT_COMPRESSION_LEGACY) {
+ if (!erofs_sb_has_big_pcluster(sbi) &&
+ vi->datalayout == EROFS_INODE_FLAT_COMPRESSION_LEGACY) {
vi->z_advise = 0;
vi->z_algorithmtype[0] = 0;
vi->z_algorithmtype[1] = 0;
@@ -49,7 +51,8 @@ static int z_erofs_fill_inode_lazy(struct inode *inode)
if (test_bit(EROFS_I_Z_INITED_BIT, &vi->flags))
goto out_unlock;

- DBG_BUGON(vi->datalayout == EROFS_INODE_FLAT_COMPRESSION_LEGACY);
+ DBG_BUGON(!erofs_sb_has_big_pcluster(EROFS_SB(sb)) &&
+ vi->datalayout == EROFS_INODE_FLAT_COMPRESSION_LEGACY);

pos = ALIGN(iloc(EROFS_SB(sb), vi->nid) + vi->inode_isize +
vi->xattr_isize, 8);
@@ -96,7 +99,7 @@ struct z_erofs_maprecorder {
u8 type;
u16 clusterofs;
u16 delta[2];
- erofs_blk_t pblk;
+ erofs_blk_t pblk, compressedlcs;
};

static int z_erofs_reload_indexes(struct z_erofs_maprecorder *m,
@@ -159,6 +162,15 @@ static int legacy_load_cluster_from_disk(struct z_erofs_maprecorder *m,
case Z_EROFS_VLE_CLUSTER_TYPE_NONHEAD:
m->clusterofs = 1 << vi->z_logical_clusterbits;
m->delta[0] = le16_to_cpu(di->di_u.delta[0]);
+ if (m->delta[0] & Z_EROFS_VLE_DI_D0_CBLKCNT) {
+ if (!(vi->z_advise & Z_EROFS_ADVISE_BIG_PCLUSTER_1)) {
+ DBG_BUGON(1);
+ return -EFSCORRUPTED;
+ }
+ m->compressedlcs = m->delta[0] &
+ ~Z_EROFS_VLE_DI_D0_CBLKCNT;
+ m->delta[0] = 1;
+ }
m->delta[1] = le16_to_cpu(di->di_u.delta[1]);
break;
case Z_EROFS_VLE_CLUSTER_TYPE_PLAIN:
@@ -366,6 +378,58 @@ static int z_erofs_extent_lookback(struct z_erofs_maprecorder *m,
return 0;
}

+static int z_erofs_get_extent_compressedlen(struct z_erofs_maprecorder *m,
+ unsigned int initial_lcn)
+{
+ struct erofs_inode *const vi = EROFS_I(m->inode);
+ struct erofs_map_blocks *const map = m->map;
+ const unsigned int lclusterbits = vi->z_logical_clusterbits;
+ unsigned long lcn;
+ int err;
+
+ DBG_BUGON(m->type != Z_EROFS_VLE_CLUSTER_TYPE_PLAIN &&
+ m->type != Z_EROFS_VLE_CLUSTER_TYPE_HEAD);
+ if (!(map->m_flags & EROFS_MAP_ZIPPED) ||
+ !(vi->z_advise & Z_EROFS_ADVISE_BIG_PCLUSTER_1)) {
+ map->m_plen = 1 << lclusterbits;
+ return 0;
+ }
+
+ lcn = m->lcn + 1;
+ if (m->compressedlcs)
+ goto out;
+ if (lcn == initial_lcn)
+ goto err_bonus_cblkcnt;
+
+ err = z_erofs_load_cluster_from_disk(m, lcn);
+ if (err)
+ return err;
+
+ switch (m->type) {
+ case Z_EROFS_VLE_CLUSTER_TYPE_NONHEAD:
+ if (m->delta[0] != 1)
+ goto err_bonus_cblkcnt;
+ if (m->compressedlcs)
+ break;
+ fallthrough;
+ default:
+ erofs_err(m->inode->i_sb,
+ "cannot found CBLKCNT @ lcn %lu of nid %llu",
+ lcn, vi->nid);
+ DBG_BUGON(1);
+ return -EFSCORRUPTED;
+ }
+out:
+ map->m_plen = m->compressedlcs << lclusterbits;
+ return 0;
+err_bonus_cblkcnt:
+ erofs_err(m->inode->i_sb,
+ "bogus CBLKCNT @ lcn %lu of nid %llu",
+ lcn, vi->nid);
+ DBG_BUGON(1);
+ return -EFSCORRUPTED;
+}
+
int z_erofs_map_blocks_iter(struct inode *inode,
struct erofs_map_blocks *map,
int flags)
@@ -377,6 +441,7 @@ int z_erofs_map_blocks_iter(struct inode *inode,
};
int err = 0;
unsigned int lclusterbits, endoff;
+ unsigned long initial_lcn;
unsigned long long ofs, end;

trace_z_erofs_map_blocks_iter_enter(inode, map, flags);
@@ -395,10 +460,10 @@ int z_erofs_map_blocks_iter(struct inode *inode,

lclusterbits = vi->z_logical_clusterbits;
ofs = map->m_la;
- m.lcn = ofs >> lclusterbits;
+ initial_lcn = ofs >> lclusterbits;
endoff = ofs & ((1 << lclusterbits) - 1);

- err = z_erofs_load_cluster_from_disk(&m, m.lcn);
+ err = z_erofs_load_cluster_from_disk(&m, initial_lcn);
if (err)
goto unmap_out;

@@ -442,10 +507,12 @@ int z_erofs_map_blocks_iter(struct inode *inode,
}

map->m_llen = end - map->m_la;
- map->m_plen = 1 << lclusterbits;
map->m_pa = blknr_to_addr(m.pblk);
map->m_flags |= EROFS_MAP_MAPPED;

+ err = z_erofs_get_extent_compressedlen(&m, initial_lcn);
+ if (err)
+ goto out;
unmap_out:
if (m.kaddr)
kunmap_atomic(m.kaddr);
--
2.20.1

2021-04-01 03:38:13

by Gao Xiang

[permalink] [raw]

Subject: [PATCH v2 10/10] erofs: enable big pcluster feature

From: Gao Xiang <[email protected]>

Enable COMPR_CFGS and BIG_PCLUSTER since the implementations are
all settled properly.

Signed-off-by: Gao Xiang <[email protected]>
---
fs/erofs/erofs_fs.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/erofs/erofs_fs.h b/fs/erofs/erofs_fs.h
index ecc3a0ea0bc4..8739d3adf51f 100644
--- a/fs/erofs/erofs_fs.h
+++ b/fs/erofs/erofs_fs.h
@@ -20,7 +20,10 @@
#define EROFS_FEATURE_INCOMPAT_LZ4_0PADDING 0x00000001
#define EROFS_FEATURE_INCOMPAT_COMPR_CFGS 0x00000002
#define EROFS_FEATURE_INCOMPAT_BIG_PCLUSTER 0x00000002
-#define EROFS_ALL_FEATURE_INCOMPAT EROFS_FEATURE_INCOMPAT_LZ4_0PADDING
+#define EROFS_ALL_FEATURE_INCOMPAT \
+ (EROFS_FEATURE_INCOMPAT_LZ4_0PADDING | \
+ EROFS_FEATURE_INCOMPAT_COMPR_CFGS | \
+ EROFS_FEATURE_INCOMPAT_BIG_PCLUSTER)

#define EROFS_SB_EXTSLOT_SIZE 16

--
2.20.1

2021-04-01 03:38:25

by Gao Xiang

[permalink] [raw]

Subject: [PATCH v2 08/10] erofs: support parsing big pcluster compact indexes

From: Gao Xiang <[email protected]>

Different from non-compact indexes, several lclusters are packed
as the compact form at once and an unique base blkaddr is stored for
each pack, so each lcluster index would take less space on avarage
(e.g. 2 bytes for COMPACT_2B.) btw, that is also why BIG_PCLUSTER
switch should be consistent for compact head0/1.

Prior to big pcluster, the size of all pclusters is 1 lcluster.
Therefore, when a new HEAD lcluster was scanned, blkaddr would be
bumped by 1 lcluster. However, that way doesn't work anymore for
big pcluster since we actually don't know the compressed size of
pclusters in advance (before reading CBLKCNT).

So, instead, let blkaddr of each pack be the first pcluster blkaddr
with a valid CBLKCNT, in detail,

1) if CBLKCNT starts at the pack, this first valid pcluster is
itself, e.g.
_____________________________________________________________
|_CBLKCNT0_|_NONHEAD_| .. |_HEAD_|_CBLKCNT1_| ... |_HEAD_| ...
^ = blkaddr base ^ += CBLKCNT0 ^ += CBLKCNT1

2) if CBLKCNT doesn't start at the pack, the first valid pcluster
is the next pcluster, e.g.
_________________________________________________________
| NONHEAD_| .. |_HEAD_|_CBLKCNT0_| ... |_HEAD_|_HEAD_| ...
^ = blkaddr base ^ += CBLKCNT0
^ += 1

When a CBLKCNT is found, blkaddr will be increased by CBLKCNT
lclusters, or a new HEAD is found immediately, bump blkaddr by 1
instead (see the picture above.)

Also noted if CBLKCNT is the end of the pack, instead of storing
delta1 (distance of the next HEAD lcluster) as normal NONHEADs,
it still stores the compressed block count (delta0) since delta1
can be calculated indirectly but the block count can't.

Adjust decoding logic to fit big pcluster compact indexes as well.

Signed-off-by: Gao Xiang <[email protected]>
---
fs/erofs/zmap.c | 63 +++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 53 insertions(+), 10 deletions(-)

diff --git a/fs/erofs/zmap.c b/fs/erofs/zmap.c
index d34ff810cc15..545cd5989e6a 100644
--- a/fs/erofs/zmap.c
+++ b/fs/erofs/zmap.c
@@ -77,6 +77,13 @@ static int z_erofs_fill_inode_lazy(struct inode *inode)
}

vi->z_logical_clusterbits = LOG_BLOCK_SIZE + (h->h_clusterbits & 7);
+ if (vi->datalayout == EROFS_INODE_FLAT_COMPRESSION &&
+ !(vi->z_advise & Z_EROFS_ADVISE_BIG_PCLUSTER_1) ^
+ !(vi->z_advise & Z_EROFS_ADVISE_BIG_PCLUSTER_2)) {
+ erofs_err(sb, "big pcluster head1/2 of compact indexes should be consistent for nid %llu",
+ vi->nid);
+ return -EFSCORRUPTED;
+ }
/* paired with smp_mb() at the beginning of the function */
smp_mb();
set_bit(EROFS_I_Z_INITED_BIT, &vi->flags);
@@ -207,6 +214,7 @@ static int unpack_compacted_index(struct z_erofs_maprecorder *m,
unsigned int vcnt, base, lo, encodebits, nblk;
int i;
u8 *in, type;
+ bool big_pcluster;

if (1 << amortizedshift == 4)
vcnt = 2;
@@ -215,6 +223,7 @@ static int unpack_compacted_index(struct z_erofs_maprecorder *m,
else
return -EOPNOTSUPP;

+ big_pcluster = vi->z_advise & Z_EROFS_ADVISE_BIG_PCLUSTER_1;
encodebits = ((vcnt << amortizedshift) - sizeof(__le32)) * 8 / vcnt;
base = round_down(eofs, vcnt << amortizedshift);
in = m->kaddr + base;
@@ -226,7 +235,15 @@ static int unpack_compacted_index(struct z_erofs_maprecorder *m,
m->type = type;
if (type == Z_EROFS_VLE_CLUSTER_TYPE_NONHEAD) {
m->clusterofs = 1 << lclusterbits;
- if (i + 1 != vcnt) {
+ if (lo & Z_EROFS_VLE_DI_D0_CBLKCNT) {
+ if (!big_pcluster) {
+ DBG_BUGON(1);
+ return -EFSCORRUPTED;
+ }
+ m->compressedlcs = lo & ~Z_EROFS_VLE_DI_D0_CBLKCNT;
+ m->delta[0] = 1;
+ return 0;
+ } else if (i + 1 != (int)vcnt) {
m->delta[0] = lo;
return 0;
}
@@ -239,22 +256,48 @@ static int unpack_compacted_index(struct z_erofs_maprecorder *m,
in, encodebits * (i - 1), &type);
if (type != Z_EROFS_VLE_CLUSTER_TYPE_NONHEAD)
lo = 0;
+ else if (lo & Z_EROFS_VLE_DI_D0_CBLKCNT)
+ lo = 1;
m->delta[0] = lo + 1;
return 0;
}
m->clusterofs = lo;
m->delta[0] = 0;
/* figout out blkaddr (pblk) for HEAD lclusters */
- nblk = 1;
- while (i > 0) {
- --i;
- lo = decode_compactedbits(lclusterbits, lomask,
- in, encodebits * i, &type);
- if (type == Z_EROFS_VLE_CLUSTER_TYPE_NONHEAD)
- i -= lo;
-
- if (i >= 0)
+ if (!big_pcluster) {
+ nblk = 1;
+ while (i > 0) {
+ --i;
+ lo = decode_compactedbits(lclusterbits, lomask,
+ in, encodebits * i, &type);
+ if (type == Z_EROFS_VLE_CLUSTER_TYPE_NONHEAD)
+ i -= lo;
+
+ if (i >= 0)
+ ++nblk;
+ }
+ } else {
+ nblk = 0;
+ while (i > 0) {
+ --i;
+ lo = decode_compactedbits(lclusterbits, lomask,
+ in, encodebits * i, &type);
+ if (type == Z_EROFS_VLE_CLUSTER_TYPE_NONHEAD) {
+ if (lo & Z_EROFS_VLE_DI_D0_CBLKCNT) {
+ --i;
+ nblk += lo & ~Z_EROFS_VLE_DI_D0_CBLKCNT;
+ continue;
+ }
+ /* bigpcluster shouldn't have plain d0 == 1 */
+ if (lo <= 1) {
+ DBG_BUGON(1);
+ return -EFSCORRUPTED;
+ }
+ i -= lo - 2;
+ continue;
+ }
++nblk;
+ }
}
in += (vcnt << amortizedshift) - sizeof(__le32);
m->pblk = le32_to_cpu(*(__le32 *)in) + nblk;
--
2.20.1

2021-04-01 03:38:25

by Gao Xiang

[permalink] [raw]

Subject: [PATCH v2 09/10] erofs: support decompress big pcluster for lz4 backend

From: Gao Xiang <[email protected]>

Prior to big pcluster, there was only one compressed page so it'd
easy to map this. However, when big pcluster is enabled, more work
needs to be done to handle multiple compressed pages. In detail,

- (maptype 0) if there is only one compressed page + no need
to copy inplace I/O, just map it directly what we did before;

- (maptype 1) if there are more compressed pages + no need to
copy inplace I/O, vmap such compressed pages instead;

- (maptype 2) if inplace I/O needs to be copied, use per-CPU
buffers for decompression then.

Another thing is how to detect inplace decompression is feasable or
not (it's still quite easy for non big pclusters), apart from the
inplace margin calculation, inplace I/O page reusing order is also
needed to be considered for each compressed page. Currently, if the
compressed page is the xth page, it shouldn't be reused as [0 ...
nrpages_out - nrpages_in + x], otherwise a full copy will be triggered.

Although there are some extra optimization ideas for this, I'd like
to make big pcluster work correctly first and obviously it can be
further optimized later since it has nothing with the on-disk format
at all.

Signed-off-by: Gao Xiang <[email protected]>
---
fs/erofs/decompressor.c | 202 ++++++++++++++++++++++++----------------
1 file changed, 122 insertions(+), 80 deletions(-)

diff --git a/fs/erofs/decompressor.c b/fs/erofs/decompressor.c
index 5d9f9dbd3681..c7b1d3fe8184 100644
--- a/fs/erofs/decompressor.c
+++ b/fs/erofs/decompressor.c
@@ -116,44 +116,87 @@ static int z_erofs_lz4_prepare_destpages(struct z_erofs_decompress_req *rq,
return kaddr ? 1 : 0;
}

-static void *generic_copy_inplace_data(struct z_erofs_decompress_req *rq,
- u8 *src, unsigned int pageofs_in)
+static void *z_erofs_handle_inplace_io(struct z_erofs_decompress_req *rq,
+ void *inpage, unsigned int *inputmargin, int *maptype,
+ bool support_0padding)
{
- /*
- * if in-place decompression is ongoing, those decompressed
- * pages should be copied in order to avoid being overlapped.
- */
- struct page **in = rq->in;
- u8 *const tmp = erofs_get_pcpubuf(1);
- u8 *tmpp = tmp;
- unsigned int inlen = rq->inputsize - pageofs_in;
- unsigned int count = min_t(uint, inlen, PAGE_SIZE - pageofs_in);
-
- while (tmpp < tmp + inlen) {
- if (!src)
- src = kmap_atomic(*in);
- memcpy(tmpp, src + pageofs_in, count);
- kunmap_atomic(src);
- src = NULL;
- tmpp += count;
- pageofs_in = 0;
- count = PAGE_SIZE;
+ unsigned int nrpages_in, nrpages_out;
+ unsigned int ofull, oend, inputsize, total, i, j;
+ struct page **in;
+ void *src, *tmp;
+
+ inputsize = rq->inputsize;
+ nrpages_in = PAGE_ALIGN(inputsize) >> PAGE_SHIFT;
+ oend = rq->pageofs_out + rq->outputsize;
+ ofull = PAGE_ALIGN(oend);
+ nrpages_out = ofull >> PAGE_SHIFT;
+
+ if (rq->inplace_io) {
+ if (rq->partial_decoding || !support_0padding ||
+ ofull - oend < LZ4_DECOMPRESS_INPLACE_MARGIN(inputsize))
+ goto docopy;
+
+ for (i = 0; i < nrpages_in; ++i) {
+ DBG_BUGON(rq->in[i] == NULL);
+ for (j = 0; j < nrpages_out - nrpages_in + i; ++j)
+ if (rq->out[j] == rq->in[i])
+ goto docopy;
+ }
+ }
+
+ if (nrpages_in <= 1) {
+ *maptype = 0;
+ return inpage;
+ }
+ kunmap_atomic(inpage);
+ might_sleep();
+ while (1) {
+ src = vm_map_ram(rq->in, nrpages_in, -1);
+ /* retry two more times (totally 3 times) */
+ if (src || ++i >= 3)
+ break;
+ vm_unmap_aliases();
+ }
+ *maptype = 1;
+ return src;
+docopy:
+ /* Or copy compressed data which can be overlapped to per-CPU buffer */
+ in = rq->in;
+ src = erofs_get_pcpubuf(nrpages_in);
+ if (!src) {
+ DBG_BUGON(1);
+ return ERR_PTR(-EFAULT);
+ }
+
+ tmp = src;
+ total = rq->inputsize;
+ while (total) {
+ unsigned int page_copycnt =
+ min_t(unsigned int, total, PAGE_SIZE - *inputmargin);
+
+ if (!inpage)
+ inpage = kmap_atomic(*in);
+ memcpy(tmp, inpage + *inputmargin, page_copycnt);
+ kunmap_atomic(inpage);
+ inpage = NULL;
+ tmp += page_copycnt;
+ total -= page_copycnt;
++in;
+ *inputmargin = 0;
}
- return tmp;
+ *maptype = 2;
+ return src;
}

static int z_erofs_lz4_decompress(struct z_erofs_decompress_req *rq, u8 *out)
{
- unsigned int inputmargin, inlen;
- u8 *src;
- bool copied, support_0padding;
- int ret;
+ unsigned int inputmargin;
+ u8 *headpage, *src;
+ bool support_0padding;
+ int ret, maptype;

- if (rq->inputsize > PAGE_SIZE)
- return -EOPNOTSUPP;
-
- src = kmap_atomic(*rq->in);
+ DBG_BUGON(*rq->in == NULL);
+ headpage = kmap_atomic(*rq->in);
inputmargin = 0;
support_0padding = false;

@@ -161,50 +204,39 @@ static int z_erofs_lz4_decompress(struct z_erofs_decompress_req *rq, u8 *out)
if (erofs_sb_has_lz4_0padding(EROFS_SB(rq->sb))) {
support_0padding = true;

- while (!src[inputmargin & ~PAGE_MASK])
+ while (!headpage[inputmargin & ~PAGE_MASK])
if (!(++inputmargin & ~PAGE_MASK))
break;

if (inputmargin >= rq->inputsize) {
- kunmap_atomic(src);
+ kunmap_atomic(headpage);
return -EIO;
}
}

- copied = false;
- inlen = rq->inputsize - inputmargin;
- if (rq->inplace_io) {
- const uint oend = (rq->pageofs_out +
- rq->outputsize) & ~PAGE_MASK;
- const uint nr = PAGE_ALIGN(rq->pageofs_out +
- rq->outputsize) >> PAGE_SHIFT;
-
- if (rq->partial_decoding || !support_0padding ||
- rq->out[nr - 1] != rq->in[0] ||
- rq->inputsize - oend <
- LZ4_DECOMPRESS_INPLACE_MARGIN(inlen)) {
- src = generic_copy_inplace_data(rq, src, inputmargin);
- inputmargin = 0;
- copied = true;
- }
+ rq->inputsize -= inputmargin;
+ src = z_erofs_handle_inplace_io(rq, headpage, &inputmargin, &maptype,
+ support_0padding);
+ if (IS_ERR(src)) {
+ kunmap_atomic(headpage);
+ return PTR_ERR(src);
}

/* legacy format could compress extra data in a pcluster. */
if (rq->partial_decoding || !support_0padding)
ret = LZ4_decompress_safe_partial(src + inputmargin, out,
- inlen, rq->outputsize,
- rq->outputsize);
+ rq->inputsize, rq->outputsize, rq->outputsize);
else
ret = LZ4_decompress_safe(src + inputmargin, out,
- inlen, rq->outputsize);
+ rq->inputsize, rq->outputsize);

if (ret != rq->outputsize) {
erofs_err(rq->sb, "failed to decompress %d in[%u, %u] out[%u]",
- ret, inlen, inputmargin, rq->outputsize);
+ ret, rq->inputsize, inputmargin, rq->outputsize);

WARN_ON(1);
print_hex_dump(KERN_DEBUG, "[ in]: ", DUMP_PREFIX_OFFSET,
- 16, 1, src + inputmargin, inlen, true);
+ 16, 1, src + inputmargin, rq->inputsize, true);
print_hex_dump(KERN_DEBUG, "[out]: ", DUMP_PREFIX_OFFSET,
16, 1, out, rq->outputsize, true);

@@ -213,10 +245,16 @@ static int z_erofs_lz4_decompress(struct z_erofs_decompress_req *rq, u8 *out)
ret = -EIO;
}

- if (copied)
- erofs_put_pcpubuf(src);
- else
+ if (maptype == 0) {
kunmap_atomic(src);
+ } else if (maptype == 1) {
+ vm_unmap_ram(src, PAGE_ALIGN(rq->inputsize) >> PAGE_SHIFT);
+ } else if (maptype == 2) {
+ erofs_put_pcpubuf(src);
+ } else {
+ DBG_BUGON(1);
+ return -EFAULT;
+ }
return ret;
}

@@ -268,33 +306,37 @@ static int z_erofs_decompress_generic(struct z_erofs_decompress_req *rq,
void *dst;
int ret, i;

- if (nrpages_out == 1 && !rq->inplace_io) {
- DBG_BUGON(!*rq->out);
- dst = kmap_atomic(*rq->out);
- dst_maptype = 0;
- goto dstmap_out;
- }
+ /* two optimized fast paths only for non bigpcluster cases yet */
+ if (rq->inputsize <= PAGE_SIZE) {
+ if (nrpages_out == 1 && !rq->inplace_io) {
+ DBG_BUGON(!*rq->out);
+ dst = kmap_atomic(*rq->out);
+ dst_maptype = 0;
+ goto dstmap_out;
+ }

- /*
- * For the case of small output size (especially much less
- * than PAGE_SIZE), memcpy the decompressed data rather than
- * compressed data is preferred.
- */
- if (rq->outputsize <= PAGE_SIZE * 7 / 8) {
- dst = erofs_get_pcpubuf(1);
- if (IS_ERR(dst))
- return PTR_ERR(dst);
-
- rq->inplace_io = false;
- ret = alg->decompress(rq, dst);
- if (!ret)
- copy_from_pcpubuf(rq->out, dst, rq->pageofs_out,
- rq->outputsize);
-
- erofs_put_pcpubuf(dst);
- return ret;
+ /*
+ * For the case of small output size (especially much less
+ * than PAGE_SIZE), memcpy the decompressed data rather than
+ * compressed data is preferred.
+ */
+ if (rq->outputsize <= PAGE_SIZE * 7 / 8) {
+ dst = erofs_get_pcpubuf(1);
+ if (IS_ERR(dst))
+ return PTR_ERR(dst);
+
+ rq->inplace_io = false;
+ ret = alg->decompress(rq, dst);
+ if (!ret)
+ copy_from_pcpubuf(rq->out, dst, rq->pageofs_out,
+ rq->outputsize);
+
+ erofs_put_pcpubuf(dst);
+ return ret;
+ }
}

+ /* general decoding path which can be used for all cases */
ret = alg->prepare_destpages(rq, pagepool);
if (ret < 0) {
return ret;
--
2.20.1

2021-04-03 03:18:54

by Gao Xiang

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] erofs: add big pcluster compression support

On Thu, Apr 01, 2021 at 11:29:44AM +0800, Gao Xiang wrote:
> Hi folks,
>
> This is the formal version of EROFS big pcluster support, which means
> EROFS can compress data into more than 1 fs block after this patchset.
>
> {l,p}cluster are EROFS-specific concepts, standing for `logical cluster'
> and `physical cluster' correspondingly. Logical cluster is the basic unit
> of compress indexes in file logical mapping, e.g. it can build compress
> indexes in 2 blocks rather than 1 block (currently only 1 block lcluster
> is supported). Physical cluster is a container of physical compressed
> blocks which contains compressed data, the size of which is the multiple
> of lclustersize.
>
> Different from previous thoughts, which had fixed-sized pclusterblks
> recorded in the on-disk compress index header, our on-disk design allows
> variable-sized pclusterblks now. The main reasons are
> - user data varies in compression ratio locally, so fixed-sized
> clustersize approach is space-wasting and causes extra read
> amplificationfor high CR cases;
>
> - inplace decompression needs zero padding to guarantee its safe margin,
> but we don't want to pad more than 1 fs block for big pcluster;
>
> - end users can now customize the pcluster size according to data type
> since various pclustersize can exist in a file, for example, using
> different pcluster size for executable code and one-shot data. such
> design should be more flexible than many other public compression fses
> (Btw, each file in EROFS can have maximum 2 algorithms at the same time
> by using HEAD1/2, which will be formally added with LZMA support.)
>
> In brief, EROFS can now compress from variable-sized input to
> variable-sized pcluster blocks, as illustrated below:
>
> |<-_lcluster_->|________________________|<-_lcluster_->|
> |____._________|_________ .. ___________|_______.______|
> . .
> . .
> .__________________________________.
> |______________| .. |______________|
> |<- pcluster ->|
>
> The next step would be how to record the compressed block count in
> lclusters. In compress indexes, there are 2 concepts called HEAD and
> NONHEAD lclusters. The difference is that HEAD lcluster starts a new
> pcluster in the lcluster, but NONHEAD not. It's easy to understand
> that big pclusters at least have 2 pclusters, thus at least 2 lclusters
> as well.
>
> Therefore, let the delta0 (distance to its HEAD lcluster) of first NONHEAD
> compress index store the compressed block count with a special flag as a
> new called CBLKCNT compress index. It's also easy to know its delta0 is
> constantly 1, as illustrated below:
> ________________________________________________________
> |_HEAD_|_CBLKCNT_|_NONHEAD_|_..._|_NONHEAD_|_HEAD | HEAD |
> |<------ a pcluster with CBLKCNT --------->|<-- -->|
> ^ a pcluster with 1
>
> If another HEAD follows a HEAD lcluster, there is no room to record
> CBLKCNT, but it's easy to know the size of pcluster will be 1.
>
> More implementation details about this and compact indexes are in the
> commit message.
>
> On the runtime performance side, the current EROFS test results are:
> ________________________________________________________________
> | file system | size | seq read | rand read | rand9m read |
> |_______________|___________|_ MiB/s __|__ MiB/s __|___ MiB/s ___|
> |___erofs_4k____|_556879872_|_ 781.4 __|__ 55.3 ___|___ 25.3 ___|
> |___erofs_16k___|_452509696_|_ 864.8 __|_ 123.2 ___|___ 20.8 ___|
> |___erofs_32k___|_415223808_|_ 899.8 __|_ 105.8 _*_|___ 16.8 ____|
> |___erofs_64k___|_393814016_|_ 906.6 __|__ 66.6 _*_|___ 11.8 ____|
> |__squashfs_8k__|_556191744_|_ 64.9 __|__ 19.3 ___|____ 9.1 ____|
> |__squashfs_16k_|_502661120_|_ 98.9 __|__ 38.0 ___|____ 9.8 ____|
> |__squashfs_32k_|_458784768_|_ 115.4 __|__ 71.6 _*_|___ 10.0 ____|
> |_squashfs_128k_|_398204928_|_ 257.2 __|_ 253.8 _*_|___ 10.9 ____|
> |____ext4_4k____|____()_____|_ 786.6 __|__ 28.6 ___|___ 27.8 ____|
>
>
> * Squashfs grabs more page cache to keep all decompressed data with
> grab_cache_page_nowait() than the normal requested readahead (see
> squashfs_copy_cache and squashfs_readpage_block).
> In principle, EROFS can also cache such all decompressed data
> if necessary, yet it's low priority for now and has little use
> (rand9m is actually a better rand read workload, since the amount
> of I/O is 9m rather than full-sized 1000m).
>
> More details are in
> https://lore.kernel.org/r/[email protected]
>
> Also it's easy to know EROFS is not a fixed pcluster design, so users
> can make several optimized strategy according to data type when mkfs.
> And there is still room to optimize runtime performance for big pcluster
> even further.
>
> Finally, it passes ro_fsstress and can also successfully boot buildroot
> & Android system with android-mainline repo.
>
> current mkfs repo for big pcluster:
> https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git -b experimental-bigpcluster-compact
>
> Thanks for your time on reading this!
>
> Thanks,
> Gao Xiang
>
> changes since v1:
> - add a missing vunmap in erofs_pcpubuf_exit();
> - refine comments and commit messages.
>
> (btw, I'll apply this patchset for -next first for further integration
> test, which will be aimed to 5.13-rc1.)
>

As a quick update, I've applied the following update to the next version
(some minor fix):

diff --git a/fs/erofs/decompressor.c b/fs/erofs/decompressor.c
index c7b1d3fe8184..fe46a9c34923 100644
--- a/fs/erofs/decompressor.c
+++ b/fs/erofs/decompressor.c
@@ -43,11 +43,15 @@ int z_erofs_load_lz4_config(struct super_block *sb,
distance = le16_to_cpu(lz4->max_distance);

sbi->lz4.max_pclusterblks = le16_to_cpu(lz4->max_pclusterblks);
- if (sbi->lz4.max_pclusterblks >
- Z_EROFS_PCLUSTER_MAX_SIZE / EROFS_BLKSIZ) {
- erofs_err(sb, "too large lz4 pcluster blocks %u",
+ if (!sbi->lz4.max_pclusterblks) {
+ sbi->lz4.max_pclusterblks = 1; /* reserved case */
+ } else if (sbi->lz4.max_pclusterblks >
+ Z_EROFS_PCLUSTER_MAX_SIZE / EROFS_BLKSIZ) {
+ erofs_err(sb, "too large lz4 pclusterblks %u",
sbi->lz4.max_pclusterblks);
return -EINVAL;
+ } else if (sbi->lz4.max_pclusterblks >= 2) {
+ erofs_info(sb, "EXPERIMENTAL big pcluster feature in use. Use at your own risk!");
}
} else {
distance = le16_to_cpu(dsb->u1.lz4_max_distance);
diff --git a/fs/erofs/zmap.c b/fs/erofs/zmap.c
index 545cd5989e6a..6fc8e7fdaef8 100644
--- a/fs/erofs/zmap.c
+++ b/fs/erofs/zmap.c
@@ -77,12 +77,21 @@ static int z_erofs_fill_inode_lazy(struct inode *inode)
}

vi->z_logical_clusterbits = LOG_BLOCK_SIZE + (h->h_clusterbits & 7);
+ if (!erofs_sb_has_big_pcluster(EROFS_SB(sb)) &&
+ vi->z_advise & (Z_EROFS_ADVISE_BIG_PCLUSTER_1 |
+ Z_EROFS_ADVISE_BIG_PCLUSTER_2)) {
+ erofs_err(sb, "per-inode big pcluster without sb feature for nid %llu",
+ vi->nid);
+ err = -EFSCORRUPTED;
+ goto unmap_done;
+ }
if (vi->datalayout == EROFS_INODE_FLAT_COMPRESSION &&
!(vi->z_advise & Z_EROFS_ADVISE_BIG_PCLUSTER_1) ^
!(vi->z_advise & Z_EROFS_ADVISE_BIG_PCLUSTER_2)) {
erofs_err(sb, "big pcluster head1/2 of compact indexes should be consistent for nid %llu",
vi->nid);
- return -EFSCORRUPTED;
+ err = -EFSCORRUPTED;
+ goto unmap_done;
}
/* paired with smp_mb() at the beginning of the function */
smp_mb();

2021-04-03 03:54:57

by Chao Yu

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] erofs: add big pcluster compression support

On 2021/4/1 11:29, Gao Xiang wrote:
> Hi folks,
>
> This is the formal version of EROFS big pcluster support, which means
> EROFS can compress data into more than 1 fs block after this patchset.
>
> {l,p}cluster are EROFS-specific concepts, standing for `logical cluster'
> and `physical cluster' correspondingly. Logical cluster is the basic unit
> of compress indexes in file logical mapping, e.g. it can build compress
> indexes in 2 blocks rather than 1 block (currently only 1 block lcluster
> is supported). Physical cluster is a container of physical compressed
> blocks which contains compressed data, the size of which is the multiple
> of lclustersize.
>
> Different from previous thoughts, which had fixed-sized pclusterblks
> recorded in the on-disk compress index header, our on-disk design allows
> variable-sized pclusterblks now. The main reasons are
> - user data varies in compression ratio locally, so fixed-sized
> clustersize approach is space-wasting and causes extra read
> amplificationfor high CR cases;
>
> - inplace decompression needs zero padding to guarantee its safe margin,
> but we don't want to pad more than 1 fs block for big pcluster;
>
> - end users can now customize the pcluster size according to data type
> since various pclustersize can exist in a file, for example, using
> different pcluster size for executable code and one-shot data. such
> design should be more flexible than many other public compression fses
> (Btw, each file in EROFS can have maximum 2 algorithms at the same time
> by using HEAD1/2, which will be formally added with LZMA support.)
>
> In brief, EROFS can now compress from variable-sized input to
> variable-sized pcluster blocks, as illustrated below:
>
> |<-_lcluster_->|________________________|<-_lcluster_->|
> |____._________|_________ .. ___________|_______.______|
> . .
> . .
> .__________________________________.
> |______________| .. |______________|
> |<- pcluster ->|
>
> The next step would be how to record the compressed block count in
> lclusters. In compress indexes, there are 2 concepts called HEAD and
> NONHEAD lclusters. The difference is that HEAD lcluster starts a new
> pcluster in the lcluster, but NONHEAD not. It's easy to understand
> that big pclusters at least have 2 pclusters, thus at least 2 lclusters
> as well.
>
> Therefore, let the delta0 (distance to its HEAD lcluster) of first NONHEAD
> compress index store the compressed block count with a special flag as a
> new called CBLKCNT compress index. It's also easy to know its delta0 is
> constantly 1, as illustrated below:
> ________________________________________________________
> |_HEAD_|_CBLKCNT_|_NONHEAD_|_..._|_NONHEAD_|_HEAD | HEAD |
> |<------ a pcluster with CBLKCNT --------->|<-- -->|
> ^ a pcluster with 1
>
> If another HEAD follows a HEAD lcluster, there is no room to record
> CBLKCNT, but it's easy to know the size of pcluster will be 1.
>
> More implementation details about this and compact indexes are in the
> commit message.
>
> On the runtime performance side, the current EROFS test results are:
> ________________________________________________________________
> | file system | size | seq read | rand read | rand9m read |
> |_______________|___________|_ MiB/s __|__ MiB/s __|___ MiB/s ___|
> |___erofs_4k____|_556879872_|_ 781.4 __|__ 55.3 ___|___ 25.3 ___|
> |___erofs_16k___|_452509696_|_ 864.8 __|_ 123.2 ___|___ 20.8 ___|
> |___erofs_32k___|_415223808_|_ 899.8 __|_ 105.8 _*_|___ 16.8 ____|
> |___erofs_64k___|_393814016_|_ 906.6 __|__ 66.6 _*_|___ 11.8 ____|
> |__squashfs_8k__|_556191744_|_ 64.9 __|__ 19.3 ___|____ 9.1 ____|
> |__squashfs_16k_|_502661120_|_ 98.9 __|__ 38.0 ___|____ 9.8 ____|
> |__squashfs_32k_|_458784768_|_ 115.4 __|__ 71.6 _*_|___ 10.0 ____|
> |_squashfs_128k_|_398204928_|_ 257.2 __|_ 253.8 _*_|___ 10.9 ____|
> |____ext4_4k____|____()_____|_ 786.6 __|__ 28.6 ___|___ 27.8 ____|
>
>
> * Squashfs grabs more page cache to keep all decompressed data with
> grab_cache_page_nowait() than the normal requested readahead (see
> squashfs_copy_cache and squashfs_readpage_block).
> In principle, EROFS can also cache such all decompressed data
> if necessary, yet it's low priority for now and has little use
> (rand9m is actually a better rand read workload, since the amount
> of I/O is 9m rather than full-sized 1000m).
>
> More details are in
> https://lore.kernel.org/r/[email protected]
>
> Also it's easy to know EROFS is not a fixed pcluster design, so users
> can make several optimized strategy according to data type when mkfs.
> And there is still room to optimize runtime performance for big pcluster
> even further.
>
> Finally, it passes ro_fsstress and can also successfully boot buildroot
> & Android system with android-mainline repo.
>
> current mkfs repo for big pcluster:
> https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git -b experimental-bigpcluster-compact
>
> Thanks for your time on reading this!

Nice job!

Acked-by: Chao Yu <[email protected]>

Thanks,

>
> Thanks,
> Gao Xiang
>
> changes since v1:
> - add a missing vunmap in erofs_pcpubuf_exit();
> - refine comments and commit messages.
>
> (btw, I'll apply this patchset for -next first for further integration
> test, which will be aimed to 5.13-rc1.)
>
> Gao Xiang (10):
> erofs: reserve physical_clusterbits[]
> erofs: introduce multipage per-CPU buffers
> erofs: introduce physical cluster slab pools
> erofs: fix up inplace I/O pointer for big pcluster
> erofs: add big physical cluster definition
> erofs: adjust per-CPU buffers according to max_pclusterblks
> erofs: support parsing big pcluster compress indexes
> erofs: support parsing big pcluster compact indexes
> erofs: support decompress big pcluster for lz4 backend
> erofs: enable big pcluster feature
>
> fs/erofs/Kconfig | 14 ---
> fs/erofs/Makefile | 2 +-
> fs/erofs/decompressor.c | 216 +++++++++++++++++++++++++---------------
> fs/erofs/erofs_fs.h | 31 ++++--
> fs/erofs/internal.h | 31 ++----
> fs/erofs/pcpubuf.c | 134 +++++++++++++++++++++++++
> fs/erofs/super.c | 1 +
> fs/erofs/utils.c | 12 ---
> fs/erofs/zdata.c | 193 ++++++++++++++++++++++-------------
> fs/erofs/zdata.h | 14 +--
> fs/erofs/zmap.c | 155 ++++++++++++++++++++++------
> 11 files changed, 560 insertions(+), 243 deletions(-)
> create mode 100644 fs/erofs/pcpubuf.c
>