2018-11-20 16:01:04

by Gao Xiang

[permalink] [raw]
Subject: [PATCH 00/10] staging: erofs: decompression stability enhancement

Hi,

This patchset mainly focuses on erofs decompression stability, most of
them are found in the internal beta test. Some patches which are still
in testing or reviewing aren't included in this patchset and will be
sent after carefully testing.

Thanks,
Gao Xiang

Gao Xiang (10):
staging: erofs: fix `trace_erofs_readpage' position
staging: erofs: fix race when the managed cache is enabled
staging: erofs: atomic_cond_read_relaxed on ref-locked workgroup
staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'
staging: erofs: add a full barrier in erofs_workgroup_unfreeze
staging: erofs: fix the definition of DBG_BUGON
staging: erofs: separate into init_once / always
staging: erofs: locked before registering for all new workgroups
staging: erofs: decompress asynchronously if PG_readahead page at
first
staging: erofs: rename strange variable names in z_erofs_vle_frontend

drivers/staging/erofs/internal.h | 69 ++++++++++++--------
drivers/staging/erofs/unzip_vle.c | 78 ++++++++++++++++-------
drivers/staging/erofs/utils.c | 131 ++++++++++++++++++++++++++------------
3 files changed, 190 insertions(+), 88 deletions(-)

--
2.14.4



2018-11-20 14:21:44

by Gao Xiang

[permalink] [raw]
Subject: [PATCH 01/10] staging: erofs: fix `trace_erofs_readpage' position

`trace_erofs_readpage' should be placed in .readpage()
rather than in the internal `z_erofs_do_read_page'.

Fixes: 284db12cfda3 ("staging: erofs: add trace points for reading zipped data")

Reviewed-by: Chen Gong <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
---
drivers/staging/erofs/unzip_vle.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/erofs/unzip_vle.c b/drivers/staging/erofs/unzip_vle.c
index 6a283f618f46..ede3383ac601 100644
--- a/drivers/staging/erofs/unzip_vle.c
+++ b/drivers/staging/erofs/unzip_vle.c
@@ -598,8 +598,6 @@ static int z_erofs_do_read_page(struct z_erofs_vle_frontend *fe,
unsigned int cur, end, spiltted, index;
int err = 0;

- trace_erofs_readpage(page, false);
-
/* register locked file pages as online pages in pack */
z_erofs_onlinepage_init(page);

@@ -1288,6 +1286,8 @@ static int z_erofs_vle_normalaccess_readpage(struct file *file,
int err;
LIST_HEAD(pagepool);

+ trace_erofs_readpage(page, false);
+
#if (EROFS_FS_ZIP_CACHE_LVL >= 2)
f.cachedzone_la = (erofs_off_t)page->index << PAGE_SHIFT;
#endif
--
2.14.4


2018-11-20 14:22:09

by Gao Xiang

[permalink] [raw]
Subject: [PATCH 03/10] staging: erofs: atomic_cond_read_relaxed on ref-locked workgroup

It's better to use atomic_cond_read_relaxed, which is implemented
in hardware instructions to monitor a variable changes currently
for ARM64, instead of open-coded busy waiting.

Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
---
drivers/staging/erofs/internal.h | 30 ++++++++++++++++++------------
1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/drivers/staging/erofs/internal.h b/drivers/staging/erofs/internal.h
index 89dbd0888e53..eb80ba44d072 100644
--- a/drivers/staging/erofs/internal.h
+++ b/drivers/staging/erofs/internal.h
@@ -221,23 +221,29 @@ static inline void erofs_workgroup_unfreeze(
preempt_enable();
}

+#if defined(CONFIG_SMP)
+static inline int erofs_wait_on_workgroup_freezed(struct erofs_workgroup *grp)
+{
+ return atomic_cond_read_relaxed(&grp->refcount,
+ VAL != EROFS_LOCKED_MAGIC);
+}
+#else
+static inline int erofs_wait_on_workgroup_freezed(struct erofs_workgroup *grp)
+{
+ int v = atomic_read(&grp->refcount);
+
+ /* workgroup is never freezed on uniprocessor systems */
+ DBG_BUGON(v == EROFS_LOCKED_MAGIC);
+ return v;
+}
+#endif
+
static inline bool erofs_workgroup_get(struct erofs_workgroup *grp, int *ocnt)
{
- const int locked = (int)EROFS_LOCKED_MAGIC;
int o;

repeat:
- o = atomic_read(&grp->refcount);
-
- /* spin if it is temporarily locked at the reclaim path */
- if (unlikely(o == locked)) {
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
- do
- cpu_relax();
- while (atomic_read(&grp->refcount) == locked);
-#endif
- goto repeat;
- }
+ o = erofs_wait_on_workgroup_freezed(grp);

if (unlikely(o <= 0))
return -1;
--
2.14.4


2018-11-20 14:22:19

by Gao Xiang

[permalink] [raw]
Subject: [PATCH 04/10] staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'

There are two minor issues in the current freeze interface:

1) Freeze interfaces have not related with CONFIG_DEBUG_SPINLOCK,
therefore fix the incorrect conditions;

2) For SMP platforms, it should also disable preemption before
doing atomic_cmpxchg in case that some high priority tasks
preempt between atomic_cmpxchg and disable_preempt, then spin
on the locked refcount later.

Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
---
drivers/staging/erofs/internal.h | 41 ++++++++++++++++++++++++----------------
1 file changed, 25 insertions(+), 16 deletions(-)

diff --git a/drivers/staging/erofs/internal.h b/drivers/staging/erofs/internal.h
index eb80ba44d072..2e0ef92c138b 100644
--- a/drivers/staging/erofs/internal.h
+++ b/drivers/staging/erofs/internal.h
@@ -194,40 +194,49 @@ struct erofs_workgroup {

#define EROFS_LOCKED_MAGIC (INT_MIN | 0xE0F510CCL)

-static inline bool erofs_workgroup_try_to_freeze(
- struct erofs_workgroup *grp, int v)
+#if defined(CONFIG_SMP)
+static inline bool erofs_workgroup_try_to_freeze(struct erofs_workgroup *grp,
+ int val)
{
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
- if (v != atomic_cmpxchg(&grp->refcount,
- v, EROFS_LOCKED_MAGIC))
- return false;
preempt_disable();
-#else
- preempt_disable();
- if (atomic_read(&grp->refcount) != v) {
+ if (val != atomic_cmpxchg(&grp->refcount, val, EROFS_LOCKED_MAGIC)) {
preempt_enable();
return false;
}
-#endif
return true;
}

-static inline void erofs_workgroup_unfreeze(
- struct erofs_workgroup *grp, int v)
+static inline void erofs_workgroup_unfreeze(struct erofs_workgroup *grp,
+ int orig_val)
{
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
- atomic_set(&grp->refcount, v);
-#endif
+ atomic_set(&grp->refcount, orig_val);
preempt_enable();
}

-#if defined(CONFIG_SMP)
static inline int erofs_wait_on_workgroup_freezed(struct erofs_workgroup *grp)
{
return atomic_cond_read_relaxed(&grp->refcount,
VAL != EROFS_LOCKED_MAGIC);
}
#else
+static inline bool erofs_workgroup_try_to_freeze(struct erofs_workgroup *grp,
+ int val)
+{
+ preempt_disable();
+ /* no need to spin on UP platforms, let's just disable preemption. */
+ if (val != atomic_read(&grp->refcount)) {
+ preempt_enable();
+ return false;
+ }
+ return true;
+}
+
+static inline void erofs_workgroup_unfreeze(struct erofs_workgroup *grp,
+ int orig_val)
+{
+ preempt_enable();
+}
+
static inline int erofs_wait_on_workgroup_freezed(struct erofs_workgroup *grp)
{
int v = atomic_read(&grp->refcount);
--
2.14.4


2018-11-20 14:22:50

by Gao Xiang

[permalink] [raw]
Subject: [PATCH 05/10] staging: erofs: add a full barrier in erofs_workgroup_unfreeze

Just like other generic locks, insert a full barrier
in case of memory reorder.

Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
---
drivers/staging/erofs/internal.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/staging/erofs/internal.h b/drivers/staging/erofs/internal.h
index 2e0ef92c138b..f77653d33633 100644
--- a/drivers/staging/erofs/internal.h
+++ b/drivers/staging/erofs/internal.h
@@ -209,6 +209,7 @@ static inline bool erofs_workgroup_try_to_freeze(struct erofs_workgroup *grp,
static inline void erofs_workgroup_unfreeze(struct erofs_workgroup *grp,
int orig_val)
{
+ smp_mb();
atomic_set(&grp->refcount, orig_val);
preempt_enable();
}
--
2.14.4


2018-11-20 14:23:03

by Gao Xiang

[permalink] [raw]
Subject: [PATCH 06/10] staging: erofs: fix the definition of DBG_BUGON

It's better not to positively BUG_ON the kernel, however developers
need a way to locate issues as soon as possible.

DBG_BUGON is introduced and it could only crash when EROFS_FS_DEBUG
(EROFS developping feature) is on. It is helpful for developers
to find and solve bugs quickly.

Previously, DBG_BUGON is defined as ((void)0) if EROFS_FS_DEBUG is off,
but some unused variable warnings as follows could occur:

drivers/staging/erofs/unzip_vle.c: In function ‘init_always’:
drivers/staging/erofs/unzip_vle.c:61:33: warning: unused variable ‘work’ [-Wunused-variable]
struct z_erofs_vle_work *const work =
^~~~

Fix it to #define DBG_BUGON(x) ((void)(x)).

Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
---
drivers/staging/erofs/internal.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/erofs/internal.h b/drivers/staging/erofs/internal.h
index f77653d33633..0aa2a41b9885 100644
--- a/drivers/staging/erofs/internal.h
+++ b/drivers/staging/erofs/internal.h
@@ -39,7 +39,7 @@
#define debugln(x, ...) ((void)0)

#define dbg_might_sleep() ((void)0)
-#define DBG_BUGON(...) ((void)0)
+#define DBG_BUGON(x) ((void)(x))
#endif

enum {
--
2.14.4


2018-11-20 14:24:22

by Gao Xiang

[permalink] [raw]
Subject: [PATCH 09/10] staging: erofs: decompress asynchronously if PG_readahead page at first

For the case of nr_to_read == lookahead_size, it is better to
decompress asynchronously as well since no page will be needed immediately.

Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
---
drivers/staging/erofs/unzip_vle.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/erofs/unzip_vle.c b/drivers/staging/erofs/unzip_vle.c
index a1376f3c6065..824d2c12c2f3 100644
--- a/drivers/staging/erofs/unzip_vle.c
+++ b/drivers/staging/erofs/unzip_vle.c
@@ -1344,8 +1344,8 @@ static int z_erofs_vle_normalaccess_readpages(struct file *filp,
{
struct inode *const inode = mapping->host;
struct erofs_sb_info *const sbi = EROFS_I_SB(inode);
- const bool sync = __should_decompress_synchronously(sbi, nr_pages);

+ bool sync = __should_decompress_synchronously(sbi, nr_pages);
struct z_erofs_vle_frontend f = VLE_FRONTEND_INIT(inode);
gfp_t gfp = mapping_gfp_constraint(mapping, GFP_KERNEL);
struct page *head = NULL;
@@ -1363,6 +1363,13 @@ static int z_erofs_vle_normalaccess_readpages(struct file *filp,
prefetchw(&page->flags);
list_del(&page->lru);

+ /*
+ * A pure asynchronous readahead is indicated if
+ * a PG_readahead marked page is hitted at first.
+ * Let's also do asynchronous decompression for this case.
+ */
+ sync &= !(PageReadahead(page) && !head);
+
if (add_to_page_cache_lru(page, mapping, page->index, gfp)) {
list_add(&page->lru, &pagepool);
continue;
--
2.14.4


2018-11-20 14:24:32

by Gao Xiang

[permalink] [raw]
Subject: [PATCH 10/10] staging: erofs: rename strange variable names in z_erofs_vle_frontend

Previously, 2 members called `initial' and `cachedzone_la' are used
for applying caching policy (whether the workgroup is at either end),
which are hard to understand, rename them to `backmost' and `headoffset'.

Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
---
drivers/staging/erofs/unzip_vle.c | 25 +++++++++++--------------
1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/drivers/staging/erofs/unzip_vle.c b/drivers/staging/erofs/unzip_vle.c
index 824d2c12c2f3..1ef178e7ac39 100644
--- a/drivers/staging/erofs/unzip_vle.c
+++ b/drivers/staging/erofs/unzip_vle.c
@@ -586,10 +586,9 @@ struct z_erofs_vle_frontend {

z_erofs_vle_owned_workgrp_t owned_head;

- bool initial;
-#if (EROFS_FS_ZIP_CACHE_LVL >= 2)
- erofs_off_t cachedzone_la;
-#endif
+ /* used for applying cache strategy on the fly */
+ bool backmost;
+ erofs_off_t headoffset;
};

#define VLE_FRONTEND_INIT(__i) { \
@@ -600,7 +599,7 @@ struct z_erofs_vle_frontend {
}, \
.builder = VLE_WORK_BUILDER_INIT(), \
.owned_head = Z_EROFS_VLE_WORKGRP_TAIL, \
- .initial = true, }
+ .backmost = true, }

static int z_erofs_do_read_page(struct z_erofs_vle_frontend *fe,
struct page *page,
@@ -643,7 +642,7 @@ static int z_erofs_do_read_page(struct z_erofs_vle_frontend *fe,
debugln("%s: [out-of-range] pos %llu", __func__, offset + cur);

if (z_erofs_vle_work_iter_end(builder))
- fe->initial = false;
+ fe->backmost = false;

map->m_la = offset + cur;
map->m_llen = 0;
@@ -669,8 +668,8 @@ static int z_erofs_do_read_page(struct z_erofs_vle_frontend *fe,
erofs_blknr(map->m_pa),
grp->compressed_pages, erofs_blknr(map->m_plen),
/* compressed page caching selection strategy */
- fe->initial | (EROFS_FS_ZIP_CACHE_LVL >= 2 ?
- map->m_la < fe->cachedzone_la : 0));
+ fe->backmost | (EROFS_FS_ZIP_CACHE_LVL >= 2 ?
+ map->m_la < fe->headoffset : 0));

if (noio_outoforder && builder_is_followed(builder))
builder->role = Z_EROFS_VLE_WORK_PRIMARY;
@@ -1316,9 +1315,8 @@ static int z_erofs_vle_normalaccess_readpage(struct file *file,

trace_erofs_readpage(page, false);

-#if (EROFS_FS_ZIP_CACHE_LVL >= 2)
- f.cachedzone_la = (erofs_off_t)page->index << PAGE_SHIFT;
-#endif
+ f.headoffset = (erofs_off_t)page->index << PAGE_SHIFT;
+
err = z_erofs_do_read_page(&f, page, &pagepool);
(void)z_erofs_vle_work_iter_end(&f.builder);

@@ -1354,9 +1352,8 @@ static int z_erofs_vle_normalaccess_readpages(struct file *filp,
trace_erofs_readpages(mapping->host, lru_to_page(pages),
nr_pages, false);

-#if (EROFS_FS_ZIP_CACHE_LVL >= 2)
- f.cachedzone_la = (erofs_off_t)lru_to_page(pages)->index << PAGE_SHIFT;
-#endif
+ f.headoffset = (erofs_off_t)lru_to_page(pages)->index << PAGE_SHIFT;
+
for (; nr_pages; --nr_pages) {
struct page *page = lru_to_page(pages);

--
2.14.4


2018-11-20 15:01:05

by Gao Xiang

[permalink] [raw]
Subject: [PATCH 02/10] staging: erofs: fix race when the managed cache is enabled

When the managed cache is enabled, the last reference count
of a workgroup must be used for its workstation.

Otherwise, it could lead to incorrect (un)freezes in
the reclaim path, and it would be harmful.

A typical race as follows:

Thread 1 (In the reclaim path) Thread 2
workgroup_freeze(grp, 1) refcnt = 1
...
workgroup_unfreeze(grp, 1) refcnt = 1
workgroup_get(grp) refcnt = 2 (x)
workgroup_put(grp) refcnt = 1 (x)
...unexpected behaviors

* grp is detached but still used, which violates cache-managed
freeze constraint.

Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
---
drivers/staging/erofs/internal.h | 1 +
drivers/staging/erofs/utils.c | 131 +++++++++++++++++++++++++++------------
2 files changed, 93 insertions(+), 39 deletions(-)

diff --git a/drivers/staging/erofs/internal.h b/drivers/staging/erofs/internal.h
index 57575c7f5635..89dbd0888e53 100644
--- a/drivers/staging/erofs/internal.h
+++ b/drivers/staging/erofs/internal.h
@@ -250,6 +250,7 @@ static inline bool erofs_workgroup_get(struct erofs_workgroup *grp, int *ocnt)
}

#define __erofs_workgroup_get(grp) atomic_inc(&(grp)->refcount)
+#define __erofs_workgroup_put(grp) atomic_dec(&(grp)->refcount)

extern int erofs_workgroup_put(struct erofs_workgroup *grp);

diff --git a/drivers/staging/erofs/utils.c b/drivers/staging/erofs/utils.c
index ea8a962e5c95..90de8d9195b7 100644
--- a/drivers/staging/erofs/utils.c
+++ b/drivers/staging/erofs/utils.c
@@ -83,12 +83,21 @@ int erofs_register_workgroup(struct super_block *sb,

grp = xa_tag_pointer(grp, tag);

- err = radix_tree_insert(&sbi->workstn_tree,
- grp->index, grp);
+ /*
+ * Bump up reference count before making this workgroup
+ * visible to other users in order to avoid potential UAF
+ * without serialized by erofs_workstn_lock.
+ */
+ __erofs_workgroup_get(grp);

- if (!err) {
- __erofs_workgroup_get(grp);
- }
+ err = radix_tree_insert(&sbi->workstn_tree,
+ grp->index, grp);
+ if (unlikely(err))
+ /*
+ * it's safe to decrease since the workgroup isn't visible
+ * and refcount >= 2 (cannot be freezed).
+ */
+ __erofs_workgroup_put(grp);

erofs_workstn_unlock(sbi);
radix_tree_preload_end();
@@ -97,19 +106,91 @@ int erofs_register_workgroup(struct super_block *sb,

extern void erofs_workgroup_free_rcu(struct erofs_workgroup *grp);

+static void __erofs_workgroup_free(struct erofs_workgroup *grp)
+{
+ atomic_long_dec(&erofs_global_shrink_cnt);
+ erofs_workgroup_free_rcu(grp);
+}
+
int erofs_workgroup_put(struct erofs_workgroup *grp)
{
int count = atomic_dec_return(&grp->refcount);

if (count == 1)
atomic_long_inc(&erofs_global_shrink_cnt);
- else if (!count) {
- atomic_long_dec(&erofs_global_shrink_cnt);
- erofs_workgroup_free_rcu(grp);
- }
+ else if (!count)
+ __erofs_workgroup_free(grp);
return count;
}

+#ifdef EROFS_FS_HAS_MANAGED_CACHE
+/* for cache-managed case, customized reclaim paths exist */
+static void erofs_workgroup_unfreeze_final(struct erofs_workgroup *grp)
+{
+ erofs_workgroup_unfreeze(grp, 0);
+ __erofs_workgroup_free(grp);
+}
+
+bool erofs_try_to_release_workgroup(struct erofs_sb_info *sbi,
+ struct erofs_workgroup *grp,
+ bool cleanup)
+{
+ /*
+ * for managed cache enabled, the refcount of workgroups
+ * themselves could be < 0 (freezed). So there is no guarantee
+ * that all refcount > 0 if managed cache is enabled.
+ */
+ if (!erofs_workgroup_try_to_freeze(grp, 1))
+ return false;
+
+ /*
+ * note that all cached pages should be unlinked
+ * before delete it from the radix tree.
+ * Otherwise some cached pages of an orphan old workgroup
+ * could be still linked after the new one is available.
+ */
+ if (erofs_try_to_free_all_cached_pages(sbi, grp)) {
+ erofs_workgroup_unfreeze(grp, 1);
+ return false;
+ }
+
+ /* it is impossible to fail after we freeze the workgroup */
+ BUG_ON(xa_untag_pointer(radix_tree_delete(&sbi->workstn_tree,
+ grp->index)) != grp);
+
+ /*
+ * if managed cache is enable, the last refcount
+ * should indicate the related workstation.
+ */
+ erofs_workgroup_unfreeze_final(grp);
+ return true;
+}
+
+#else
+/* for nocache case, no customized reclaim path at all */
+bool erofs_try_to_release_workgroup(struct erofs_sb_info *sbi,
+ struct erofs_workgroup *grp,
+ bool cleanup)
+{
+ int cnt = atomic_read(&grp->refcount);
+
+ DBG_BUGON(cnt <= 0);
+ DBG_BUGON(cleanup && cnt != 1);
+
+ if (cnt > 1)
+ return false;
+
+ if (xa_untag_pointer(radix_tree_delete(&sbi->workstn_tree,
+ grp->index)) != grp)
+ return false;
+
+ /* (rarely) could be grabbed again when freeing */
+ erofs_workgroup_put(grp);
+ return true;
+}
+
+#endif
+
unsigned long erofs_shrink_workstation(struct erofs_sb_info *sbi,
unsigned long nr_shrink,
bool cleanup)
@@ -126,41 +207,13 @@ unsigned long erofs_shrink_workstation(struct erofs_sb_info *sbi,
batch, first_index, PAGEVEC_SIZE);

for (i = 0; i < found; ++i) {
- int cnt;
struct erofs_workgroup *grp = xa_untag_pointer(batch[i]);

first_index = grp->index + 1;

- cnt = atomic_read(&grp->refcount);
- BUG_ON(cnt <= 0);
-
- if (cleanup)
- BUG_ON(cnt != 1);
-
-#ifndef EROFS_FS_HAS_MANAGED_CACHE
- else if (cnt > 1)
-#else
- if (!erofs_workgroup_try_to_freeze(grp, 1))
-#endif
- continue;
-
- if (xa_untag_pointer(radix_tree_delete(&sbi->workstn_tree,
- grp->index)) != grp) {
-#ifdef EROFS_FS_HAS_MANAGED_CACHE
-skip:
- erofs_workgroup_unfreeze(grp, 1);
-#endif
+ /* try to shrink each valid workgroup */
+ if (!erofs_try_to_release_workgroup(sbi, grp, cleanup))
continue;
- }
-
-#ifdef EROFS_FS_HAS_MANAGED_CACHE
- if (erofs_try_to_free_all_cached_pages(sbi, grp))
- goto skip;
-
- erofs_workgroup_unfreeze(grp, 1);
-#endif
- /* (rarely) grabbed again when freeing */
- erofs_workgroup_put(grp);

++freed;
if (unlikely(!--nr_shrink))
--
2.14.4


2018-11-20 16:01:05

by Gao Xiang

[permalink] [raw]
Subject: [PATCH 07/10] staging: erofs: separate into init_once / always

`z_erofs_vle_workgroup' is heavily generated in the decompression,
for example, it resets 32 bytes redundantly for 64-bit platforms
even through Z_EROFS_VLE_INLINE_PAGEVECS + Z_EROFS_CLUSTER_MAX_PAGES,
default 4, pages are stored in `z_erofs_vle_workgroup'.

As an another example, `struct mutex' takes 72 bytes for our kirin
64-bit platforms, it's unnecessary to be reseted at first and
be initialized each time.

Let's avoid filling all `z_erofs_vle_workgroup' with 0 at first
since most fields are reinitialized to meaningful values later,
and pagevec is no need to initialized at all.

Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
---
drivers/staging/erofs/unzip_vle.c | 34 +++++++++++++++++++++++++++++-----
1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/drivers/staging/erofs/unzip_vle.c b/drivers/staging/erofs/unzip_vle.c
index ede3383ac601..4e5843e8ee35 100644
--- a/drivers/staging/erofs/unzip_vle.c
+++ b/drivers/staging/erofs/unzip_vle.c
@@ -43,12 +43,38 @@ static inline int init_unzip_workqueue(void)
return z_erofs_workqueue ? 0 : -ENOMEM;
}

+static void init_once(void *ptr)
+{
+ struct z_erofs_vle_workgroup *grp = ptr;
+ struct z_erofs_vle_work *const work =
+ z_erofs_vle_grab_primary_work(grp);
+ unsigned int i;
+
+ mutex_init(&work->lock);
+ work->nr_pages = 0;
+ work->vcnt = 0;
+ for (i = 0; i < Z_EROFS_CLUSTER_MAX_PAGES; ++i)
+ grp->compressed_pages[i] = NULL;
+}
+
+static void init_always(struct z_erofs_vle_workgroup *grp)
+{
+ struct z_erofs_vle_work *const work =
+ z_erofs_vle_grab_primary_work(grp);
+
+ atomic_set(&grp->obj.refcount, 1);
+ grp->flags = 0;
+
+ DBG_BUGON(work->nr_pages);
+ DBG_BUGON(work->vcnt);
+}
+
int __init z_erofs_init_zip_subsystem(void)
{
z_erofs_workgroup_cachep =
kmem_cache_create("erofs_compress",
Z_EROFS_WORKGROUP_SIZE, 0,
- SLAB_RECLAIM_ACCOUNT, NULL);
+ SLAB_RECLAIM_ACCOUNT, init_once);

if (z_erofs_workgroup_cachep) {
if (!init_unzip_workqueue())
@@ -370,10 +396,11 @@ z_erofs_vle_work_register(const struct z_erofs_vle_work_finder *f,
BUG_ON(grp);

/* no available workgroup, let's allocate one */
- grp = kmem_cache_zalloc(z_erofs_workgroup_cachep, GFP_NOFS);
+ grp = kmem_cache_alloc(z_erofs_workgroup_cachep, GFP_NOFS);
if (unlikely(!grp))
return ERR_PTR(-ENOMEM);

+ init_always(grp);
grp->obj.index = f->idx;
grp->llen = map->m_llen;

@@ -381,7 +408,6 @@ z_erofs_vle_work_register(const struct z_erofs_vle_work_finder *f,
(map->m_flags & EROFS_MAP_ZIPPED) ?
Z_EROFS_VLE_WORKGRP_FMT_LZ4 :
Z_EROFS_VLE_WORKGRP_FMT_PLAIN);
- atomic_set(&grp->obj.refcount, 1);

/* new workgrps have been claimed as type 1 */
WRITE_ONCE(grp->next, *f->owned_head);
@@ -394,8 +420,6 @@ z_erofs_vle_work_register(const struct z_erofs_vle_work_finder *f,
work = z_erofs_vle_grab_primary_work(grp);
work->pageofs = f->pageofs;

- mutex_init(&work->lock);
-
if (gnew) {
int err = erofs_register_workgroup(f->sb, &grp->obj, 0);

--
2.14.4


2018-11-20 16:01:05

by Gao Xiang

[permalink] [raw]
Subject: [PATCH 08/10] staging: erofs: locked before registering for all new workgroups

Let's make sure that the one registering a workgroup will also
take the primary work lock at first for two reasons:
1) There's no need to introduce such a race window (and consequently
overhead) between registering and locking, other tasks could break
in by chance, and the race seems unnecessary (no benefit at all);

2) It's better to take the primary work when a workgroup
is registered to apply the cache managed policy, for example,
if some other tasks break in, it could turn into the in-place
decompression rather than use as the cached decompression.

Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
---
drivers/staging/erofs/unzip_vle.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/erofs/unzip_vle.c b/drivers/staging/erofs/unzip_vle.c
index 4e5843e8ee35..a1376f3c6065 100644
--- a/drivers/staging/erofs/unzip_vle.c
+++ b/drivers/staging/erofs/unzip_vle.c
@@ -420,18 +420,22 @@ z_erofs_vle_work_register(const struct z_erofs_vle_work_finder *f,
work = z_erofs_vle_grab_primary_work(grp);
work->pageofs = f->pageofs;

+ /* lock all primary followed works before visible to others */
+ if (unlikely(!mutex_trylock(&work->lock)))
+ /* for a new workgroup, try_lock *never* fails */
+ DBG_BUGON(1);
+
if (gnew) {
int err = erofs_register_workgroup(f->sb, &grp->obj, 0);

if (err) {
+ mutex_unlock(&work->lock);
kmem_cache_free(z_erofs_workgroup_cachep, grp);
return ERR_PTR(-EAGAIN);
}
}

*f->owned_head = *f->grp_ret = grp;
-
- mutex_lock(&work->lock);
return work;
}

--
2.14.4


2018-11-23 09:28:53

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 03/10] staging: erofs: atomic_cond_read_relaxed on ref-locked workgroup

On Tue, Nov 20, 2018 at 10:34:18PM +0800, Gao Xiang wrote:
> It's better to use atomic_cond_read_relaxed, which is implemented
> in hardware instructions to monitor a variable changes currently
> for ARM64, instead of open-coded busy waiting.
>
> Reviewed-by: Chao Yu <[email protected]>
> Signed-off-by: Gao Xiang <[email protected]>
> ---
> drivers/staging/erofs/internal.h | 30 ++++++++++++++++++------------
> 1 file changed, 18 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/staging/erofs/internal.h b/drivers/staging/erofs/internal.h
> index 89dbd0888e53..eb80ba44d072 100644
> --- a/drivers/staging/erofs/internal.h
> +++ b/drivers/staging/erofs/internal.h
> @@ -221,23 +221,29 @@ static inline void erofs_workgroup_unfreeze(
> preempt_enable();
> }
>
> +#if defined(CONFIG_SMP)
> +static inline int erofs_wait_on_workgroup_freezed(struct erofs_workgroup *grp)
> +{
> + return atomic_cond_read_relaxed(&grp->refcount,
> + VAL != EROFS_LOCKED_MAGIC);
> +}
> +#else
> +static inline int erofs_wait_on_workgroup_freezed(struct erofs_workgroup *grp)
> +{
> + int v = atomic_read(&grp->refcount);

Again, why not use the refcount api instead of doing this yourself?

thanks,

greg k-h

2018-11-23 09:39:48

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH 05/10] staging: erofs: add a full barrier in erofs_workgroup_unfreeze

Hi Greg,

On 2018/11/22 18:22, Greg Kroah-Hartman wrote:
> Please document this memory barrier. It does not make much sense to
> me...

Because we need to make the other observers noticing the latest values modified
in this locking period before unfreezing the whole workgroup, one way is to use
a memory barrier and the other way is to use ACQUIRE and RELEASE. we selected
the first one.

Hmmm...ok, I will add a simple message to explain this, but I think that is
plain enough for a lock...

Thanks,
Gao Xiang

>
> thanks,
>
> greg k-h

2018-11-23 09:45:39

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 04/10] staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'

On Thu, Nov 22, 2018 at 06:29:34PM +0800, Gao Xiang wrote:
> Hi Greg,
>
> On 2018/11/22 18:21, Greg Kroah-Hartman wrote:
> > On Tue, Nov 20, 2018 at 10:34:19PM +0800, Gao Xiang wrote:
> >> There are two minor issues in the current freeze interface:
> >>
> >> 1) Freeze interfaces have not related with CONFIG_DEBUG_SPINLOCK,
> >> therefore fix the incorrect conditions;
> >>
> >> 2) For SMP platforms, it should also disable preemption before
> >> doing atomic_cmpxchg in case that some high priority tasks
> >> preempt between atomic_cmpxchg and disable_preempt, then spin
> >> on the locked refcount later.
> >
> > spinning on a refcount implies that you are trying to do your own type
> > of locking. Why not use the in-kernel locking api instead? It will
> > always do better than trying to do your own logic as the developers
> > there know locking across all types of cpus better than filesystem
> > developers :)
>
> It is because refcount also plays a role as a spinlock on a specific value
> (== EROFS_LOCKED_MAGIC), no need to introduce such a value since the spin
> window is small.

As I said in another email, doing two things with one variable is odd as
those are two different types of functions.

greg k-h

2018-11-23 09:45:40

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 02/10] staging: erofs: fix race when the managed cache is enabled

On Thu, Nov 22, 2018 at 06:42:52PM +0800, Gao Xiang wrote:
> Hi Greg,
>
> On 2018/11/22 18:17, Greg Kroah-Hartman wrote:
> > Any specific reason why you are not using the refcount.h api instead of
> > "doing it yourself" with atomic_inc/dec()?
> >
> > I'm not rejecting this, just curious.
>
> As I explained in the previous email,
> Re: [PATCH 04/10] staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'
>
> we need such a function when the value is >= 0, it plays as a refcount,
> but when the refcount == EROFS_LOCKED_MAGIC (<0, but not 0 as refcount.h),
> and actually there is no need to introduce a seperate spinlock_t because
> we don't actually care about its performance (rarely locked). and
> the corresponding struct is too large for now, we need to decrease its size.

Why do you need to decrease the size? How many of these structures are
created?

And you will care about the performance when a lock is being held, as is
evident by your logic to try to fix those issues in this patch series.
Using a "real" lock will solve all of that and keep you from having to
implement it all "by hand".

thanks,

greg k-h

2018-11-23 09:59:17

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH 04/10] staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'

Hi Greg,

On 2018/11/22 19:05, Greg Kroah-Hartman wrote:
> As I said in another email, doing two things with one variable is odd as
> those are two different types of functions.

I think lockref_put_or_lock do the similar thing, it is heavily used in dcache.c, but
1) 0 is special for us, we need lock it on a < 0 value rather than 0.
2) spinlock_t is too large for us because every compressed page will have the structure,
but the locking rarely happens.

Thanks,
Gao Xiang

>
> greg k-h

2018-11-23 10:35:51

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 02/10] staging: erofs: fix race when the managed cache is enabled

On Tue, Nov 20, 2018 at 10:34:17PM +0800, Gao Xiang wrote:
> When the managed cache is enabled, the last reference count
> of a workgroup must be used for its workstation.
>
> Otherwise, it could lead to incorrect (un)freezes in
> the reclaim path, and it would be harmful.
>
> A typical race as follows:
>
> Thread 1 (In the reclaim path) Thread 2
> workgroup_freeze(grp, 1) refcnt = 1
> ...
> workgroup_unfreeze(grp, 1) refcnt = 1
> workgroup_get(grp) refcnt = 2 (x)
> workgroup_put(grp) refcnt = 1 (x)
> ...unexpected behaviors
>
> * grp is detached but still used, which violates cache-managed
> freeze constraint.
>
> Reviewed-by: Chao Yu <[email protected]>
> Signed-off-by: Gao Xiang <[email protected]>
> ---
> drivers/staging/erofs/internal.h | 1 +
> drivers/staging/erofs/utils.c | 131 +++++++++++++++++++++++++++------------
> 2 files changed, 93 insertions(+), 39 deletions(-)
>
> diff --git a/drivers/staging/erofs/internal.h b/drivers/staging/erofs/internal.h
> index 57575c7f5635..89dbd0888e53 100644
> --- a/drivers/staging/erofs/internal.h
> +++ b/drivers/staging/erofs/internal.h
> @@ -250,6 +250,7 @@ static inline bool erofs_workgroup_get(struct erofs_workgroup *grp, int *ocnt)
> }
>
> #define __erofs_workgroup_get(grp) atomic_inc(&(grp)->refcount)
> +#define __erofs_workgroup_put(grp) atomic_dec(&(grp)->refcount)

Any specific reason why you are not using the refcount.h api instead of
"doing it yourself" with atomic_inc/dec()?

I'm not rejecting this, just curious.

thanks,

greg k-h

2018-11-23 10:35:59

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 02/10] staging: erofs: fix race when the managed cache is enabled

On Tue, Nov 20, 2018 at 10:34:17PM +0800, Gao Xiang wrote:
> When the managed cache is enabled, the last reference count
> of a workgroup must be used for its workstation.
>
> Otherwise, it could lead to incorrect (un)freezes in
> the reclaim path, and it would be harmful.
>
> A typical race as follows:
>
> Thread 1 (In the reclaim path) Thread 2
> workgroup_freeze(grp, 1) refcnt = 1
> ...
> workgroup_unfreeze(grp, 1) refcnt = 1
> workgroup_get(grp) refcnt = 2 (x)
> workgroup_put(grp) refcnt = 1 (x)
> ...unexpected behaviors
>
> * grp is detached but still used, which violates cache-managed
> freeze constraint.
>
> Reviewed-by: Chao Yu <[email protected]>
> Signed-off-by: Gao Xiang <[email protected]>
> ---
> drivers/staging/erofs/internal.h | 1 +
> drivers/staging/erofs/utils.c | 131 +++++++++++++++++++++++++++------------
> 2 files changed, 93 insertions(+), 39 deletions(-)
>
> diff --git a/drivers/staging/erofs/internal.h b/drivers/staging/erofs/internal.h
> index 57575c7f5635..89dbd0888e53 100644
> --- a/drivers/staging/erofs/internal.h
> +++ b/drivers/staging/erofs/internal.h
> @@ -250,6 +250,7 @@ static inline bool erofs_workgroup_get(struct erofs_workgroup *grp, int *ocnt)
> }
>
> #define __erofs_workgroup_get(grp) atomic_inc(&(grp)->refcount)
> +#define __erofs_workgroup_put(grp) atomic_dec(&(grp)->refcount)
>
> extern int erofs_workgroup_put(struct erofs_workgroup *grp);
>
> diff --git a/drivers/staging/erofs/utils.c b/drivers/staging/erofs/utils.c
> index ea8a962e5c95..90de8d9195b7 100644
> --- a/drivers/staging/erofs/utils.c
> +++ b/drivers/staging/erofs/utils.c
> @@ -83,12 +83,21 @@ int erofs_register_workgroup(struct super_block *sb,
>
> grp = xa_tag_pointer(grp, tag);
>
> - err = radix_tree_insert(&sbi->workstn_tree,
> - grp->index, grp);
> + /*
> + * Bump up reference count before making this workgroup
> + * visible to other users in order to avoid potential UAF
> + * without serialized by erofs_workstn_lock.
> + */
> + __erofs_workgroup_get(grp);
>
> - if (!err) {
> - __erofs_workgroup_get(grp);
> - }
> + err = radix_tree_insert(&sbi->workstn_tree,
> + grp->index, grp);
> + if (unlikely(err))
> + /*
> + * it's safe to decrease since the workgroup isn't visible
> + * and refcount >= 2 (cannot be freezed).
> + */
> + __erofs_workgroup_put(grp);
>
> erofs_workstn_unlock(sbi);
> radix_tree_preload_end();
> @@ -97,19 +106,91 @@ int erofs_register_workgroup(struct super_block *sb,
>
> extern void erofs_workgroup_free_rcu(struct erofs_workgroup *grp);
>
> +static void __erofs_workgroup_free(struct erofs_workgroup *grp)
> +{
> + atomic_long_dec(&erofs_global_shrink_cnt);
> + erofs_workgroup_free_rcu(grp);
> +}
> +
> int erofs_workgroup_put(struct erofs_workgroup *grp)
> {
> int count = atomic_dec_return(&grp->refcount);
>
> if (count == 1)
> atomic_long_inc(&erofs_global_shrink_cnt);
> - else if (!count) {
> - atomic_long_dec(&erofs_global_shrink_cnt);
> - erofs_workgroup_free_rcu(grp);
> - }
> + else if (!count)
> + __erofs_workgroup_free(grp);
> return count;
> }
>
> +#ifdef EROFS_FS_HAS_MANAGED_CACHE
> +/* for cache-managed case, customized reclaim paths exist */
> +static void erofs_workgroup_unfreeze_final(struct erofs_workgroup *grp)
> +{
> + erofs_workgroup_unfreeze(grp, 0);
> + __erofs_workgroup_free(grp);
> +}
> +
> +bool erofs_try_to_release_workgroup(struct erofs_sb_info *sbi,
> + struct erofs_workgroup *grp,
> + bool cleanup)
> +{
> + /*
> + * for managed cache enabled, the refcount of workgroups
> + * themselves could be < 0 (freezed). So there is no guarantee
> + * that all refcount > 0 if managed cache is enabled.
> + */
> + if (!erofs_workgroup_try_to_freeze(grp, 1))
> + return false;
> +
> + /*
> + * note that all cached pages should be unlinked
> + * before delete it from the radix tree.
> + * Otherwise some cached pages of an orphan old workgroup
> + * could be still linked after the new one is available.
> + */
> + if (erofs_try_to_free_all_cached_pages(sbi, grp)) {
> + erofs_workgroup_unfreeze(grp, 1);
> + return false;
> + }
> +
> + /* it is impossible to fail after we freeze the workgroup */
> + BUG_ON(xa_untag_pointer(radix_tree_delete(&sbi->workstn_tree,
> + grp->index)) != grp);
> +

It is not a good idea to crash the system. Please try to recover from
this, a BUG_ON() implies that the developer has no idea how to handle
this type of error so either it can never happen (which means that the
BUG_ON can be removed), or you need to fix the logic to handle this type
of issue when it does happen.

I'm not going to take this as-is, sorry.

greg k-h

2018-11-23 10:38:36

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 07/10] staging: erofs: separate into init_once / always

On Tue, Nov 20, 2018 at 10:34:22PM +0800, Gao Xiang wrote:
> `z_erofs_vle_workgroup' is heavily generated in the decompression,
> for example, it resets 32 bytes redundantly for 64-bit platforms
> even through Z_EROFS_VLE_INLINE_PAGEVECS + Z_EROFS_CLUSTER_MAX_PAGES,
> default 4, pages are stored in `z_erofs_vle_workgroup'.
>
> As an another example, `struct mutex' takes 72 bytes for our kirin
> 64-bit platforms, it's unnecessary to be reseted at first and
> be initialized each time.
>
> Let's avoid filling all `z_erofs_vle_workgroup' with 0 at first
> since most fields are reinitialized to meaningful values later,
> and pagevec is no need to initialized at all.
>
> Reviewed-by: Chao Yu <[email protected]>
> Signed-off-by: Gao Xiang <[email protected]>
> ---
> drivers/staging/erofs/unzip_vle.c | 34 +++++++++++++++++++++++++++++-----
> 1 file changed, 29 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/staging/erofs/unzip_vle.c b/drivers/staging/erofs/unzip_vle.c
> index ede3383ac601..4e5843e8ee35 100644
> --- a/drivers/staging/erofs/unzip_vle.c
> +++ b/drivers/staging/erofs/unzip_vle.c
> @@ -43,12 +43,38 @@ static inline int init_unzip_workqueue(void)
> return z_erofs_workqueue ? 0 : -ENOMEM;
> }
>
> +static void init_once(void *ptr)
> +{
> + struct z_erofs_vle_workgroup *grp = ptr;
> + struct z_erofs_vle_work *const work =
> + z_erofs_vle_grab_primary_work(grp);
> + unsigned int i;
> +
> + mutex_init(&work->lock);
> + work->nr_pages = 0;
> + work->vcnt = 0;
> + for (i = 0; i < Z_EROFS_CLUSTER_MAX_PAGES; ++i)
> + grp->compressed_pages[i] = NULL;
> +}
> +
> +static void init_always(struct z_erofs_vle_workgroup *grp)
> +{
> + struct z_erofs_vle_work *const work =
> + z_erofs_vle_grab_primary_work(grp);
> +
> + atomic_set(&grp->obj.refcount, 1);
> + grp->flags = 0;
> +
> + DBG_BUGON(work->nr_pages);
> + DBG_BUGON(work->vcnt);

How can these ever be triggered? I understand the need for debugging
code when you are writing code, but at this point it shouldn't be needed
anymore, right?

thanks,

greg k-h

2018-11-23 10:41:17

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 01/10] staging: erofs: fix `trace_erofs_readpage' position

On Tue, Nov 20, 2018 at 10:34:16PM +0800, Gao Xiang wrote:
> `trace_erofs_readpage' should be placed in .readpage()
> rather than in the internal `z_erofs_do_read_page'.

Why? What happens with the code today?

>
> Fixes: 284db12cfda3 ("staging: erofs: add trace points for reading zipped data")

Should this get into 4.20-final?

thanks,

greg k-h

2018-11-23 10:42:08

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 04/10] staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'

On Tue, Nov 20, 2018 at 10:34:19PM +0800, Gao Xiang wrote:
> There are two minor issues in the current freeze interface:
>
> 1) Freeze interfaces have not related with CONFIG_DEBUG_SPINLOCK,
> therefore fix the incorrect conditions;
>
> 2) For SMP platforms, it should also disable preemption before
> doing atomic_cmpxchg in case that some high priority tasks
> preempt between atomic_cmpxchg and disable_preempt, then spin
> on the locked refcount later.

spinning on a refcount implies that you are trying to do your own type
of locking. Why not use the in-kernel locking api instead? It will
always do better than trying to do your own logic as the developers
there know locking across all types of cpus better than filesystem
developers :)

thanks,

greg k-h

2018-11-23 10:42:15

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 05/10] staging: erofs: add a full barrier in erofs_workgroup_unfreeze

On Tue, Nov 20, 2018 at 10:34:20PM +0800, Gao Xiang wrote:
> Just like other generic locks, insert a full barrier
> in case of memory reorder.
>
> Reviewed-by: Chao Yu <[email protected]>
> Signed-off-by: Gao Xiang <[email protected]>
> ---
> drivers/staging/erofs/internal.h | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/staging/erofs/internal.h b/drivers/staging/erofs/internal.h
> index 2e0ef92c138b..f77653d33633 100644
> --- a/drivers/staging/erofs/internal.h
> +++ b/drivers/staging/erofs/internal.h
> @@ -209,6 +209,7 @@ static inline bool erofs_workgroup_try_to_freeze(struct erofs_workgroup *grp,
> static inline void erofs_workgroup_unfreeze(struct erofs_workgroup *grp,
> int orig_val)
> {
> + smp_mb();

Please document this memory barrier. It does not make much sense to
me...

thanks,

greg k-h

2018-11-23 10:55:37

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 08/10] staging: erofs: locked before registering for all new workgroups

On Tue, Nov 20, 2018 at 10:34:23PM +0800, Gao Xiang wrote:
> Let's make sure that the one registering a workgroup will also
> take the primary work lock at first for two reasons:
> 1) There's no need to introduce such a race window (and consequently
> overhead) between registering and locking, other tasks could break
> in by chance, and the race seems unnecessary (no benefit at all);
>
> 2) It's better to take the primary work when a workgroup
> is registered to apply the cache managed policy, for example,
> if some other tasks break in, it could turn into the in-place
> decompression rather than use as the cached decompression.
>
> Reviewed-by: Chao Yu <[email protected]>
> Signed-off-by: Gao Xiang <[email protected]>
> ---
> drivers/staging/erofs/unzip_vle.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/staging/erofs/unzip_vle.c b/drivers/staging/erofs/unzip_vle.c
> index 4e5843e8ee35..a1376f3c6065 100644
> --- a/drivers/staging/erofs/unzip_vle.c
> +++ b/drivers/staging/erofs/unzip_vle.c
> @@ -420,18 +420,22 @@ z_erofs_vle_work_register(const struct z_erofs_vle_work_finder *f,
> work = z_erofs_vle_grab_primary_work(grp);
> work->pageofs = f->pageofs;
>
> + /* lock all primary followed works before visible to others */
> + if (unlikely(!mutex_trylock(&work->lock)))
> + /* for a new workgroup, try_lock *never* fails */
> + DBG_BUGON(1);

Again, drop this, if it never fails, then there's no need for this. If
it can fail, then properly handle it.

And trylock can fail, so this needs to be fixed.

thanks,

greg k-h

2018-11-23 11:06:49

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH 04/10] staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'

Hi Greg,

On 2018/11/22 18:21, Greg Kroah-Hartman wrote:
> On Tue, Nov 20, 2018 at 10:34:19PM +0800, Gao Xiang wrote:
>> There are two minor issues in the current freeze interface:
>>
>> 1) Freeze interfaces have not related with CONFIG_DEBUG_SPINLOCK,
>> therefore fix the incorrect conditions;
>>
>> 2) For SMP platforms, it should also disable preemption before
>> doing atomic_cmpxchg in case that some high priority tasks
>> preempt between atomic_cmpxchg and disable_preempt, then spin
>> on the locked refcount later.
>
> spinning on a refcount implies that you are trying to do your own type
> of locking. Why not use the in-kernel locking api instead? It will
> always do better than trying to do your own logic as the developers
> there know locking across all types of cpus better than filesystem
> developers :)

It is because refcount also plays a role as a spinlock on a specific value
(== EROFS_LOCKED_MAGIC), no need to introduce such a value since the spin
window is small.

Thanks,
Gao Xiang

>
> thanks,
>
> greg k-h
>

2018-11-23 11:16:47

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH 07/10] staging: erofs: separate into init_once / always

Hi Greg,

On 2018/11/22 18:23, Greg Kroah-Hartman wrote:
>> +
>> + DBG_BUGON(work->nr_pages);
>> + DBG_BUGON(work->vcnt);
> How can these ever be triggered? I understand the need for debugging
> code when you are writing code, but at this point it shouldn't be needed
> anymore, right?

I need to avoid some fields is not 0 when the new workgroup is created (because
work->nr_pages and work->vcnt == 0 usually after the previous workgroup is freed).
But that is not obvious, it is promised by the current logic.

In order to not introduce such a issue in the future, or there are some potential
race (work->nr_pages and work->vcnt != 0 when the previous workgroup == 0), it need
to be noticed to developpers as early as possible.

Thanks,
Gao Xiang

>
> thanks,
>
> greg k-h

2018-11-23 11:27:25

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH 08/10] staging: erofs: locked before registering for all new workgroups

Hi Greg,

On 2018/11/22 18:24, Greg Kroah-Hartman wrote:
>> + /* lock all primary followed works before visible to others */
>> + if (unlikely(!mutex_trylock(&work->lock)))
>> + /* for a new workgroup, try_lock *never* fails */
>> + DBG_BUGON(1);
> Again, drop this, if it never fails, then there's no need for this. If
> it can fail, then properly handle it.
>
> And trylock can fail, so this needs to be fixed.

OK, I will drop this.

Thanks,
Gao Xiang

>
> thanks,

2018-11-23 11:41:51

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH 02/10] staging: erofs: fix race when the managed cache is enabled

Hi Greg,

On 2018/11/22 18:17, Greg Kroah-Hartman wrote:
> Any specific reason why you are not using the refcount.h api instead of
> "doing it yourself" with atomic_inc/dec()?
>
> I'm not rejecting this, just curious.

As I explained in the previous email,
Re: [PATCH 04/10] staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'

we need such a function when the value is >= 0, it plays as a refcount,
but when the refcount == EROFS_LOCKED_MAGIC (<0, but not 0 as refcount.h),
and actually there is no need to introduce a seperate spinlock_t because
we don't actually care about its performance (rarely locked). and
the corresponding struct is too large for now, we need to decrease its size.

Thanks,
Gao Xiang

>
> thanks,
>
> greg k-h

2018-11-23 11:46:29

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH 01/10] staging: erofs: fix `trace_erofs_readpage' position

Hi Greg,

On 2018/11/22 18:19, Greg Kroah-Hartman wrote:
> On Tue, Nov 20, 2018 at 10:34:16PM +0800, Gao Xiang wrote:
>> `trace_erofs_readpage' should be placed in .readpage()
>> rather than in the internal `z_erofs_do_read_page'.
> Why? What happens with the code today?
trace_erofs_readpage is used to trace .readpage() interface (it represents sync read)
hook rather than its internal implementation z_erofs_do_read_page (which both .readpage()
and .readpages() uses it). Chen Gong places the tracepoint to a wrong place by mistake.
And we found it by our internal test using this tracepoint.

>
>> Fixes: 284db12cfda3 ("staging: erofs: add trace points for reading zipped data")
> Should this get into 4.20-final?
I think so, which is not very important but I think it should be fixed...

Thanks,
Gao Xiang

>
> thanks,
>
> greg k-h

2018-11-23 12:03:35

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 01/10] staging: erofs: fix `trace_erofs_readpage' position

On Thu, Nov 22, 2018 at 06:49:53PM +0800, Gao Xiang wrote:
> Hi Greg,
>
> On 2018/11/22 18:19, Greg Kroah-Hartman wrote:
> > On Tue, Nov 20, 2018 at 10:34:16PM +0800, Gao Xiang wrote:
> >> `trace_erofs_readpage' should be placed in .readpage()
> >> rather than in the internal `z_erofs_do_read_page'.
> > Why? What happens with the code today?
> trace_erofs_readpage is used to trace .readpage() interface (it represents sync read)
> hook rather than its internal implementation z_erofs_do_read_page (which both .readpage()
> and .readpages() uses it). Chen Gong places the tracepoint to a wrong place by mistake.
> And we found it by our internal test using this tracepoint.

Ok, you should put this in the changelog text when you redo this series.

thanks,

greg k-h

2018-11-23 12:04:14

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 04/10] staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'

On Thu, Nov 22, 2018 at 06:29:34PM +0800, Gao Xiang wrote:
> Hi Greg,
>
> On 2018/11/22 18:21, Greg Kroah-Hartman wrote:
> > On Tue, Nov 20, 2018 at 10:34:19PM +0800, Gao Xiang wrote:
> >> There are two minor issues in the current freeze interface:
> >>
> >> 1) Freeze interfaces have not related with CONFIG_DEBUG_SPINLOCK,
> >> therefore fix the incorrect conditions;
> >>
> >> 2) For SMP platforms, it should also disable preemption before
> >> doing atomic_cmpxchg in case that some high priority tasks
> >> preempt between atomic_cmpxchg and disable_preempt, then spin
> >> on the locked refcount later.
> >
> > spinning on a refcount implies that you are trying to do your own type
> > of locking. Why not use the in-kernel locking api instead? It will
> > always do better than trying to do your own logic as the developers
> > there know locking across all types of cpus better than filesystem
> > developers :)
>
> It is because refcount also plays a role as a spinlock on a specific value
> (== EROFS_LOCKED_MAGIC), no need to introduce such a value since the spin
> window is small.

Do not try to overload a reference count as a spinlock, you will run
into problems and in the end have to implement everything that the core
locking code has done.

Your use of this is highly suspect, what happens when you use a "real"
spinlock and a "real" reference count instead? Those are two different
things from a logical point of view and you are mixing them together
which is odd.

thanks,

greg k-h

2018-11-23 12:04:15

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 07/10] staging: erofs: separate into init_once / always

On Thu, Nov 22, 2018 at 06:34:10PM +0800, Gao Xiang wrote:
> Hi Greg,
>
> On 2018/11/22 18:23, Greg Kroah-Hartman wrote:
> >> +
> >> + DBG_BUGON(work->nr_pages);
> >> + DBG_BUGON(work->vcnt);
> > How can these ever be triggered? I understand the need for debugging
> > code when you are writing code, but at this point it shouldn't be needed
> > anymore, right?
>
> I need to avoid some fields is not 0 when the new workgroup is created (because
> work->nr_pages and work->vcnt == 0 usually after the previous workgroup is freed).
> But that is not obvious, it is promised by the current logic.

Then delete these lines if they can never happen :)

> In order to not introduce such a issue in the future, or there are some potential
> race (work->nr_pages and work->vcnt != 0 when the previous workgroup == 0), it need
> to be noticed to developpers as early as possible.

Then make it a real call, do not wrap it in odd macros that do not
really explain why it is "debugging only". Your code is "real" now,
make the logic real for all developers and users.

thanks,

greg k-h

2018-11-23 12:04:16

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH 07/10] staging: erofs: separate into init_once / always

Hi Greg,

On 2018/11/22 19:05, Greg Kroah-Hartman wrote:
> On Thu, Nov 22, 2018 at 06:34:10PM +0800, Gao Xiang wrote:
>> Hi Greg,
>>
>> On 2018/11/22 18:23, Greg Kroah-Hartman wrote:
>>>> +
>>>> + DBG_BUGON(work->nr_pages);
>>>> + DBG_BUGON(work->vcnt);
>>> How can these ever be triggered? I understand the need for debugging
>>> code when you are writing code, but at this point it shouldn't be needed
>>> anymore, right?
>>
>> I need to avoid some fields is not 0 when the new workgroup is created (because
>> work->nr_pages and work->vcnt == 0 usually after the previous workgroup is freed).
>> But that is not obvious, it is promised by the current logic.
>
> Then delete these lines if they can never happen :)

I don't know how to observe such a race in our beta test and community users.

Because if the kernel is crashed, we could collect the whole kernel dump to observe the memory
and all registers, if we only have some warning, it will be not easy to get the state as early as possible.

Thank,
Gao Xiang

>
>> In order to not introduce such a issue in the future, or there are some potential
>> race (work->nr_pages and work->vcnt != 0 when the previous workgroup == 0), it need
>> to be noticed to developpers as early as possible.
>
> Then make it a real call, do not wrap it in odd macros that do not
> really explain why it is "debugging only". Your code is "real" now,
> make the logic real for all developers and users.
>
> thanks,
>
> greg k-h
>

2018-11-23 12:41:41

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 07/10] staging: erofs: separate into init_once / always

On Thu, Nov 22, 2018 at 07:11:08PM +0800, Gao Xiang wrote:
> Hi Greg,
>
> On 2018/11/22 19:05, Greg Kroah-Hartman wrote:
> > On Thu, Nov 22, 2018 at 06:34:10PM +0800, Gao Xiang wrote:
> >> Hi Greg,
> >>
> >> On 2018/11/22 18:23, Greg Kroah-Hartman wrote:
> >>>> +
> >>>> + DBG_BUGON(work->nr_pages);
> >>>> + DBG_BUGON(work->vcnt);
> >>> How can these ever be triggered? I understand the need for debugging
> >>> code when you are writing code, but at this point it shouldn't be needed
> >>> anymore, right?
> >>
> >> I need to avoid some fields is not 0 when the new workgroup is created (because
> >> work->nr_pages and work->vcnt == 0 usually after the previous workgroup is freed).
> >> But that is not obvious, it is promised by the current logic.
> >
> > Then delete these lines if they can never happen :)
>
> I don't know how to observe such a race in our beta test and community users.

/*
* Let developers know something went really wrong with their
* initialization code
*/
if (!work->nr_pages) {
pr_err("nr_pages == NULL!");
WARN_ON(1);
}
if (!work->vcnt) {
pr_err("vcnt == NULL!");
WARN_ON(1);
}

or something like that.

Don't make people rebuild your code with different options for
debugging. That will never work in the 'real world' when people start
using the code. You need to have things enabled for people all the
time, which is why we have dynamic debugging in the kernel now, and not
a zillion different "DRIVER_DEBUG" build options anymore.

> Because if the kernel is crashed, we could collect the whole kernel dump to observe the memory
> and all registers, if we only have some warning, it will be not easy to get the state as early as possible.

When the kernel crashes, geting a dump is hard on almost all hardware.
It is only rare systems that you can get a kernel dump.

thanks,

greg k-h

2018-11-23 13:10:56

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH 07/10] staging: erofs: separate into init_once / always

Hi Greg,

On 2018/11/22 19:26, Greg Kroah-Hartman wrote:
> On Thu, Nov 22, 2018 at 07:11:08PM +0800, Gao Xiang wrote:
>> Hi Greg,
>>
>> On 2018/11/22 19:05, Greg Kroah-Hartman wrote:
>>> On Thu, Nov 22, 2018 at 06:34:10PM +0800, Gao Xiang wrote:
>>>> Hi Greg,
>>>>
>>>> On 2018/11/22 18:23, Greg Kroah-Hartman wrote:
>>>>>> +
>>>>>> + DBG_BUGON(work->nr_pages);
>>>>>> + DBG_BUGON(work->vcnt);
>>>>> How can these ever be triggered? I understand the need for debugging
>>>>> code when you are writing code, but at this point it shouldn't be needed
>>>>> anymore, right?
>>>>
>>>> I need to avoid some fields is not 0 when the new workgroup is created (because
>>>> work->nr_pages and work->vcnt == 0 usually after the previous workgroup is freed).
>>>> But that is not obvious, it is promised by the current logic.
>>>
>>> Then delete these lines if they can never happen :)
>>
>> I don't know how to observe such a race in our beta test and community users.
>
> /*
> * Let developers know something went really wrong with their
> * initialization code
> */
> if (!work->nr_pages) {
> pr_err("nr_pages == NULL!");
> WARN_ON(1);
> }
> if (!work->vcnt) {
> pr_err("vcnt == NULL!");
> WARN_ON(1);
> }
>
> or something like that.
>
> Don't make people rebuild your code with different options for
> debugging. That will never work in the 'real world' when people start
> using the code. You need to have things enabled for people all the
> time, which is why we have dynamic debugging in the kernel now, and not
> a zillion different "DRIVER_DEBUG" build options anymore.
>
>> Because if the kernel is crashed, we could collect the whole kernel dump to observe the memory
>> and all registers, if we only have some warning, it will be not easy to get the state as early as possible.
>
> When the kernel crashes, geting a dump is hard on almost all hardware.
> It is only rare systems that you can get a kernel dump.

This piece of code is already used in our hisilicon beta test, all memorydump,
registers, stack can be collected.

Apart from that, I observed many f2fs bugs are observed in this way and
many erofs bugs are collected in that way...sigh...

Thanks,
Gao Xiang


>
> thanks,
>
> greg k-h
>

2018-11-23 14:04:13

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH 02/10] staging: erofs: fix race when the managed cache is enabled

Hi Greg,

On 2018/11/22 19:06, Greg Kroah-Hartman wrote:
> On Thu, Nov 22, 2018 at 06:42:52PM +0800, Gao Xiang wrote:
>> Hi Greg,
>>
>> On 2018/11/22 18:17, Greg Kroah-Hartman wrote:
>>> Any specific reason why you are not using the refcount.h api instead of
>>> "doing it yourself" with atomic_inc/dec()?
>>>
>>> I'm not rejecting this, just curious.
>> As I explained in the previous email,
>> Re: [PATCH 04/10] staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'
>>
>> we need such a function when the value is >= 0, it plays as a refcount,
>> but when the refcount == EROFS_LOCKED_MAGIC (<0, but not 0 as refcount.h),
>> and actually there is no need to introduce a seperate spinlock_t because
>> we don't actually care about its performance (rarely locked). and
>> the corresponding struct is too large for now, we need to decrease its size.
> Why do you need to decrease the size? How many of these structures are
> created?

As I said in the previous email, every compressed page will have a managed structure
called erofs_workgroup, and it is heavily allocated like page/inode/dentry in the erofs.

>
> And you will care about the performance when a lock is being held, as is
> evident by your logic to try to fix those issues in this patch series.
> Using a "real" lock will solve all of that and keep you from having to
> implement it all "by hand".

The function is much like lockref (aligned_u64 lock_count;) with the exception as my previous email
explained.

Thanks,
Gao Xiang

>
> thanks,
>
> greg k-h

2018-11-23 14:07:00

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH 07/10] staging: erofs: separate into init_once / always

Hi Greg,

On 2018/11/22 19:26, Greg Kroah-Hartman wrote:
> Don't make people rebuild your code with different options for
> debugging. That will never work in the 'real world' when people start
> using the code. You need to have things enabled for people all the
> time, which is why we have dynamic debugging in the kernel now, and not
> a zillion different "DRIVER_DEBUG" build options anymore.

Actually, current erofs handle differently for beta users (in eng mode)
and commercial users.

CONFIG_EROFS_FS_DEBUG is enable for all beta users, after an observed
expression is false, we could get the whole memorydump as early as possible.
But for commercial users, there are no such observing points to promise
the kernel stability and performance.

It has helped us to find several bug, and I cannot find some alternative way
to get the the first scene of the accident...

Thanks,
Gao Xiang

2018-11-23 14:22:25

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH 07/10] staging: erofs: separate into init_once / always

Hi Greg,

On 2018/11/22 20:00, Gao Xiang wrote:
> Hi Greg,
>
> On 2018/11/22 19:26, Greg Kroah-Hartman wrote:
>> Don't make people rebuild your code with different options for
>> debugging. That will never work in the 'real world' when people start
>> using the code. You need to have things enabled for people all the
>> time, which is why we have dynamic debugging in the kernel now, and not
>> a zillion different "DRIVER_DEBUG" build options anymore.
> Actually, current erofs handle differently for beta users (in eng mode)
> and commercial users.
>
> CONFIG_EROFS_FS_DEBUG is enable for all beta users, after an observed
> expression is false, we could get the whole memorydump as early as possible.
> But for commercial users, there are no such observing points to promise
> the kernel stability and performance.
>
> It has helped us to find several bug, and I cannot find some alternative way
> to get the the first scene of the accident...

I'm about to send v2 of this patchset... I need to get your opinion...

I think for the current erofs development state, I tend to leave such a
developping switch because the code could be modified frequently,
many potential bugs could be avoid when this debugging mode is on and
it will be dropped after erofs becomes more stable...

Thanks,
Gao Xiang

>
> Thanks,
> Gao Xiang

2018-11-23 15:49:40

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 02/10] staging: erofs: fix race when the managed cache is enabled

On Thu, Nov 22, 2018 at 07:43:50PM +0800, Gao Xiang wrote:
> Hi Greg,
>
> On 2018/11/22 19:06, Greg Kroah-Hartman wrote:
> > On Thu, Nov 22, 2018 at 06:42:52PM +0800, Gao Xiang wrote:
> >> Hi Greg,
> >>
> >> On 2018/11/22 18:17, Greg Kroah-Hartman wrote:
> >>> Any specific reason why you are not using the refcount.h api instead of
> >>> "doing it yourself" with atomic_inc/dec()?
> >>>
> >>> I'm not rejecting this, just curious.
> >> As I explained in the previous email,
> >> Re: [PATCH 04/10] staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'
> >>
> >> we need such a function when the value is >= 0, it plays as a refcount,
> >> but when the refcount == EROFS_LOCKED_MAGIC (<0, but not 0 as refcount.h),
> >> and actually there is no need to introduce a seperate spinlock_t because
> >> we don't actually care about its performance (rarely locked). and
> >> the corresponding struct is too large for now, we need to decrease its size.
> > Why do you need to decrease the size? How many of these structures are
> > created?
>
> As I said in the previous email, every compressed page will have a managed structure
> called erofs_workgroup, and it is heavily allocated like page/inode/dentry in the erofs.

Ugh, every page? Ok, nevermind, I take back my objections. You all are
crazy and need to do crazy things like this :)

greg k-h

2018-11-23 16:12:16

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH 02/10] staging: erofs: fix race when the managed cache is enabled

Hi Greg,

On 2018/11/22 20:26, Greg Kroah-Hartman wrote:
> Ugh, every page? Ok, nevermind, I take back my objections. You all are
> crazy and need to do crazy things like this :)

...Do you have some idea about this? ... I think it is fairly normal... :(

We have a large number of managed workgroups, the current use case is the 4k compressed size,
each compressed page has a workgroup, but erofs also has reclaim paths to free them for low memory cases.

But since the structure (z_erofs_vle_workgroup, erofs_workgroup) is critical,
I need to make it as small as possible.

Thanks,
Gao Xiang

2018-11-23 16:25:34

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 07/10] staging: erofs: separate into init_once / always

On Thu, Nov 22, 2018 at 09:01:31PM +0800, Gao Xiang wrote:
> Hi Greg,
>
> On 2018/11/22 20:00, Gao Xiang wrote:
> > Hi Greg,
> >
> > On 2018/11/22 19:26, Greg Kroah-Hartman wrote:
> >> Don't make people rebuild your code with different options for
> >> debugging. That will never work in the 'real world' when people start
> >> using the code. You need to have things enabled for people all the
> >> time, which is why we have dynamic debugging in the kernel now, and not
> >> a zillion different "DRIVER_DEBUG" build options anymore.
> > Actually, current erofs handle differently for beta users (in eng mode)
> > and commercial users.
> >
> > CONFIG_EROFS_FS_DEBUG is enable for all beta users, after an observed
> > expression is false, we could get the whole memorydump as early as possible.
> > But for commercial users, there are no such observing points to promise
> > the kernel stability and performance.
> >
> > It has helped us to find several bug, and I cannot find some alternative way
> > to get the the first scene of the accident...
>
> I'm about to send v2 of this patchset... I need to get your opinion...
>
> I think for the current erofs development state, I tend to leave such a
> developping switch because the code could be modified frequently,
> many potential bugs could be avoid when this debugging mode is on and
> it will be dropped after erofs becomes more stable...

You have two competing issues here. You need to have some sort of debug
build that allows you to get feedback from users, and you need to
provide a solid, reliable system for those not wanting to debug
anything.

If you use a kernel build option, then you can do what you are doing
now, just that do not expect for any user that has problems with the
filesystem, they will rebuild the code just to try to help debug it.
That is going to require run-time debugging to be enabled as they will
not usually have built their own kernel/module at all.

For now, keep doing what you are doing, I just started to worry when I
saw the initial BUG_ON() as we had just talked about these at the
Maintainers summit a few weeks ago, and how we need to get rid of them
from the kernel as they are a crutch that we need to solve.

thanks, and sorry for the noise. Please resend this series again,

greg k-h

2018-11-23 16:40:33

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 02/10] staging: erofs: fix race when the managed cache is enabled

On Thu, Nov 22, 2018 at 08:41:15PM +0800, Gao Xiang wrote:
> Hi Greg,
>
> On 2018/11/22 20:26, Greg Kroah-Hartman wrote:
> > Ugh, every page? Ok, nevermind, I take back my objections. You all are
> > crazy and need to do crazy things like this :)
>
> ...Do you have some idea about this? ... I think it is fairly normal... :(
>
> We have a large number of managed workgroups, the current use case is the 4k compressed size,
> each compressed page has a workgroup, but erofs also has reclaim paths to free them for low memory cases.
>
> But since the structure (z_erofs_vle_workgroup, erofs_workgroup) is critical,
> I need to make it as small as possible.

I do not know "real" filesystems (i.e. ones that put stuff on disks),
only virtual ones (like sysfs/kernfs), so I do not know what to suggest
here, sorry. I thought this was a much higner-level structure, if this
is per-compressed-page, then you are more than welcome to optimize for
this. Good luck with your "fake" spinlocks though :)

thanks,

greg k-h

2018-11-23 18:14:20

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH 07/10] staging: erofs: separate into init_once / always

Hi Greg,

On 2018/11/22 21:23, Greg Kroah-Hartman wrote:
> On Thu, Nov 22, 2018 at 09:01:31PM +0800, Gao Xiang wrote:
>> Hi Greg,
>>
>> On 2018/11/22 20:00, Gao Xiang wrote:
>>> Hi Greg,
>>>
>>> On 2018/11/22 19:26, Greg Kroah-Hartman wrote:
>>>> Don't make people rebuild your code with different options for
>>>> debugging. That will never work in the 'real world' when people start
>>>> using the code. You need to have things enabled for people all the
>>>> time, which is why we have dynamic debugging in the kernel now, and not
>>>> a zillion different "DRIVER_DEBUG" build options anymore.
>>> Actually, current erofs handle differently for beta users (in eng mode)
>>> and commercial users.
>>>
>>> CONFIG_EROFS_FS_DEBUG is enable for all beta users, after an observed
>>> expression is false, we could get the whole memorydump as early as possible.
>>> But for commercial users, there are no such observing points to promise
>>> the kernel stability and performance.
>>>
>>> It has helped us to find several bug, and I cannot find some alternative way
>>> to get the the first scene of the accident...
>>
>> I'm about to send v2 of this patchset... I need to get your opinion...
>>
>> I think for the current erofs development state, I tend to leave such a
>> developping switch because the code could be modified frequently,
>> many potential bugs could be avoid when this debugging mode is on and
>> it will be dropped after erofs becomes more stable...
>
> You have two competing issues here. You need to have some sort of debug
> build that allows you to get feedback from users, and you need to
> provide a solid, reliable system for those not wanting to debug
> anything.

Yes, you are right :)

>
> If you use a kernel build option, then you can do what you are doing
> now, just that do not expect for any user that has problems with the
> filesystem, they will rebuild the code just to try to help debug it.
> That is going to require run-time debugging to be enabled as they will
> not usually have built their own kernel/module at all.

Yes, actually the current Android system have 2 modes -- eng and commerical build
for all versions.

A large number of beta users will receive eng build, and many buges were observed
from these users. That is the debugging benefit what Android system does now.

I do not expect real users to rebuild kernels for debugging, I think after the Android
use case, erofs has become solid enough. For commerical builds, it will have enough
error handing case found by Android users and developers.

>
> For now, keep doing what you are doing, I just started to worry when I
> saw the initial BUG_ON() as we had just talked about these at the
> Maintainers summit a few weeks ago, and how we need to get rid of them
> from the kernel as they are a crutch that we need to solve.

Thanks for understanding...

Thanks,
Gao Xiang

>
> thanks, and sorry for the noise. Please resend this series again,
>
> greg k-h
>

2018-11-24 07:14:24

by Andrea Parri

[permalink] [raw]
Subject: Re: [PATCH 05/10] staging: erofs: add a full barrier in erofs_workgroup_unfreeze

On Thu, Nov 22, 2018 at 06:56:32PM +0800, Gao Xiang wrote:
> Hi Greg,
>
> On 2018/11/22 18:22, Greg Kroah-Hartman wrote:
> > Please document this memory barrier. It does not make much sense to
> > me...
>
> Because we need to make the other observers noticing the latest values modified
> in this locking period before unfreezing the whole workgroup, one way is to use
> a memory barrier and the other way is to use ACQUIRE and RELEASE. we selected
> the first one.
>
> Hmmm...ok, I will add a simple message to explain this, but I think that is
> plain enough for a lock...

Sympathizing with Greg's request, let me add some specific suggestions:

1. It wouldn't hurt to indicate a pair of memory accesses which are
intended to be "ordered" by the memory barrier in question (yes,
this pair might not be unique, but you should be able to provide
an example).

2. Memory barriers always come matched by other memory barriers, or
dependencies (it really does not make sense to talk about a full
barrier "in isolation"): please also indicate (an instance of) a
matching barrier or the matching barriers.

3. How do the hardware threads communicate? In the acquire/release
pattern you mentioned above, the load-acquire *reads from* a/the
previous store-release, a memory access that follows the acquire
somehow communicate with a memory access preceding the release...

4. It is a good practice to include the above information within an
(inline) comment accompanying the added memory barrier (in fact,
IIRC, checkpatch.pl gives you a "memory barrier without comment"
warning when you omit to do so); not just in the commit message.

Hope this helps. Please let me know if something I wrote is unclear,

Andrea


>
> Thanks,
> Gao Xiang
>
> >
> > thanks,
> >
> > greg k-h

2018-11-24 07:55:13

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH 05/10] staging: erofs: add a full barrier in erofs_workgroup_unfreeze

Hi Andrea,

On 2018/11/23 2:50, Andrea Parri wrote:
> On Thu, Nov 22, 2018 at 06:56:32PM +0800, Gao Xiang wrote:
>> Hi Greg,
>>
>> On 2018/11/22 18:22, Greg Kroah-Hartman wrote:
>>> Please document this memory barrier. It does not make much sense to
>>> me...
>>
>> Because we need to make the other observers noticing the latest values modified
>> in this locking period before unfreezing the whole workgroup, one way is to use
>> a memory barrier and the other way is to use ACQUIRE and RELEASE. we selected
>> the first one.
>>
>> Hmmm...ok, I will add a simple message to explain this, but I think that is
>> plain enough for a lock...
>
> Sympathizing with Greg's request, let me add some specific suggestions:
>
> 1. It wouldn't hurt to indicate a pair of memory accesses which are
> intended to be "ordered" by the memory barrier in question (yes,
> this pair might not be unique, but you should be able to provide
> an example).
>
> 2. Memory barriers always come matched by other memory barriers, or
> dependencies (it really does not make sense to talk about a full
> barrier "in isolation"): please also indicate (an instance of) a
> matching barrier or the matching barriers.
>
> 3. How do the hardware threads communicate? In the acquire/release
> pattern you mentioned above, the load-acquire *reads from* a/the
> previous store-release, a memory access that follows the acquire
> somehow communicate with a memory access preceding the release...
>
> 4. It is a good practice to include the above information within an
> (inline) comment accompanying the added memory barrier (in fact,
> IIRC, checkpatch.pl gives you a "memory barrier without comment"
> warning when you omit to do so); not just in the commit message.
>
> Hope this helps. Please let me know if something I wrote is unclear,

Thanks for taking time on the detailed explanation. I think it is helpful for me. :)
And you are right, barriers should be in pairs, and I think I need to explain more:

255 static inline bool erofs_workgroup_get(struct erofs_workgroup *grp, int *ocnt)
256 {
257 int o;
258
259 repeat:
260 o = erofs_wait_on_workgroup_freezed(grp);
261
262 if (unlikely(o <= 0))
263 return -1;
264
265 if (unlikely(atomic_cmpxchg(&grp->refcount, o, o + 1) != o)) <- *
266 goto repeat;
imply a memory barrier here
267
268 *ocnt = o;
269 return 0;
270 }

I think atomic_cmpxchg implies a memory barrier semantics when the value comparison (*) succeeds...

I don't know whether my understanding is correct, If I am wrong..please correct me, or
I need to add more detailed code comments to explain in the code?

Thanks,
Gao Xiang


>
> Andrea
>
>
>>
>> Thanks,
>> Gao Xiang
>>
>>>
>>> thanks,
>>>
>>> greg k-h

2018-11-24 08:21:10

by Andrea Parri

[permalink] [raw]
Subject: Re: [PATCH 05/10] staging: erofs: add a full barrier in erofs_workgroup_unfreeze

On Fri, Nov 23, 2018 at 10:51:33AM +0800, Gao Xiang wrote:
> Hi Andrea,
>
> On 2018/11/23 2:50, Andrea Parri wrote:
> > On Thu, Nov 22, 2018 at 06:56:32PM +0800, Gao Xiang wrote:
> >> Hi Greg,
> >>
> >> On 2018/11/22 18:22, Greg Kroah-Hartman wrote:
> >>> Please document this memory barrier. It does not make much sense to
> >>> me...
> >>
> >> Because we need to make the other observers noticing the latest values modified
> >> in this locking period before unfreezing the whole workgroup, one way is to use
> >> a memory barrier and the other way is to use ACQUIRE and RELEASE. we selected
> >> the first one.
> >>
> >> Hmmm...ok, I will add a simple message to explain this, but I think that is
> >> plain enough for a lock...
> >
> > Sympathizing with Greg's request, let me add some specific suggestions:
> >
> > 1. It wouldn't hurt to indicate a pair of memory accesses which are
> > intended to be "ordered" by the memory barrier in question (yes,
> > this pair might not be unique, but you should be able to provide
> > an example).
> >
> > 2. Memory barriers always come matched by other memory barriers, or
> > dependencies (it really does not make sense to talk about a full
> > barrier "in isolation"): please also indicate (an instance of) a
> > matching barrier or the matching barriers.
> >
> > 3. How do the hardware threads communicate? In the acquire/release
> > pattern you mentioned above, the load-acquire *reads from* a/the
> > previous store-release, a memory access that follows the acquire
> > somehow communicate with a memory access preceding the release...
> >
> > 4. It is a good practice to include the above information within an
> > (inline) comment accompanying the added memory barrier (in fact,
> > IIRC, checkpatch.pl gives you a "memory barrier without comment"
> > warning when you omit to do so); not just in the commit message.
> >
> > Hope this helps. Please let me know if something I wrote is unclear,
>
> Thanks for taking time on the detailed explanation. I think it is helpful for me. :)
> And you are right, barriers should be in pairs, and I think I need to explain more:
>
> 255 static inline bool erofs_workgroup_get(struct erofs_workgroup *grp, int *ocnt)
> 256 {
> 257 int o;
> 258
> 259 repeat:
> 260 o = erofs_wait_on_workgroup_freezed(grp);
> 261
> 262 if (unlikely(o <= 0))
> 263 return -1;
> 264
> 265 if (unlikely(atomic_cmpxchg(&grp->refcount, o, o + 1) != o)) <- *
> 266 goto repeat;
> imply a memory barrier here
> 267
> 268 *ocnt = o;
> 269 return 0;
> 270 }
>
> I think atomic_cmpxchg implies a memory barrier semantics when the value comparison (*) succeeds...

Correct. This is informally documented in Documentation/atomic_t.txt
and formalized within tools/memory-model/.


>
> I don't know whether my understanding is correct, If I am wrong..please correct me, or
> I need to add more detailed code comments to explain in the code?

Yes, please; please review the above points (including 1. and 3.) and
try to address them with inline comments. Maybe (if that matches the
*behavior*/guarantee you have in mind...) something like:

[in erofs_workgroup_unfreeze()]

/*
* Orders the store/load to/from [???] and the store to
* ->refcount performed by the atomic_set() below.
*
* Matches the atomic_cmpxchg() in erofs_workgroup_get().
*
* Guarantees that if a successful atomic_cmpxchg() reads
* the value stored by the atomic_set() then [???].
*/
smp_mb();
atomic_set(&grp->refcount, v);


[in erofs_workgroup_get()]

/*
* Orders the load from ->refcount performed by the
* atomic_cmpxchg() below and the store/load [???].
*
* See the comment for the smp_mb() in
* erofs_workgroup_unfreeze().
*/
if (unlikely(atomic_cmpxchg(&grp->refcount, o, o + 1) != o))
goto repeat;

Thanks,
Andrea


>
> Thanks,
> Gao Xiang
>
>
> >
> > Andrea
> >
> >
> >>
> >> Thanks,
> >> Gao Xiang
> >>
> >>>
> >>> thanks,
> >>>
> >>> greg k-h

2018-11-24 08:22:21

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH 05/10] staging: erofs: add a full barrier in erofs_workgroup_unfreeze

Hi Andrea,

On 2018/11/23 17:51, Andrea Parri wrote:
> Correct. This is informally documented in Documentation/atomic_t.txt
> and formalized within tools/memory-model/.
>
>
>> I don't know whether my understanding is correct, If I am wrong..please correct me, or
>> I need to add more detailed code comments to explain in the code?
> Yes, please; please review the above points (including 1. and 3.) and
> try to address them with inline comments. Maybe (if that matches the
> *behavior*/guarantee you have in mind...) something like:
>
> [in erofs_workgroup_unfreeze()]
>
> /*
> * Orders the store/load to/from [???] and the store to
> * ->refcount performed by the atomic_set() below.
> *
> * Matches the atomic_cmpxchg() in erofs_workgroup_get().
> *
> * Guarantees that if a successful atomic_cmpxchg() reads
> * the value stored by the atomic_set() then [???].
> */
> smp_mb();
> atomic_set(&grp->refcount, v);
>
>
> [in erofs_workgroup_get()]
>
> /*
> * Orders the load from ->refcount performed by the
> * atomic_cmpxchg() below and the store/load [???].
> *
> * See the comment for the smp_mb() in
> * erofs_workgroup_unfreeze().
> */
> if (unlikely(atomic_cmpxchg(&grp->refcount, o, o + 1) != o))
> goto repeat;
>

OK, I will add these comments in the next version patchset, will be sent later.
Thanks for your suggestion. :)

Thanks,
Gao Xiang

> Thanks,
> Andrea
>
>