2022-07-12 10:55:29

by Jan Kara

[permalink] [raw]
Subject: [PATCH 0/10 v3] ext4: Fix possible fs corruption due to xattr races

Hello!

I've noticed this series didn't get merged yet. I was waiting for more review
feedback from Ritesh but somehow that didn't happen. So this is the third
submission of the series fixing the races of ext4 xattr block reuse with the
few changes that have accumulated since v2. Ted, do you think you can add this
series to your tree so that we can merge it during the merge window? Thanks!

Changes since v1:
* Reworked the series to fix all corner cases and make API less errorprone.

Changes since v2:
* Renamed mb_cache_entry_try_delete() to mb_cache_entry_delete_and_get()
* Added Tested-by tag from Ritesh

Honza

Previous versions:
Link: http://lore.kernel.org/r/[email protected] # v1
Link: http://lore.kernel.org/r/[email protected] # v2


2022-07-12 10:55:36

by Jan Kara

[permalink] [raw]
Subject: [PATCH 04/10] ext4: Unindent codeblock in ext4_xattr_block_set()

Remove unnecessary else (and thus indentation level) from a code block
in ext4_xattr_block_set(). It will also make following code changes
easier. No functional changes.

CC: [email protected]
Fixes: 82939d7999df ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/xattr.c | 79 ++++++++++++++++++++++++-------------------------
1 file changed, 39 insertions(+), 40 deletions(-)

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 7fc40fb1e6b3..aadfae53d055 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1850,6 +1850,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
#define header(x) ((struct ext4_xattr_header *)(x))

if (s->base) {
+ int offset = (char *)s->here - bs->bh->b_data;
+
BUFFER_TRACE(bs->bh, "get_write_access");
error = ext4_journal_get_write_access(handle, sb, bs->bh,
EXT4_JTR_NONE);
@@ -1882,50 +1884,47 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
if (error)
goto cleanup;
goto inserted;
- } else {
- int offset = (char *)s->here - bs->bh->b_data;
+ }
+ unlock_buffer(bs->bh);
+ ea_bdebug(bs->bh, "cloning");
+ s->base = kmalloc(bs->bh->b_size, GFP_NOFS);
+ error = -ENOMEM;
+ if (s->base == NULL)
+ goto cleanup;
+ memcpy(s->base, BHDR(bs->bh), bs->bh->b_size);
+ s->first = ENTRY(header(s->base)+1);
+ header(s->base)->h_refcount = cpu_to_le32(1);
+ s->here = ENTRY(s->base + offset);
+ s->end = s->base + bs->bh->b_size;

- unlock_buffer(bs->bh);
- ea_bdebug(bs->bh, "cloning");
- s->base = kmalloc(bs->bh->b_size, GFP_NOFS);
- error = -ENOMEM;
- if (s->base == NULL)
+ /*
+ * If existing entry points to an xattr inode, we need
+ * to prevent ext4_xattr_set_entry() from decrementing
+ * ref count on it because the reference belongs to the
+ * original block. In this case, make the entry look
+ * like it has an empty value.
+ */
+ if (!s->not_found && s->here->e_value_inum) {
+ ea_ino = le32_to_cpu(s->here->e_value_inum);
+ error = ext4_xattr_inode_iget(inode, ea_ino,
+ le32_to_cpu(s->here->e_hash),
+ &tmp_inode);
+ if (error)
goto cleanup;
- memcpy(s->base, BHDR(bs->bh), bs->bh->b_size);
- s->first = ENTRY(header(s->base)+1);
- header(s->base)->h_refcount = cpu_to_le32(1);
- s->here = ENTRY(s->base + offset);
- s->end = s->base + bs->bh->b_size;

- /*
- * If existing entry points to an xattr inode, we need
- * to prevent ext4_xattr_set_entry() from decrementing
- * ref count on it because the reference belongs to the
- * original block. In this case, make the entry look
- * like it has an empty value.
- */
- if (!s->not_found && s->here->e_value_inum) {
- ea_ino = le32_to_cpu(s->here->e_value_inum);
- error = ext4_xattr_inode_iget(inode, ea_ino,
- le32_to_cpu(s->here->e_hash),
- &tmp_inode);
- if (error)
- goto cleanup;
-
- if (!ext4_test_inode_state(tmp_inode,
- EXT4_STATE_LUSTRE_EA_INODE)) {
- /*
- * Defer quota free call for previous
- * inode until success is guaranteed.
- */
- old_ea_inode_quota = le32_to_cpu(
- s->here->e_value_size);
- }
- iput(tmp_inode);
-
- s->here->e_value_inum = 0;
- s->here->e_value_size = 0;
+ if (!ext4_test_inode_state(tmp_inode,
+ EXT4_STATE_LUSTRE_EA_INODE)) {
+ /*
+ * Defer quota free call for previous
+ * inode until success is guaranteed.
+ */
+ old_ea_inode_quota = le32_to_cpu(
+ s->here->e_value_size);
}
+ iput(tmp_inode);
+
+ s->here->e_value_inum = 0;
+ s->here->e_value_size = 0;
}
} else {
/* Allocate a buffer where we construct the new block. */
--
2.35.3

2022-07-12 10:55:40

by Jan Kara

[permalink] [raw]
Subject: [PATCH 03/10] ext4: Remove EA inode entry from mbcache on inode eviction

Currently we remove EA inode from mbcache as soon as its xattr refcount
drops to zero. However there can be pending attempts to reuse the inode
and thus refcount handling code has to handle the situation when
refcount increases from zero anyway. So save some work and just keep EA
inode in mbcache until it is getting evicted. At that moment we are sure
following iget() of EA inode will fail anyway (or wait for eviction to
finish and load things from the disk again) and so removing mbcache
entry at that moment is fine and simplifies the code a bit.

CC: [email protected]
Fixes: 82939d7999df ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 2 ++
fs/ext4/xattr.c | 24 ++++++++----------------
fs/ext4/xattr.h | 1 +
3 files changed, 11 insertions(+), 16 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3dce7d058985..7450ee734262 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -177,6 +177,8 @@ void ext4_evict_inode(struct inode *inode)

trace_ext4_evict_inode(inode);

+ if (EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL)
+ ext4_evict_ea_inode(inode);
if (inode->i_nlink) {
/*
* When journalling data dirty buffers are tracked only in the
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 042325349098..7fc40fb1e6b3 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -436,6 +436,14 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
return err;
}

+/* Remove entry from mbcache when EA inode is getting evicted */
+void ext4_evict_ea_inode(struct inode *inode)
+{
+ if (EA_INODE_CACHE(inode))
+ mb_cache_entry_delete(EA_INODE_CACHE(inode),
+ ext4_xattr_inode_get_hash(inode), inode->i_ino);
+}
+
static int
ext4_xattr_inode_verify_hashes(struct inode *ea_inode,
struct ext4_xattr_entry *entry, void *buffer,
@@ -976,10 +984,8 @@ int __ext4_xattr_set_credits(struct super_block *sb, struct inode *inode,
static int ext4_xattr_inode_update_ref(handle_t *handle, struct inode *ea_inode,
int ref_change)
{
- struct mb_cache *ea_inode_cache = EA_INODE_CACHE(ea_inode);
struct ext4_iloc iloc;
s64 ref_count;
- u32 hash;
int ret;

inode_lock(ea_inode);
@@ -1002,14 +1008,6 @@ static int ext4_xattr_inode_update_ref(handle_t *handle, struct inode *ea_inode,

set_nlink(ea_inode, 1);
ext4_orphan_del(handle, ea_inode);
-
- if (ea_inode_cache) {
- hash = ext4_xattr_inode_get_hash(ea_inode);
- mb_cache_entry_create(ea_inode_cache,
- GFP_NOFS, hash,
- ea_inode->i_ino,
- true /* reusable */);
- }
}
} else {
WARN_ONCE(ref_count < 0, "EA inode %lu ref_count=%lld",
@@ -1022,12 +1020,6 @@ static int ext4_xattr_inode_update_ref(handle_t *handle, struct inode *ea_inode,

clear_nlink(ea_inode);
ext4_orphan_add(handle, ea_inode);
-
- if (ea_inode_cache) {
- hash = ext4_xattr_inode_get_hash(ea_inode);
- mb_cache_entry_delete(ea_inode_cache, hash,
- ea_inode->i_ino);
- }
}
}

diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index 77efb9a627ad..060b7a2f485c 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -178,6 +178,7 @@ extern void ext4_xattr_inode_array_free(struct ext4_xattr_inode_array *array);

extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
struct ext4_inode *raw_inode, handle_t *handle);
+extern void ext4_evict_ea_inode(struct inode *inode);

extern const struct xattr_handler *ext4_xattr_handlers[];

--
2.35.3

2022-07-12 10:55:46

by Jan Kara

[permalink] [raw]
Subject: [PATCH 09/10] mbcache: Remove mb_cache_entry_delete()

Nobody uses mb_cache_entry_delete() anymore. Remove it.

Signed-off-by: Jan Kara <[email protected]>
---
fs/mbcache.c | 37 -------------------------------------
include/linux/mbcache.h | 1 -
2 files changed, 38 deletions(-)

diff --git a/fs/mbcache.c b/fs/mbcache.c
index 2010bc80a3f2..d1ebb5df2856 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -230,43 +230,6 @@ struct mb_cache_entry *mb_cache_entry_get(struct mb_cache *cache, u32 key,
}
EXPORT_SYMBOL(mb_cache_entry_get);

-/* mb_cache_entry_delete - try to remove a cache entry
- * @cache - cache we work with
- * @key - key
- * @value - value
- *
- * Remove entry from cache @cache with key @key and value @value.
- */
-void mb_cache_entry_delete(struct mb_cache *cache, u32 key, u64 value)
-{
- struct hlist_bl_node *node;
- struct hlist_bl_head *head;
- struct mb_cache_entry *entry;
-
- head = mb_cache_entry_head(cache, key);
- hlist_bl_lock(head);
- hlist_bl_for_each_entry(entry, node, head, e_hash_list) {
- if (entry->e_key == key && entry->e_value == value) {
- /* We keep hash list reference to keep entry alive */
- hlist_bl_del_init(&entry->e_hash_list);
- hlist_bl_unlock(head);
- spin_lock(&cache->c_list_lock);
- if (!list_empty(&entry->e_list)) {
- list_del_init(&entry->e_list);
- if (!WARN_ONCE(cache->c_entry_count == 0,
- "mbcache: attempt to decrement c_entry_count past zero"))
- cache->c_entry_count--;
- atomic_dec(&entry->e_refcnt);
- }
- spin_unlock(&cache->c_list_lock);
- mb_cache_entry_put(cache, entry);
- return;
- }
- }
- hlist_bl_unlock(head);
-}
-EXPORT_SYMBOL(mb_cache_entry_delete);
-
/* mb_cache_entry_delete_or_get - remove a cache entry if it has no users
* @cache - cache we work with
* @key - key
diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h
index 8eca7f25c432..452b579856d4 100644
--- a/include/linux/mbcache.h
+++ b/include/linux/mbcache.h
@@ -47,7 +47,6 @@ static inline int mb_cache_entry_put(struct mb_cache *cache,

struct mb_cache_entry *mb_cache_entry_delete_or_get(struct mb_cache *cache,
u32 key, u64 value);
-void mb_cache_entry_delete(struct mb_cache *cache, u32 key, u64 value);
struct mb_cache_entry *mb_cache_entry_get(struct mb_cache *cache, u32 key,
u64 value);
struct mb_cache_entry *mb_cache_entry_find_first(struct mb_cache *cache,
--
2.35.3

2022-07-12 10:56:17

by Jan Kara

[permalink] [raw]
Subject: [PATCH 02/10] mbcache: Add functions to delete entry if unused

Add function mb_cache_entry_delete_or_get() to delete mbcache entry if
it is unused and also add a function to wait for entry to become unused
- mb_cache_entry_wait_unused(). We do not share code between the two
deleting function as one of them will go away soon.

CC: [email protected]
Fixes: 82939d7999df ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <[email protected]>
---
fs/mbcache.c | 66 +++++++++++++++++++++++++++++++++++++++--
include/linux/mbcache.h | 10 ++++++-
2 files changed, 73 insertions(+), 3 deletions(-)

diff --git a/fs/mbcache.c b/fs/mbcache.c
index cfc28129fb6f..2010bc80a3f2 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -11,7 +11,7 @@
/*
* Mbcache is a simple key-value store. Keys need not be unique, however
* key-value pairs are expected to be unique (we use this fact in
- * mb_cache_entry_delete()).
+ * mb_cache_entry_delete_or_get()).
*
* Ext2 and ext4 use this cache for deduplication of extended attribute blocks.
* Ext4 also uses it for deduplication of xattr values stored in inodes.
@@ -125,6 +125,19 @@ void __mb_cache_entry_free(struct mb_cache_entry *entry)
}
EXPORT_SYMBOL(__mb_cache_entry_free);

+/*
+ * mb_cache_entry_wait_unused - wait to be the last user of the entry
+ *
+ * @entry - entry to work on
+ *
+ * Wait to be the last user of the entry.
+ */
+void mb_cache_entry_wait_unused(struct mb_cache_entry *entry)
+{
+ wait_var_event(&entry->e_refcnt, atomic_read(&entry->e_refcnt) <= 3);
+}
+EXPORT_SYMBOL(mb_cache_entry_wait_unused);
+
static struct mb_cache_entry *__entry_find(struct mb_cache *cache,
struct mb_cache_entry *entry,
u32 key)
@@ -217,7 +230,7 @@ struct mb_cache_entry *mb_cache_entry_get(struct mb_cache *cache, u32 key,
}
EXPORT_SYMBOL(mb_cache_entry_get);

-/* mb_cache_entry_delete - remove a cache entry
+/* mb_cache_entry_delete - try to remove a cache entry
* @cache - cache we work with
* @key - key
* @value - value
@@ -254,6 +267,55 @@ void mb_cache_entry_delete(struct mb_cache *cache, u32 key, u64 value)
}
EXPORT_SYMBOL(mb_cache_entry_delete);

+/* mb_cache_entry_delete_or_get - remove a cache entry if it has no users
+ * @cache - cache we work with
+ * @key - key
+ * @value - value
+ *
+ * Remove entry from cache @cache with key @key and value @value. The removal
+ * happens only if the entry is unused. The function returns NULL in case the
+ * entry was successfully removed or there's no entry in cache. Otherwise the
+ * function grabs reference of the entry that we failed to delete because it
+ * still has users and return it.
+ */
+struct mb_cache_entry *mb_cache_entry_delete_or_get(struct mb_cache *cache,
+ u32 key, u64 value)
+{
+ struct hlist_bl_node *node;
+ struct hlist_bl_head *head;
+ struct mb_cache_entry *entry;
+
+ head = mb_cache_entry_head(cache, key);
+ hlist_bl_lock(head);
+ hlist_bl_for_each_entry(entry, node, head, e_hash_list) {
+ if (entry->e_key == key && entry->e_value == value) {
+ if (atomic_read(&entry->e_refcnt) > 2) {
+ atomic_inc(&entry->e_refcnt);
+ hlist_bl_unlock(head);
+ return entry;
+ }
+ /* We keep hash list reference to keep entry alive */
+ hlist_bl_del_init(&entry->e_hash_list);
+ hlist_bl_unlock(head);
+ spin_lock(&cache->c_list_lock);
+ if (!list_empty(&entry->e_list)) {
+ list_del_init(&entry->e_list);
+ if (!WARN_ONCE(cache->c_entry_count == 0,
+ "mbcache: attempt to decrement c_entry_count past zero"))
+ cache->c_entry_count--;
+ atomic_dec(&entry->e_refcnt);
+ }
+ spin_unlock(&cache->c_list_lock);
+ mb_cache_entry_put(cache, entry);
+ return NULL;
+ }
+ }
+ hlist_bl_unlock(head);
+
+ return NULL;
+}
+EXPORT_SYMBOL(mb_cache_entry_delete_or_get);
+
/* mb_cache_entry_touch - cache entry got used
* @cache - cache the entry belongs to
* @entry - entry that got used
diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h
index 20f1e3ff6013..8eca7f25c432 100644
--- a/include/linux/mbcache.h
+++ b/include/linux/mbcache.h
@@ -30,15 +30,23 @@ void mb_cache_destroy(struct mb_cache *cache);
int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
u64 value, bool reusable);
void __mb_cache_entry_free(struct mb_cache_entry *entry);
+void mb_cache_entry_wait_unused(struct mb_cache_entry *entry);
static inline int mb_cache_entry_put(struct mb_cache *cache,
struct mb_cache_entry *entry)
{
- if (!atomic_dec_and_test(&entry->e_refcnt))
+ unsigned int cnt = atomic_dec_return(&entry->e_refcnt);
+
+ if (cnt > 0) {
+ if (cnt <= 3)
+ wake_up_var(&entry->e_refcnt);
return 0;
+ }
__mb_cache_entry_free(entry);
return 1;
}

+struct mb_cache_entry *mb_cache_entry_delete_or_get(struct mb_cache *cache,
+ u32 key, u64 value);
void mb_cache_entry_delete(struct mb_cache *cache, u32 key, u64 value);
struct mb_cache_entry *mb_cache_entry_get(struct mb_cache *cache, u32 key,
u64 value);
--
2.35.3

2022-07-12 10:56:18

by Jan Kara

[permalink] [raw]
Subject: [PATCH 07/10] ext2: Unindent codeblock in ext2_xattr_set()

Replace one else in ext2_xattr_set() with a goto. This makes following
code changes simpler to follow. No functional changes.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext2/xattr.c | 32 ++++++++++++++++----------------
1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/ext2/xattr.c b/fs/ext2/xattr.c
index 9885294993ef..37ce495eb279 100644
--- a/fs/ext2/xattr.c
+++ b/fs/ext2/xattr.c
@@ -517,7 +517,8 @@ ext2_xattr_set(struct inode *inode, int name_index, const char *name,
/* Here we know that we can set the new attribute. */

if (header) {
- /* assert(header == HDR(bh)); */
+ int offset;
+
lock_buffer(bh);
if (header->h_refcount == cpu_to_le32(1)) {
__u32 hash = le32_to_cpu(header->h_hash);
@@ -531,22 +532,20 @@ ext2_xattr_set(struct inode *inode, int name_index, const char *name,
bh->b_blocknr);

/* keep the buffer locked while modifying it. */
- } else {
- int offset;
-
- unlock_buffer(bh);
- ea_bdebug(bh, "cloning");
- header = kmemdup(HDR(bh), bh->b_size, GFP_KERNEL);
- error = -ENOMEM;
- if (header == NULL)
- goto cleanup;
- header->h_refcount = cpu_to_le32(1);
-
- offset = (char *)here - bh->b_data;
- here = ENTRY((char *)header + offset);
- offset = (char *)last - bh->b_data;
- last = ENTRY((char *)header + offset);
+ goto update_block;
}
+ unlock_buffer(bh);
+ ea_bdebug(bh, "cloning");
+ header = kmemdup(HDR(bh), bh->b_size, GFP_KERNEL);
+ error = -ENOMEM;
+ if (header == NULL)
+ goto cleanup;
+ header->h_refcount = cpu_to_le32(1);
+
+ offset = (char *)here - bh->b_data;
+ here = ENTRY((char *)header + offset);
+ offset = (char *)last - bh->b_data;
+ last = ENTRY((char *)header + offset);
} else {
/* Allocate a buffer where we construct the new block. */
header = kzalloc(sb->s_blocksize, GFP_KERNEL);
@@ -559,6 +558,7 @@ ext2_xattr_set(struct inode *inode, int name_index, const char *name,
last = here = ENTRY(header+1);
}

+update_block:
/* Iff we are modifying the block in-place, bh is locked here. */

if (not_found) {
--
2.35.3

2022-07-12 10:56:19

by Jan Kara

[permalink] [raw]
Subject: [PATCH 06/10] ext2: Factor our freeing of xattr block reference

Free of xattr block reference is opencode in two places. Factor it out
into a separate function and use it.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext2/xattr.c | 90 +++++++++++++++++++++----------------------------
1 file changed, 38 insertions(+), 52 deletions(-)

diff --git a/fs/ext2/xattr.c b/fs/ext2/xattr.c
index 841fa6d9d744..9885294993ef 100644
--- a/fs/ext2/xattr.c
+++ b/fs/ext2/xattr.c
@@ -651,6 +651,42 @@ ext2_xattr_set(struct inode *inode, int name_index, const char *name,
return error;
}

+static void ext2_xattr_release_block(struct inode *inode,
+ struct buffer_head *bh)
+{
+ struct mb_cache *ea_block_cache = EA_BLOCK_CACHE(inode);
+
+ lock_buffer(bh);
+ if (HDR(bh)->h_refcount == cpu_to_le32(1)) {
+ __u32 hash = le32_to_cpu(HDR(bh)->h_hash);
+
+ /*
+ * This must happen under buffer lock for
+ * ext2_xattr_set2() to reliably detect freed block
+ */
+ mb_cache_entry_delete(ea_block_cache, hash,
+ bh->b_blocknr);
+ /* Free the old block. */
+ ea_bdebug(bh, "freeing");
+ ext2_free_blocks(inode, bh->b_blocknr, 1);
+ /* We let our caller release bh, so we
+ * need to duplicate the buffer before. */
+ get_bh(bh);
+ bforget(bh);
+ unlock_buffer(bh);
+ } else {
+ /* Decrement the refcount only. */
+ le32_add_cpu(&HDR(bh)->h_refcount, -1);
+ dquot_free_block(inode, 1);
+ mark_buffer_dirty(bh);
+ unlock_buffer(bh);
+ ea_bdebug(bh, "refcount now=%d",
+ le32_to_cpu(HDR(bh)->h_refcount));
+ if (IS_SYNC(inode))
+ sync_dirty_buffer(bh);
+ }
+}
+
/*
* Second half of ext2_xattr_set(): Update the file system.
*/
@@ -747,34 +783,7 @@ ext2_xattr_set2(struct inode *inode, struct buffer_head *old_bh,
* If there was an old block and we are no longer using it,
* release the old block.
*/
- lock_buffer(old_bh);
- if (HDR(old_bh)->h_refcount == cpu_to_le32(1)) {
- __u32 hash = le32_to_cpu(HDR(old_bh)->h_hash);
-
- /*
- * This must happen under buffer lock for
- * ext2_xattr_set2() to reliably detect freed block
- */
- mb_cache_entry_delete(ea_block_cache, hash,
- old_bh->b_blocknr);
- /* Free the old block. */
- ea_bdebug(old_bh, "freeing");
- ext2_free_blocks(inode, old_bh->b_blocknr, 1);
- mark_inode_dirty(inode);
- /* We let our caller release old_bh, so we
- * need to duplicate the buffer before. */
- get_bh(old_bh);
- bforget(old_bh);
- } else {
- /* Decrement the refcount only. */
- le32_add_cpu(&HDR(old_bh)->h_refcount, -1);
- dquot_free_block_nodirty(inode, 1);
- mark_inode_dirty(inode);
- mark_buffer_dirty(old_bh);
- ea_bdebug(old_bh, "refcount now=%d",
- le32_to_cpu(HDR(old_bh)->h_refcount));
- }
- unlock_buffer(old_bh);
+ ext2_xattr_release_block(inode, old_bh);
}

cleanup:
@@ -828,30 +837,7 @@ ext2_xattr_delete_inode(struct inode *inode)
EXT2_I(inode)->i_file_acl);
goto cleanup;
}
- lock_buffer(bh);
- if (HDR(bh)->h_refcount == cpu_to_le32(1)) {
- __u32 hash = le32_to_cpu(HDR(bh)->h_hash);
-
- /*
- * This must happen under buffer lock for ext2_xattr_set2() to
- * reliably detect freed block
- */
- mb_cache_entry_delete(EA_BLOCK_CACHE(inode), hash,
- bh->b_blocknr);
- ext2_free_blocks(inode, EXT2_I(inode)->i_file_acl, 1);
- get_bh(bh);
- bforget(bh);
- unlock_buffer(bh);
- } else {
- le32_add_cpu(&HDR(bh)->h_refcount, -1);
- ea_bdebug(bh, "refcount now=%d",
- le32_to_cpu(HDR(bh)->h_refcount));
- unlock_buffer(bh);
- mark_buffer_dirty(bh);
- if (IS_SYNC(inode))
- sync_dirty_buffer(bh);
- dquot_free_block_nodirty(inode, 1);
- }
+ ext2_xattr_release_block(inode, bh);
EXT2_I(inode)->i_file_acl = 0;

cleanup:
--
2.35.3

2022-07-12 10:56:40

by Jan Kara

[permalink] [raw]
Subject: [PATCH 08/10] ext2: Avoid deleting xattr block that is being reused

Currently when we decide to reuse xattr block we detect the case when
the last reference to xattr block is being dropped at the same time and
cancel the reuse attempt. Convert ext2 to a new scheme when as soon as
matching mbcache entry is found, we wait with dropping the last xattr
block reference until mbcache entry reference is dropped (meaning either
the xattr block reference is increased or we decided not to reuse the
block).

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext2/xattr.c | 58 ++++++++++++++++++++++++-------------------------
1 file changed, 29 insertions(+), 29 deletions(-)

diff --git a/fs/ext2/xattr.c b/fs/ext2/xattr.c
index 37ce495eb279..641abfa4b718 100644
--- a/fs/ext2/xattr.c
+++ b/fs/ext2/xattr.c
@@ -522,17 +522,18 @@ ext2_xattr_set(struct inode *inode, int name_index, const char *name,
lock_buffer(bh);
if (header->h_refcount == cpu_to_le32(1)) {
__u32 hash = le32_to_cpu(header->h_hash);
+ struct mb_cache_entry *oe;

- ea_bdebug(bh, "modifying in-place");
+ oe = mb_cache_entry_delete_or_get(EA_BLOCK_CACHE(inode),
+ hash, bh->b_blocknr);
+ if (!oe) {
+ ea_bdebug(bh, "modifying in-place");
+ goto update_block;
+ }
/*
- * This must happen under buffer lock for
- * ext2_xattr_set2() to reliably detect modified block
+ * Someone is trying to reuse the block, leave it alone
*/
- mb_cache_entry_delete(EA_BLOCK_CACHE(inode), hash,
- bh->b_blocknr);
-
- /* keep the buffer locked while modifying it. */
- goto update_block;
+ mb_cache_entry_put(EA_BLOCK_CACHE(inode), oe);
}
unlock_buffer(bh);
ea_bdebug(bh, "cloning");
@@ -656,16 +657,29 @@ static void ext2_xattr_release_block(struct inode *inode,
{
struct mb_cache *ea_block_cache = EA_BLOCK_CACHE(inode);

+retry_ref:
lock_buffer(bh);
if (HDR(bh)->h_refcount == cpu_to_le32(1)) {
__u32 hash = le32_to_cpu(HDR(bh)->h_hash);
+ struct mb_cache_entry *oe;

/*
- * This must happen under buffer lock for
- * ext2_xattr_set2() to reliably detect freed block
+ * This must happen under buffer lock to properly
+ * serialize with ext2_xattr_set() reusing the block.
*/
- mb_cache_entry_delete(ea_block_cache, hash,
- bh->b_blocknr);
+ oe = mb_cache_entry_delete_or_get(ea_block_cache, hash,
+ bh->b_blocknr);
+ if (oe) {
+ /*
+ * Someone is trying to reuse the block. Wait
+ * and retry.
+ */
+ unlock_buffer(bh);
+ mb_cache_entry_wait_unused(oe);
+ mb_cache_entry_put(ea_block_cache, oe);
+ goto retry_ref;
+ }
+
/* Free the old block. */
ea_bdebug(bh, "freeing");
ext2_free_blocks(inode, bh->b_blocknr, 1);
@@ -929,7 +943,7 @@ ext2_xattr_cache_find(struct inode *inode, struct ext2_xattr_header *header)
if (!header->h_hash)
return NULL; /* never share */
ea_idebug(inode, "looking for cached blocks [%x]", (int)hash);
-again:
+
ce = mb_cache_entry_find_first(ea_block_cache, hash);
while (ce) {
struct buffer_head *bh;
@@ -941,22 +955,8 @@ ext2_xattr_cache_find(struct inode *inode, struct ext2_xattr_header *header)
inode->i_ino, (unsigned long) ce->e_value);
} else {
lock_buffer(bh);
- /*
- * We have to be careful about races with freeing or
- * rehashing of xattr block. Once we hold buffer lock
- * xattr block's state is stable so we can check
- * whether the block got freed / rehashed or not.
- * Since we unhash mbcache entry under buffer lock when
- * freeing / rehashing xattr block, checking whether
- * entry is still hashed is reliable.
- */
- if (hlist_bl_unhashed(&ce->e_hash_list)) {
- mb_cache_entry_put(ea_block_cache, ce);
- unlock_buffer(bh);
- brelse(bh);
- goto again;
- } else if (le32_to_cpu(HDR(bh)->h_refcount) >
- EXT2_XATTR_REFCOUNT_MAX) {
+ if (le32_to_cpu(HDR(bh)->h_refcount) >
+ EXT2_XATTR_REFCOUNT_MAX) {
ea_idebug(inode, "block %ld refcount %d>%d",
(unsigned long) ce->e_value,
le32_to_cpu(HDR(bh)->h_refcount),
--
2.35.3

2022-07-12 10:57:05

by Jan Kara

[permalink] [raw]
Subject: [PATCH 01/10] mbcache: Don't reclaim used entries

Do not reclaim entries that are currently used by somebody from a
shrinker. Firstly, these entries are likely useful. Secondly, we will
need to keep such entries to protect pending increment of xattr block
refcount.

CC: [email protected]
Fixes: 82939d7999df ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <[email protected]>
---
fs/mbcache.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/mbcache.c b/fs/mbcache.c
index 97c54d3a2227..cfc28129fb6f 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -288,7 +288,7 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
while (nr_to_scan-- && !list_empty(&cache->c_list)) {
entry = list_first_entry(&cache->c_list,
struct mb_cache_entry, e_list);
- if (entry->e_referenced) {
+ if (entry->e_referenced || atomic_read(&entry->e_refcnt) > 2) {
entry->e_referenced = 0;
list_move_tail(&entry->e_list, &cache->c_list);
continue;
@@ -302,6 +302,14 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
spin_unlock(&cache->c_list_lock);
head = mb_cache_entry_head(cache, entry->e_key);
hlist_bl_lock(head);
+ /* Now a reliable check if the entry didn't get used... */
+ if (atomic_read(&entry->e_refcnt) > 2) {
+ hlist_bl_unlock(head);
+ spin_lock(&cache->c_list_lock);
+ list_add_tail(&entry->e_list, &cache->c_list);
+ cache->c_entry_count++;
+ continue;
+ }
if (!hlist_bl_unhashed(&entry->e_hash_list)) {
hlist_bl_del_init(&entry->e_hash_list);
atomic_dec(&entry->e_refcnt);
--
2.35.3

2022-07-12 10:57:06

by Jan Kara

[permalink] [raw]
Subject: [PATCH 05/10] ext4: Fix race when reusing xattr blocks

When ext4_xattr_block_set() decides to remove xattr block the following
race can happen:

CPU1 CPU2
ext4_xattr_block_set() ext4_xattr_release_block()
new_bh = ext4_xattr_block_cache_find()

lock_buffer(bh);
ref = le32_to_cpu(BHDR(bh)->h_refcount);
if (ref == 1) {
...
mb_cache_entry_delete();
unlock_buffer(bh);
ext4_free_blocks();
...
ext4_forget(..., bh, ...);
jbd2_journal_revoke(..., bh);

ext4_journal_get_write_access(..., new_bh, ...)
do_get_write_access()
jbd2_journal_cancel_revoke(..., new_bh);

Later the code in ext4_xattr_block_set() finds out the block got freed
and cancels reusal of the block but the revoke stays canceled and so in
case of block reuse and journal replay the filesystem can get corrupted.
If the race works out slightly differently, we can also hit assertions
in the jbd2 code.

Fix the problem by making sure that once matching mbcache entry is
found, code dropping the last xattr block reference (or trying to modify
xattr block in place) waits until the mbcache entry reference is
dropped. This way code trying to reuse xattr block is protected from
someone trying to drop the last reference to xattr block.

Reported-and-tested-by: Ritesh Harjani <[email protected]>
CC: [email protected]
Fixes: 82939d7999df ("ext4: convert to mbcache2")
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/xattr.c | 67 +++++++++++++++++++++++++++++++++----------------
1 file changed, 45 insertions(+), 22 deletions(-)

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index aadfae53d055..3a0928c8720e 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -439,9 +439,16 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
/* Remove entry from mbcache when EA inode is getting evicted */
void ext4_evict_ea_inode(struct inode *inode)
{
- if (EA_INODE_CACHE(inode))
- mb_cache_entry_delete(EA_INODE_CACHE(inode),
- ext4_xattr_inode_get_hash(inode), inode->i_ino);
+ struct mb_cache_entry *oe;
+
+ if (!EA_INODE_CACHE(inode))
+ return;
+ /* Wait for entry to get unused so that we can remove it */
+ while ((oe = mb_cache_entry_delete_or_get(EA_INODE_CACHE(inode),
+ ext4_xattr_inode_get_hash(inode), inode->i_ino))) {
+ mb_cache_entry_wait_unused(oe);
+ mb_cache_entry_put(EA_INODE_CACHE(inode), oe);
+ }
}

static int
@@ -1229,6 +1236,7 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
if (error)
goto out;

+retry_ref:
lock_buffer(bh);
hash = le32_to_cpu(BHDR(bh)->h_hash);
ref = le32_to_cpu(BHDR(bh)->h_refcount);
@@ -1238,9 +1246,18 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
* This must happen under buffer lock for
* ext4_xattr_block_set() to reliably detect freed block
*/
- if (ea_block_cache)
- mb_cache_entry_delete(ea_block_cache, hash,
- bh->b_blocknr);
+ if (ea_block_cache) {
+ struct mb_cache_entry *oe;
+
+ oe = mb_cache_entry_delete_or_get(ea_block_cache, hash,
+ bh->b_blocknr);
+ if (oe) {
+ unlock_buffer(bh);
+ mb_cache_entry_wait_unused(oe);
+ mb_cache_entry_put(ea_block_cache, oe);
+ goto retry_ref;
+ }
+ }
get_bh(bh);
unlock_buffer(bh);

@@ -1867,9 +1884,20 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
* ext4_xattr_block_set() to reliably detect modified
* block
*/
- if (ea_block_cache)
- mb_cache_entry_delete(ea_block_cache, hash,
- bs->bh->b_blocknr);
+ if (ea_block_cache) {
+ struct mb_cache_entry *oe;
+
+ oe = mb_cache_entry_delete_or_get(ea_block_cache,
+ hash, bs->bh->b_blocknr);
+ if (oe) {
+ /*
+ * Xattr block is getting reused. Leave
+ * it alone.
+ */
+ mb_cache_entry_put(ea_block_cache, oe);
+ goto clone_block;
+ }
+ }
ea_bdebug(bs->bh, "modifying in-place");
error = ext4_xattr_set_entry(i, s, handle, inode,
true /* is_block */);
@@ -1885,6 +1913,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
goto cleanup;
goto inserted;
}
+clone_block:
unlock_buffer(bs->bh);
ea_bdebug(bs->bh, "cloning");
s->base = kmalloc(bs->bh->b_size, GFP_NOFS);
@@ -1991,18 +2020,13 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
lock_buffer(new_bh);
/*
* We have to be careful about races with
- * freeing, rehashing or adding references to
- * xattr block. Once we hold buffer lock xattr
- * block's state is stable so we can check
- * whether the block got freed / rehashed or
- * not. Since we unhash mbcache entry under
- * buffer lock when freeing / rehashing xattr
- * block, checking whether entry is still
- * hashed is reliable. Same rules hold for
- * e_reusable handling.
+ * adding references to xattr block. Once we
+ * hold buffer lock xattr block's state is
+ * stable so we can check the additional
+ * reference fits.
*/
- if (hlist_bl_unhashed(&ce->e_hash_list) ||
- !ce->e_reusable) {
+ ref = le32_to_cpu(BHDR(new_bh)->h_refcount) + 1;
+ if (ref > EXT4_XATTR_REFCOUNT_MAX) {
/*
* Undo everything and check mbcache
* again.
@@ -2017,9 +2041,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
new_bh = NULL;
goto inserted;
}
- ref = le32_to_cpu(BHDR(new_bh)->h_refcount) + 1;
BHDR(new_bh)->h_refcount = cpu_to_le32(ref);
- if (ref >= EXT4_XATTR_REFCOUNT_MAX)
+ if (ref == EXT4_XATTR_REFCOUNT_MAX)
ce->e_reusable = 0;
ea_bdebug(new_bh, "reusing; refcount now=%d",
ref);
--
2.35.3

2022-07-12 10:58:11

by Jan Kara

[permalink] [raw]
Subject: [PATCH 10/10] mbcache: Automatically delete entries from cache on freeing

Use the fact that entries with elevated refcount are not removed from
the hash and just move removal of the entry from the hash to the entry
freeing time. When doing this we also change the generic code to hold
one reference to the cache entry, not two of them, which makes code
somewhat more obvious.

Signed-off-by: Jan Kara <[email protected]>
---
fs/mbcache.c | 108 +++++++++++++++-------------------------
include/linux/mbcache.h | 24 ++++++---
2 files changed, 55 insertions(+), 77 deletions(-)

diff --git a/fs/mbcache.c b/fs/mbcache.c
index d1ebb5df2856..96f1d49d30a5 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -90,7 +90,7 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
return -ENOMEM;

INIT_LIST_HEAD(&entry->e_list);
- /* One ref for hash, one ref returned */
+ /* Initial hash reference */
atomic_set(&entry->e_refcnt, 1);
entry->e_key = key;
entry->e_value = value;
@@ -106,21 +106,28 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
}
}
hlist_bl_add_head(&entry->e_hash_list, head);
- hlist_bl_unlock(head);
-
+ /*
+ * Add entry to LRU list before it can be found by
+ * mb_cache_entry_delete() to avoid races
+ */
spin_lock(&cache->c_list_lock);
list_add_tail(&entry->e_list, &cache->c_list);
- /* Grab ref for LRU list */
- atomic_inc(&entry->e_refcnt);
cache->c_entry_count++;
spin_unlock(&cache->c_list_lock);
+ hlist_bl_unlock(head);

return 0;
}
EXPORT_SYMBOL(mb_cache_entry_create);

-void __mb_cache_entry_free(struct mb_cache_entry *entry)
+void __mb_cache_entry_free(struct mb_cache *cache, struct mb_cache_entry *entry)
{
+ struct hlist_bl_head *head;
+
+ head = mb_cache_entry_head(cache, entry->e_key);
+ hlist_bl_lock(head);
+ hlist_bl_del(&entry->e_hash_list);
+ hlist_bl_unlock(head);
kmem_cache_free(mb_entry_cache, entry);
}
EXPORT_SYMBOL(__mb_cache_entry_free);
@@ -134,7 +141,7 @@ EXPORT_SYMBOL(__mb_cache_entry_free);
*/
void mb_cache_entry_wait_unused(struct mb_cache_entry *entry)
{
- wait_var_event(&entry->e_refcnt, atomic_read(&entry->e_refcnt) <= 3);
+ wait_var_event(&entry->e_refcnt, atomic_read(&entry->e_refcnt) <= 2);
}
EXPORT_SYMBOL(mb_cache_entry_wait_unused);

@@ -155,10 +162,9 @@ static struct mb_cache_entry *__entry_find(struct mb_cache *cache,
while (node) {
entry = hlist_bl_entry(node, struct mb_cache_entry,
e_hash_list);
- if (entry->e_key == key && entry->e_reusable) {
- atomic_inc(&entry->e_refcnt);
+ if (entry->e_key == key && entry->e_reusable &&
+ atomic_inc_not_zero(&entry->e_refcnt))
goto out;
- }
node = node->next;
}
entry = NULL;
@@ -218,10 +224,9 @@ struct mb_cache_entry *mb_cache_entry_get(struct mb_cache *cache, u32 key,
head = mb_cache_entry_head(cache, key);
hlist_bl_lock(head);
hlist_bl_for_each_entry(entry, node, head, e_hash_list) {
- if (entry->e_key == key && entry->e_value == value) {
- atomic_inc(&entry->e_refcnt);
+ if (entry->e_key == key && entry->e_value == value &&
+ atomic_inc_not_zero(&entry->e_refcnt))
goto out;
- }
}
entry = NULL;
out:
@@ -244,37 +249,25 @@ EXPORT_SYMBOL(mb_cache_entry_get);
struct mb_cache_entry *mb_cache_entry_delete_or_get(struct mb_cache *cache,
u32 key, u64 value)
{
- struct hlist_bl_node *node;
- struct hlist_bl_head *head;
struct mb_cache_entry *entry;

- head = mb_cache_entry_head(cache, key);
- hlist_bl_lock(head);
- hlist_bl_for_each_entry(entry, node, head, e_hash_list) {
- if (entry->e_key == key && entry->e_value == value) {
- if (atomic_read(&entry->e_refcnt) > 2) {
- atomic_inc(&entry->e_refcnt);
- hlist_bl_unlock(head);
- return entry;
- }
- /* We keep hash list reference to keep entry alive */
- hlist_bl_del_init(&entry->e_hash_list);
- hlist_bl_unlock(head);
- spin_lock(&cache->c_list_lock);
- if (!list_empty(&entry->e_list)) {
- list_del_init(&entry->e_list);
- if (!WARN_ONCE(cache->c_entry_count == 0,
- "mbcache: attempt to decrement c_entry_count past zero"))
- cache->c_entry_count--;
- atomic_dec(&entry->e_refcnt);
- }
- spin_unlock(&cache->c_list_lock);
- mb_cache_entry_put(cache, entry);
- return NULL;
- }
- }
- hlist_bl_unlock(head);
+ entry = mb_cache_entry_get(cache, key, value);
+ if (!entry)
+ return NULL;
+
+ /*
+ * Drop the ref we got from mb_cache_entry_get() and the initial hash
+ * ref if we are the last user
+ */
+ if (atomic_cmpxchg(&entry->e_refcnt, 2, 0) != 2)
+ return entry;

+ spin_lock(&cache->c_list_lock);
+ if (!list_empty(&entry->e_list))
+ list_del_init(&entry->e_list);
+ cache->c_entry_count--;
+ spin_unlock(&cache->c_list_lock);
+ __mb_cache_entry_free(cache, entry);
return NULL;
}
EXPORT_SYMBOL(mb_cache_entry_delete_or_get);
@@ -306,42 +299,24 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
unsigned long nr_to_scan)
{
struct mb_cache_entry *entry;
- struct hlist_bl_head *head;
unsigned long shrunk = 0;

spin_lock(&cache->c_list_lock);
while (nr_to_scan-- && !list_empty(&cache->c_list)) {
entry = list_first_entry(&cache->c_list,
struct mb_cache_entry, e_list);
- if (entry->e_referenced || atomic_read(&entry->e_refcnt) > 2) {
+ /* Drop initial hash reference if there is no user */
+ if (entry->e_referenced ||
+ atomic_cmpxchg(&entry->e_refcnt, 1, 0) != 1) {
entry->e_referenced = 0;
list_move_tail(&entry->e_list, &cache->c_list);
continue;
}
list_del_init(&entry->e_list);
cache->c_entry_count--;
- /*
- * We keep LRU list reference so that entry doesn't go away
- * from under us.
- */
spin_unlock(&cache->c_list_lock);
- head = mb_cache_entry_head(cache, entry->e_key);
- hlist_bl_lock(head);
- /* Now a reliable check if the entry didn't get used... */
- if (atomic_read(&entry->e_refcnt) > 2) {
- hlist_bl_unlock(head);
- spin_lock(&cache->c_list_lock);
- list_add_tail(&entry->e_list, &cache->c_list);
- cache->c_entry_count++;
- continue;
- }
- if (!hlist_bl_unhashed(&entry->e_hash_list)) {
- hlist_bl_del_init(&entry->e_hash_list);
- atomic_dec(&entry->e_refcnt);
- }
- hlist_bl_unlock(head);
- if (mb_cache_entry_put(cache, entry))
- shrunk++;
+ __mb_cache_entry_free(cache, entry);
+ shrunk++;
cond_resched();
spin_lock(&cache->c_list_lock);
}
@@ -433,11 +408,6 @@ void mb_cache_destroy(struct mb_cache *cache)
* point.
*/
list_for_each_entry_safe(entry, next, &cache->c_list, e_list) {
- if (!hlist_bl_unhashed(&entry->e_hash_list)) {
- hlist_bl_del_init(&entry->e_hash_list);
- atomic_dec(&entry->e_refcnt);
- } else
- WARN_ON(1);
list_del(&entry->e_list);
WARN_ON(atomic_read(&entry->e_refcnt) != 1);
mb_cache_entry_put(cache, entry);
diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h
index 452b579856d4..2da63fd7b98f 100644
--- a/include/linux/mbcache.h
+++ b/include/linux/mbcache.h
@@ -13,8 +13,16 @@ struct mb_cache;
struct mb_cache_entry {
/* List of entries in cache - protected by cache->c_list_lock */
struct list_head e_list;
- /* Hash table list - protected by hash chain bitlock */
+ /*
+ * Hash table list - protected by hash chain bitlock. The entry is
+ * guaranteed to be hashed while e_refcnt > 0.
+ */
struct hlist_bl_node e_hash_list;
+ /*
+ * Entry refcount. Once it reaches zero, entry is unhashed and freed.
+ * While refcount > 0, the entry is guaranteed to stay in the hash and
+ * e.g. mb_cache_entry_try_delete() will fail.
+ */
atomic_t e_refcnt;
/* Key in hash - stable during lifetime of the entry */
u32 e_key;
@@ -29,20 +37,20 @@ void mb_cache_destroy(struct mb_cache *cache);

int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
u64 value, bool reusable);
-void __mb_cache_entry_free(struct mb_cache_entry *entry);
+void __mb_cache_entry_free(struct mb_cache *cache,
+ struct mb_cache_entry *entry);
void mb_cache_entry_wait_unused(struct mb_cache_entry *entry);
-static inline int mb_cache_entry_put(struct mb_cache *cache,
- struct mb_cache_entry *entry)
+static inline void mb_cache_entry_put(struct mb_cache *cache,
+ struct mb_cache_entry *entry)
{
unsigned int cnt = atomic_dec_return(&entry->e_refcnt);

if (cnt > 0) {
- if (cnt <= 3)
+ if (cnt <= 2)
wake_up_var(&entry->e_refcnt);
- return 0;
+ return;
}
- __mb_cache_entry_free(entry);
- return 1;
+ __mb_cache_entry_free(cache, entry);
}

struct mb_cache_entry *mb_cache_entry_delete_or_get(struct mb_cache *cache,
--
2.35.3

2022-07-12 12:48:49

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH 0/10 v3] ext4: Fix possible fs corruption due to xattr races

On 22/07/12 12:54PM, Jan Kara wrote:
> Hello!
>
> I've noticed this series didn't get merged yet. I was waiting for more review
> feedback from Ritesh but somehow that didn't happen. So this is the third

Hello Jan,

I had reviewed this series till 05/10 which were meant for stable fixes too.
But I didn't quiet add any Reviewed-by because I didn't find any obvious
problem (also my familiarity with mbcache and revoke code paths are not as
good).

But fair point, I do wanted to continue reviewing the series of later patches
too. I will complete those before our next call (btw, I forgot to check on
these in last call actually)

But this doesn't has to delay picking this patch series for merge any further.
Please feel free to pick it up, if required.

Thanks again for your help!!
-ritesh

> submission of the series fixing the races of ext4 xattr block reuse with the
> few changes that have accumulated since v2. Ted, do you think you can add this
> series to your tree so that we can merge it during the merge window? Thanks!
>
> Changes since v1:
> * Reworked the series to fix all corner cases and make API less errorprone.
>
> Changes since v2:
> * Renamed mb_cache_entry_try_delete() to mb_cache_entry_delete_and_get()
> * Added Tested-by tag from Ritesh
>
> Honza
>
> Previous versions:
> Link: http://lore.kernel.org/r/[email protected] # v1
> Link: http://lore.kernel.org/r/[email protected] # v2

2022-07-14 11:56:53

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH 01/10] mbcache: Don't reclaim used entries

On 22/07/12 12:54PM, Jan Kara wrote:
> Do not reclaim entries that are currently used by somebody from a
> shrinker. Firstly, these entries are likely useful. Secondly, we will
> need to keep such entries to protect pending increment of xattr block
> refcount.
>
> CC: [email protected]
> Fixes: 82939d7999df ("ext4: convert to mbcache2")
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/mbcache.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/fs/mbcache.c b/fs/mbcache.c
> index 97c54d3a2227..cfc28129fb6f 100644
> --- a/fs/mbcache.c
> +++ b/fs/mbcache.c
> @@ -288,7 +288,7 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
> while (nr_to_scan-- && !list_empty(&cache->c_list)) {
> entry = list_first_entry(&cache->c_list,
> struct mb_cache_entry, e_list);
> - if (entry->e_referenced) {
> + if (entry->e_referenced || atomic_read(&entry->e_refcnt) > 2) {
> entry->e_referenced = 0;
> list_move_tail(&entry->e_list, &cache->c_list);
> continue;
> @@ -302,6 +302,14 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
> spin_unlock(&cache->c_list_lock);
> head = mb_cache_entry_head(cache, entry->e_key);
> hlist_bl_lock(head);
> + /* Now a reliable check if the entry didn't get used... */
> + if (atomic_read(&entry->e_refcnt) > 2) {

On taking a look at this patchset again. I think if we move this "if" condition
of checking refcnt to above i.e. before we delete the entry from c_list.
Then we can avoid =>
removing of the entry -> checking it's refcnt under lock -> adding it back
if the refcnt is elevated.

Thoughts?

-ritesh

> + hlist_bl_unlock(head);
> + spin_lock(&cache->c_list_lock);
> + list_add_tail(&entry->e_list, &cache->c_list);
> + cache->c_entry_count++;
> + continue;
> + }
> if (!hlist_bl_unhashed(&entry->e_hash_list)) {
> hlist_bl_del_init(&entry->e_hash_list);
> atomic_dec(&entry->e_refcnt);
> --
> 2.35.3
>

2022-07-14 12:18:35

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH 02/10] mbcache: Add functions to delete entry if unused

On 22/07/12 12:54PM, Jan Kara wrote:
> Add function mb_cache_entry_delete_or_get() to delete mbcache entry if
> it is unused and also add a function to wait for entry to become unused
> - mb_cache_entry_wait_unused(). We do not share code between the two
> deleting function as one of them will go away soon.
>
> CC: [email protected]
> Fixes: 82939d7999df ("ext4: convert to mbcache2")
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/mbcache.c | 66 +++++++++++++++++++++++++++++++++++++++--
> include/linux/mbcache.h | 10 ++++++-
> 2 files changed, 73 insertions(+), 3 deletions(-)
>
> diff --git a/fs/mbcache.c b/fs/mbcache.c
> index cfc28129fb6f..2010bc80a3f2 100644
> --- a/fs/mbcache.c
> +++ b/fs/mbcache.c
> @@ -11,7 +11,7 @@
> /*
> * Mbcache is a simple key-value store. Keys need not be unique, however
> * key-value pairs are expected to be unique (we use this fact in
> - * mb_cache_entry_delete()).
> + * mb_cache_entry_delete_or_get()).
> *
> * Ext2 and ext4 use this cache for deduplication of extended attribute blocks.
> * Ext4 also uses it for deduplication of xattr values stored in inodes.
> @@ -125,6 +125,19 @@ void __mb_cache_entry_free(struct mb_cache_entry *entry)
> }
> EXPORT_SYMBOL(__mb_cache_entry_free);
>
> +/*
> + * mb_cache_entry_wait_unused - wait to be the last user of the entry
> + *
> + * @entry - entry to work on
> + *
> + * Wait to be the last user of the entry.
> + */
> +void mb_cache_entry_wait_unused(struct mb_cache_entry *entry)
> +{
> + wait_var_event(&entry->e_refcnt, atomic_read(&entry->e_refcnt) <= 3);

It's not very intuitive of why we check for refcnt <= 3.
A small note at top of this function might be helpful.
IIUC, it is because by default when anyone creates an entry we start with
a refcnt of 2 (in mb_cache_entry_create.
- Now when the user of the entry wants to delete this, it will try and call
mb_cache_entry_delete_or_get(). If during this function call it sees that the
refcnt is elevated more than 2, that means there is another user of this entry
currently active and hence we should wait before we remove this entry from the
cache. So it will take an extra refcnt and return.
- So then this caller will call mb_cache_entry_wait_unused() for the refcnt to
be <= 3, so that the entry can be deleted.

Quick qn -
So now is the design like, ext4_evict_ea_inode() will be waiting indefinitely
until the other user of this mb_cache entry releases the reference right?
And that will not happen until,
- either the shrinker removes this entry from the cache during which we are
checking if the refcnt <= 3, then we call a wakeup event
- Or the user removes/deletes the xattr entry
Is the above understanding correct?

-ritesh


> +}
> +EXPORT_SYMBOL(mb_cache_entry_wait_unused);
> +
> static struct mb_cache_entry *__entry_find(struct mb_cache *cache,
> struct mb_cache_entry *entry,
> u32 key)
> @@ -217,7 +230,7 @@ struct mb_cache_entry *mb_cache_entry_get(struct mb_cache *cache, u32 key,
> }
> EXPORT_SYMBOL(mb_cache_entry_get);
>
> -/* mb_cache_entry_delete - remove a cache entry
> +/* mb_cache_entry_delete - try to remove a cache entry
> * @cache - cache we work with
> * @key - key
> * @value - value
> @@ -254,6 +267,55 @@ void mb_cache_entry_delete(struct mb_cache *cache, u32 key, u64 value)
> }
> EXPORT_SYMBOL(mb_cache_entry_delete);
>
> +/* mb_cache_entry_delete_or_get - remove a cache entry if it has no users
> + * @cache - cache we work with
> + * @key - key
> + * @value - value
> + *
> + * Remove entry from cache @cache with key @key and value @value. The removal
> + * happens only if the entry is unused. The function returns NULL in case the
> + * entry was successfully removed or there's no entry in cache. Otherwise the
> + * function grabs reference of the entry that we failed to delete because it
> + * still has users and return it.
> + */
> +struct mb_cache_entry *mb_cache_entry_delete_or_get(struct mb_cache *cache,
> + u32 key, u64 value)
> +{
> + struct hlist_bl_node *node;
> + struct hlist_bl_head *head;
> + struct mb_cache_entry *entry;
> +
> + head = mb_cache_entry_head(cache, key);
> + hlist_bl_lock(head);
> + hlist_bl_for_each_entry(entry, node, head, e_hash_list) {
> + if (entry->e_key == key && entry->e_value == value) {
> + if (atomic_read(&entry->e_refcnt) > 2) {
> + atomic_inc(&entry->e_refcnt);
> + hlist_bl_unlock(head);
> + return entry;
> + }
> + /* We keep hash list reference to keep entry alive */
> + hlist_bl_del_init(&entry->e_hash_list);
> + hlist_bl_unlock(head);
> + spin_lock(&cache->c_list_lock);
> + if (!list_empty(&entry->e_list)) {
> + list_del_init(&entry->e_list);
> + if (!WARN_ONCE(cache->c_entry_count == 0,
> + "mbcache: attempt to decrement c_entry_count past zero"))
> + cache->c_entry_count--;
> + atomic_dec(&entry->e_refcnt);
> + }
> + spin_unlock(&cache->c_list_lock);
> + mb_cache_entry_put(cache, entry);
> + return NULL;
> + }
> + }
> + hlist_bl_unlock(head);
> +
> + return NULL;
> +}
> +EXPORT_SYMBOL(mb_cache_entry_delete_or_get);
> +
> /* mb_cache_entry_touch - cache entry got used
> * @cache - cache the entry belongs to
> * @entry - entry that got used
> diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h
> index 20f1e3ff6013..8eca7f25c432 100644
> --- a/include/linux/mbcache.h
> +++ b/include/linux/mbcache.h
> @@ -30,15 +30,23 @@ void mb_cache_destroy(struct mb_cache *cache);
> int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
> u64 value, bool reusable);
> void __mb_cache_entry_free(struct mb_cache_entry *entry);
> +void mb_cache_entry_wait_unused(struct mb_cache_entry *entry);
> static inline int mb_cache_entry_put(struct mb_cache *cache,
> struct mb_cache_entry *entry)
> {
> - if (!atomic_dec_and_test(&entry->e_refcnt))
> + unsigned int cnt = atomic_dec_return(&entry->e_refcnt);
> +
> + if (cnt > 0) {
> + if (cnt <= 3)
> + wake_up_var(&entry->e_refcnt);
> return 0;
> + }
> __mb_cache_entry_free(entry);
> return 1;
> }
>
> +struct mb_cache_entry *mb_cache_entry_delete_or_get(struct mb_cache *cache,
> + u32 key, u64 value);
> void mb_cache_entry_delete(struct mb_cache *cache, u32 key, u64 value);
> struct mb_cache_entry *mb_cache_entry_get(struct mb_cache *cache, u32 key,
> u64 value);
> --
> 2.35.3
>

2022-07-14 12:21:00

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH 04/10] ext4: Unindent codeblock in ext4_xattr_block_set()

On 22/07/12 12:54PM, Jan Kara wrote:
> Remove unnecessary else (and thus indentation level) from a code block
> in ext4_xattr_block_set(). It will also make following code changes
> easier. No functional changes.

The patch looks good to me. Just a note, while applying this patch on ext4-dev
tree, I found a minor conflict with below patch.

ext4: use kmemdup() to replace kmalloc + memcpy

Replace kmalloc + memcpy with kmemdup()

-ritesh

>
> CC: [email protected]
> Fixes: 82939d7999df ("ext4: convert to mbcache2")
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/ext4/xattr.c | 79 ++++++++++++++++++++++++-------------------------
> 1 file changed, 39 insertions(+), 40 deletions(-)
>
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index 7fc40fb1e6b3..aadfae53d055 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -1850,6 +1850,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
> #define header(x) ((struct ext4_xattr_header *)(x))
>
> if (s->base) {
> + int offset = (char *)s->here - bs->bh->b_data;
> +
> BUFFER_TRACE(bs->bh, "get_write_access");
> error = ext4_journal_get_write_access(handle, sb, bs->bh,
> EXT4_JTR_NONE);
> @@ -1882,50 +1884,47 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
> if (error)
> goto cleanup;
> goto inserted;
> - } else {
> - int offset = (char *)s->here - bs->bh->b_data;
> + }
> + unlock_buffer(bs->bh);
> + ea_bdebug(bs->bh, "cloning");
> + s->base = kmalloc(bs->bh->b_size, GFP_NOFS);
> + error = -ENOMEM;
> + if (s->base == NULL)
> + goto cleanup;
> + memcpy(s->base, BHDR(bs->bh), bs->bh->b_size);
> + s->first = ENTRY(header(s->base)+1);
> + header(s->base)->h_refcount = cpu_to_le32(1);
> + s->here = ENTRY(s->base + offset);
> + s->end = s->base + bs->bh->b_size;
>
> - unlock_buffer(bs->bh);
> - ea_bdebug(bs->bh, "cloning");
> - s->base = kmalloc(bs->bh->b_size, GFP_NOFS);
> - error = -ENOMEM;
> - if (s->base == NULL)
> + /*
> + * If existing entry points to an xattr inode, we need
> + * to prevent ext4_xattr_set_entry() from decrementing
> + * ref count on it because the reference belongs to the
> + * original block. In this case, make the entry look
> + * like it has an empty value.
> + */
> + if (!s->not_found && s->here->e_value_inum) {
> + ea_ino = le32_to_cpu(s->here->e_value_inum);
> + error = ext4_xattr_inode_iget(inode, ea_ino,
> + le32_to_cpu(s->here->e_hash),
> + &tmp_inode);
> + if (error)
> goto cleanup;
> - memcpy(s->base, BHDR(bs->bh), bs->bh->b_size);
> - s->first = ENTRY(header(s->base)+1);
> - header(s->base)->h_refcount = cpu_to_le32(1);
> - s->here = ENTRY(s->base + offset);
> - s->end = s->base + bs->bh->b_size;
>
> - /*
> - * If existing entry points to an xattr inode, we need
> - * to prevent ext4_xattr_set_entry() from decrementing
> - * ref count on it because the reference belongs to the
> - * original block. In this case, make the entry look
> - * like it has an empty value.
> - */
> - if (!s->not_found && s->here->e_value_inum) {
> - ea_ino = le32_to_cpu(s->here->e_value_inum);
> - error = ext4_xattr_inode_iget(inode, ea_ino,
> - le32_to_cpu(s->here->e_hash),
> - &tmp_inode);
> - if (error)
> - goto cleanup;
> -
> - if (!ext4_test_inode_state(tmp_inode,
> - EXT4_STATE_LUSTRE_EA_INODE)) {
> - /*
> - * Defer quota free call for previous
> - * inode until success is guaranteed.
> - */
> - old_ea_inode_quota = le32_to_cpu(
> - s->here->e_value_size);
> - }
> - iput(tmp_inode);
> -
> - s->here->e_value_inum = 0;
> - s->here->e_value_size = 0;
> + if (!ext4_test_inode_state(tmp_inode,
> + EXT4_STATE_LUSTRE_EA_INODE)) {
> + /*
> + * Defer quota free call for previous
> + * inode until success is guaranteed.
> + */
> + old_ea_inode_quota = le32_to_cpu(
> + s->here->e_value_size);
> }
> + iput(tmp_inode);
> +
> + s->here->e_value_inum = 0;
> + s->here->e_value_size = 0;
> }
> } else {
> /* Allocate a buffer where we construct the new block. */
> --
> 2.35.3
>

2022-07-14 12:30:11

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH 05/10] ext4: Fix race when reusing xattr blocks

On 22/07/12 12:54PM, Jan Kara wrote:
> When ext4_xattr_block_set() decides to remove xattr block the following
> race can happen:
>
> CPU1 CPU2
> ext4_xattr_block_set() ext4_xattr_release_block()
> new_bh = ext4_xattr_block_cache_find()
>
> lock_buffer(bh);
> ref = le32_to_cpu(BHDR(bh)->h_refcount);
> if (ref == 1) {
> ...
> mb_cache_entry_delete();
> unlock_buffer(bh);
> ext4_free_blocks();
> ...
> ext4_forget(..., bh, ...);
> jbd2_journal_revoke(..., bh);
>
> ext4_journal_get_write_access(..., new_bh, ...)
> do_get_write_access()
> jbd2_journal_cancel_revoke(..., new_bh);
>
> Later the code in ext4_xattr_block_set() finds out the block got freed
> and cancels reusal of the block but the revoke stays canceled and so in
> case of block reuse and journal replay the filesystem can get corrupted.
> If the race works out slightly differently, we can also hit assertions
> in the jbd2 code.
>
> Fix the problem by making sure that once matching mbcache entry is
> found, code dropping the last xattr block reference (or trying to modify
> xattr block in place) waits until the mbcache entry reference is
> dropped. This way code trying to reuse xattr block is protected from
> someone trying to drop the last reference to xattr block.
>
> Reported-and-tested-by: Ritesh Harjani <[email protected]>
> CC: [email protected]
> Fixes: 82939d7999df ("ext4: convert to mbcache2")
> Signed-off-by: Jan Kara <[email protected]>

Thanks Jan,
Just a note - I retested the patches only till here (marked stable) with
stress-ng --xattr 16.
And I didn't find any issues so far for ext2, ext3, ext4 default mkfs options.

Also I re-ran full v3 patch series with the same test case on all 3 filesystem,
and I didn't find any failures of the same test case.

-ritesh




> ---
> fs/ext4/xattr.c | 67 +++++++++++++++++++++++++++++++++----------------
> 1 file changed, 45 insertions(+), 22 deletions(-)
>
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index aadfae53d055..3a0928c8720e 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -439,9 +439,16 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
> /* Remove entry from mbcache when EA inode is getting evicted */
> void ext4_evict_ea_inode(struct inode *inode)
> {
> - if (EA_INODE_CACHE(inode))
> - mb_cache_entry_delete(EA_INODE_CACHE(inode),
> - ext4_xattr_inode_get_hash(inode), inode->i_ino);
> + struct mb_cache_entry *oe;
> +
> + if (!EA_INODE_CACHE(inode))
> + return;
> + /* Wait for entry to get unused so that we can remove it */
> + while ((oe = mb_cache_entry_delete_or_get(EA_INODE_CACHE(inode),
> + ext4_xattr_inode_get_hash(inode), inode->i_ino))) {
> + mb_cache_entry_wait_unused(oe);
> + mb_cache_entry_put(EA_INODE_CACHE(inode), oe);
> + }
> }
>
> static int
> @@ -1229,6 +1236,7 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
> if (error)
> goto out;
>
> +retry_ref:
> lock_buffer(bh);
> hash = le32_to_cpu(BHDR(bh)->h_hash);
> ref = le32_to_cpu(BHDR(bh)->h_refcount);
> @@ -1238,9 +1246,18 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
> * This must happen under buffer lock for
> * ext4_xattr_block_set() to reliably detect freed block
> */
> - if (ea_block_cache)
> - mb_cache_entry_delete(ea_block_cache, hash,
> - bh->b_blocknr);
> + if (ea_block_cache) {
> + struct mb_cache_entry *oe;
> +
> + oe = mb_cache_entry_delete_or_get(ea_block_cache, hash,
> + bh->b_blocknr);
> + if (oe) {
> + unlock_buffer(bh);
> + mb_cache_entry_wait_unused(oe);
> + mb_cache_entry_put(ea_block_cache, oe);
> + goto retry_ref;
> + }
> + }
> get_bh(bh);
> unlock_buffer(bh);
>
> @@ -1867,9 +1884,20 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
> * ext4_xattr_block_set() to reliably detect modified
> * block
> */
> - if (ea_block_cache)
> - mb_cache_entry_delete(ea_block_cache, hash,
> - bs->bh->b_blocknr);
> + if (ea_block_cache) {
> + struct mb_cache_entry *oe;
> +
> + oe = mb_cache_entry_delete_or_get(ea_block_cache,
> + hash, bs->bh->b_blocknr);
> + if (oe) {
> + /*
> + * Xattr block is getting reused. Leave
> + * it alone.
> + */
> + mb_cache_entry_put(ea_block_cache, oe);
> + goto clone_block;
> + }
> + }
> ea_bdebug(bs->bh, "modifying in-place");
> error = ext4_xattr_set_entry(i, s, handle, inode,
> true /* is_block */);
> @@ -1885,6 +1913,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
> goto cleanup;
> goto inserted;
> }
> +clone_block:
> unlock_buffer(bs->bh);
> ea_bdebug(bs->bh, "cloning");
> s->base = kmalloc(bs->bh->b_size, GFP_NOFS);
> @@ -1991,18 +2020,13 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
> lock_buffer(new_bh);
> /*
> * We have to be careful about races with
> - * freeing, rehashing or adding references to
> - * xattr block. Once we hold buffer lock xattr
> - * block's state is stable so we can check
> - * whether the block got freed / rehashed or
> - * not. Since we unhash mbcache entry under
> - * buffer lock when freeing / rehashing xattr
> - * block, checking whether entry is still
> - * hashed is reliable. Same rules hold for
> - * e_reusable handling.
> + * adding references to xattr block. Once we
> + * hold buffer lock xattr block's state is
> + * stable so we can check the additional
> + * reference fits.
> */
> - if (hlist_bl_unhashed(&ce->e_hash_list) ||
> - !ce->e_reusable) {
> + ref = le32_to_cpu(BHDR(new_bh)->h_refcount) + 1;
> + if (ref > EXT4_XATTR_REFCOUNT_MAX) {
> /*
> * Undo everything and check mbcache
> * again.
> @@ -2017,9 +2041,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
> new_bh = NULL;
> goto inserted;
> }
> - ref = le32_to_cpu(BHDR(new_bh)->h_refcount) + 1;
> BHDR(new_bh)->h_refcount = cpu_to_le32(ref);
> - if (ref >= EXT4_XATTR_REFCOUNT_MAX)
> + if (ref == EXT4_XATTR_REFCOUNT_MAX)
> ce->e_reusable = 0;
> ea_bdebug(new_bh, "reusing; refcount now=%d",
> ref);
> --
> 2.35.3
>

2022-07-14 12:37:51

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH 06/10] ext2: Factor our freeing of xattr block reference

On 22/07/12 12:54PM, Jan Kara wrote:
> Free of xattr block reference is opencode in two places. Factor it out
> into a separate function and use it.

Looked into the refactoring logic. The patch looks good to me.
Small queries below -

>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/ext2/xattr.c | 90 +++++++++++++++++++++----------------------------
> 1 file changed, 38 insertions(+), 52 deletions(-)
>
> diff --git a/fs/ext2/xattr.c b/fs/ext2/xattr.c
> index 841fa6d9d744..9885294993ef 100644
> --- a/fs/ext2/xattr.c
> +++ b/fs/ext2/xattr.c
> @@ -651,6 +651,42 @@ ext2_xattr_set(struct inode *inode, int name_index, const char *name,
> return error;
> }
>
> +static void ext2_xattr_release_block(struct inode *inode,
> + struct buffer_head *bh)
> +{
> + struct mb_cache *ea_block_cache = EA_BLOCK_CACHE(inode);
> +
> + lock_buffer(bh);
> + if (HDR(bh)->h_refcount == cpu_to_le32(1)) {
> + __u32 hash = le32_to_cpu(HDR(bh)->h_hash);
> +
> + /*
> + * This must happen under buffer lock for
> + * ext2_xattr_set2() to reliably detect freed block
> + */
> + mb_cache_entry_delete(ea_block_cache, hash,
> + bh->b_blocknr);
> + /* Free the old block. */
> + ea_bdebug(bh, "freeing");
> + ext2_free_blocks(inode, bh->b_blocknr, 1);
> + /* We let our caller release bh, so we
> + * need to duplicate the buffer before. */
> + get_bh(bh);
> + bforget(bh);
> + unlock_buffer(bh);
> + } else {
> + /* Decrement the refcount only. */
> + le32_add_cpu(&HDR(bh)->h_refcount, -1);
> + dquot_free_block(inode, 1);
> + mark_buffer_dirty(bh);
> + unlock_buffer(bh);
> + ea_bdebug(bh, "refcount now=%d",
> + le32_to_cpu(HDR(bh)->h_refcount));
> + if (IS_SYNC(inode))
> + sync_dirty_buffer(bh);
> + }
> +}
> +
> /*
> * Second half of ext2_xattr_set(): Update the file system.
> */
> @@ -747,34 +783,7 @@ ext2_xattr_set2(struct inode *inode, struct buffer_head *old_bh,
> * If there was an old block and we are no longer using it,
> * release the old block.
> */
> - lock_buffer(old_bh);
> - if (HDR(old_bh)->h_refcount == cpu_to_le32(1)) {
> - __u32 hash = le32_to_cpu(HDR(old_bh)->h_hash);
> -
> - /*
> - * This must happen under buffer lock for
> - * ext2_xattr_set2() to reliably detect freed block
> - */
> - mb_cache_entry_delete(ea_block_cache, hash,
> - old_bh->b_blocknr);
> - /* Free the old block. */
> - ea_bdebug(old_bh, "freeing");
> - ext2_free_blocks(inode, old_bh->b_blocknr, 1);
> - mark_inode_dirty(inode);

^^^ this is not needed because ext2_free_blocks() will take care of it.
Hence you have dropped this in ext2_xattr_release_block()

> - /* We let our caller release old_bh, so we
> - * need to duplicate the buffer before. */
> - get_bh(old_bh);
> - bforget(old_bh);
> - } else {
> - /* Decrement the refcount only. */
> - le32_add_cpu(&HDR(old_bh)->h_refcount, -1);
> - dquot_free_block_nodirty(inode, 1);
> - mark_inode_dirty(inode);

Quick qn -> Don't we need mark_inode_dirty() here?

> - mark_buffer_dirty(old_bh);
> - ea_bdebug(old_bh, "refcount now=%d",
> - le32_to_cpu(HDR(old_bh)->h_refcount));
> - }
> - unlock_buffer(old_bh);
> + ext2_xattr_release_block(inode, old_bh);
> }
>
> cleanup:
> @@ -828,30 +837,7 @@ ext2_xattr_delete_inode(struct inode *inode)
> EXT2_I(inode)->i_file_acl);
> goto cleanup;
> }
> - lock_buffer(bh);
> - if (HDR(bh)->h_refcount == cpu_to_le32(1)) {
> - __u32 hash = le32_to_cpu(HDR(bh)->h_hash);
> -
> - /*
> - * This must happen under buffer lock for ext2_xattr_set2() to
> - * reliably detect freed block
> - */
> - mb_cache_entry_delete(EA_BLOCK_CACHE(inode), hash,
> - bh->b_blocknr);
> - ext2_free_blocks(inode, EXT2_I(inode)->i_file_acl, 1);
> - get_bh(bh);
> - bforget(bh);
> - unlock_buffer(bh);
> - } else {
> - le32_add_cpu(&HDR(bh)->h_refcount, -1);
> - ea_bdebug(bh, "refcount now=%d",
> - le32_to_cpu(HDR(bh)->h_refcount));
> - unlock_buffer(bh);
> - mark_buffer_dirty(bh);
> - if (IS_SYNC(inode))
> - sync_dirty_buffer(bh);
> - dquot_free_block_nodirty(inode, 1);
> - }
> + ext2_xattr_release_block(inode, bh);
> EXT2_I(inode)->i_file_acl = 0;
>
> cleanup:
> --
> 2.35.3
>

2022-07-14 12:46:29

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH 07/10] ext2: Unindent codeblock in ext2_xattr_set()

On 22/07/12 12:54PM, Jan Kara wrote:
> Replace one else in ext2_xattr_set() with a goto. This makes following
> code changes simpler to follow. No functional changes.

Straight forward refactoring. Looks good to me.

Reviewed-by: Ritesh Harjani (IBM) <[email protected]>

>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/ext2/xattr.c | 32 ++++++++++++++++----------------
> 1 file changed, 16 insertions(+), 16 deletions(-)
>
> diff --git a/fs/ext2/xattr.c b/fs/ext2/xattr.c
> index 9885294993ef..37ce495eb279 100644
> --- a/fs/ext2/xattr.c
> +++ b/fs/ext2/xattr.c
> @@ -517,7 +517,8 @@ ext2_xattr_set(struct inode *inode, int name_index, const char *name,
> /* Here we know that we can set the new attribute. */
>
> if (header) {
> - /* assert(header == HDR(bh)); */
> + int offset;
> +
> lock_buffer(bh);
> if (header->h_refcount == cpu_to_le32(1)) {
> __u32 hash = le32_to_cpu(header->h_hash);
> @@ -531,22 +532,20 @@ ext2_xattr_set(struct inode *inode, int name_index, const char *name,
> bh->b_blocknr);
>
> /* keep the buffer locked while modifying it. */
> - } else {
> - int offset;
> -
> - unlock_buffer(bh);
> - ea_bdebug(bh, "cloning");
> - header = kmemdup(HDR(bh), bh->b_size, GFP_KERNEL);
> - error = -ENOMEM;
> - if (header == NULL)
> - goto cleanup;
> - header->h_refcount = cpu_to_le32(1);
> -
> - offset = (char *)here - bh->b_data;
> - here = ENTRY((char *)header + offset);
> - offset = (char *)last - bh->b_data;
> - last = ENTRY((char *)header + offset);
> + goto update_block;
> }
> + unlock_buffer(bh);
> + ea_bdebug(bh, "cloning");
> + header = kmemdup(HDR(bh), bh->b_size, GFP_KERNEL);
> + error = -ENOMEM;
> + if (header == NULL)
> + goto cleanup;
> + header->h_refcount = cpu_to_le32(1);
> +
> + offset = (char *)here - bh->b_data;
> + here = ENTRY((char *)header + offset);
> + offset = (char *)last - bh->b_data;
> + last = ENTRY((char *)header + offset);
> } else {
> /* Allocate a buffer where we construct the new block. */
> header = kzalloc(sb->s_blocksize, GFP_KERNEL);
> @@ -559,6 +558,7 @@ ext2_xattr_set(struct inode *inode, int name_index, const char *name,
> last = here = ENTRY(header+1);
> }
>
> +update_block:
> /* Iff we are modifying the block in-place, bh is locked here. */
>
> if (not_found) {
> --
> 2.35.3
>

2022-07-14 13:13:38

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH 10/10] mbcache: Automatically delete entries from cache on freeing

On 22/07/12 12:54PM, Jan Kara wrote:
> Use the fact that entries with elevated refcount are not removed from

The elevated refcnt means >= 2?

> the hash and just move removal of the entry from the hash to the entry
> freeing time. When doing this we also change the generic code to hold
> one reference to the cache entry, not two of them, which makes code
> somewhat more obvious.
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/mbcache.c | 108 +++++++++++++++-------------------------
> include/linux/mbcache.h | 24 ++++++---
> 2 files changed, 55 insertions(+), 77 deletions(-)
>
> diff --git a/fs/mbcache.c b/fs/mbcache.c
> index d1ebb5df2856..96f1d49d30a5 100644
> --- a/fs/mbcache.c
> +++ b/fs/mbcache.c
> @@ -90,7 +90,7 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
> return -ENOMEM;
>
> INIT_LIST_HEAD(&entry->e_list);
> - /* One ref for hash, one ref returned */
> + /* Initial hash reference */
> atomic_set(&entry->e_refcnt, 1);
> entry->e_key = key;
> entry->e_value = value;
> @@ -106,21 +106,28 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
> }
> }
> hlist_bl_add_head(&entry->e_hash_list, head);
> - hlist_bl_unlock(head);
> -
> + /*
> + * Add entry to LRU list before it can be found by
> + * mb_cache_entry_delete() to avoid races
> + */

No reference to mb_cache_entry_delete() now. It is
mb_cache_entry_delete_or_get()

> spin_lock(&cache->c_list_lock);
> list_add_tail(&entry->e_list, &cache->c_list);
> - /* Grab ref for LRU list */
> - atomic_inc(&entry->e_refcnt);
> cache->c_entry_count++;
> spin_unlock(&cache->c_list_lock);
> + hlist_bl_unlock(head);
>
> return 0;
> }
> EXPORT_SYMBOL(mb_cache_entry_create);
>
> -void __mb_cache_entry_free(struct mb_cache_entry *entry)
> +void __mb_cache_entry_free(struct mb_cache *cache, struct mb_cache_entry *entry)
> {
> + struct hlist_bl_head *head;
> +
> + head = mb_cache_entry_head(cache, entry->e_key);
> + hlist_bl_lock(head);
> + hlist_bl_del(&entry->e_hash_list);
> + hlist_bl_unlock(head);
> kmem_cache_free(mb_entry_cache, entry);
> }
> EXPORT_SYMBOL(__mb_cache_entry_free);
> @@ -134,7 +141,7 @@ EXPORT_SYMBOL(__mb_cache_entry_free);
> */
> void mb_cache_entry_wait_unused(struct mb_cache_entry *entry)
> {
> - wait_var_event(&entry->e_refcnt, atomic_read(&entry->e_refcnt) <= 3);
> + wait_var_event(&entry->e_refcnt, atomic_read(&entry->e_refcnt) <= 2);
> }
> EXPORT_SYMBOL(mb_cache_entry_wait_unused);
>
> @@ -155,10 +162,9 @@ static struct mb_cache_entry *__entry_find(struct mb_cache *cache,
> while (node) {
> entry = hlist_bl_entry(node, struct mb_cache_entry,
> e_hash_list);
> - if (entry->e_key == key && entry->e_reusable) {
> - atomic_inc(&entry->e_refcnt);
> + if (entry->e_key == key && entry->e_reusable &&
> + atomic_inc_not_zero(&entry->e_refcnt))
> goto out;
> - }
> node = node->next;
> }
> entry = NULL;
> @@ -218,10 +224,9 @@ struct mb_cache_entry *mb_cache_entry_get(struct mb_cache *cache, u32 key,
> head = mb_cache_entry_head(cache, key);
> hlist_bl_lock(head);
> hlist_bl_for_each_entry(entry, node, head, e_hash_list) {
> - if (entry->e_key == key && entry->e_value == value) {
> - atomic_inc(&entry->e_refcnt);
> + if (entry->e_key == key && entry->e_value == value &&
> + atomic_inc_not_zero(&entry->e_refcnt))
> goto out;
> - }
> }
> entry = NULL;
> out:
> @@ -244,37 +249,25 @@ EXPORT_SYMBOL(mb_cache_entry_get);
> struct mb_cache_entry *mb_cache_entry_delete_or_get(struct mb_cache *cache,
> u32 key, u64 value)
> {
> - struct hlist_bl_node *node;
> - struct hlist_bl_head *head;
> struct mb_cache_entry *entry;
>
> - head = mb_cache_entry_head(cache, key);
> - hlist_bl_lock(head);
> - hlist_bl_for_each_entry(entry, node, head, e_hash_list) {
> - if (entry->e_key == key && entry->e_value == value) {
> - if (atomic_read(&entry->e_refcnt) > 2) {
> - atomic_inc(&entry->e_refcnt);
> - hlist_bl_unlock(head);
> - return entry;
> - }
> - /* We keep hash list reference to keep entry alive */
> - hlist_bl_del_init(&entry->e_hash_list);
> - hlist_bl_unlock(head);
> - spin_lock(&cache->c_list_lock);
> - if (!list_empty(&entry->e_list)) {
> - list_del_init(&entry->e_list);
> - if (!WARN_ONCE(cache->c_entry_count == 0,
> - "mbcache: attempt to decrement c_entry_count past zero"))
> - cache->c_entry_count--;
> - atomic_dec(&entry->e_refcnt);
> - }
> - spin_unlock(&cache->c_list_lock);
> - mb_cache_entry_put(cache, entry);
> - return NULL;
> - }
> - }
> - hlist_bl_unlock(head);
> + entry = mb_cache_entry_get(cache, key, value);
> + if (!entry)
> + return NULL;
> +
> + /*
> + * Drop the ref we got from mb_cache_entry_get() and the initial hash
> + * ref if we are the last user
> + */
> + if (atomic_cmpxchg(&entry->e_refcnt, 2, 0) != 2)
> + return entry;
>
> + spin_lock(&cache->c_list_lock);
> + if (!list_empty(&entry->e_list))
> + list_del_init(&entry->e_list);
> + cache->c_entry_count--;
> + spin_unlock(&cache->c_list_lock);
> + __mb_cache_entry_free(cache, entry);
> return NULL;
> }
> EXPORT_SYMBOL(mb_cache_entry_delete_or_get);
> @@ -306,42 +299,24 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
> unsigned long nr_to_scan)
> {
> struct mb_cache_entry *entry;
> - struct hlist_bl_head *head;
> unsigned long shrunk = 0;
>
> spin_lock(&cache->c_list_lock);
> while (nr_to_scan-- && !list_empty(&cache->c_list)) {
> entry = list_first_entry(&cache->c_list,
> struct mb_cache_entry, e_list);
> - if (entry->e_referenced || atomic_read(&entry->e_refcnt) > 2) {
> + /* Drop initial hash reference if there is no user */
> + if (entry->e_referenced ||
> + atomic_cmpxchg(&entry->e_refcnt, 1, 0) != 1) {

So here if the refcnt of an entry is 1. That means it is still in use right.
So the shrinker will do the atomic_cmpxchg and make it to 0. And then
delete the entry from the cache?
This will only happen for entry with just 1 reference count.

Is that correct understanding?

-ritesh


> entry->e_referenced = 0;
> list_move_tail(&entry->e_list, &cache->c_list);
> continue;
> }
> list_del_init(&entry->e_list);
> cache->c_entry_count--;
> - /*
> - * We keep LRU list reference so that entry doesn't go away
> - * from under us.
> - */
> spin_unlock(&cache->c_list_lock);
> - head = mb_cache_entry_head(cache, entry->e_key);
> - hlist_bl_lock(head);
> - /* Now a reliable check if the entry didn't get used... */
> - if (atomic_read(&entry->e_refcnt) > 2) {
> - hlist_bl_unlock(head);
> - spin_lock(&cache->c_list_lock);
> - list_add_tail(&entry->e_list, &cache->c_list);
> - cache->c_entry_count++;
> - continue;
> - }
> - if (!hlist_bl_unhashed(&entry->e_hash_list)) {
> - hlist_bl_del_init(&entry->e_hash_list);
> - atomic_dec(&entry->e_refcnt);
> - }
> - hlist_bl_unlock(head);
> - if (mb_cache_entry_put(cache, entry))
> - shrunk++;
> + __mb_cache_entry_free(cache, entry);
> + shrunk++;
> cond_resched();
> spin_lock(&cache->c_list_lock);
> }
> @@ -433,11 +408,6 @@ void mb_cache_destroy(struct mb_cache *cache)
> * point.
> */
> list_for_each_entry_safe(entry, next, &cache->c_list, e_list) {
> - if (!hlist_bl_unhashed(&entry->e_hash_list)) {
> - hlist_bl_del_init(&entry->e_hash_list);
> - atomic_dec(&entry->e_refcnt);
> - } else
> - WARN_ON(1);
> list_del(&entry->e_list);
> WARN_ON(atomic_read(&entry->e_refcnt) != 1);
> mb_cache_entry_put(cache, entry);
> diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h
> index 452b579856d4..2da63fd7b98f 100644
> --- a/include/linux/mbcache.h
> +++ b/include/linux/mbcache.h
> @@ -13,8 +13,16 @@ struct mb_cache;
> struct mb_cache_entry {
> /* List of entries in cache - protected by cache->c_list_lock */
> struct list_head e_list;
> - /* Hash table list - protected by hash chain bitlock */
> + /*
> + * Hash table list - protected by hash chain bitlock. The entry is
> + * guaranteed to be hashed while e_refcnt > 0.
> + */
> struct hlist_bl_node e_hash_list;
> + /*
> + * Entry refcount. Once it reaches zero, entry is unhashed and freed.
> + * While refcount > 0, the entry is guaranteed to stay in the hash and
> + * e.g. mb_cache_entry_try_delete() will fail.
> + */
> atomic_t e_refcnt;
> /* Key in hash - stable during lifetime of the entry */
> u32 e_key;
> @@ -29,20 +37,20 @@ void mb_cache_destroy(struct mb_cache *cache);
>
> int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
> u64 value, bool reusable);
> -void __mb_cache_entry_free(struct mb_cache_entry *entry);
> +void __mb_cache_entry_free(struct mb_cache *cache,
> + struct mb_cache_entry *entry);
> void mb_cache_entry_wait_unused(struct mb_cache_entry *entry);
> -static inline int mb_cache_entry_put(struct mb_cache *cache,
> - struct mb_cache_entry *entry)
> +static inline void mb_cache_entry_put(struct mb_cache *cache,
> + struct mb_cache_entry *entry)
> {
> unsigned int cnt = atomic_dec_return(&entry->e_refcnt);
>
> if (cnt > 0) {
> - if (cnt <= 3)
> + if (cnt <= 2)
> wake_up_var(&entry->e_refcnt);
> - return 0;
> + return;
> }
> - __mb_cache_entry_free(entry);
> - return 1;
> + __mb_cache_entry_free(cache, entry);
> }
>
> struct mb_cache_entry *mb_cache_entry_delete_or_get(struct mb_cache *cache,
> --
> 2.35.3
>

2022-07-14 14:41:31

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 01/10] mbcache: Don't reclaim used entries

On Thu 14-07-22 17:17:02, Ritesh Harjani wrote:
> On 22/07/12 12:54PM, Jan Kara wrote:
> > Do not reclaim entries that are currently used by somebody from a
> > shrinker. Firstly, these entries are likely useful. Secondly, we will
> > need to keep such entries to protect pending increment of xattr block
> > refcount.
> >
> > CC: [email protected]
> > Fixes: 82939d7999df ("ext4: convert to mbcache2")
> > Signed-off-by: Jan Kara <[email protected]>
> > ---
> > fs/mbcache.c | 10 +++++++++-
> > 1 file changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/mbcache.c b/fs/mbcache.c
> > index 97c54d3a2227..cfc28129fb6f 100644
> > --- a/fs/mbcache.c
> > +++ b/fs/mbcache.c
> > @@ -288,7 +288,7 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
> > while (nr_to_scan-- && !list_empty(&cache->c_list)) {
> > entry = list_first_entry(&cache->c_list,
> > struct mb_cache_entry, e_list);
> > - if (entry->e_referenced) {
> > + if (entry->e_referenced || atomic_read(&entry->e_refcnt) > 2) {
> > entry->e_referenced = 0;
> > list_move_tail(&entry->e_list, &cache->c_list);
> > continue;
> > @@ -302,6 +302,14 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
> > spin_unlock(&cache->c_list_lock);
> > head = mb_cache_entry_head(cache, entry->e_key);
> > hlist_bl_lock(head);
> > + /* Now a reliable check if the entry didn't get used... */
> > + if (atomic_read(&entry->e_refcnt) > 2) {
>
> On taking a look at this patchset again. I think if we move this "if" condition
> of checking refcnt to above i.e. before we delete the entry from c_list.
> Then we can avoid =>
> removing of the entry -> checking it's refcnt under lock -> adding it back
> if the refcnt is elevated.
>
> Thoughts?

Well, but synchronization would get more complicated because we don't want
to acquire hlist_bl_lock() under c_list_lock (technically we could at this
point in the series but it would make life harder for the last patch in the
series). And we need c_list_lock to remove entry from the LRU list. It
could be all done but I don't think what you suggest is really that simpler
and this code will go away later in the patchset anyway...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-07-14 14:56:50

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH 01/10] mbcache: Don't reclaim used entries

On 22/07/14 04:36PM, Jan Kara wrote:
> On Thu 14-07-22 17:17:02, Ritesh Harjani wrote:
> > On 22/07/12 12:54PM, Jan Kara wrote:
> > > Do not reclaim entries that are currently used by somebody from a
> > > shrinker. Firstly, these entries are likely useful. Secondly, we will
> > > need to keep such entries to protect pending increment of xattr block
> > > refcount.
> > >
> > > CC: [email protected]
> > > Fixes: 82939d7999df ("ext4: convert to mbcache2")
> > > Signed-off-by: Jan Kara <[email protected]>
> > > ---
> > > fs/mbcache.c | 10 +++++++++-
> > > 1 file changed, 9 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/mbcache.c b/fs/mbcache.c
> > > index 97c54d3a2227..cfc28129fb6f 100644
> > > --- a/fs/mbcache.c
> > > +++ b/fs/mbcache.c
> > > @@ -288,7 +288,7 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
> > > while (nr_to_scan-- && !list_empty(&cache->c_list)) {
> > > entry = list_first_entry(&cache->c_list,
> > > struct mb_cache_entry, e_list);
> > > - if (entry->e_referenced) {
> > > + if (entry->e_referenced || atomic_read(&entry->e_refcnt) > 2) {
> > > entry->e_referenced = 0;
> > > list_move_tail(&entry->e_list, &cache->c_list);
> > > continue;
> > > @@ -302,6 +302,14 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
> > > spin_unlock(&cache->c_list_lock);
> > > head = mb_cache_entry_head(cache, entry->e_key);
> > > hlist_bl_lock(head);
> > > + /* Now a reliable check if the entry didn't get used... */
> > > + if (atomic_read(&entry->e_refcnt) > 2) {
> >
> > On taking a look at this patchset again. I think if we move this "if" condition
> > of checking refcnt to above i.e. before we delete the entry from c_list.
> > Then we can avoid =>
> > removing of the entry -> checking it's refcnt under lock -> adding it back
> > if the refcnt is elevated.
> >
> > Thoughts?
>
> Well, but synchronization would get more complicated because we don't want
> to acquire hlist_bl_lock() under c_list_lock (technically we could at this
Ok, yes. I tried implementing it and it becomes lock()/unlock() mess.

> point in the series but it would make life harder for the last patch in the
> series). And we need c_list_lock to remove entry from the LRU list. It
> could be all done but I don't think what you suggest is really that simpler
> and this code will go away later in the patchset anyway...

I agree. Thanks for re-checking it.

-ritesh

2022-07-14 14:56:52

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 02/10] mbcache: Add functions to delete entry if unused

On Thu 14-07-22 17:45:32, Ritesh Harjani wrote:
> On 22/07/12 12:54PM, Jan Kara wrote:
> > Add function mb_cache_entry_delete_or_get() to delete mbcache entry if
> > it is unused and also add a function to wait for entry to become unused
> > - mb_cache_entry_wait_unused(). We do not share code between the two
> > deleting function as one of them will go away soon.
> >
> > CC: [email protected]
> > Fixes: 82939d7999df ("ext4: convert to mbcache2")
> > Signed-off-by: Jan Kara <[email protected]>
> > ---
> > fs/mbcache.c | 66 +++++++++++++++++++++++++++++++++++++++--
> > include/linux/mbcache.h | 10 ++++++-
> > 2 files changed, 73 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/mbcache.c b/fs/mbcache.c
> > index cfc28129fb6f..2010bc80a3f2 100644
> > --- a/fs/mbcache.c
> > +++ b/fs/mbcache.c
> > @@ -11,7 +11,7 @@
> > /*
> > * Mbcache is a simple key-value store. Keys need not be unique, however
> > * key-value pairs are expected to be unique (we use this fact in
> > - * mb_cache_entry_delete()).
> > + * mb_cache_entry_delete_or_get()).
> > *
> > * Ext2 and ext4 use this cache for deduplication of extended attribute blocks.
> > * Ext4 also uses it for deduplication of xattr values stored in inodes.
> > @@ -125,6 +125,19 @@ void __mb_cache_entry_free(struct mb_cache_entry *entry)
> > }
> > EXPORT_SYMBOL(__mb_cache_entry_free);
> >
> > +/*
> > + * mb_cache_entry_wait_unused - wait to be the last user of the entry
> > + *
> > + * @entry - entry to work on
> > + *
> > + * Wait to be the last user of the entry.
> > + */
> > +void mb_cache_entry_wait_unused(struct mb_cache_entry *entry)
> > +{
> > + wait_var_event(&entry->e_refcnt, atomic_read(&entry->e_refcnt) <= 3);
>
> It's not very intuitive of why we check for refcnt <= 3.
> A small note at top of this function might be helpful.
> IIUC, it is because by default when anyone creates an entry we start with
> a refcnt of 2 (in mb_cache_entry_create.
> - Now when the user of the entry wants to delete this, it will try and call
> mb_cache_entry_delete_or_get(). If during this function call it sees that the
> refcnt is elevated more than 2, that means there is another user of this entry
> currently active and hence we should wait before we remove this entry from the
> cache. So it will take an extra refcnt and return.
> - So then this caller will call mb_cache_entry_wait_unused() for the refcnt to
> be <= 3, so that the entry can be deleted.

Correct. I will add a comment as you suggest.

> Quick qn -
> So now is the design like, ext4_evict_ea_inode() will be waiting indefinitely
> until the other user of this mb_cache entry releases the reference right?

Correct. Similarly for ext4_xattr_release_block().

> And that will not happen until,
> - either the shrinker removes this entry from the cache during which we are
> checking if the refcnt <= 3, then we call a wakeup event

No, shrinker will not touch these entries with active users anymore.

> - Or the user removes/deletes the xattr entry

No. We hold reference to mbcache entry only while we are trying to reuse
it. So functions ext4_xattr_block_cache_find() and
ext4_xattr_inode_cache_find() will lookup potential mbcache entry that may
have the same contents and get reference to it. Then we do comparisons
verifying whether the contents really matches, if yes, we increment on-disk
inode/block refcount. Then we drop mbcache entry reference which unblocks
waiters in mb_cache_entry_wait_unused().

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-07-14 14:59:15

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 06/10] ext2: Factor our freeing of xattr block reference

On Thu 14-07-22 18:07:14, Ritesh Harjani wrote:
> On 22/07/12 12:54PM, Jan Kara wrote:
> > Free of xattr block reference is opencode in two places. Factor it out
> > into a separate function and use it.
>
> Looked into the refactoring logic. The patch looks good to me.
> Small queries below -
>
> >
> > Signed-off-by: Jan Kara <[email protected]>
> > ---
> > fs/ext2/xattr.c | 90 +++++++++++++++++++++----------------------------
> > 1 file changed, 38 insertions(+), 52 deletions(-)
> >
> > diff --git a/fs/ext2/xattr.c b/fs/ext2/xattr.c
> > index 841fa6d9d744..9885294993ef 100644
> > --- a/fs/ext2/xattr.c
> > +++ b/fs/ext2/xattr.c
> > @@ -651,6 +651,42 @@ ext2_xattr_set(struct inode *inode, int name_index, const char *name,
> > return error;
> > }
> >
> > +static void ext2_xattr_release_block(struct inode *inode,
> > + struct buffer_head *bh)
> > +{
> > + struct mb_cache *ea_block_cache = EA_BLOCK_CACHE(inode);
> > +
> > + lock_buffer(bh);
> > + if (HDR(bh)->h_refcount == cpu_to_le32(1)) {
> > + __u32 hash = le32_to_cpu(HDR(bh)->h_hash);
> > +
> > + /*
> > + * This must happen under buffer lock for
> > + * ext2_xattr_set2() to reliably detect freed block
> > + */
> > + mb_cache_entry_delete(ea_block_cache, hash,
> > + bh->b_blocknr);
> > + /* Free the old block. */
> > + ea_bdebug(bh, "freeing");
> > + ext2_free_blocks(inode, bh->b_blocknr, 1);
> > + /* We let our caller release bh, so we
> > + * need to duplicate the buffer before. */
> > + get_bh(bh);
> > + bforget(bh);
> > + unlock_buffer(bh);
> > + } else {
> > + /* Decrement the refcount only. */
> > + le32_add_cpu(&HDR(bh)->h_refcount, -1);
> > + dquot_free_block(inode, 1);
> > + mark_buffer_dirty(bh);
> > + unlock_buffer(bh);
> > + ea_bdebug(bh, "refcount now=%d",
> > + le32_to_cpu(HDR(bh)->h_refcount));
> > + if (IS_SYNC(inode))
> > + sync_dirty_buffer(bh);
> > + }
> > +}
> > +
> > /*
> > * Second half of ext2_xattr_set(): Update the file system.
> > */
> > @@ -747,34 +783,7 @@ ext2_xattr_set2(struct inode *inode, struct buffer_head *old_bh,
> > * If there was an old block and we are no longer using it,
> > * release the old block.
> > */
> > - lock_buffer(old_bh);
> > - if (HDR(old_bh)->h_refcount == cpu_to_le32(1)) {
> > - __u32 hash = le32_to_cpu(HDR(old_bh)->h_hash);
> > -
> > - /*
> > - * This must happen under buffer lock for
> > - * ext2_xattr_set2() to reliably detect freed block
> > - */
> > - mb_cache_entry_delete(ea_block_cache, hash,
> > - old_bh->b_blocknr);
> > - /* Free the old block. */
> > - ea_bdebug(old_bh, "freeing");
> > - ext2_free_blocks(inode, old_bh->b_blocknr, 1);
> > - mark_inode_dirty(inode);
>
> ^^^ this is not needed because ext2_free_blocks() will take care of it.
> Hence you have dropped this in ext2_xattr_release_block()

Correct. ext2_free_blocks() always dirties the inode (unless there is
metadata inconsistency found in which case we don't really care).

> > - /* We let our caller release old_bh, so we
> > - * need to duplicate the buffer before. */
> > - get_bh(old_bh);
> > - bforget(old_bh);
> > - } else {
> > - /* Decrement the refcount only. */
> > - le32_add_cpu(&HDR(old_bh)->h_refcount, -1);
> > - dquot_free_block_nodirty(inode, 1);
> > - mark_inode_dirty(inode);
>
> Quick qn -> Don't we need mark_inode_dirty() here?

Notice that I've changed dquot_free_block_nodirty() to dquot_free_block()
because quota info update is the only reason why we need to dirty the inode
so why not let quota code handle it...

>
> > - mark_buffer_dirty(old_bh);
> > - ea_bdebug(old_bh, "refcount now=%d",
> > - le32_to_cpu(HDR(old_bh)->h_refcount));
> > - }
> > - unlock_buffer(old_bh);
> > + ext2_xattr_release_block(inode, old_bh);
> > }

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-07-14 15:00:50

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH 02/10] mbcache: Add functions to delete entry if unused

On 22/07/14 04:49PM, Jan Kara wrote:
> On Thu 14-07-22 17:45:32, Ritesh Harjani wrote:
> > On 22/07/12 12:54PM, Jan Kara wrote:
> > > Add function mb_cache_entry_delete_or_get() to delete mbcache entry if
> > > it is unused and also add a function to wait for entry to become unused
> > > - mb_cache_entry_wait_unused(). We do not share code between the two
> > > deleting function as one of them will go away soon.
> > >
> > > CC: [email protected]
> > > Fixes: 82939d7999df ("ext4: convert to mbcache2")
> > > Signed-off-by: Jan Kara <[email protected]>
> > > ---
> > > fs/mbcache.c | 66 +++++++++++++++++++++++++++++++++++++++--
> > > include/linux/mbcache.h | 10 ++++++-
> > > 2 files changed, 73 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/fs/mbcache.c b/fs/mbcache.c
> > > index cfc28129fb6f..2010bc80a3f2 100644
> > > --- a/fs/mbcache.c
> > > +++ b/fs/mbcache.c
> > > @@ -11,7 +11,7 @@
> > > /*
> > > * Mbcache is a simple key-value store. Keys need not be unique, however
> > > * key-value pairs are expected to be unique (we use this fact in
> > > - * mb_cache_entry_delete()).
> > > + * mb_cache_entry_delete_or_get()).
> > > *
> > > * Ext2 and ext4 use this cache for deduplication of extended attribute blocks.
> > > * Ext4 also uses it for deduplication of xattr values stored in inodes.
> > > @@ -125,6 +125,19 @@ void __mb_cache_entry_free(struct mb_cache_entry *entry)
> > > }
> > > EXPORT_SYMBOL(__mb_cache_entry_free);
> > >
> > > +/*
> > > + * mb_cache_entry_wait_unused - wait to be the last user of the entry
> > > + *
> > > + * @entry - entry to work on
> > > + *
> > > + * Wait to be the last user of the entry.
> > > + */
> > > +void mb_cache_entry_wait_unused(struct mb_cache_entry *entry)
> > > +{
> > > + wait_var_event(&entry->e_refcnt, atomic_read(&entry->e_refcnt) <= 3);
> >
> > It's not very intuitive of why we check for refcnt <= 3.
> > A small note at top of this function might be helpful.
> > IIUC, it is because by default when anyone creates an entry we start with
> > a refcnt of 2 (in mb_cache_entry_create.
> > - Now when the user of the entry wants to delete this, it will try and call
> > mb_cache_entry_delete_or_get(). If during this function call it sees that the
> > refcnt is elevated more than 2, that means there is another user of this entry
> > currently active and hence we should wait before we remove this entry from the
> > cache. So it will take an extra refcnt and return.
> > - So then this caller will call mb_cache_entry_wait_unused() for the refcnt to
> > be <= 3, so that the entry can be deleted.
>
> Correct. I will add a comment as you suggest.
>
> > Quick qn -
> > So now is the design like, ext4_evict_ea_inode() will be waiting indefinitely
> > until the other user of this mb_cache entry releases the reference right?
>
> Correct. Similarly for ext4_xattr_release_block().
>
> > And that will not happen until,
> > - either the shrinker removes this entry from the cache during which we are
> > checking if the refcnt <= 3, then we call a wakeup event
>
> No, shrinker will not touch these entries with active users anymore.
>
> > - Or the user removes/deletes the xattr entry
>
> No. We hold reference to mbcache entry only while we are trying to reuse
> it. So functions ext4_xattr_block_cache_find() and
> ext4_xattr_inode_cache_find() will lookup potential mbcache entry that may
> have the same contents and get reference to it. Then we do comparisons
> verifying whether the contents really matches, if yes, we increment on-disk
> inode/block refcount. Then we drop mbcache entry reference which unblocks
> waiters in mb_cache_entry_wait_unused().
>

ohk, yes. This is where I was a bit confused.
Thanks for explaining it. This makes more sense. I did go through the mbcache
implementation, but I was missing the info on how the callers are using it.

-ritesh

> Honza
>
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2022-07-14 15:07:59

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 10/10] mbcache: Automatically delete entries from cache on freeing

On Thu 14-07-22 18:39:51, Ritesh Harjani wrote:
> On 22/07/12 12:54PM, Jan Kara wrote:
> > Use the fact that entries with elevated refcount are not removed from
>
> The elevated refcnt means >= 2?

Well, it means when there is real user of the mbcache entry. So 3 before
this patch, 2 after this patch...

> > the hash and just move removal of the entry from the hash to the entry
> > freeing time. When doing this we also change the generic code to hold
> > one reference to the cache entry, not two of them, which makes code
> > somewhat more obvious.
> >
> > Signed-off-by: Jan Kara <[email protected]>
> > ---
> > fs/mbcache.c | 108 +++++++++++++++-------------------------
> > include/linux/mbcache.h | 24 ++++++---
> > 2 files changed, 55 insertions(+), 77 deletions(-)
> >
> > diff --git a/fs/mbcache.c b/fs/mbcache.c
> > index d1ebb5df2856..96f1d49d30a5 100644
> > --- a/fs/mbcache.c
> > +++ b/fs/mbcache.c
> > @@ -90,7 +90,7 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
> > return -ENOMEM;
> >
> > INIT_LIST_HEAD(&entry->e_list);
> > - /* One ref for hash, one ref returned */
> > + /* Initial hash reference */
> > atomic_set(&entry->e_refcnt, 1);
> > entry->e_key = key;
> > entry->e_value = value;
> > @@ -106,21 +106,28 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
> > }
> > }
> > hlist_bl_add_head(&entry->e_hash_list, head);
> > - hlist_bl_unlock(head);
> > -
> > + /*
> > + * Add entry to LRU list before it can be found by
> > + * mb_cache_entry_delete() to avoid races
> > + */
>
> No reference to mb_cache_entry_delete() now. It is
> mb_cache_entry_delete_or_get()

Thanks, will fix.

> > spin_lock(&cache->c_list_lock);
> > list_add_tail(&entry->e_list, &cache->c_list);
> > - /* Grab ref for LRU list */
> > - atomic_inc(&entry->e_refcnt);
> > cache->c_entry_count++;
> > spin_unlock(&cache->c_list_lock);
> > + hlist_bl_unlock(head);
> >
> > return 0;
> > }
> > EXPORT_SYMBOL(mb_cache_entry_create);
> >
> > -void __mb_cache_entry_free(struct mb_cache_entry *entry)
> > +void __mb_cache_entry_free(struct mb_cache *cache, struct mb_cache_entry *entry)
> > {
> > + struct hlist_bl_head *head;
> > +
> > + head = mb_cache_entry_head(cache, entry->e_key);
> > + hlist_bl_lock(head);
> > + hlist_bl_del(&entry->e_hash_list);
> > + hlist_bl_unlock(head);
> > kmem_cache_free(mb_entry_cache, entry);
> > }
> > EXPORT_SYMBOL(__mb_cache_entry_free);
> > @@ -134,7 +141,7 @@ EXPORT_SYMBOL(__mb_cache_entry_free);
> > */
> > void mb_cache_entry_wait_unused(struct mb_cache_entry *entry)
> > {
> > - wait_var_event(&entry->e_refcnt, atomic_read(&entry->e_refcnt) <= 3);
> > + wait_var_event(&entry->e_refcnt, atomic_read(&entry->e_refcnt) <= 2);
> > }
> > EXPORT_SYMBOL(mb_cache_entry_wait_unused);
> >
> > @@ -155,10 +162,9 @@ static struct mb_cache_entry *__entry_find(struct mb_cache *cache,
> > while (node) {
> > entry = hlist_bl_entry(node, struct mb_cache_entry,
> > e_hash_list);
> > - if (entry->e_key == key && entry->e_reusable) {
> > - atomic_inc(&entry->e_refcnt);
> > + if (entry->e_key == key && entry->e_reusable &&
> > + atomic_inc_not_zero(&entry->e_refcnt))
> > goto out;
> > - }
> > node = node->next;
> > }
> > entry = NULL;
> > @@ -218,10 +224,9 @@ struct mb_cache_entry *mb_cache_entry_get(struct mb_cache *cache, u32 key,
> > head = mb_cache_entry_head(cache, key);
> > hlist_bl_lock(head);
> > hlist_bl_for_each_entry(entry, node, head, e_hash_list) {
> > - if (entry->e_key == key && entry->e_value == value) {
> > - atomic_inc(&entry->e_refcnt);
> > + if (entry->e_key == key && entry->e_value == value &&
> > + atomic_inc_not_zero(&entry->e_refcnt))
> > goto out;
> > - }
> > }
> > entry = NULL;
> > out:
> > @@ -244,37 +249,25 @@ EXPORT_SYMBOL(mb_cache_entry_get);
> > struct mb_cache_entry *mb_cache_entry_delete_or_get(struct mb_cache *cache,
> > u32 key, u64 value)
> > {
> > - struct hlist_bl_node *node;
> > - struct hlist_bl_head *head;
> > struct mb_cache_entry *entry;
> >
> > - head = mb_cache_entry_head(cache, key);
> > - hlist_bl_lock(head);
> > - hlist_bl_for_each_entry(entry, node, head, e_hash_list) {
> > - if (entry->e_key == key && entry->e_value == value) {
> > - if (atomic_read(&entry->e_refcnt) > 2) {
> > - atomic_inc(&entry->e_refcnt);
> > - hlist_bl_unlock(head);
> > - return entry;
> > - }
> > - /* We keep hash list reference to keep entry alive */
> > - hlist_bl_del_init(&entry->e_hash_list);
> > - hlist_bl_unlock(head);
> > - spin_lock(&cache->c_list_lock);
> > - if (!list_empty(&entry->e_list)) {
> > - list_del_init(&entry->e_list);
> > - if (!WARN_ONCE(cache->c_entry_count == 0,
> > - "mbcache: attempt to decrement c_entry_count past zero"))
> > - cache->c_entry_count--;
> > - atomic_dec(&entry->e_refcnt);
> > - }
> > - spin_unlock(&cache->c_list_lock);
> > - mb_cache_entry_put(cache, entry);
> > - return NULL;
> > - }
> > - }
> > - hlist_bl_unlock(head);
> > + entry = mb_cache_entry_get(cache, key, value);
> > + if (!entry)
> > + return NULL;
> > +
> > + /*
> > + * Drop the ref we got from mb_cache_entry_get() and the initial hash
> > + * ref if we are the last user
> > + */
> > + if (atomic_cmpxchg(&entry->e_refcnt, 2, 0) != 2)
> > + return entry;
> >
> > + spin_lock(&cache->c_list_lock);
> > + if (!list_empty(&entry->e_list))
> > + list_del_init(&entry->e_list);
> > + cache->c_entry_count--;
> > + spin_unlock(&cache->c_list_lock);
> > + __mb_cache_entry_free(cache, entry);
> > return NULL;
> > }
> > EXPORT_SYMBOL(mb_cache_entry_delete_or_get);
> > @@ -306,42 +299,24 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
> > unsigned long nr_to_scan)
> > {
> > struct mb_cache_entry *entry;
> > - struct hlist_bl_head *head;
> > unsigned long shrunk = 0;
> >
> > spin_lock(&cache->c_list_lock);
> > while (nr_to_scan-- && !list_empty(&cache->c_list)) {
> > entry = list_first_entry(&cache->c_list,
> > struct mb_cache_entry, e_list);
> > - if (entry->e_referenced || atomic_read(&entry->e_refcnt) > 2) {
> > + /* Drop initial hash reference if there is no user */
> > + if (entry->e_referenced ||
> > + atomic_cmpxchg(&entry->e_refcnt, 1, 0) != 1) {
>
> So here if the refcnt of an entry is 1. That means it is still in use right.

No. One reference is held by the hashtable/LRU itself. So 1 means entry is
free.

> So the shrinker will do the atomic_cmpxchg and make it to 0. And then
> delete the entry from the cache?
> This will only happen for entry with just 1 reference count.
>
> Is that correct understanding?

Correct. Basically the atomic 1 -> 0 transition makes sure we are not
racing with anybody else doing the 1 -> 2 transition. And once reference
gets to 0, we make sure no new references can be created.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-07-14 16:23:41

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH 06/10] ext2: Factor our freeing of xattr block reference

On 22/07/14 04:55PM, Jan Kara wrote:
> On Thu 14-07-22 18:07:14, Ritesh Harjani wrote:
> > On 22/07/12 12:54PM, Jan Kara wrote:
> > > Free of xattr block reference is opencode in two places. Factor it out
> > > into a separate function and use it.
> >
> > Looked into the refactoring logic. The patch looks good to me.
> > Small queries below -
> >
> > >
> > > Signed-off-by: Jan Kara <[email protected]>
> > > ---
> > > fs/ext2/xattr.c | 90 +++++++++++++++++++++----------------------------
> > > 1 file changed, 38 insertions(+), 52 deletions(-)
> > >
> > > diff --git a/fs/ext2/xattr.c b/fs/ext2/xattr.c
> > > index 841fa6d9d744..9885294993ef 100644
> > > --- a/fs/ext2/xattr.c
> > > +++ b/fs/ext2/xattr.c
> > > @@ -651,6 +651,42 @@ ext2_xattr_set(struct inode *inode, int name_index, const char *name,
> > > return error;
> > > }
> > >
> > > +static void ext2_xattr_release_block(struct inode *inode,
> > > + struct buffer_head *bh)
> > > +{
> > > + struct mb_cache *ea_block_cache = EA_BLOCK_CACHE(inode);
> > > +
> > > + lock_buffer(bh);
> > > + if (HDR(bh)->h_refcount == cpu_to_le32(1)) {
> > > + __u32 hash = le32_to_cpu(HDR(bh)->h_hash);
> > > +
> > > + /*
> > > + * This must happen under buffer lock for
> > > + * ext2_xattr_set2() to reliably detect freed block
> > > + */
> > > + mb_cache_entry_delete(ea_block_cache, hash,
> > > + bh->b_blocknr);
> > > + /* Free the old block. */
> > > + ea_bdebug(bh, "freeing");
> > > + ext2_free_blocks(inode, bh->b_blocknr, 1);
> > > + /* We let our caller release bh, so we
> > > + * need to duplicate the buffer before. */
> > > + get_bh(bh);
> > > + bforget(bh);
> > > + unlock_buffer(bh);
> > > + } else {
> > > + /* Decrement the refcount only. */
> > > + le32_add_cpu(&HDR(bh)->h_refcount, -1);
> > > + dquot_free_block(inode, 1);
> > > + mark_buffer_dirty(bh);
> > > + unlock_buffer(bh);
> > > + ea_bdebug(bh, "refcount now=%d",
> > > + le32_to_cpu(HDR(bh)->h_refcount));
> > > + if (IS_SYNC(inode))
> > > + sync_dirty_buffer(bh);
> > > + }
> > > +}
> > > +
> > > /*
> > > * Second half of ext2_xattr_set(): Update the file system.
> > > */
> > > @@ -747,34 +783,7 @@ ext2_xattr_set2(struct inode *inode, struct buffer_head *old_bh,
> > > * If there was an old block and we are no longer using it,
> > > * release the old block.
> > > */
> > > - lock_buffer(old_bh);
> > > - if (HDR(old_bh)->h_refcount == cpu_to_le32(1)) {
> > > - __u32 hash = le32_to_cpu(HDR(old_bh)->h_hash);
> > > -
> > > - /*
> > > - * This must happen under buffer lock for
> > > - * ext2_xattr_set2() to reliably detect freed block
> > > - */
> > > - mb_cache_entry_delete(ea_block_cache, hash,
> > > - old_bh->b_blocknr);
> > > - /* Free the old block. */
> > > - ea_bdebug(old_bh, "freeing");
> > > - ext2_free_blocks(inode, old_bh->b_blocknr, 1);
> > > - mark_inode_dirty(inode);
> >
> > ^^^ this is not needed because ext2_free_blocks() will take care of it.
> > Hence you have dropped this in ext2_xattr_release_block()
>
> Correct. ext2_free_blocks() always dirties the inode (unless there is
> metadata inconsistency found in which case we don't really care).
>
> > > - /* We let our caller release old_bh, so we
> > > - * need to duplicate the buffer before. */
> > > - get_bh(old_bh);
> > > - bforget(old_bh);
> > > - } else {
> > > - /* Decrement the refcount only. */
> > > - le32_add_cpu(&HDR(old_bh)->h_refcount, -1);
> > > - dquot_free_block_nodirty(inode, 1);
> > > - mark_inode_dirty(inode);
> >
> > Quick qn -> Don't we need mark_inode_dirty() here?
>
> Notice that I've changed dquot_free_block_nodirty() to dquot_free_block()
> because quota info update is the only reason why we need to dirty the inode
> so why not let quota code handle it...
>

Ok, yes. Missed it. Thanks for pointing it out.

-ritesh

2022-07-22 14:01:14

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 01/10] mbcache: Don't reclaim used entries

On Tue, 12 Jul 2022 12:54:20 +0200, Jan Kara wrote:
> Do not reclaim entries that are currently used by somebody from a
> shrinker. Firstly, these entries are likely useful. Secondly, we will
> need to keep such entries to protect pending increment of xattr block
> refcount.
>

Applied, thanks! (Some slight adjustments were needed to resolve a
merge conflict.)

[01/10] mbcache: Don't reclaim used entries
commit: ee595bcf21a86af4cff673000e2728d61c7c0e7b
[02/10] mbcache: Add functions to delete entry if unused
commit: ad3923aa44185f5f65e17764fe5c30501c6dfd22
[03/10] ext4: Remove EA inode entry from mbcache on inode eviction
commit: 428dc374a6cb6c0cbbf6fe8984b667ef78dc7d75
[04/10] ext4: Unindent codeblock in ext4_xattr_block_set()
commit: d52086dcf26a6284b08b5544210a7475b4837d52
[05/10] ext4: Fix race when reusing xattr blocks
commit: 132991ed28822cfb4be41ac72195f00fc0baf3c8
[06/10] ext2: Factor our freeing of xattr block reference
commit: c30e78a5f165244985aa346bdd460d459094470e
[07/10] ext2: Unindent codeblock in ext2_xattr_set()
commit: 0e85fb030d13e427deca44a95aabb2475614f8d2
[08/10] ext2: Avoid deleting xattr block that is being reused
commit: 44ce98e77ab4583b17ff4f501c2076eec3b759d7
[09/10] mbcache: Remove mb_cache_entry_delete()
commit: c3671ffa0919f2d433576c99c4e211cd367afda0
[10/10] mbcache: Automatically delete entries from cache on freeing
commit: b51539a7d04fb7d05b28ab9387364ccde88b6b6d

Best regards,
--
Theodore Ts'o <[email protected]>

2022-07-25 15:28:05

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 05/10] ext4: Fix race when reusing xattr blocks

On Sat 16-07-22 11:00:46, Zhihao Cheng wrote:
> 在 2022/7/12 18:54, Jan Kara 写道:
> Hi Jan, one little question confuses me:
> > When ext4_xattr_block_set() decides to remove xattr block the following
> > race can happen:
> >
> > CPU1 CPU2
> > ext4_xattr_block_set() ext4_xattr_release_block()
> > new_bh = ext4_xattr_block_cache_find()
> >
> > lock_buffer(bh);
> > ref = le32_to_cpu(BHDR(bh)->h_refcount);
> > if (ref == 1) {
> > ...
> > mb_cache_entry_delete();
> > unlock_buffer(bh);
> > ext4_free_blocks();
> > ...
> > ext4_forget(..., bh, ...);
> > jbd2_journal_revoke(..., bh);
> >
> > ext4_journal_get_write_access(..., new_bh, ...)
> > do_get_write_access()
> > jbd2_journal_cancel_revoke(..., new_bh);
> >
> > Later the code in ext4_xattr_block_set() finds out the block got freed
> > and cancels reusal of the block but the revoke stays canceled and so in
> > case of block reuse and journal replay the filesystem can get corrupted.
> > If the race works out slightly differently, we can also hit assertions
> > in the jbd2 code.
> >
> > Fix the problem by making sure that once matching mbcache entry is
> > found, code dropping the last xattr block reference (or trying to modify
> > xattr block in place) waits until the mbcache entry reference is
> > dropped. This way code trying to reuse xattr block is protected from
> > someone trying to drop the last reference to xattr block.
> >
> > Reported-and-tested-by: Ritesh Harjani <[email protected]>
> > CC: [email protected]
> > Fixes: 82939d7999df ("ext4: convert to mbcache2")
> > Signed-off-by: Jan Kara <[email protected]>

...

> > @@ -1991,18 +2020,13 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
> > lock_buffer(new_bh);
> > /*
> > * We have to be careful about races with
> > - * freeing, rehashing or adding references to
> > - * xattr block. Once we hold buffer lock xattr
> > - * block's state is stable so we can check
> > - * whether the block got freed / rehashed or
> > - * not. Since we unhash mbcache entry under
> > - * buffer lock when freeing / rehashing xattr
> > - * block, checking whether entry is still
> > - * hashed is reliable. Same rules hold for
> > - * e_reusable handling.
> > + * adding references to xattr block. Once we
> > + * hold buffer lock xattr block's state is
> > + * stable so we can check the additional
> > + * reference fits.
> > */
> > - if (hlist_bl_unhashed(&ce->e_hash_list) ||
> > - !ce->e_reusable) {
> > + ref = le32_to_cpu(BHDR(new_bh)->h_refcount) + 1;
> > + if (ref > EXT4_XATTR_REFCOUNT_MAX) {
>
> So far, we have mb_cache_entry_delete_or_get() and
> mb_cache_entry_wait_unused(), so used cache entry cannot be concurrently
> removed. Removing check 'hlist_bl_unhashed(&ce->e_hash_list)' is okay.
>
> What's affect of changing the other two checks 'ref >=
> EXT4_XATTR_REFCOUNT_MAX' and '!ce->e_reusable'? To make code more obvious?
> eg. To point out the condition of 'ref <= EXT4_XATTR_REFCOUNT_MAX' rather
> than 'ce->e_reusable', we have checked 'ce->e_reusable' in
> ext4_xattr_block_cache_find() before?

Well, ce->e_reusable is set if and only if BHDR(new_bh)->h_refcount <
EXT4_XATTR_REFCOUNT_MAX. So checking whether the refcount is small enough
is all that is needed and we don't need the ce->e_reusable check here.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR