2023-05-09 17:05:51

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 00/32] bcachefs - a new COW filesystem

I'm submitting the bcachefs filesystem for review and inclusion.

Included in this patch series are all the non fs/bcachefs/ patches. The
entire tree, based on v6.3, may be found at:

http://evilpiepirate.org/git/bcachefs.git bcachefs-for-upstream

----------------------------------------------------------------

bcachefs overview, status:

Features:
- too many to list

Known bugs:
- too many to list

Status:
- Snapshots have been declared stable; one serious bug report
outstanding to look into, most users report it working well.

These are RW btrfs-style snapshots, but with far better scalability
and no scalability issues with sparse snapshots due to key level
versioning.

- Erasure coding is getting really close; hope to have it ready for
users to beat on it by this summer. This is a novel RAID/erasure
coding design with no write hole, and no fragmentation of writes
(e.g. RAIDZ).

- Tons of scalabality work finished over the past year, users are
running it on 100 TB filesystems without complaint, waiting for first
1 PB user; next thing to address re: scalability is fsck/recovery
memory usage.

- Test infrastructure! Major project milestone, check out our test
dashboard at
https://evilpiepirate.org/~testdashboard/ci?branch=bcachefs

Other project notes:

irc::/irc.oftc.net/bcache is where most activity happens; I'm always
there, and most code review happens there - I find the conversational
format more productive.

------------------------------------------------

patches in this series:

Christopher James Halse Rogers (1):
stacktrace: Export stack_trace_save_tsk

Daniel Hill (1):
lib: add mean and variance module.

Dave Chinner (3):
vfs: factor out inode hash head calculation
hlist-bl: add hlist_bl_fake()
vfs: inode cache conversion to hash-bl

Kent Overstreet (27):
Compiler Attributes: add __flatten
locking/lockdep: lock_class_is_held()
locking/lockdep: lockdep_set_no_check_recursion()
locking: SIX locks (shared/intent/exclusive)
MAINTAINERS: Add entry for six locks
sched: Add task_struct->faults_disabled_mapping
mm: Bring back vmalloc_exec
fs: factor out d_mark_tmpfile()
block: Add some exports for bcachefs
block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset
block: Bring back zero_fill_bio_iter
block: Rework bio_for_each_segment_all()
block: Rework bio_for_each_folio_all()
block: Don't block on s_umount from __invalidate_super()
bcache: move closures to lib/
MAINTAINERS: Add entry for closures
closures: closure_wait_event()
closures: closure_nr_remaining()
closures: Add a missing include
iov_iter: copy_folio_from_iter_atomic()
MAINTAINERS: Add entry for generic-radix-tree
lib/generic-radix-tree.c: Don't overflow in peek()
lib/generic-radix-tree.c: Add a missing include
lib/generic-radix-tree.c: Add peek_prev()
lib/string_helpers: string_get_size() now returns characters wrote
lib: Export errname
MAINTAINERS: Add entry for bcachefs

MAINTAINERS | 39 +
block/bdev.c | 2 +-
block/bio.c | 57 +-
block/blk-core.c | 1 +
block/blk-map.c | 38 +-
block/blk.h | 1 -
block/bounce.c | 12 +-
drivers/md/bcache/Kconfig | 10 +-
drivers/md/bcache/Makefile | 4 +-
drivers/md/bcache/bcache.h | 2 +-
drivers/md/bcache/btree.c | 8 +-
drivers/md/bcache/super.c | 1 -
drivers/md/bcache/util.h | 3 +-
drivers/md/dm-crypt.c | 10 +-
drivers/md/raid1.c | 4 +-
fs/btrfs/disk-io.c | 4 +-
fs/btrfs/extent_io.c | 50 +-
fs/btrfs/raid56.c | 14 +-
fs/crypto/bio.c | 9 +-
fs/dcache.c | 12 +-
fs/erofs/zdata.c | 4 +-
fs/ext4/page-io.c | 8 +-
fs/ext4/readpage.c | 4 +-
fs/f2fs/data.c | 20 +-
fs/gfs2/lops.c | 10 +-
fs/gfs2/meta_io.c | 8 +-
fs/inode.c | 218 +++--
fs/iomap/buffered-io.c | 14 +-
fs/mpage.c | 4 +-
fs/squashfs/block.c | 48 +-
fs/squashfs/lz4_wrapper.c | 17 +-
fs/squashfs/lzo_wrapper.c | 17 +-
fs/squashfs/xz_wrapper.c | 19 +-
fs/squashfs/zlib_wrapper.c | 18 +-
fs/squashfs/zstd_wrapper.c | 19 +-
fs/super.c | 40 +-
fs/verity/verify.c | 9 +-
include/linux/bio.h | 132 +--
include/linux/blkdev.h | 1 +
include/linux/bvec.h | 70 +-
.../md/bcache => include/linux}/closure.h | 46 +-
include/linux/compiler_attributes.h | 5 +
include/linux/dcache.h | 1 +
include/linux/fs.h | 10 +-
include/linux/generic-radix-tree.h | 68 +-
include/linux/list_bl.h | 22 +
include/linux/lockdep.h | 10 +
include/linux/lockdep_types.h | 2 +-
include/linux/mean_and_variance.h | 219 +++++
include/linux/sched.h | 1 +
include/linux/six.h | 210 +++++
include/linux/string_helpers.h | 4 +-
include/linux/uio.h | 2 +
include/linux/vmalloc.h | 1 +
init/init_task.c | 1 +
kernel/Kconfig.locks | 3 +
kernel/locking/Makefile | 1 +
kernel/locking/lockdep.c | 46 ++
kernel/locking/six.c | 779 ++++++++++++++++++
kernel/module/main.c | 4 +-
kernel/stacktrace.c | 2 +
lib/Kconfig | 3 +
lib/Kconfig.debug | 18 +
lib/Makefile | 2 +
{drivers/md/bcache => lib}/closure.c | 36 +-
lib/errname.c | 1 +
lib/generic-radix-tree.c | 76 +-
lib/iov_iter.c | 53 +-
lib/math/Kconfig | 3 +
lib/math/Makefile | 2 +
lib/math/mean_and_variance.c | 136 +++
lib/math/mean_and_variance_test.c | 155 ++++
lib/string_helpers.c | 8 +-
mm/nommu.c | 18 +
mm/vmalloc.c | 21 +
75 files changed, 2485 insertions(+), 445 deletions(-)
rename {drivers/md/bcache => include/linux}/closure.h (93%)
create mode 100644 include/linux/mean_and_variance.h
create mode 100644 include/linux/six.h
create mode 100644 kernel/locking/six.c
rename {drivers/md/bcache => lib}/closure.c (88%)
create mode 100644 lib/math/mean_and_variance.c
create mode 100644 lib/math/mean_and_variance_test.c

--
2.40.1


2023-05-09 17:06:53

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 21/32] hlist-bl: add hlist_bl_fake()

From: Dave Chinner <[email protected]>

in preparation for switching the VFS inode cache over the hlist_bl
lists, we nee dto be able to fake a list node that looks like it is
hased for correct operation of filesystems that don't directly use
the VFS indoe cache.

Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/list_bl.h | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
index ae1b541446..8ee2bf5af1 100644
--- a/include/linux/list_bl.h
+++ b/include/linux/list_bl.h
@@ -143,6 +143,28 @@ static inline void hlist_bl_del_init(struct hlist_bl_node *n)
}
}

+/**
+ * hlist_bl_add_fake - create a fake list consisting of a single headless node
+ * @n: Node to make a fake list out of
+ *
+ * This makes @n appear to be its own predecessor on a headless hlist.
+ * The point of this is to allow things like hlist_bl_del() to work correctly
+ * in cases where there is no list.
+ */
+static inline void hlist_bl_add_fake(struct hlist_bl_node *n)
+{
+ n->pprev = &n->next;
+}
+
+/**
+ * hlist_fake: Is this node a fake hlist_bl?
+ * @h: Node to check for being a self-referential fake hlist.
+ */
+static inline bool hlist_bl_fake(struct hlist_bl_node *n)
+{
+ return n->pprev == &n->next;
+}
+
static inline void hlist_bl_lock(struct hlist_bl_head *b)
{
bit_spin_lock(0, (unsigned long *)b);
--
2.40.1

2023-05-09 17:06:53

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 12/32] block: Rework bio_for_each_segment_all()

This patch reworks bio_for_each_segment_all() to be more inline with how
the other bio iterators work:

- bio_iter_all_peek() now returns a synthesized bio_vec; we don't stash
one in the iterator and pass a pointer to it - bad. This way makes it
clearer what's a constructed value vs. a reference to something
pre-existing, and it also will help with cleaning up and
consolidating code with bio_for_each_folio_all().

- We now provide bio_for_each_segment_all_continue(), for squashfs:
this makes their code clearer.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: [email protected]
Cc: Ming Lei <[email protected]>
Cc: Phillip Lougher <[email protected]>
---
block/bio.c | 38 ++++++++++++------------
block/blk-map.c | 38 ++++++++++++------------
block/bounce.c | 12 ++++----
drivers/md/bcache/btree.c | 8 ++---
drivers/md/dm-crypt.c | 10 +++----
drivers/md/raid1.c | 4 +--
fs/btrfs/disk-io.c | 4 +--
fs/btrfs/extent_io.c | 50 +++++++++++++++----------------
fs/btrfs/raid56.c | 14 ++++-----
fs/erofs/zdata.c | 4 +--
fs/ext4/page-io.c | 8 ++---
fs/ext4/readpage.c | 4 +--
fs/f2fs/data.c | 20 ++++++-------
fs/gfs2/lops.c | 10 +++----
fs/gfs2/meta_io.c | 8 ++---
fs/mpage.c | 4 +--
fs/squashfs/block.c | 48 +++++++++++++++++-------------
fs/squashfs/lz4_wrapper.c | 17 ++++++-----
fs/squashfs/lzo_wrapper.c | 17 ++++++-----
fs/squashfs/xz_wrapper.c | 19 ++++++------
fs/squashfs/zlib_wrapper.c | 18 ++++++-----
fs/squashfs/zstd_wrapper.c | 19 ++++++------
include/linux/bio.h | 34 ++++++++++++++++-----
include/linux/bvec.h | 61 ++++++++++++++++++++++----------------
24 files changed, 256 insertions(+), 213 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 70b5c987bc..f2845d4e47 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1163,13 +1163,13 @@ EXPORT_SYMBOL(bio_add_folio);

void __bio_release_pages(struct bio *bio, bool mark_dirty)
{
- struct bvec_iter_all iter_all;
- struct bio_vec *bvec;
+ struct bvec_iter_all iter;
+ struct bio_vec bvec;

- bio_for_each_segment_all(bvec, bio, iter_all) {
- if (mark_dirty && !PageCompound(bvec->bv_page))
- set_page_dirty_lock(bvec->bv_page);
- put_page(bvec->bv_page);
+ bio_for_each_segment_all(bvec, bio, iter) {
+ if (mark_dirty && !PageCompound(bvec.bv_page))
+ set_page_dirty_lock(bvec.bv_page);
+ put_page(bvec.bv_page);
}
}
EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1436,11 +1436,11 @@ EXPORT_SYMBOL(bio_copy_data);

void bio_free_pages(struct bio *bio)
{
- struct bio_vec *bvec;
- struct bvec_iter_all iter_all;
+ struct bvec_iter_all iter;
+ struct bio_vec bvec;

- bio_for_each_segment_all(bvec, bio, iter_all)
- __free_page(bvec->bv_page);
+ bio_for_each_segment_all(bvec, bio, iter)
+ __free_page(bvec.bv_page);
}
EXPORT_SYMBOL(bio_free_pages);

@@ -1475,12 +1475,12 @@ EXPORT_SYMBOL(bio_free_pages);
*/
void bio_set_pages_dirty(struct bio *bio)
{
- struct bio_vec *bvec;
- struct bvec_iter_all iter_all;
+ struct bvec_iter_all iter;
+ struct bio_vec bvec;

- bio_for_each_segment_all(bvec, bio, iter_all) {
- if (!PageCompound(bvec->bv_page))
- set_page_dirty_lock(bvec->bv_page);
+ bio_for_each_segment_all(bvec, bio, iter) {
+ if (!PageCompound(bvec.bv_page))
+ set_page_dirty_lock(bvec.bv_page);
}
}
EXPORT_SYMBOL_GPL(bio_set_pages_dirty);
@@ -1524,12 +1524,12 @@ static void bio_dirty_fn(struct work_struct *work)

void bio_check_pages_dirty(struct bio *bio)
{
- struct bio_vec *bvec;
+ struct bvec_iter_all iter;
+ struct bio_vec bvec;
unsigned long flags;
- struct bvec_iter_all iter_all;

- bio_for_each_segment_all(bvec, bio, iter_all) {
- if (!PageDirty(bvec->bv_page) && !PageCompound(bvec->bv_page))
+ bio_for_each_segment_all(bvec, bio, iter) {
+ if (!PageDirty(bvec.bv_page) && !PageCompound(bvec.bv_page))
goto defer;
}

diff --git a/block/blk-map.c b/block/blk-map.c
index 9137d16cec..5774a9e467 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -46,21 +46,21 @@ static struct bio_map_data *bio_alloc_map_data(struct iov_iter *data,
*/
static int bio_copy_from_iter(struct bio *bio, struct iov_iter *iter)
{
- struct bio_vec *bvec;
- struct bvec_iter_all iter_all;
+ struct bvec_iter_all bv_iter;
+ struct bio_vec bvec;

- bio_for_each_segment_all(bvec, bio, iter_all) {
+ bio_for_each_segment_all(bvec, bio, bv_iter) {
ssize_t ret;

- ret = copy_page_from_iter(bvec->bv_page,
- bvec->bv_offset,
- bvec->bv_len,
+ ret = copy_page_from_iter(bvec.bv_page,
+ bvec.bv_offset,
+ bvec.bv_len,
iter);

if (!iov_iter_count(iter))
break;

- if (ret < bvec->bv_len)
+ if (ret < bvec.bv_len)
return -EFAULT;
}

@@ -77,21 +77,21 @@ static int bio_copy_from_iter(struct bio *bio, struct iov_iter *iter)
*/
static int bio_copy_to_iter(struct bio *bio, struct iov_iter iter)
{
- struct bio_vec *bvec;
- struct bvec_iter_all iter_all;
+ struct bvec_iter_all bv_iter;
+ struct bio_vec bvec;

- bio_for_each_segment_all(bvec, bio, iter_all) {
+ bio_for_each_segment_all(bvec, bio, bv_iter) {
ssize_t ret;

- ret = copy_page_to_iter(bvec->bv_page,
- bvec->bv_offset,
- bvec->bv_len,
+ ret = copy_page_to_iter(bvec.bv_page,
+ bvec.bv_offset,
+ bvec.bv_len,
&iter);

if (!iov_iter_count(&iter))
break;

- if (ret < bvec->bv_len)
+ if (ret < bvec.bv_len)
return -EFAULT;
}

@@ -442,12 +442,12 @@ static void bio_copy_kern_endio(struct bio *bio)
static void bio_copy_kern_endio_read(struct bio *bio)
{
char *p = bio->bi_private;
- struct bio_vec *bvec;
- struct bvec_iter_all iter_all;
+ struct bvec_iter_all iter;
+ struct bio_vec bvec;

- bio_for_each_segment_all(bvec, bio, iter_all) {
- memcpy_from_bvec(p, bvec);
- p += bvec->bv_len;
+ bio_for_each_segment_all(bvec, bio, iter) {
+ memcpy_from_bvec(p, &bvec);
+ p += bvec.bv_len;
}

bio_copy_kern_endio(bio);
diff --git a/block/bounce.c b/block/bounce.c
index 7cfcb242f9..e701832d76 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -102,18 +102,18 @@ static void copy_to_high_bio_irq(struct bio *to, struct bio *from)
static void bounce_end_io(struct bio *bio)
{
struct bio *bio_orig = bio->bi_private;
- struct bio_vec *bvec, orig_vec;
+ struct bio_vec bvec, orig_vec;
struct bvec_iter orig_iter = bio_orig->bi_iter;
- struct bvec_iter_all iter_all;
+ struct bvec_iter_all iter;

/*
* free up bounce indirect pages used
*/
- bio_for_each_segment_all(bvec, bio, iter_all) {
+ bio_for_each_segment_all(bvec, bio, iter) {
orig_vec = bio_iter_iovec(bio_orig, orig_iter);
- if (bvec->bv_page != orig_vec.bv_page) {
- dec_zone_page_state(bvec->bv_page, NR_BOUNCE);
- mempool_free(bvec->bv_page, &page_pool);
+ if (bvec.bv_page != orig_vec.bv_page) {
+ dec_zone_page_state(bvec.bv_page, NR_BOUNCE);
+ mempool_free(bvec.bv_page, &page_pool);
}
bio_advance_iter(bio_orig, &orig_iter, orig_vec.bv_len);
}
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 147c493a98..98ce12b239 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -373,12 +373,12 @@ static void do_btree_node_write(struct btree *b)
bset_sector_offset(&b->keys, i));

if (!bch_bio_alloc_pages(b->bio, __GFP_NOWARN|GFP_NOWAIT)) {
- struct bio_vec *bv;
+ struct bio_vec bv;
void *addr = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1));
- struct bvec_iter_all iter_all;
+ struct bvec_iter_all iter;

- bio_for_each_segment_all(bv, b->bio, iter_all) {
- memcpy(page_address(bv->bv_page), addr, PAGE_SIZE);
+ bio_for_each_segment_all(bv, b->bio, iter) {
+ memcpy(page_address(bv.bv_page), addr, PAGE_SIZE);
addr += PAGE_SIZE;
}

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 3ba53dc3cc..166bb4fdb4 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1713,12 +1713,12 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned int size)

static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone)
{
- struct bio_vec *bv;
- struct bvec_iter_all iter_all;
+ struct bvec_iter_all iter;
+ struct bio_vec bv;

- bio_for_each_segment_all(bv, clone, iter_all) {
- BUG_ON(!bv->bv_page);
- mempool_free(bv->bv_page, &cc->page_pool);
+ bio_for_each_segment_all(bv, clone, iter) {
+ BUG_ON(!bv.bv_page);
+ mempool_free(bv.bv_page, &cc->page_pool);
}
}

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 68a9e2d998..4f58cae37e 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2188,7 +2188,7 @@ static void process_checks(struct r1bio *r1_bio)
blk_status_t status = sbio->bi_status;
struct page **ppages = get_resync_pages(pbio)->pages;
struct page **spages = get_resync_pages(sbio)->pages;
- struct bio_vec *bi;
+ struct bio_vec bi;
int page_len[RESYNC_PAGES] = { 0 };
struct bvec_iter_all iter_all;

@@ -2198,7 +2198,7 @@ static void process_checks(struct r1bio *r1_bio)
sbio->bi_status = 0;

bio_for_each_segment_all(bi, sbio, iter_all)
- page_len[j++] = bi->bv_len;
+ page_len[j++] = bi.bv_len;

if (!status) {
for (j = vcnt; j-- ; ) {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9e1596bb20..92b3396c15 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3804,12 +3804,12 @@ ALLOW_ERROR_INJECTION(open_ctree, ERRNO);
static void btrfs_end_super_write(struct bio *bio)
{
struct btrfs_device *device = bio->bi_private;
- struct bio_vec *bvec;
+ struct bio_vec bvec;
struct bvec_iter_all iter_all;
struct page *page;

bio_for_each_segment_all(bvec, bio, iter_all) {
- page = bvec->bv_page;
+ page = bvec.bv_page;

if (bio->bi_status) {
btrfs_warn_rl_in_rcu(device->fs_info,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 40300e8e5f..5796c99ea1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -581,34 +581,34 @@ static void end_bio_extent_writepage(struct btrfs_bio *bbio)
{
struct bio *bio = &bbio->bio;
int error = blk_status_to_errno(bio->bi_status);
- struct bio_vec *bvec;
+ struct bio_vec bvec;
u64 start;
u64 end;
struct bvec_iter_all iter_all;

ASSERT(!bio_flagged(bio, BIO_CLONED));
bio_for_each_segment_all(bvec, bio, iter_all) {
- struct page *page = bvec->bv_page;
+ struct page *page = bvec.bv_page;
struct inode *inode = page->mapping->host;
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
const u32 sectorsize = fs_info->sectorsize;

/* Our read/write should always be sector aligned. */
- if (!IS_ALIGNED(bvec->bv_offset, sectorsize))
+ if (!IS_ALIGNED(bvec.bv_offset, sectorsize))
btrfs_err(fs_info,
"partial page write in btrfs with offset %u and length %u",
- bvec->bv_offset, bvec->bv_len);
- else if (!IS_ALIGNED(bvec->bv_len, sectorsize))
+ bvec.bv_offset, bvec.bv_len);
+ else if (!IS_ALIGNED(bvec.bv_len, sectorsize))
btrfs_info(fs_info,
"incomplete page write with offset %u and length %u",
- bvec->bv_offset, bvec->bv_len);
+ bvec.bv_offset, bvec.bv_len);

- start = page_offset(page) + bvec->bv_offset;
- end = start + bvec->bv_len - 1;
+ start = page_offset(page) + bvec.bv_offset;
+ end = start + bvec.bv_len - 1;

end_extent_writepage(page, error, start, end);

- btrfs_page_clear_writeback(fs_info, page, start, bvec->bv_len);
+ btrfs_page_clear_writeback(fs_info, page, start, bvec.bv_len);
}

bio_put(bio);
@@ -736,7 +736,7 @@ static struct extent_buffer *find_extent_buffer_readpage(
static void end_bio_extent_readpage(struct btrfs_bio *bbio)
{
struct bio *bio = &bbio->bio;
- struct bio_vec *bvec;
+ struct bio_vec bvec;
struct processed_extent processed = { 0 };
/*
* The offset to the beginning of a bio, since one bio can never be
@@ -749,7 +749,7 @@ static void end_bio_extent_readpage(struct btrfs_bio *bbio)
ASSERT(!bio_flagged(bio, BIO_CLONED));
bio_for_each_segment_all(bvec, bio, iter_all) {
bool uptodate = !bio->bi_status;
- struct page *page = bvec->bv_page;
+ struct page *page = bvec.bv_page;
struct inode *inode = page->mapping->host;
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
const u32 sectorsize = fs_info->sectorsize;
@@ -769,19 +769,19 @@ static void end_bio_extent_readpage(struct btrfs_bio *bbio)
* for unaligned offsets, and an error if they don't add up to
* a full sector.
*/
- if (!IS_ALIGNED(bvec->bv_offset, sectorsize))
+ if (!IS_ALIGNED(bvec.bv_offset, sectorsize))
btrfs_err(fs_info,
"partial page read in btrfs with offset %u and length %u",
- bvec->bv_offset, bvec->bv_len);
- else if (!IS_ALIGNED(bvec->bv_offset + bvec->bv_len,
+ bvec.bv_offset, bvec.bv_len);
+ else if (!IS_ALIGNED(bvec.bv_offset + bvec.bv_len,
sectorsize))
btrfs_info(fs_info,
"incomplete page read with offset %u and length %u",
- bvec->bv_offset, bvec->bv_len);
+ bvec.bv_offset, bvec.bv_len);

- start = page_offset(page) + bvec->bv_offset;
- end = start + bvec->bv_len - 1;
- len = bvec->bv_len;
+ start = page_offset(page) + bvec.bv_offset;
+ end = start + bvec.bv_len - 1;
+ len = bvec.bv_len;

mirror = bbio->mirror_num;
if (uptodate && !is_data_inode(inode) &&
@@ -1993,7 +1993,7 @@ static void end_bio_subpage_eb_writepage(struct btrfs_bio *bbio)
{
struct bio *bio = &bbio->bio;
struct btrfs_fs_info *fs_info;
- struct bio_vec *bvec;
+ struct bio_vec bvec;
struct bvec_iter_all iter_all;

fs_info = btrfs_sb(bio_first_page_all(bio)->mapping->host->i_sb);
@@ -2001,12 +2001,12 @@ static void end_bio_subpage_eb_writepage(struct btrfs_bio *bbio)

ASSERT(!bio_flagged(bio, BIO_CLONED));
bio_for_each_segment_all(bvec, bio, iter_all) {
- struct page *page = bvec->bv_page;
- u64 bvec_start = page_offset(page) + bvec->bv_offset;
- u64 bvec_end = bvec_start + bvec->bv_len - 1;
+ struct page *page = bvec.bv_page;
+ u64 bvec_start = page_offset(page) + bvec.bv_offset;
+ u64 bvec_end = bvec_start + bvec.bv_len - 1;
u64 cur_bytenr = bvec_start;

- ASSERT(IS_ALIGNED(bvec->bv_len, fs_info->nodesize));
+ ASSERT(IS_ALIGNED(bvec.bv_len, fs_info->nodesize));

/* Iterate through all extent buffers in the range */
while (cur_bytenr <= bvec_end) {
@@ -2050,14 +2050,14 @@ static void end_bio_subpage_eb_writepage(struct btrfs_bio *bbio)
static void end_bio_extent_buffer_writepage(struct btrfs_bio *bbio)
{
struct bio *bio = &bbio->bio;
- struct bio_vec *bvec;
+ struct bio_vec bvec;
struct extent_buffer *eb;
int done;
struct bvec_iter_all iter_all;

ASSERT(!bio_flagged(bio, BIO_CLONED));
bio_for_each_segment_all(bvec, bio, iter_all) {
- struct page *page = bvec->bv_page;
+ struct page *page = bvec.bv_page;

eb = (struct extent_buffer *)page->private;
BUG_ON(!eb);
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 642828c1b2..39d8101541 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1388,7 +1388,7 @@ static struct sector_ptr *find_stripe_sector(struct btrfs_raid_bio *rbio,
static void set_bio_pages_uptodate(struct btrfs_raid_bio *rbio, struct bio *bio)
{
const u32 sectorsize = rbio->bioc->fs_info->sectorsize;
- struct bio_vec *bvec;
+ struct bio_vec bvec;
struct bvec_iter_all iter_all;

ASSERT(!bio_flagged(bio, BIO_CLONED));
@@ -1397,9 +1397,9 @@ static void set_bio_pages_uptodate(struct btrfs_raid_bio *rbio, struct bio *bio)
struct sector_ptr *sector;
int pgoff;

- for (pgoff = bvec->bv_offset; pgoff - bvec->bv_offset < bvec->bv_len;
+ for (pgoff = bvec.bv_offset; pgoff - bvec.bv_offset < bvec.bv_len;
pgoff += sectorsize) {
- sector = find_stripe_sector(rbio, bvec->bv_page, pgoff);
+ sector = find_stripe_sector(rbio, bvec.bv_page, pgoff);
ASSERT(sector);
if (sector)
sector->uptodate = 1;
@@ -1453,7 +1453,7 @@ static void verify_bio_data_sectors(struct btrfs_raid_bio *rbio,
{
struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
int total_sector_nr = get_bio_sector_nr(rbio, bio);
- struct bio_vec *bvec;
+ struct bio_vec bvec;
struct bvec_iter_all iter_all;

/* No data csum for the whole stripe, no need to verify. */
@@ -1467,8 +1467,8 @@ static void verify_bio_data_sectors(struct btrfs_raid_bio *rbio,
bio_for_each_segment_all(bvec, bio, iter_all) {
int bv_offset;

- for (bv_offset = bvec->bv_offset;
- bv_offset < bvec->bv_offset + bvec->bv_len;
+ for (bv_offset = bvec.bv_offset;
+ bv_offset < bvec.bv_offset + bvec.bv_len;
bv_offset += fs_info->sectorsize, total_sector_nr++) {
u8 csum_buf[BTRFS_CSUM_SIZE];
u8 *expected_csum = rbio->csum_buf +
@@ -1479,7 +1479,7 @@ static void verify_bio_data_sectors(struct btrfs_raid_bio *rbio,
if (!test_bit(total_sector_nr, rbio->csum_bitmap))
continue;

- ret = btrfs_check_sector_csum(fs_info, bvec->bv_page,
+ ret = btrfs_check_sector_csum(fs_info, bvec.bv_page,
bv_offset, csum_buf, expected_csum);
if (ret < 0)
set_bit(total_sector_nr, rbio->error_bitmap);
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index f1708c77a9..1fd0f01d11 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -1651,11 +1651,11 @@ static void z_erofs_decompressqueue_endio(struct bio *bio)
{
struct z_erofs_decompressqueue *q = bio->bi_private;
blk_status_t err = bio->bi_status;
- struct bio_vec *bvec;
+ struct bio_vec bvec;
struct bvec_iter_all iter_all;

bio_for_each_segment_all(bvec, bio, iter_all) {
- struct page *page = bvec->bv_page;
+ struct page *page = bvec.bv_page;

DBG_BUGON(PageUptodate(page));
DBG_BUGON(z_erofs_page_is_invalidated(page));
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 1e4db96a04..81a1cc4518 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -99,15 +99,15 @@ static void buffer_io_error(struct buffer_head *bh)

static void ext4_finish_bio(struct bio *bio)
{
- struct bio_vec *bvec;
+ struct bio_vec bvec;
struct bvec_iter_all iter_all;

bio_for_each_segment_all(bvec, bio, iter_all) {
- struct page *page = bvec->bv_page;
+ struct page *page = bvec.bv_page;
struct page *bounce_page = NULL;
struct buffer_head *bh, *head;
- unsigned bio_start = bvec->bv_offset;
- unsigned bio_end = bio_start + bvec->bv_len;
+ unsigned bio_start = bvec.bv_offset;
+ unsigned bio_end = bio_start + bvec.bv_len;
unsigned under_io = 0;
unsigned long flags;

diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
index c61dc8a7c0..ce42b3d5c9 100644
--- a/fs/ext4/readpage.c
+++ b/fs/ext4/readpage.c
@@ -69,11 +69,11 @@ struct bio_post_read_ctx {
static void __read_end_io(struct bio *bio)
{
struct page *page;
- struct bio_vec *bv;
+ struct bio_vec bv;
struct bvec_iter_all iter_all;

bio_for_each_segment_all(bv, bio, iter_all) {
- page = bv->bv_page;
+ page = bv.bv_page;

if (bio->bi_status)
ClearPageUptodate(page);
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 06b552a0ab..e44bd8586f 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -139,12 +139,12 @@ struct bio_post_read_ctx {
*/
static void f2fs_finish_read_bio(struct bio *bio, bool in_task)
{
- struct bio_vec *bv;
+ struct bio_vec bv;
struct bvec_iter_all iter_all;
struct bio_post_read_ctx *ctx = bio->bi_private;

bio_for_each_segment_all(bv, bio, iter_all) {
- struct page *page = bv->bv_page;
+ struct page *page = bv.bv_page;

if (f2fs_is_compressed_page(page)) {
if (ctx && !ctx->decompression_attempted)
@@ -189,11 +189,11 @@ static void f2fs_verify_bio(struct work_struct *work)
* as those were handled separately by f2fs_end_read_compressed_page().
*/
if (may_have_compressed_pages) {
- struct bio_vec *bv;
+ struct bio_vec bv;
struct bvec_iter_all iter_all;

bio_for_each_segment_all(bv, bio, iter_all) {
- struct page *page = bv->bv_page;
+ struct page *page = bv.bv_page;

if (!f2fs_is_compressed_page(page) &&
!fsverity_verify_page(page)) {
@@ -241,13 +241,13 @@ static void f2fs_verify_and_finish_bio(struct bio *bio, bool in_task)
static void f2fs_handle_step_decompress(struct bio_post_read_ctx *ctx,
bool in_task)
{
- struct bio_vec *bv;
+ struct bio_vec bv;
struct bvec_iter_all iter_all;
bool all_compressed = true;
block_t blkaddr = ctx->fs_blkaddr;

bio_for_each_segment_all(bv, ctx->bio, iter_all) {
- struct page *page = bv->bv_page;
+ struct page *page = bv.bv_page;

if (f2fs_is_compressed_page(page))
f2fs_end_read_compressed_page(page, false, blkaddr,
@@ -327,7 +327,7 @@ static void f2fs_read_end_io(struct bio *bio)
static void f2fs_write_end_io(struct bio *bio)
{
struct f2fs_sb_info *sbi;
- struct bio_vec *bvec;
+ struct bio_vec bvec;
struct bvec_iter_all iter_all;

iostat_update_and_unbind_ctx(bio);
@@ -337,7 +337,7 @@ static void f2fs_write_end_io(struct bio *bio)
bio->bi_status = BLK_STS_IOERR;

bio_for_each_segment_all(bvec, bio, iter_all) {
- struct page *page = bvec->bv_page;
+ struct page *page = bvec.bv_page;
enum count_type type = WB_DATA_TYPE(page);

if (page_private_dummy(page)) {
@@ -583,7 +583,7 @@ static void __submit_merged_bio(struct f2fs_bio_info *io)
static bool __has_merged_page(struct bio *bio, struct inode *inode,
struct page *page, nid_t ino)
{
- struct bio_vec *bvec;
+ struct bio_vec bvec;
struct bvec_iter_all iter_all;

if (!bio)
@@ -593,7 +593,7 @@ static bool __has_merged_page(struct bio *bio, struct inode *inode,
return true;

bio_for_each_segment_all(bvec, bio, iter_all) {
- struct page *target = bvec->bv_page;
+ struct page *target = bvec.bv_page;

if (fscrypt_is_bounce_page(target)) {
target = fscrypt_pagecache_page(target);
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 1902413d5d..7f62fe8eb7 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -202,7 +202,7 @@ static void gfs2_end_log_write_bh(struct gfs2_sbd *sdp,
static void gfs2_end_log_write(struct bio *bio)
{
struct gfs2_sbd *sdp = bio->bi_private;
- struct bio_vec *bvec;
+ struct bio_vec bvec;
struct page *page;
struct bvec_iter_all iter_all;

@@ -217,9 +217,9 @@ static void gfs2_end_log_write(struct bio *bio)
}

bio_for_each_segment_all(bvec, bio, iter_all) {
- page = bvec->bv_page;
+ page = bvec.bv_page;
if (page_has_buffers(page))
- gfs2_end_log_write_bh(sdp, bvec, bio->bi_status);
+ gfs2_end_log_write_bh(sdp, &bvec, bio->bi_status);
else
mempool_free(page, gfs2_page_pool);
}
@@ -395,11 +395,11 @@ static void gfs2_log_write_page(struct gfs2_sbd *sdp, struct page *page)
static void gfs2_end_log_read(struct bio *bio)
{
struct page *page;
- struct bio_vec *bvec;
+ struct bio_vec bvec;
struct bvec_iter_all iter_all;

bio_for_each_segment_all(bvec, bio, iter_all) {
- page = bvec->bv_page;
+ page = bvec.bv_page;
if (bio->bi_status) {
int err = blk_status_to_errno(bio->bi_status);

diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index 924361fa51..832572784e 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -193,15 +193,15 @@ struct buffer_head *gfs2_meta_new(struct gfs2_glock *gl, u64 blkno)

static void gfs2_meta_read_endio(struct bio *bio)
{
- struct bio_vec *bvec;
+ struct bio_vec bvec;
struct bvec_iter_all iter_all;

bio_for_each_segment_all(bvec, bio, iter_all) {
- struct page *page = bvec->bv_page;
+ struct page *page = bvec.bv_page;
struct buffer_head *bh = page_buffers(page);
- unsigned int len = bvec->bv_len;
+ unsigned int len = bvec.bv_len;

- while (bh_offset(bh) < bvec->bv_offset)
+ while (bh_offset(bh) < bvec.bv_offset)
bh = bh->b_this_page;
do {
struct buffer_head *next = bh->b_this_page;
diff --git a/fs/mpage.c b/fs/mpage.c
index 22b9de5ddd..49505456ba 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -45,11 +45,11 @@
*/
static void mpage_end_io(struct bio *bio)
{
- struct bio_vec *bv;
+ struct bio_vec bv;
struct bvec_iter_all iter_all;

bio_for_each_segment_all(bv, bio, iter_all) {
- struct page *page = bv->bv_page;
+ struct page *page = bv.bv_page;
page_endio(page, bio_op(bio),
blk_status_to_errno(bio->bi_status));
}
diff --git a/fs/squashfs/block.c b/fs/squashfs/block.c
index bed3bb8b27..83e8b44518 100644
--- a/fs/squashfs/block.c
+++ b/fs/squashfs/block.c
@@ -35,30 +35,33 @@ static int copy_bio_to_actor(struct bio *bio,
int offset, int req_length)
{
void *actor_addr;
- struct bvec_iter_all iter_all = {};
- struct bio_vec *bvec = bvec_init_iter_all(&iter_all);
+ struct bvec_iter_all iter;
+ struct bio_vec bvec;
int copied_bytes = 0;
int actor_offset = 0;
+ int bytes_to_copy;

squashfs_actor_nobuff(actor);
actor_addr = squashfs_first_page(actor);

- if (WARN_ON_ONCE(!bio_next_segment(bio, &iter_all)))
- return 0;
+ bvec_iter_all_init(&iter);
+ bio_iter_all_advance(bio, &iter, offset);

- while (copied_bytes < req_length) {
- int bytes_to_copy = min_t(int, bvec->bv_len - offset,
+ while (copied_bytes < req_length &&
+ iter.idx < bio->bi_vcnt) {
+ bvec = bio_iter_all_peek(bio, &iter);
+
+ bytes_to_copy = min_t(int, bvec.bv_len,
PAGE_SIZE - actor_offset);

bytes_to_copy = min_t(int, bytes_to_copy,
req_length - copied_bytes);
if (!IS_ERR(actor_addr))
- memcpy(actor_addr + actor_offset, bvec_virt(bvec) +
- offset, bytes_to_copy);
+ memcpy(actor_addr + actor_offset, bvec_virt(&bvec),
+ bytes_to_copy);

actor_offset += bytes_to_copy;
copied_bytes += bytes_to_copy;
- offset += bytes_to_copy;

if (actor_offset >= PAGE_SIZE) {
actor_addr = squashfs_next_page(actor);
@@ -66,11 +69,8 @@ static int copy_bio_to_actor(struct bio *bio,
break;
actor_offset = 0;
}
- if (offset >= bvec->bv_len) {
- if (!bio_next_segment(bio, &iter_all))
- break;
- offset = 0;
- }
+
+ bio_iter_all_advance(bio, &iter, bytes_to_copy);
}
squashfs_finish_page(actor);
return copied_bytes;
@@ -159,8 +159,10 @@ int squashfs_read_data(struct super_block *sb, u64 index, int length,
* Metadata block.
*/
const u8 *data;
- struct bvec_iter_all iter_all = {};
- struct bio_vec *bvec = bvec_init_iter_all(&iter_all);
+ struct bvec_iter_all iter;
+ struct bio_vec bvec;
+
+ bvec_iter_all_init(&iter);

if (index + 2 > msblk->bytes_used) {
res = -EIO;
@@ -170,21 +172,25 @@ int squashfs_read_data(struct super_block *sb, u64 index, int length,
if (res)
goto out;

- if (WARN_ON_ONCE(!bio_next_segment(bio, &iter_all))) {
+ bvec = bio_iter_all_peek(bio, &iter);
+
+ if (WARN_ON_ONCE(!bvec.bv_len)) {
res = -EIO;
goto out_free_bio;
}
/* Extract the length of the metadata block */
- data = bvec_virt(bvec);
+ data = bvec_virt(&bvec);
length = data[offset];
- if (offset < bvec->bv_len - 1) {
+ if (offset < bvec.bv_len - 1) {
length |= data[offset + 1] << 8;
} else {
- if (WARN_ON_ONCE(!bio_next_segment(bio, &iter_all))) {
+ bio_iter_all_advance(bio, &iter, bvec.bv_len);
+
+ if (WARN_ON_ONCE(!bvec.bv_len)) {
res = -EIO;
goto out_free_bio;
}
- data = bvec_virt(bvec);
+ data = bvec_virt(&bvec);
length |= data[0] << 8;
}
bio_free_pages(bio);
diff --git a/fs/squashfs/lz4_wrapper.c b/fs/squashfs/lz4_wrapper.c
index 49797729f1..bd0dd787d2 100644
--- a/fs/squashfs/lz4_wrapper.c
+++ b/fs/squashfs/lz4_wrapper.c
@@ -92,20 +92,23 @@ static int lz4_uncompress(struct squashfs_sb_info *msblk, void *strm,
struct bio *bio, int offset, int length,
struct squashfs_page_actor *output)
{
- struct bvec_iter_all iter_all = {};
- struct bio_vec *bvec = bvec_init_iter_all(&iter_all);
+ struct bvec_iter_all iter;
+ struct bio_vec bvec;
struct squashfs_lz4 *stream = strm;
void *buff = stream->input, *data;
int bytes = length, res;

- while (bio_next_segment(bio, &iter_all)) {
- int avail = min(bytes, ((int)bvec->bv_len) - offset);
+ bvec_iter_all_init(&iter);
+ bio_iter_all_advance(bio, &iter, offset);

- data = bvec_virt(bvec);
- memcpy(buff, data + offset, avail);
+ bio_for_each_segment_all_continue(bvec, bio, iter) {
+ unsigned avail = min_t(unsigned, bytes, bvec.bv_len);
+
+ memcpy(buff, bvec_virt(&bvec), avail);
buff += avail;
bytes -= avail;
- offset = 0;
+ if (!bytes)
+ break;
}

res = LZ4_decompress_safe(stream->input, stream->output,
diff --git a/fs/squashfs/lzo_wrapper.c b/fs/squashfs/lzo_wrapper.c
index d216aeefa8..bccfcfa12e 100644
--- a/fs/squashfs/lzo_wrapper.c
+++ b/fs/squashfs/lzo_wrapper.c
@@ -66,21 +66,24 @@ static int lzo_uncompress(struct squashfs_sb_info *msblk, void *strm,
struct bio *bio, int offset, int length,
struct squashfs_page_actor *output)
{
- struct bvec_iter_all iter_all = {};
- struct bio_vec *bvec = bvec_init_iter_all(&iter_all);
+ struct bvec_iter_all iter;
+ struct bio_vec bvec;
struct squashfs_lzo *stream = strm;
void *buff = stream->input, *data;
int bytes = length, res;
size_t out_len = output->length;

- while (bio_next_segment(bio, &iter_all)) {
- int avail = min(bytes, ((int)bvec->bv_len) - offset);
+ bvec_iter_all_init(&iter);
+ bio_iter_all_advance(bio, &iter, offset);

- data = bvec_virt(bvec);
- memcpy(buff, data + offset, avail);
+ bio_for_each_segment_all_continue(bvec, bio, iter) {
+ unsigned avail = min_t(unsigned, bytes, bvec.bv_len);
+
+ memcpy(buff, bvec_virt(&bvec), avail);
buff += avail;
bytes -= avail;
- offset = 0;
+ if (!bytes)
+ break;
}

res = lzo1x_decompress_safe(stream->input, (size_t)length,
diff --git a/fs/squashfs/xz_wrapper.c b/fs/squashfs/xz_wrapper.c
index 6c49481a2f..6cf0e11e3b 100644
--- a/fs/squashfs/xz_wrapper.c
+++ b/fs/squashfs/xz_wrapper.c
@@ -120,8 +120,7 @@ static int squashfs_xz_uncompress(struct squashfs_sb_info *msblk, void *strm,
struct bio *bio, int offset, int length,
struct squashfs_page_actor *output)
{
- struct bvec_iter_all iter_all = {};
- struct bio_vec *bvec = bvec_init_iter_all(&iter_all);
+ struct bvec_iter_all iter;
int total = 0, error = 0;
struct squashfs_xz *stream = strm;

@@ -136,26 +135,28 @@ static int squashfs_xz_uncompress(struct squashfs_sb_info *msblk, void *strm,
goto finish;
}

+ bvec_iter_all_init(&iter);
+ bio_iter_all_advance(bio, &iter, offset);
+
for (;;) {
enum xz_ret xz_err;

if (stream->buf.in_pos == stream->buf.in_size) {
- const void *data;
- int avail;
+ struct bio_vec bvec = bio_iter_all_peek(bio, &iter);
+ unsigned avail = min_t(unsigned, length, bvec.bv_len);

- if (!bio_next_segment(bio, &iter_all)) {
+ if (iter.idx >= bio->bi_vcnt) {
/* XZ_STREAM_END must be reached. */
error = -EIO;
break;
}

- avail = min(length, ((int)bvec->bv_len) - offset);
- data = bvec_virt(bvec);
length -= avail;
- stream->buf.in = data + offset;
+ stream->buf.in = bvec_virt(&bvec);
stream->buf.in_size = avail;
stream->buf.in_pos = 0;
- offset = 0;
+
+ bio_iter_all_advance(bio, &iter, avail);
}

if (stream->buf.out_pos == stream->buf.out_size) {
diff --git a/fs/squashfs/zlib_wrapper.c b/fs/squashfs/zlib_wrapper.c
index cbb7afe7bc..981ca5e410 100644
--- a/fs/squashfs/zlib_wrapper.c
+++ b/fs/squashfs/zlib_wrapper.c
@@ -53,8 +53,7 @@ static int zlib_uncompress(struct squashfs_sb_info *msblk, void *strm,
struct bio *bio, int offset, int length,
struct squashfs_page_actor *output)
{
- struct bvec_iter_all iter_all = {};
- struct bio_vec *bvec = bvec_init_iter_all(&iter_all);
+ struct bvec_iter_all iter;
int zlib_init = 0, error = 0;
z_stream *stream = strm;

@@ -67,25 +66,28 @@ static int zlib_uncompress(struct squashfs_sb_info *msblk, void *strm,
goto finish;
}

+ bvec_iter_all_init(&iter);
+ bio_iter_all_advance(bio, &iter, offset);
+
for (;;) {
int zlib_err;

if (stream->avail_in == 0) {
- const void *data;
+ struct bio_vec bvec = bio_iter_all_peek(bio, &iter);
int avail;

- if (!bio_next_segment(bio, &iter_all)) {
+ if (iter.idx >= bio->bi_vcnt) {
/* Z_STREAM_END must be reached. */
error = -EIO;
break;
}

- avail = min(length, ((int)bvec->bv_len) - offset);
- data = bvec_virt(bvec);
+ avail = min_t(unsigned, length, bvec.bv_len);
length -= avail;
- stream->next_in = data + offset;
+ stream->next_in = bvec_virt(&bvec);
stream->avail_in = avail;
- offset = 0;
+
+ bio_iter_all_advance(bio, &iter, avail);
}

if (stream->avail_out == 0) {
diff --git a/fs/squashfs/zstd_wrapper.c b/fs/squashfs/zstd_wrapper.c
index 0e407c4d8b..658e5d462a 100644
--- a/fs/squashfs/zstd_wrapper.c
+++ b/fs/squashfs/zstd_wrapper.c
@@ -68,8 +68,7 @@ static int zstd_uncompress(struct squashfs_sb_info *msblk, void *strm,
int error = 0;
zstd_in_buffer in_buf = { NULL, 0, 0 };
zstd_out_buffer out_buf = { NULL, 0, 0 };
- struct bvec_iter_all iter_all = {};
- struct bio_vec *bvec = bvec_init_iter_all(&iter_all);
+ struct bvec_iter_all iter;

stream = zstd_init_dstream(wksp->window_size, wksp->mem, wksp->mem_size);

@@ -85,25 +84,27 @@ static int zstd_uncompress(struct squashfs_sb_info *msblk, void *strm,
goto finish;
}

+ bvec_iter_all_init(&iter);
+ bio_iter_all_advance(bio, &iter, offset);
+
for (;;) {
size_t zstd_err;

if (in_buf.pos == in_buf.size) {
- const void *data;
- int avail;
+ struct bio_vec bvec = bio_iter_all_peek(bio, &iter);
+ unsigned avail = min_t(unsigned, length, bvec.bv_len);

- if (!bio_next_segment(bio, &iter_all)) {
+ if (iter.idx >= bio->bi_vcnt) {
error = -EIO;
break;
}

- avail = min(length, ((int)bvec->bv_len) - offset);
- data = bvec_virt(bvec);
length -= avail;
- in_buf.src = data + offset;
+ in_buf.src = bvec_virt(&bvec);
in_buf.size = avail;
in_buf.pos = 0;
- offset = 0;
+
+ bio_iter_all_advance(bio, &iter, avail);
}

if (out_buf.pos == out_buf.size) {
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 3536f28c05..f86c7190c3 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -78,22 +78,40 @@ static inline void *bio_data(struct bio *bio)
return NULL;
}

-static inline bool bio_next_segment(const struct bio *bio,
- struct bvec_iter_all *iter)
+static inline struct bio_vec bio_iter_all_peek(const struct bio *bio,
+ struct bvec_iter_all *iter)
{
- if (iter->idx >= bio->bi_vcnt)
- return false;
+ if (WARN_ON(iter->idx >= bio->bi_vcnt))
+ return (struct bio_vec) { NULL };

- bvec_advance(&bio->bi_io_vec[iter->idx], iter);
- return true;
+ return bvec_iter_all_peek(bio->bi_io_vec, iter);
+}
+
+static inline void bio_iter_all_advance(const struct bio *bio,
+ struct bvec_iter_all *iter,
+ unsigned bytes)
+{
+ bvec_iter_all_advance(bio->bi_io_vec, iter, bytes);
+
+ WARN_ON(iter->idx > bio->bi_vcnt ||
+ (iter->idx == bio->bi_vcnt && iter->done));
}

+#define bio_for_each_segment_all_continue(bvl, bio, iter) \
+ for (; \
+ iter.idx < bio->bi_vcnt && \
+ ((bvl = bio_iter_all_peek(bio, &iter)), true); \
+ bio_iter_all_advance((bio), &iter, bvl.bv_len))
+
/*
* drivers should _never_ use the all version - the bio may have been split
* before it got to the driver and the driver won't own all of it
*/
-#define bio_for_each_segment_all(bvl, bio, iter) \
- for (bvl = bvec_init_iter_all(&iter); bio_next_segment((bio), &iter); )
+#define bio_for_each_segment_all(bvl, bio, iter) \
+ for (bvec_iter_all_init(&iter); \
+ iter.idx < (bio)->bi_vcnt && \
+ ((bvl = bio_iter_all_peek((bio), &iter)), true); \
+ bio_iter_all_advance((bio), &iter, bvl.bv_len))

static inline void bio_advance_iter(const struct bio *bio,
struct bvec_iter *iter, unsigned int bytes)
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 555aae5448..635fb54143 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -85,12 +85,6 @@ struct bvec_iter {
current bvec */
} __packed;

-struct bvec_iter_all {
- struct bio_vec bv;
- int idx;
- unsigned done;
-};
-
/*
* various member access, note that bio_data should of course not be used
* on highmem page vectors
@@ -184,7 +178,10 @@ static inline void bvec_iter_advance_single(const struct bio_vec *bv,
((bvl = bvec_iter_bvec((bio_vec), (iter))), 1); \
bvec_iter_advance_single((bio_vec), &(iter), (bvl).bv_len))

-/* for iterating one bio from start to end */
+/*
+ * bvec_iter_all: for advancing over a bio as it was originally created, but
+ * with the usual bio_for_each_segment interface - nonstandard, do not use:
+ */
#define BVEC_ITER_ALL_INIT (struct bvec_iter) \
{ \
.bi_sector = 0, \
@@ -193,33 +190,45 @@ static inline void bvec_iter_advance_single(const struct bio_vec *bv,
.bi_bvec_done = 0, \
}

-static inline struct bio_vec *bvec_init_iter_all(struct bvec_iter_all *iter_all)
+/*
+ * bvec_iter_all: for advancing over individual pages in a bio, as it was when
+ * it was first created:
+ */
+struct bvec_iter_all {
+ int idx;
+ unsigned done;
+};
+
+static inline void bvec_iter_all_init(struct bvec_iter_all *iter_all)
{
iter_all->done = 0;
iter_all->idx = 0;
+}

- return &iter_all->bv;
+static inline struct bio_vec bvec_iter_all_peek(const struct bio_vec *bvec,
+ struct bvec_iter_all *iter)
+{
+ struct bio_vec bv = bvec[iter->idx];
+
+ bv.bv_offset += iter->done;
+ bv.bv_len -= iter->done;
+
+ bv.bv_page += bv.bv_offset >> PAGE_SHIFT;
+ bv.bv_offset &= ~PAGE_MASK;
+ bv.bv_len = min_t(unsigned, PAGE_SIZE - bv.bv_offset, bv.bv_len);
+
+ return bv;
}

-static inline void bvec_advance(const struct bio_vec *bvec,
- struct bvec_iter_all *iter_all)
+static inline void bvec_iter_all_advance(const struct bio_vec *bvec,
+ struct bvec_iter_all *iter,
+ unsigned bytes)
{
- struct bio_vec *bv = &iter_all->bv;
-
- if (iter_all->done) {
- bv->bv_page++;
- bv->bv_offset = 0;
- } else {
- bv->bv_page = bvec->bv_page + (bvec->bv_offset >> PAGE_SHIFT);
- bv->bv_offset = bvec->bv_offset & ~PAGE_MASK;
- }
- bv->bv_len = min_t(unsigned int, PAGE_SIZE - bv->bv_offset,
- bvec->bv_len - iter_all->done);
- iter_all->done += bv->bv_len;
+ iter->done += bytes;

- if (iter_all->done == bvec->bv_len) {
- iter_all->idx++;
- iter_all->done = 0;
+ while (iter->done && iter->done >= bvec[iter->idx].bv_len) {
+ iter->done -= bvec[iter->idx].bv_len;
+ iter->idx++;
}
}

--
2.40.1

2023-05-09 17:07:13

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 15/32] bcache: move closures to lib/

From: Kent Overstreet <[email protected]>

Prep work for bcachefs - being a fork of bcache it also uses closures

Signed-off-by: Kent Overstreet <[email protected]>
Acked-by: Coly Li <[email protected]>
---
drivers/md/bcache/Kconfig | 10 +-----
drivers/md/bcache/Makefile | 4 +--
drivers/md/bcache/bcache.h | 2 +-
drivers/md/bcache/super.c | 1 -
drivers/md/bcache/util.h | 3 +-
.../md/bcache => include/linux}/closure.h | 17 +++++----
lib/Kconfig | 3 ++
lib/Kconfig.debug | 9 +++++
lib/Makefile | 2 ++
{drivers/md/bcache => lib}/closure.c | 35 +++++++++----------
10 files changed, 43 insertions(+), 43 deletions(-)
rename {drivers/md/bcache => include/linux}/closure.h (97%)
rename {drivers/md/bcache => lib}/closure.c (88%)

diff --git a/drivers/md/bcache/Kconfig b/drivers/md/bcache/Kconfig
index 529c9d04e9..b2d10063d3 100644
--- a/drivers/md/bcache/Kconfig
+++ b/drivers/md/bcache/Kconfig
@@ -4,6 +4,7 @@ config BCACHE
tristate "Block device as cache"
select BLOCK_HOLDER_DEPRECATED if SYSFS
select CRC64
+ select CLOSURES
help
Allows a block device to be used as cache for other devices; uses
a btree for indexing and the layout is optimized for SSDs.
@@ -19,15 +20,6 @@ config BCACHE_DEBUG
Enables extra debugging tools, allows expensive runtime checks to be
turned on.

-config BCACHE_CLOSURES_DEBUG
- bool "Debug closures"
- depends on BCACHE
- select DEBUG_FS
- help
- Keeps all active closures in a linked list and provides a debugfs
- interface to list them, which makes it possible to see asynchronous
- operations that get stuck.
-
config BCACHE_ASYNC_REGISTRATION
bool "Asynchronous device registration"
depends on BCACHE
diff --git a/drivers/md/bcache/Makefile b/drivers/md/bcache/Makefile
index 5b87e59676..054e8a33a7 100644
--- a/drivers/md/bcache/Makefile
+++ b/drivers/md/bcache/Makefile
@@ -2,6 +2,6 @@

obj-$(CONFIG_BCACHE) += bcache.o

-bcache-y := alloc.o bset.o btree.o closure.o debug.o extents.o\
- io.o journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\
+bcache-y := alloc.o bset.o btree.o debug.o extents.o io.o\
+ journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\
util.o writeback.o features.o
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index aebb7ef10e..c8b4914ad8 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -179,6 +179,7 @@
#define pr_fmt(fmt) "bcache: %s() " fmt, __func__

#include <linux/bio.h>
+#include <linux/closure.h>
#include <linux/kobject.h>
#include <linux/list.h>
#include <linux/mutex.h>
@@ -192,7 +193,6 @@
#include "bcache_ondisk.h"
#include "bset.h"
#include "util.h"
-#include "closure.h"

struct bucket {
atomic_t pin;
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index ba3909bb6b..31b68a1b87 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2912,7 +2912,6 @@ static int __init bcache_init(void)
goto err;

bch_debug_init();
- closure_debug_init();

bcache_is_reboot = false;

diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
index 6f3cb7c921..f61ab1bada 100644
--- a/drivers/md/bcache/util.h
+++ b/drivers/md/bcache/util.h
@@ -4,6 +4,7 @@
#define _BCACHE_UTIL_H

#include <linux/blkdev.h>
+#include <linux/closure.h>
#include <linux/errno.h>
#include <linux/kernel.h>
#include <linux/sched/clock.h>
@@ -13,8 +14,6 @@
#include <linux/workqueue.h>
#include <linux/crc64.h>

-#include "closure.h"
-
struct closure;

#ifdef CONFIG_BCACHE_DEBUG
diff --git a/drivers/md/bcache/closure.h b/include/linux/closure.h
similarity index 97%
rename from drivers/md/bcache/closure.h
rename to include/linux/closure.h
index c88cdc4ae4..0ec9e7bc8d 100644
--- a/drivers/md/bcache/closure.h
+++ b/include/linux/closure.h
@@ -155,7 +155,7 @@ struct closure {

atomic_t remaining;

-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
#define CLOSURE_MAGIC_DEAD 0xc054dead
#define CLOSURE_MAGIC_ALIVE 0xc054a11e

@@ -184,15 +184,13 @@ static inline void closure_sync(struct closure *cl)
__closure_sync(cl);
}

-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES

-void closure_debug_init(void);
void closure_debug_create(struct closure *cl);
void closure_debug_destroy(struct closure *cl);

#else

-static inline void closure_debug_init(void) {}
static inline void closure_debug_create(struct closure *cl) {}
static inline void closure_debug_destroy(struct closure *cl) {}

@@ -200,21 +198,21 @@ static inline void closure_debug_destroy(struct closure *cl) {}

static inline void closure_set_ip(struct closure *cl)
{
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
cl->ip = _THIS_IP_;
#endif
}

static inline void closure_set_ret_ip(struct closure *cl)
{
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
cl->ip = _RET_IP_;
#endif
}

static inline void closure_set_waiting(struct closure *cl, unsigned long f)
{
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
cl->waiting_on = f;
#endif
}
@@ -243,6 +241,7 @@ static inline void closure_queue(struct closure *cl)
*/
BUILD_BUG_ON(offsetof(struct closure, fn)
!= offsetof(struct work_struct, func));
+
if (wq) {
INIT_WORK(&cl->work, cl->work.func);
BUG_ON(!queue_work(wq, &cl->work));
@@ -255,7 +254,7 @@ static inline void closure_queue(struct closure *cl)
*/
static inline void closure_get(struct closure *cl)
{
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
BUG_ON((atomic_inc_return(&cl->remaining) &
CLOSURE_REMAINING_MASK) <= 1);
#else
@@ -271,7 +270,7 @@ static inline void closure_get(struct closure *cl)
*/
static inline void closure_init(struct closure *cl, struct closure *parent)
{
- memset(cl, 0, sizeof(struct closure));
+ cl->fn = NULL;
cl->parent = parent;
if (parent)
closure_get(parent);
diff --git a/lib/Kconfig b/lib/Kconfig
index ce2abffb9e..1aa1c15a83 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -504,6 +504,9 @@ config ASSOCIATIVE_ARRAY

for more information.

+config CLOSURES
+ bool
+
config HAS_IOMEM
bool
depends on !NO_IOMEM
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 39d1d93164..3dba7a9aff 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1618,6 +1618,15 @@ config DEBUG_NOTIFIERS
This is a relatively cheap check but if you care about maximum
performance, say N.

+config DEBUG_CLOSURES
+ bool "Debug closures (bcache async widgits)"
+ depends on CLOSURES
+ select DEBUG_FS
+ help
+ Keeps all active closures in a linked list and provides a debugfs
+ interface to list them, which makes it possible to see asynchronous
+ operations that get stuck.
+
config BUG_ON_DATA_CORRUPTION
bool "Trigger a BUG when data corruption is detected"
select DEBUG_LIST
diff --git a/lib/Makefile b/lib/Makefile
index baf2821f7a..fd13ca6e0e 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -245,6 +245,8 @@ obj-$(CONFIG_ATOMIC64_SELFTEST) += atomic64_test.o

obj-$(CONFIG_CPU_RMAP) += cpu_rmap.o

+obj-$(CONFIG_CLOSURES) += closure.o
+
obj-$(CONFIG_DQL) += dynamic_queue_limits.o

obj-$(CONFIG_GLOB) += glob.o
diff --git a/drivers/md/bcache/closure.c b/lib/closure.c
similarity index 88%
rename from drivers/md/bcache/closure.c
rename to lib/closure.c
index d8d9394a6b..b38ded00b9 100644
--- a/drivers/md/bcache/closure.c
+++ b/lib/closure.c
@@ -6,13 +6,12 @@
* Copyright 2012 Google, Inc.
*/

+#include <linux/closure.h>
#include <linux/debugfs.h>
-#include <linux/module.h>
+#include <linux/export.h>
#include <linux/seq_file.h>
#include <linux/sched/debug.h>

-#include "closure.h"
-
static inline void closure_put_after_sub(struct closure *cl, int flags)
{
int r = flags & CLOSURE_REMAINING_MASK;
@@ -45,6 +44,7 @@ void closure_sub(struct closure *cl, int v)
{
closure_put_after_sub(cl, atomic_sub_return(v, &cl->remaining));
}
+EXPORT_SYMBOL(closure_sub);

/*
* closure_put - decrement a closure's refcount
@@ -53,6 +53,7 @@ void closure_put(struct closure *cl)
{
closure_put_after_sub(cl, atomic_dec_return(&cl->remaining));
}
+EXPORT_SYMBOL(closure_put);

/*
* closure_wake_up - wake up all closures on a wait list, without memory barrier
@@ -74,6 +75,7 @@ void __closure_wake_up(struct closure_waitlist *wait_list)
closure_sub(cl, CLOSURE_WAITING + 1);
}
}
+EXPORT_SYMBOL(__closure_wake_up);

/**
* closure_wait - add a closure to a waitlist
@@ -93,6 +95,7 @@ bool closure_wait(struct closure_waitlist *waitlist, struct closure *cl)

return true;
}
+EXPORT_SYMBOL(closure_wait);

struct closure_syncer {
struct task_struct *task;
@@ -127,8 +130,9 @@ void __sched __closure_sync(struct closure *cl)

__set_current_state(TASK_RUNNING);
}
+EXPORT_SYMBOL(__closure_sync);

-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES

static LIST_HEAD(closure_list);
static DEFINE_SPINLOCK(closure_list_lock);
@@ -144,6 +148,7 @@ void closure_debug_create(struct closure *cl)
list_add(&cl->all, &closure_list);
spin_unlock_irqrestore(&closure_list_lock, flags);
}
+EXPORT_SYMBOL(closure_debug_create);

void closure_debug_destroy(struct closure *cl)
{
@@ -156,8 +161,7 @@ void closure_debug_destroy(struct closure *cl)
list_del(&cl->all);
spin_unlock_irqrestore(&closure_list_lock, flags);
}
-
-static struct dentry *closure_debug;
+EXPORT_SYMBOL(closure_debug_destroy);

static int debug_show(struct seq_file *f, void *data)
{
@@ -181,7 +185,7 @@ static int debug_show(struct seq_file *f, void *data)
seq_printf(f, " W %pS\n",
(void *) cl->waiting_on);

- seq_printf(f, "\n");
+ seq_puts(f, "\n");
}

spin_unlock_irq(&closure_list_lock);
@@ -190,18 +194,11 @@ static int debug_show(struct seq_file *f, void *data)

DEFINE_SHOW_ATTRIBUTE(debug);

-void __init closure_debug_init(void)
+static int __init closure_debug_init(void)
{
- if (!IS_ERR_OR_NULL(bcache_debug))
- /*
- * it is unnecessary to check return value of
- * debugfs_create_file(), we should not care
- * about this.
- */
- closure_debug = debugfs_create_file(
- "closures", 0400, bcache_debug, NULL, &debug_fops);
+ debugfs_create_file("closures", 0400, NULL, NULL, &debug_fops);
+ return 0;
}
-#endif
+late_initcall(closure_debug_init)

-MODULE_AUTHOR("Kent Overstreet <[email protected]>");
-MODULE_LICENSE("GPL");
+#endif
--
2.40.1

2023-05-09 17:07:48

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 02/32] locking/lockdep: lock_class_is_held()

From: Kent Overstreet <[email protected]>

This patch adds lock_class_is_held(), which can be used to assert that a
particular type of lock is not held.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
---
include/linux/lockdep.h | 4 ++++
kernel/locking/lockdep.c | 20 ++++++++++++++++++++
2 files changed, 24 insertions(+)

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 1023f349af..e858c288c7 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -339,6 +339,8 @@ extern void lock_unpin_lock(struct lockdep_map *lock, struct pin_cookie);
#define lockdep_repin_lock(l,c) lock_repin_lock(&(l)->dep_map, (c))
#define lockdep_unpin_lock(l,c) lock_unpin_lock(&(l)->dep_map, (c))

+int lock_class_is_held(struct lock_class_key *key);
+
#else /* !CONFIG_LOCKDEP */

static inline void lockdep_init_task(struct task_struct *task)
@@ -427,6 +429,8 @@ extern int lockdep_is_held(const void *);
#define lockdep_repin_lock(l, c) do { (void)(l); (void)(c); } while (0)
#define lockdep_unpin_lock(l, c) do { (void)(l); (void)(c); } while (0)

+static inline int lock_class_is_held(struct lock_class_key *key) { return 0; }
+
#endif /* !LOCKDEP */

enum xhlock_context_t {
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 50d4863974..e631464070 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -6487,6 +6487,26 @@ void debug_check_no_locks_held(void)
}
EXPORT_SYMBOL_GPL(debug_check_no_locks_held);

+#ifdef CONFIG_LOCKDEP
+int lock_class_is_held(struct lock_class_key *key)
+{
+ struct task_struct *curr = current;
+ struct held_lock *hlock;
+
+ if (unlikely(!debug_locks))
+ return 0;
+
+ for (hlock = curr->held_locks;
+ hlock < curr->held_locks + curr->lockdep_depth;
+ hlock++)
+ if (hlock->instance->key == key)
+ return 1;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(lock_class_is_held);
+#endif
+
#ifdef __KERNEL__
void debug_show_all_locks(void)
{
--
2.40.1

2023-05-09 17:07:49

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 25/32] lib/generic-radix-tree.c: Don't overflow in peek()

From: Kent Overstreet <[email protected]>

When we started spreading new inode numbers throughout most of the 64
bit inode space, that triggered some corner case bugs, in particular
some integer overflows related to the radix tree code. Oops.

Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/generic-radix-tree.h | 6 ++++++
lib/generic-radix-tree.c | 17 ++++++++++++++---
2 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/include/linux/generic-radix-tree.h b/include/linux/generic-radix-tree.h
index 107613f7d7..63080822dc 100644
--- a/include/linux/generic-radix-tree.h
+++ b/include/linux/generic-radix-tree.h
@@ -184,6 +184,12 @@ void *__genradix_iter_peek(struct genradix_iter *, struct __genradix *, size_t);
static inline void __genradix_iter_advance(struct genradix_iter *iter,
size_t obj_size)
{
+ if (iter->offset + obj_size < iter->offset) {
+ iter->offset = SIZE_MAX;
+ iter->pos = SIZE_MAX;
+ return;
+ }
+
iter->offset += obj_size;

if (!is_power_of_2(obj_size) &&
diff --git a/lib/generic-radix-tree.c b/lib/generic-radix-tree.c
index f25eb111c0..7dfa88282b 100644
--- a/lib/generic-radix-tree.c
+++ b/lib/generic-radix-tree.c
@@ -166,6 +166,10 @@ void *__genradix_iter_peek(struct genradix_iter *iter,
struct genradix_root *r;
struct genradix_node *n;
unsigned level, i;
+
+ if (iter->offset == SIZE_MAX)
+ return NULL;
+
restart:
r = READ_ONCE(radix->root);
if (!r)
@@ -184,10 +188,17 @@ void *__genradix_iter_peek(struct genradix_iter *iter,
(GENRADIX_ARY - 1);

while (!n->children[i]) {
+ size_t objs_per_ptr = genradix_depth_size(level);
+
+ if (iter->offset + objs_per_ptr < iter->offset) {
+ iter->offset = SIZE_MAX;
+ iter->pos = SIZE_MAX;
+ return NULL;
+ }
+
i++;
- iter->offset = round_down(iter->offset +
- genradix_depth_size(level),
- genradix_depth_size(level));
+ iter->offset = round_down(iter->offset + objs_per_ptr,
+ objs_per_ptr);
iter->pos = (iter->offset >> PAGE_SHIFT) *
objs_per_page;
if (i == GENRADIX_ARY)
--
2.40.1

2023-05-09 17:07:56

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

From: Kent Overstreet <[email protected]>

This is used by bcachefs to fix a page cache coherency issue with
O_DIRECT writes.

Also relevant: mapping->invalidate_lock, see below.

O_DIRECT writes (and other filesystem operations that modify file data
while bypassing the page cache) need to shoot down ranges of the page
cache - and additionally, need locking to prevent those pages from
pulled back in.

But O_DIRECT writes invoke the page fault handler (via get_user_pages),
and the page fault handler will need to take that same lock - this is a
classic recursive deadlock if userspace has mmaped the file they're DIO
writing to and uses those pages for the buffer to write from, and it's a
lock ordering deadlock in general.

Thus we need a way to signal from the dio code to the page fault handler
when we already are holding the pagecache add lock on an address space -
this patch just adds a member to task_struct for this purpose. For now
only bcachefs is implementing this locking, though it may be moved out
of bcachefs and made available to other filesystems in the future.

---------------------------------

The closest current VFS equivalent is mapping->invalidate_lock, which
comes from XFS. However, it's not used by direct IO. Instead, direct IO
paths shoot down the page cache twice - before starting the IO and at
the end, and they're still technically racy w.r.t. page cache coherency.

This is a more complete approach: in the future we might consider
replacing mapping->invalidate_lock with the bcachefs code.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: [email protected]
---
include/linux/sched.h | 1 +
init/init_task.c | 1 +
2 files changed, 2 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63d242164b..f2a56f64f7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -869,6 +869,7 @@ struct task_struct {

struct mm_struct *mm;
struct mm_struct *active_mm;
+ struct address_space *faults_disabled_mapping;

int exit_state;
int exit_code;
diff --git a/init/init_task.c b/init/init_task.c
index ff6c4b9bfe..f703116e05 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -85,6 +85,7 @@ struct task_struct init_task
.nr_cpus_allowed= NR_CPUS,
.mm = NULL,
.active_mm = &init_mm,
+ .faults_disabled_mapping = NULL,
.restart_block = {
.fn = do_no_restart_syscall,
},
--
2.40.1

2023-05-09 17:08:03

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 05/32] MAINTAINERS: Add entry for six locks

SIX locks are a new locking primitive, shared/intent/exclusive,
currently used by bcachefs but available for other uses. Mark them as
maintained.

Signed-off-by: Kent Overstreet <[email protected]>
---
MAINTAINERS | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index c6545eb541..3fc37de3d6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -19166,6 +19166,14 @@ S: Maintained
W: http://www.winischhofer.at/linuxsisusbvga.shtml
F: drivers/usb/misc/sisusbvga/

+SIX LOCKS
+M: Kent Overstreet <[email protected]>
+L: [email protected]
+S: Supported
+C: irc://irc.oftc.net/bcache
+F: include/linux/six.h
+F: kernel/locking/six.c
+
SL28 CPLD MFD DRIVER
M: Michael Walle <[email protected]>
S: Maintained
--
2.40.1

2023-05-09 17:08:03

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 18/32] closures: closure_nr_remaining()

Factor out a new helper, which returns the number of events outstanding.

Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/closure.h | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/closure.h b/include/linux/closure.h
index 36b4a83f9b..722a586bb2 100644
--- a/include/linux/closure.h
+++ b/include/linux/closure.h
@@ -172,6 +172,11 @@ void __closure_wake_up(struct closure_waitlist *list);
bool closure_wait(struct closure_waitlist *list, struct closure *cl);
void __closure_sync(struct closure *cl);

+static inline unsigned closure_nr_remaining(struct closure *cl)
+{
+ return atomic_read(&cl->remaining) & CLOSURE_REMAINING_MASK;
+}
+
/**
* closure_sync - sleep until a closure a closure has nothing left to wait on
*
@@ -180,7 +185,7 @@ void __closure_sync(struct closure *cl);
*/
static inline void closure_sync(struct closure *cl)
{
- if ((atomic_read(&cl->remaining) & CLOSURE_REMAINING_MASK) != 1)
+ if (closure_nr_remaining(cl) != 1)
__closure_sync(cl);
}

--
2.40.1

2023-05-09 17:08:24

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote

From: Kent Overstreet <[email protected]>

printbuf now needs to know the number of characters that would have been
written if the buffer was too small, like snprintf(); this changes
string_get_size() to return the the return value of snprintf().

Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/string_helpers.h | 4 ++--
lib/string_helpers.c | 8 ++++----
2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/string_helpers.h b/include/linux/string_helpers.h
index fae6beaaa2..44148f8feb 100644
--- a/include/linux/string_helpers.h
+++ b/include/linux/string_helpers.h
@@ -23,8 +23,8 @@ enum string_size_units {
STRING_UNITS_2, /* use binary powers of 2^10 */
};

-void string_get_size(u64 size, u64 blk_size, enum string_size_units units,
- char *buf, int len);
+int string_get_size(u64 size, u64 blk_size, enum string_size_units units,
+ char *buf, int len);

int parse_int_array_user(const char __user *from, size_t count, int **array);

diff --git a/lib/string_helpers.c b/lib/string_helpers.c
index 230020a2e0..ca36ceba0e 100644
--- a/lib/string_helpers.c
+++ b/lib/string_helpers.c
@@ -32,8 +32,8 @@
* at least 9 bytes and will always be zero terminated.
*
*/
-void string_get_size(u64 size, u64 blk_size, const enum string_size_units units,
- char *buf, int len)
+int string_get_size(u64 size, u64 blk_size, const enum string_size_units units,
+ char *buf, int len)
{
static const char *const units_10[] = {
"B", "kB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB"
@@ -126,8 +126,8 @@ void string_get_size(u64 size, u64 blk_size, const enum string_size_units units,
else
unit = units_str[units][i];

- snprintf(buf, len, "%u%s %s", (u32)size,
- tmp, unit);
+ return snprintf(buf, len, "%u%s %s", (u32)size,
+ tmp, unit);
}
EXPORT_SYMBOL(string_get_size);

--
2.40.1

2023-05-09 17:08:28

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 20/32] vfs: factor out inode hash head calculation

From: Dave Chinner <[email protected]>

In preparation for changing the inode hash table implementation.

Signed-off-by: Dave Chinner <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: [email protected]
Signed-off-by: Kent Overstreet <[email protected]>
---
fs/inode.c | 44 +++++++++++++++++++++++++-------------------
1 file changed, 25 insertions(+), 19 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 4558dc2f13..41a10bcda1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -60,6 +60,22 @@ static unsigned int i_hash_shift __read_mostly;
static struct hlist_head *inode_hashtable __read_mostly;
static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);

+static unsigned long hash(struct super_block *sb, unsigned long hashval)
+{
+ unsigned long tmp;
+
+ tmp = (hashval * (unsigned long)sb) ^ (GOLDEN_RATIO_PRIME + hashval) /
+ L1_CACHE_BYTES;
+ tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> i_hash_shift);
+ return tmp & i_hash_mask;
+}
+
+static inline struct hlist_head *i_hash_head(struct super_block *sb,
+ unsigned int hashval)
+{
+ return inode_hashtable + hash(sb, hashval);
+}
+
/*
* Empty aops. Can be used for the cases where the user does not
* define any of the address_space operations.
@@ -506,16 +522,6 @@ static inline void inode_sb_list_del(struct inode *inode)
}
}

-static unsigned long hash(struct super_block *sb, unsigned long hashval)
-{
- unsigned long tmp;
-
- tmp = (hashval * (unsigned long)sb) ^ (GOLDEN_RATIO_PRIME + hashval) /
- L1_CACHE_BYTES;
- tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> i_hash_shift);
- return tmp & i_hash_mask;
-}
-
/**
* __insert_inode_hash - hash an inode
* @inode: unhashed inode
@@ -1163,7 +1169,7 @@ struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
int (*test)(struct inode *, void *),
int (*set)(struct inode *, void *), void *data)
{
- struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
+ struct hlist_head *head = i_hash_head(inode->i_sb, hashval);
struct inode *old;

again:
@@ -1267,7 +1273,7 @@ EXPORT_SYMBOL(iget5_locked);
*/
struct inode *iget_locked(struct super_block *sb, unsigned long ino)
{
- struct hlist_head *head = inode_hashtable + hash(sb, ino);
+ struct hlist_head *head = i_hash_head(sb, ino);
struct inode *inode;
again:
spin_lock(&inode_hash_lock);
@@ -1335,7 +1341,7 @@ EXPORT_SYMBOL(iget_locked);
*/
static int test_inode_iunique(struct super_block *sb, unsigned long ino)
{
- struct hlist_head *b = inode_hashtable + hash(sb, ino);
+ struct hlist_head *b = i_hash_head(sb, ino);
struct inode *inode;

hlist_for_each_entry_rcu(inode, b, i_hash) {
@@ -1422,7 +1428,7 @@ EXPORT_SYMBOL(igrab);
struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
int (*test)(struct inode *, void *), void *data)
{
- struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+ struct hlist_head *head = i_hash_head(sb, hashval);
struct inode *inode;

spin_lock(&inode_hash_lock);
@@ -1477,7 +1483,7 @@ EXPORT_SYMBOL(ilookup5);
*/
struct inode *ilookup(struct super_block *sb, unsigned long ino)
{
- struct hlist_head *head = inode_hashtable + hash(sb, ino);
+ struct hlist_head *head = i_hash_head(sb, ino);
struct inode *inode;
again:
spin_lock(&inode_hash_lock);
@@ -1526,7 +1532,7 @@ struct inode *find_inode_nowait(struct super_block *sb,
void *),
void *data)
{
- struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+ struct hlist_head *head = i_hash_head(sb, hashval);
struct inode *inode, *ret_inode = NULL;
int mval;

@@ -1571,7 +1577,7 @@ EXPORT_SYMBOL(find_inode_nowait);
struct inode *find_inode_rcu(struct super_block *sb, unsigned long hashval,
int (*test)(struct inode *, void *), void *data)
{
- struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+ struct hlist_head *head = i_hash_head(sb, hashval);
struct inode *inode;

RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
@@ -1609,7 +1615,7 @@ EXPORT_SYMBOL(find_inode_rcu);
struct inode *find_inode_by_ino_rcu(struct super_block *sb,
unsigned long ino)
{
- struct hlist_head *head = inode_hashtable + hash(sb, ino);
+ struct hlist_head *head = i_hash_head(sb, ino);
struct inode *inode;

RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
@@ -1629,7 +1635,7 @@ int insert_inode_locked(struct inode *inode)
{
struct super_block *sb = inode->i_sb;
ino_t ino = inode->i_ino;
- struct hlist_head *head = inode_hashtable + hash(sb, ino);
+ struct hlist_head *head = i_hash_head(sb, ino);

while (1) {
struct inode *old = NULL;
--
2.40.1

2023-05-09 17:08:35

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 09/32] block: Add some exports for bcachefs

From: Kent Overstreet <[email protected]>

- bio_set_pages_dirty(), bio_check_pages_dirty() - dio path
- blk_status_to_str() - error messages
- bio_add_folio() - this should definitely be exported for everyone,
it's the modern version of bio_add_page()

Signed-off-by: Kent Overstreet <[email protected]>
Cc: [email protected]
Cc: Jens Axboe <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
block/bio.c | 3 +++
block/blk-core.c | 1 +
block/blk.h | 1 -
include/linux/blkdev.h | 1 +
4 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index fd11614bba..1e75840d17 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1159,6 +1159,7 @@ bool bio_add_folio(struct bio *bio, struct folio *folio, size_t len,
return false;
return bio_add_page(bio, &folio->page, len, off) > 0;
}
+EXPORT_SYMBOL(bio_add_folio);

void __bio_release_pages(struct bio *bio, bool mark_dirty)
{
@@ -1480,6 +1481,7 @@ void bio_set_pages_dirty(struct bio *bio)
set_page_dirty_lock(bvec->bv_page);
}
}
+EXPORT_SYMBOL_GPL(bio_set_pages_dirty);

/*
* bio_check_pages_dirty() will check that all the BIO's pages are still dirty.
@@ -1539,6 +1541,7 @@ void bio_check_pages_dirty(struct bio *bio)
spin_unlock_irqrestore(&bio_dirty_lock, flags);
schedule_work(&bio_dirty_work);
}
+EXPORT_SYMBOL_GPL(bio_check_pages_dirty);

static inline bool bio_remaining_done(struct bio *bio)
{
diff --git a/block/blk-core.c b/block/blk-core.c
index 42926e6cb8..f19bcc684b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -205,6 +205,7 @@ const char *blk_status_to_str(blk_status_t status)
return "<null>";
return blk_errors[idx].name;
}
+EXPORT_SYMBOL_GPL(blk_status_to_str);

/**
* blk_sync_queue - cancel any pending callbacks on a queue
diff --git a/block/blk.h b/block/blk.h
index cc4e8873df..cc04dc73e9 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -259,7 +259,6 @@ static inline void blk_integrity_del(struct gendisk *disk)

unsigned long blk_rq_timeout(unsigned long timeout);
void blk_add_timer(struct request *req);
-const char *blk_status_to_str(blk_status_t status);

bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
unsigned int nr_segs);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 941304f174..7cac183112 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -867,6 +867,7 @@ extern const char *blk_op_str(enum req_op op);

int blk_status_to_errno(blk_status_t status);
blk_status_t errno_to_blk_status(int errno);
+const char *blk_status_to_str(blk_status_t status);

/* only poll the hardware once, don't continue until a completion was found */
#define BLK_POLL_ONESHOT (1 << 0)
--
2.40.1

2023-05-09 17:08:42

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 30/32] lib: Export errname

The bcachefs module now wants this and it seems sensible.

Signed-off-by: Christopher James Halse Rogers <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
lib/errname.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/lib/errname.c b/lib/errname.c
index 67739b174a..dd1b998552 100644
--- a/lib/errname.c
+++ b/lib/errname.c
@@ -228,3 +228,4 @@ const char *errname(int err)

return err > 0 ? name + 1 : name;
}
+EXPORT_SYMBOL(errname);
--
2.40.1

2023-05-09 17:08:48

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 22/32] vfs: inode cache conversion to hash-bl

From: Dave Chinner <[email protected]>

Because scalability of the global inode_hash_lock really, really
sucks.

32-way concurrent create on a couple of different filesystems
before:

- 52.13% 0.04% [kernel] [k] ext4_create
- 52.09% ext4_create
- 41.03% __ext4_new_inode
- 29.92% insert_inode_locked
- 25.35% _raw_spin_lock
- do_raw_spin_lock
- 24.97% __pv_queued_spin_lock_slowpath

- 72.33% 0.02% [kernel] [k] do_filp_open
- 72.31% do_filp_open
- 72.28% path_openat
- 57.03% bch2_create
- 56.46% __bch2_create
- 40.43% inode_insert5
- 36.07% _raw_spin_lock
- do_raw_spin_lock
35.86% __pv_queued_spin_lock_slowpath
4.02% find_inode

Convert the inode hash table to a RCU-aware hash-bl table just like
the dentry cache. Note that we need to store a pointer to the
hlist_bl_head the inode has been added to in the inode so that when
it comes to unhash the inode we know what list to lock. We need to
do this because the hash value that is used to hash the inode is
generated from the inode itself - filesystems can provide this
themselves so we have to either store the hash or the head pointer
in the inode to be able to find the right list head for removal...

Same workload after:

Signed-off-by: Dave Chinner <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: [email protected]
Signed-off-by: Kent Overstreet <[email protected]>
---
fs/inode.c | 200 ++++++++++++++++++++++++++++-----------------
include/linux/fs.h | 9 +-
2 files changed, 132 insertions(+), 77 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 41a10bcda1..d446b054ec 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -57,8 +57,7 @@

static unsigned int i_hash_mask __read_mostly;
static unsigned int i_hash_shift __read_mostly;
-static struct hlist_head *inode_hashtable __read_mostly;
-static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
+static struct hlist_bl_head *inode_hashtable __read_mostly;

static unsigned long hash(struct super_block *sb, unsigned long hashval)
{
@@ -70,7 +69,7 @@ static unsigned long hash(struct super_block *sb, unsigned long hashval)
return tmp & i_hash_mask;
}

-static inline struct hlist_head *i_hash_head(struct super_block *sb,
+static inline struct hlist_bl_head *i_hash_head(struct super_block *sb,
unsigned int hashval)
{
return inode_hashtable + hash(sb, hashval);
@@ -433,7 +432,7 @@ EXPORT_SYMBOL(address_space_init_once);
void inode_init_once(struct inode *inode)
{
memset(inode, 0, sizeof(*inode));
- INIT_HLIST_NODE(&inode->i_hash);
+ INIT_HLIST_BL_NODE(&inode->i_hash);
INIT_LIST_HEAD(&inode->i_devices);
INIT_LIST_HEAD(&inode->i_io_list);
INIT_LIST_HEAD(&inode->i_wb_list);
@@ -522,6 +521,17 @@ static inline void inode_sb_list_del(struct inode *inode)
}
}

+/*
+ * Ensure that we store the hash head in the inode when we insert the inode into
+ * the hlist_bl_head...
+ */
+static inline void
+__insert_inode_hash_head(struct inode *inode, struct hlist_bl_head *b)
+{
+ hlist_bl_add_head_rcu(&inode->i_hash, b);
+ inode->i_hash_head = b;
+}
+
/**
* __insert_inode_hash - hash an inode
* @inode: unhashed inode
@@ -532,13 +542,13 @@ static inline void inode_sb_list_del(struct inode *inode)
*/
void __insert_inode_hash(struct inode *inode, unsigned long hashval)
{
- struct hlist_head *b = inode_hashtable + hash(inode->i_sb, hashval);
+ struct hlist_bl_head *b = i_hash_head(inode->i_sb, hashval);

- spin_lock(&inode_hash_lock);
+ hlist_bl_lock(b);
spin_lock(&inode->i_lock);
- hlist_add_head_rcu(&inode->i_hash, b);
+ __insert_inode_hash_head(inode, b);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_hash_lock);
+ hlist_bl_unlock(b);
}
EXPORT_SYMBOL(__insert_inode_hash);

@@ -550,11 +560,44 @@ EXPORT_SYMBOL(__insert_inode_hash);
*/
void __remove_inode_hash(struct inode *inode)
{
- spin_lock(&inode_hash_lock);
- spin_lock(&inode->i_lock);
- hlist_del_init_rcu(&inode->i_hash);
- spin_unlock(&inode->i_lock);
- spin_unlock(&inode_hash_lock);
+ struct hlist_bl_head *b = inode->i_hash_head;
+
+ /*
+ * There are some callers that come through here without synchronisation
+ * and potentially with multiple references to the inode. Hence we have
+ * to handle the case that we might race with a remove and insert to a
+ * different list. Coda, in particular, seems to have a userspace API
+ * that can directly trigger "unhash/rehash to different list" behaviour
+ * without any serialisation at all.
+ *
+ * Hence we have to handle the situation where the inode->i_hash_head
+ * might point to a different list than what we expect, indicating that
+ * we raced with another unhash and potentially a new insertion. This
+ * means we have to retest the head once we have everything locked up
+ * and loop again if it doesn't match.
+ */
+ while (b) {
+ hlist_bl_lock(b);
+ spin_lock(&inode->i_lock);
+ if (b != inode->i_hash_head) {
+ hlist_bl_unlock(b);
+ b = inode->i_hash_head;
+ spin_unlock(&inode->i_lock);
+ continue;
+ }
+ /*
+ * Need to set the pprev pointer to NULL after list removal so
+ * that both RCU traversals and hlist_bl_unhashed() work
+ * correctly at this point.
+ */
+ hlist_bl_del_rcu(&inode->i_hash);
+ inode->i_hash.pprev = NULL;
+ inode->i_hash_head = NULL;
+ spin_unlock(&inode->i_lock);
+ hlist_bl_unlock(b);
+ break;
+ }
+
}
EXPORT_SYMBOL(__remove_inode_hash);

@@ -904,26 +947,28 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
return freed;
}

-static void __wait_on_freeing_inode(struct inode *inode);
+static void __wait_on_freeing_inode(struct hlist_bl_head *b,
+ struct inode *inode);
/*
* Called with the inode lock held.
*/
static struct inode *find_inode(struct super_block *sb,
- struct hlist_head *head,
+ struct hlist_bl_head *b,
int (*test)(struct inode *, void *),
void *data)
{
+ struct hlist_bl_node *node;
struct inode *inode = NULL;

repeat:
- hlist_for_each_entry(inode, head, i_hash) {
+ hlist_bl_for_each_entry(inode, node, b, i_hash) {
if (inode->i_sb != sb)
continue;
if (!test(inode, data))
continue;
spin_lock(&inode->i_lock);
if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
- __wait_on_freeing_inode(inode);
+ __wait_on_freeing_inode(b, inode);
goto repeat;
}
if (unlikely(inode->i_state & I_CREATING)) {
@@ -942,19 +987,20 @@ static struct inode *find_inode(struct super_block *sb,
* iget_locked for details.
*/
static struct inode *find_inode_fast(struct super_block *sb,
- struct hlist_head *head, unsigned long ino)
+ struct hlist_bl_head *b, unsigned long ino)
{
+ struct hlist_bl_node *node;
struct inode *inode = NULL;

repeat:
- hlist_for_each_entry(inode, head, i_hash) {
+ hlist_bl_for_each_entry(inode, node, b, i_hash) {
if (inode->i_ino != ino)
continue;
if (inode->i_sb != sb)
continue;
spin_lock(&inode->i_lock);
if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
- __wait_on_freeing_inode(inode);
+ __wait_on_freeing_inode(b, inode);
goto repeat;
}
if (unlikely(inode->i_state & I_CREATING)) {
@@ -1162,25 +1208,25 @@ EXPORT_SYMBOL(unlock_two_nondirectories);
* return it locked, hashed, and with the I_NEW flag set. The file system gets
* to fill it in before unlocking it via unlock_new_inode().
*
- * Note both @test and @set are called with the inode_hash_lock held, so can't
- * sleep.
+ * Note both @test and @set are called with the inode hash chain lock held,
+ * so can't sleep.
*/
struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
int (*test)(struct inode *, void *),
int (*set)(struct inode *, void *), void *data)
{
- struct hlist_head *head = i_hash_head(inode->i_sb, hashval);
+ struct hlist_bl_head *b = i_hash_head(inode->i_sb, hashval);
struct inode *old;

again:
- spin_lock(&inode_hash_lock);
- old = find_inode(inode->i_sb, head, test, data);
+ hlist_bl_lock(b);
+ old = find_inode(inode->i_sb, b, test, data);
if (unlikely(old)) {
/*
* Uhhuh, somebody else created the same inode under us.
* Use the old inode instead of the preallocated one.
*/
- spin_unlock(&inode_hash_lock);
+ hlist_bl_unlock(b);
if (IS_ERR(old))
return NULL;
wait_on_inode(old);
@@ -1202,7 +1248,7 @@ struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
*/
spin_lock(&inode->i_lock);
inode->i_state |= I_NEW;
- hlist_add_head_rcu(&inode->i_hash, head);
+ __insert_inode_hash_head(inode, b);
spin_unlock(&inode->i_lock);

/*
@@ -1212,7 +1258,7 @@ struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
if (list_empty(&inode->i_sb_list))
inode_sb_list_add(inode);
unlock:
- spin_unlock(&inode_hash_lock);
+ hlist_bl_unlock(b);

return inode;
}
@@ -1273,12 +1319,12 @@ EXPORT_SYMBOL(iget5_locked);
*/
struct inode *iget_locked(struct super_block *sb, unsigned long ino)
{
- struct hlist_head *head = i_hash_head(sb, ino);
+ struct hlist_bl_head *b = i_hash_head(sb, ino);
struct inode *inode;
again:
- spin_lock(&inode_hash_lock);
- inode = find_inode_fast(sb, head, ino);
- spin_unlock(&inode_hash_lock);
+ hlist_bl_lock(b);
+ inode = find_inode_fast(sb, b, ino);
+ hlist_bl_unlock(b);
if (inode) {
if (IS_ERR(inode))
return NULL;
@@ -1294,17 +1340,17 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
if (inode) {
struct inode *old;

- spin_lock(&inode_hash_lock);
+ hlist_bl_lock(b);
/* We released the lock, so.. */
- old = find_inode_fast(sb, head, ino);
+ old = find_inode_fast(sb, b, ino);
if (!old) {
inode->i_ino = ino;
spin_lock(&inode->i_lock);
inode->i_state = I_NEW;
- hlist_add_head_rcu(&inode->i_hash, head);
+ __insert_inode_hash_head(inode, b);
spin_unlock(&inode->i_lock);
inode_sb_list_add(inode);
- spin_unlock(&inode_hash_lock);
+ hlist_bl_unlock(b);

/* Return the locked inode with I_NEW set, the
* caller is responsible for filling in the contents
@@ -1317,7 +1363,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
* us. Use the old inode instead of the one we just
* allocated.
*/
- spin_unlock(&inode_hash_lock);
+ hlist_bl_unlock(b);
destroy_inode(inode);
if (IS_ERR(old))
return NULL;
@@ -1341,10 +1387,11 @@ EXPORT_SYMBOL(iget_locked);
*/
static int test_inode_iunique(struct super_block *sb, unsigned long ino)
{
- struct hlist_head *b = i_hash_head(sb, ino);
+ struct hlist_bl_head *b = i_hash_head(sb, ino);
+ struct hlist_bl_node *node;
struct inode *inode;

- hlist_for_each_entry_rcu(inode, b, i_hash) {
+ hlist_bl_for_each_entry_rcu(inode, node, b, i_hash) {
if (inode->i_ino == ino && inode->i_sb == sb)
return 0;
}
@@ -1428,12 +1475,12 @@ EXPORT_SYMBOL(igrab);
struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
int (*test)(struct inode *, void *), void *data)
{
- struct hlist_head *head = i_hash_head(sb, hashval);
+ struct hlist_bl_head *b = i_hash_head(sb, hashval);
struct inode *inode;

- spin_lock(&inode_hash_lock);
- inode = find_inode(sb, head, test, data);
- spin_unlock(&inode_hash_lock);
+ hlist_bl_lock(b);
+ inode = find_inode(sb, b, test, data);
+ hlist_bl_unlock(b);

return IS_ERR(inode) ? NULL : inode;
}
@@ -1483,12 +1530,12 @@ EXPORT_SYMBOL(ilookup5);
*/
struct inode *ilookup(struct super_block *sb, unsigned long ino)
{
- struct hlist_head *head = i_hash_head(sb, ino);
+ struct hlist_bl_head *b = i_hash_head(sb, ino);
struct inode *inode;
again:
- spin_lock(&inode_hash_lock);
- inode = find_inode_fast(sb, head, ino);
- spin_unlock(&inode_hash_lock);
+ hlist_bl_lock(b);
+ inode = find_inode_fast(sb, b, ino);
+ hlist_bl_unlock(b);

if (inode) {
if (IS_ERR(inode))
@@ -1532,12 +1579,13 @@ struct inode *find_inode_nowait(struct super_block *sb,
void *),
void *data)
{
- struct hlist_head *head = i_hash_head(sb, hashval);
+ struct hlist_bl_head *b = i_hash_head(sb, hashval);
+ struct hlist_bl_node *node;
struct inode *inode, *ret_inode = NULL;
int mval;

- spin_lock(&inode_hash_lock);
- hlist_for_each_entry(inode, head, i_hash) {
+ hlist_bl_lock(b);
+ hlist_bl_for_each_entry(inode, node, b, i_hash) {
if (inode->i_sb != sb)
continue;
mval = match(inode, hashval, data);
@@ -1548,7 +1596,7 @@ struct inode *find_inode_nowait(struct super_block *sb,
goto out;
}
out:
- spin_unlock(&inode_hash_lock);
+ hlist_bl_unlock(b);
return ret_inode;
}
EXPORT_SYMBOL(find_inode_nowait);
@@ -1577,13 +1625,14 @@ EXPORT_SYMBOL(find_inode_nowait);
struct inode *find_inode_rcu(struct super_block *sb, unsigned long hashval,
int (*test)(struct inode *, void *), void *data)
{
- struct hlist_head *head = i_hash_head(sb, hashval);
+ struct hlist_bl_head *b = i_hash_head(sb, hashval);
+ struct hlist_bl_node *node;
struct inode *inode;

RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
"suspicious find_inode_rcu() usage");

- hlist_for_each_entry_rcu(inode, head, i_hash) {
+ hlist_bl_for_each_entry_rcu(inode, node, b, i_hash) {
if (inode->i_sb == sb &&
!(READ_ONCE(inode->i_state) & (I_FREEING | I_WILL_FREE)) &&
test(inode, data))
@@ -1615,13 +1664,14 @@ EXPORT_SYMBOL(find_inode_rcu);
struct inode *find_inode_by_ino_rcu(struct super_block *sb,
unsigned long ino)
{
- struct hlist_head *head = i_hash_head(sb, ino);
+ struct hlist_bl_head *b = i_hash_head(sb, ino);
+ struct hlist_bl_node *node;
struct inode *inode;

RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
"suspicious find_inode_by_ino_rcu() usage");

- hlist_for_each_entry_rcu(inode, head, i_hash) {
+ hlist_bl_for_each_entry_rcu(inode, node, b, i_hash) {
if (inode->i_ino == ino &&
inode->i_sb == sb &&
!(READ_ONCE(inode->i_state) & (I_FREEING | I_WILL_FREE)))
@@ -1635,39 +1685,42 @@ int insert_inode_locked(struct inode *inode)
{
struct super_block *sb = inode->i_sb;
ino_t ino = inode->i_ino;
- struct hlist_head *head = i_hash_head(sb, ino);
+ struct hlist_bl_head *b = i_hash_head(sb, ino);

while (1) {
- struct inode *old = NULL;
- spin_lock(&inode_hash_lock);
- hlist_for_each_entry(old, head, i_hash) {
- if (old->i_ino != ino)
+ struct hlist_bl_node *node;
+ struct inode *old = NULL, *t;
+
+ hlist_bl_lock(b);
+ hlist_bl_for_each_entry(t, node, b, i_hash) {
+ if (t->i_ino != ino)
continue;
- if (old->i_sb != sb)
+ if (t->i_sb != sb)
continue;
- spin_lock(&old->i_lock);
- if (old->i_state & (I_FREEING|I_WILL_FREE)) {
- spin_unlock(&old->i_lock);
+ spin_lock(&t->i_lock);
+ if (t->i_state & (I_FREEING|I_WILL_FREE)) {
+ spin_unlock(&t->i_lock);
continue;
}
+ old = t;
break;
}
if (likely(!old)) {
spin_lock(&inode->i_lock);
inode->i_state |= I_NEW | I_CREATING;
- hlist_add_head_rcu(&inode->i_hash, head);
+ __insert_inode_hash_head(inode, b);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_hash_lock);
+ hlist_bl_unlock(b);
return 0;
}
if (unlikely(old->i_state & I_CREATING)) {
spin_unlock(&old->i_lock);
- spin_unlock(&inode_hash_lock);
+ hlist_bl_unlock(b);
return -EBUSY;
}
__iget(old);
spin_unlock(&old->i_lock);
- spin_unlock(&inode_hash_lock);
+ hlist_bl_unlock(b);
wait_on_inode(old);
if (unlikely(!inode_unhashed(old))) {
iput(old);
@@ -2192,17 +2245,18 @@ EXPORT_SYMBOL(inode_needs_sync);
* wake_up_bit(&inode->i_state, __I_NEW) after removing from the hash list
* will DTRT.
*/
-static void __wait_on_freeing_inode(struct inode *inode)
+static void __wait_on_freeing_inode(struct hlist_bl_head *b,
+ struct inode *inode)
{
wait_queue_head_t *wq;
DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW);
wq = bit_waitqueue(&inode->i_state, __I_NEW);
prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_hash_lock);
+ hlist_bl_unlock(b);
schedule();
finish_wait(wq, &wait.wq_entry);
- spin_lock(&inode_hash_lock);
+ hlist_bl_lock(b);
}

static __initdata unsigned long ihash_entries;
@@ -2228,7 +2282,7 @@ void __init inode_init_early(void)

inode_hashtable =
alloc_large_system_hash("Inode-cache",
- sizeof(struct hlist_head),
+ sizeof(struct hlist_bl_head),
ihash_entries,
14,
HASH_EARLY | HASH_ZERO,
@@ -2254,7 +2308,7 @@ void __init inode_init(void)

inode_hashtable =
alloc_large_system_hash("Inode-cache",
- sizeof(struct hlist_head),
+ sizeof(struct hlist_bl_head),
ihash_entries,
14,
HASH_ZERO,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1a6f951942..db8d49cbf7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -647,7 +647,8 @@ struct inode {
unsigned long dirtied_when; /* jiffies of first dirtying */
unsigned long dirtied_time_when;

- struct hlist_node i_hash;
+ struct hlist_bl_node i_hash;
+ struct hlist_bl_head *i_hash_head;
struct list_head i_io_list; /* backing dev IO list */
#ifdef CONFIG_CGROUP_WRITEBACK
struct bdi_writeback *i_wb; /* the associated cgroup wb */
@@ -713,7 +714,7 @@ static inline unsigned int i_blocksize(const struct inode *node)

static inline int inode_unhashed(struct inode *inode)
{
- return hlist_unhashed(&inode->i_hash);
+ return hlist_bl_unhashed(&inode->i_hash);
}

/*
@@ -724,7 +725,7 @@ static inline int inode_unhashed(struct inode *inode)
*/
static inline void inode_fake_hash(struct inode *inode)
{
- hlist_add_fake(&inode->i_hash);
+ hlist_bl_add_fake(&inode->i_hash);
}

/*
@@ -2695,7 +2696,7 @@ static inline void insert_inode_hash(struct inode *inode)
extern void __remove_inode_hash(struct inode *);
static inline void remove_inode_hash(struct inode *inode)
{
- if (!inode_unhashed(inode) && !hlist_fake(&inode->i_hash))
+ if (!inode_unhashed(inode) && !hlist_bl_fake(&inode->i_hash))
__remove_inode_hash(inode);
}

--
2.40.1

2023-05-09 17:09:39

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 01/32] Compiler Attributes: add __flatten

From: Kent Overstreet <[email protected]>

This makes __attribute__((flatten)) available, which is used by
bcachefs.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Miguel Ojeda <[email protected]> (maintainer:COMPILER ATTRIBUTES)
Cc: Nick Desaulniers <[email protected]> (reviewer:COMPILER ATTRIBUTES)
Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/compiler_attributes.h | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/include/linux/compiler_attributes.h b/include/linux/compiler_attributes.h
index e659cb6fde..e56793bc08 100644
--- a/include/linux/compiler_attributes.h
+++ b/include/linux/compiler_attributes.h
@@ -366,4 +366,9 @@
*/
#define __fix_address noinline __noclone

+/*
+ * gcc: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-flatten-function-attribute
+ */
+#define __flatten __attribute__((flatten))
+
#endif /* __LINUX_COMPILER_ATTRIBUTES_H */
--
2.40.1

2023-05-09 17:09:41

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 26/32] lib/generic-radix-tree.c: Add a missing include

From: Kent Overstreet <[email protected]>

We now need linux/limits.h for SIZE_MAX.

Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/generic-radix-tree.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/linux/generic-radix-tree.h b/include/linux/generic-radix-tree.h
index 63080822dc..f6cd0f909d 100644
--- a/include/linux/generic-radix-tree.h
+++ b/include/linux/generic-radix-tree.h
@@ -38,6 +38,7 @@

#include <asm/page.h>
#include <linux/bug.h>
+#include <linux/limits.h>
#include <linux/log2.h>
#include <linux/math.h>
#include <linux/types.h>
--
2.40.1

2023-05-09 17:10:01

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 10/32] block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset

bio_iov_iter_get_pages() trims the IO based on the block size of the
block device the IO will be issued to.

However, bcachefs is a multi device filesystem; when we're creating the
bio we don't yet know which block device the bio will be submitted to -
we have to handle the alignment checks elsewhere.

Thus this is needed to avoid a null ptr deref.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: [email protected]
---
block/bio.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 1e75840d17..e74a04ea14 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1245,7 +1245,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
struct page **pages = (struct page **)bv;
ssize_t size, left;
unsigned len, i = 0;
- size_t offset, trim;
+ size_t offset;
int ret = 0;

/*
@@ -1274,10 +1274,12 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)

nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);

- trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1);
- iov_iter_revert(iter, trim);
+ if (bio->bi_bdev) {
+ size_t trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1);
+ iov_iter_revert(iter, trim);
+ size -= trim;
+ }

- size -= trim;
if (unlikely(!size)) {
ret = -EFAULT;
goto out;
--
2.40.1

2023-05-09 17:11:22

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 17/32] closures: closure_wait_event()

From: Kent Overstreet <[email protected]>

Like wait_event() - except, because it uses closures and closure
waitlists it doesn't have the restriction on modifying task state inside
the condition check, like wait_event() does.

Signed-off-by: Kent Overstreet <[email protected]>
Acked-by: Coly Li <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/closure.h | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/include/linux/closure.h b/include/linux/closure.h
index 0ec9e7bc8d..36b4a83f9b 100644
--- a/include/linux/closure.h
+++ b/include/linux/closure.h
@@ -374,4 +374,26 @@ static inline void closure_call(struct closure *cl, closure_fn fn,
continue_at_nobarrier(cl, fn, wq);
}

+#define __closure_wait_event(waitlist, _cond) \
+do { \
+ struct closure cl; \
+ \
+ closure_init_stack(&cl); \
+ \
+ while (1) { \
+ closure_wait(waitlist, &cl); \
+ if (_cond) \
+ break; \
+ closure_sync(&cl); \
+ } \
+ closure_wake_up(waitlist); \
+ closure_sync(&cl); \
+} while (0)
+
+#define closure_wait_event(waitlist, _cond) \
+do { \
+ if (!(_cond)) \
+ __closure_wait_event(waitlist, _cond); \
+} while (0)
+
#endif /* _LINUX_CLOSURE_H */
--
2.40.1

2023-05-09 17:11:33

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 11/32] block: Bring back zero_fill_bio_iter

From: Kent Overstreet <[email protected]>

This reverts the commit that deleted it; it's used by bcachefs.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: [email protected]
---
block/bio.c | 6 +++---
include/linux/bio.h | 7 ++++++-
2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index e74a04ea14..70b5c987bc 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -606,15 +606,15 @@ struct bio *bio_kmalloc(unsigned short nr_vecs, gfp_t gfp_mask)
}
EXPORT_SYMBOL(bio_kmalloc);

-void zero_fill_bio(struct bio *bio)
+void zero_fill_bio_iter(struct bio *bio, struct bvec_iter start)
{
struct bio_vec bv;
struct bvec_iter iter;

- bio_for_each_segment(bv, bio, iter)
+ __bio_for_each_segment(bv, bio, iter, start)
memzero_bvec(&bv);
}
-EXPORT_SYMBOL(zero_fill_bio);
+EXPORT_SYMBOL(zero_fill_bio_iter);

/**
* bio_truncate - truncate the bio to small size of @new_size
diff --git a/include/linux/bio.h b/include/linux/bio.h
index d766be7152..3536f28c05 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -484,7 +484,12 @@ extern void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter,
extern void bio_copy_data(struct bio *dst, struct bio *src);
extern void bio_free_pages(struct bio *bio);
void guard_bio_eod(struct bio *bio);
-void zero_fill_bio(struct bio *bio);
+void zero_fill_bio_iter(struct bio *bio, struct bvec_iter iter);
+
+static inline void zero_fill_bio(struct bio *bio)
+{
+ zero_fill_bio_iter(bio, bio->bi_iter);
+}

static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
{
--
2.40.1

2023-05-09 17:11:39

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 19/32] closures: Add a missing include

Fixes building in userspace.

Signed-off-by: Kent Overstreet <[email protected]>
---
lib/closure.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/lib/closure.c b/lib/closure.c
index b38ded00b9..0855e698ce 100644
--- a/lib/closure.c
+++ b/lib/closure.c
@@ -9,6 +9,7 @@
#include <linux/closure.h>
#include <linux/debugfs.h>
#include <linux/export.h>
+#include <linux/rcupdate.h>
#include <linux/seq_file.h>
#include <linux/sched/debug.h>

--
2.40.1

2023-05-09 17:11:45

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 32/32] MAINTAINERS: Add entry for bcachefs

bcachefs is a new copy-on-write filesystem; add a MAINTAINERS entry for
it.

Signed-off-by: Kent Overstreet <[email protected]>
---
MAINTAINERS | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index dbf3c33c31..0ac2b432f0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3509,6 +3509,13 @@ W: http://bcache.evilpiepirate.org
C: irc://irc.oftc.net/bcache
F: drivers/md/bcache/

+BCACHEFS:
+M: Kent Overstreet <[email protected]>
+L: [email protected]
+S: Supported
+C: irc://irc.oftc.net/bcache
+F: fs/bcachefs/
+
BDISP ST MEDIA DRIVER
M: Fabien Dessenne <[email protected]>
L: [email protected]
--
2.40.1

2023-05-09 17:12:10

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 24/32] MAINTAINERS: Add entry for generic-radix-tree

lib/generic-radix-tree.c is a simple radix tree that supports storing
arbitrary types. Add a maintainers entry for it.

Signed-off-by: Kent Overstreet <[email protected]>
---
MAINTAINERS | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 5d76169140..c550f5909e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8615,6 +8615,13 @@ F: Documentation/devicetree/bindings/power/power?domain*
F: drivers/base/power/domain*.c
F: include/linux/pm_domain.h

+GENERIC RADIX TREE:
+M: Kent Overstreet <[email protected]>
+S: Supported
+C: irc://irc.oftc.net/bcache
+F: include/linux/generic-radix-tree.h
+F: lib/generic-radix-tree.c
+
GENERIC RESISTIVE TOUCHSCREEN ADC DRIVER
M: Eugen Hristev <[email protected]>
L: [email protected]
--
2.40.1

2023-05-09 17:12:14

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 08/32] fs: factor out d_mark_tmpfile()

From: Kent Overstreet <[email protected]>

New helper for bcachefs - bcachefs doesn't want the
inode_dec_link_count() call that d_tmpfile does, it handles i_nlink on
its own atomically with other btree updates

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: [email protected]
---
fs/dcache.c | 12 ++++++++++--
include/linux/dcache.h | 1 +
2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 52e6d5fdab..dbdafa2617 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3249,11 +3249,10 @@ void d_genocide(struct dentry *parent)

EXPORT_SYMBOL(d_genocide);

-void d_tmpfile(struct file *file, struct inode *inode)
+void d_mark_tmpfile(struct file *file, struct inode *inode)
{
struct dentry *dentry = file->f_path.dentry;

- inode_dec_link_count(inode);
BUG_ON(dentry->d_name.name != dentry->d_iname ||
!hlist_unhashed(&dentry->d_u.d_alias) ||
!d_unlinked(dentry));
@@ -3263,6 +3262,15 @@ void d_tmpfile(struct file *file, struct inode *inode)
(unsigned long long)inode->i_ino);
spin_unlock(&dentry->d_lock);
spin_unlock(&dentry->d_parent->d_lock);
+}
+EXPORT_SYMBOL(d_mark_tmpfile);
+
+void d_tmpfile(struct file *file, struct inode *inode)
+{
+ struct dentry *dentry = file->f_path.dentry;
+
+ inode_dec_link_count(inode);
+ d_mark_tmpfile(file, inode);
d_instantiate(dentry, inode);
}
EXPORT_SYMBOL(d_tmpfile);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 6b351e009f..3da2f0545d 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -251,6 +251,7 @@ extern struct dentry * d_make_root(struct inode *);
/* <clickety>-<click> the ramfs-type tree */
extern void d_genocide(struct dentry *);

+extern void d_mark_tmpfile(struct file *, struct inode *);
extern void d_tmpfile(struct file *, struct inode *);

extern struct dentry *d_find_alias(struct inode *);
--
2.40.1

2023-05-09 17:12:20

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 27/32] lib/generic-radix-tree.c: Add peek_prev()

From: Kent Overstreet <[email protected]>

This patch adds genradix_peek_prev(), genradix_iter_rewind(), and
genradix_for_each_reverse(), for iterating backwards over a generic
radix tree.

Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/generic-radix-tree.h | 61 +++++++++++++++++++++++++++++-
lib/generic-radix-tree.c | 59 +++++++++++++++++++++++++++++
2 files changed, 119 insertions(+), 1 deletion(-)

diff --git a/include/linux/generic-radix-tree.h b/include/linux/generic-radix-tree.h
index f6cd0f909d..c74b737699 100644
--- a/include/linux/generic-radix-tree.h
+++ b/include/linux/generic-radix-tree.h
@@ -117,6 +117,11 @@ static inline size_t __idx_to_offset(size_t idx, size_t obj_size)

#define __genradix_cast(_radix) (typeof((_radix)->type[0]) *)
#define __genradix_obj_size(_radix) sizeof((_radix)->type[0])
+#define __genradix_objs_per_page(_radix) \
+ (PAGE_SIZE / sizeof((_radix)->type[0]))
+#define __genradix_page_remainder(_radix) \
+ (PAGE_SIZE % sizeof((_radix)->type[0]))
+
#define __genradix_idx_to_offset(_radix, _idx) \
__idx_to_offset(_idx, __genradix_obj_size(_radix))

@@ -180,7 +185,25 @@ void *__genradix_iter_peek(struct genradix_iter *, struct __genradix *, size_t);
#define genradix_iter_peek(_iter, _radix) \
(__genradix_cast(_radix) \
__genradix_iter_peek(_iter, &(_radix)->tree, \
- PAGE_SIZE / __genradix_obj_size(_radix)))
+ __genradix_objs_per_page(_radix)))
+
+void *__genradix_iter_peek_prev(struct genradix_iter *, struct __genradix *,
+ size_t, size_t);
+
+/**
+ * genradix_iter_peek - get first entry at or below iterator's current
+ * position
+ * @_iter: a genradix_iter
+ * @_radix: genradix being iterated over
+ *
+ * If no more entries exist at or below @_iter's current position, returns NULL
+ */
+#define genradix_iter_peek_prev(_iter, _radix) \
+ (__genradix_cast(_radix) \
+ __genradix_iter_peek_prev(_iter, &(_radix)->tree, \
+ __genradix_objs_per_page(_radix), \
+ __genradix_obj_size(_radix) + \
+ __genradix_page_remainder(_radix)))

static inline void __genradix_iter_advance(struct genradix_iter *iter,
size_t obj_size)
@@ -203,6 +226,25 @@ static inline void __genradix_iter_advance(struct genradix_iter *iter,
#define genradix_iter_advance(_iter, _radix) \
__genradix_iter_advance(_iter, __genradix_obj_size(_radix))

+static inline void __genradix_iter_rewind(struct genradix_iter *iter,
+ size_t obj_size)
+{
+ if (iter->offset == 0 ||
+ iter->offset == SIZE_MAX) {
+ iter->offset = SIZE_MAX;
+ return;
+ }
+
+ if ((iter->offset & (PAGE_SIZE - 1)) == 0)
+ iter->offset -= PAGE_SIZE % obj_size;
+
+ iter->offset -= obj_size;
+ iter->pos--;
+}
+
+#define genradix_iter_rewind(_iter, _radix) \
+ __genradix_iter_rewind(_iter, __genradix_obj_size(_radix))
+
#define genradix_for_each_from(_radix, _iter, _p, _start) \
for (_iter = genradix_iter_init(_radix, _start); \
(_p = genradix_iter_peek(&_iter, _radix)) != NULL; \
@@ -220,6 +262,23 @@ static inline void __genradix_iter_advance(struct genradix_iter *iter,
#define genradix_for_each(_radix, _iter, _p) \
genradix_for_each_from(_radix, _iter, _p, 0)

+#define genradix_last_pos(_radix) \
+ (SIZE_MAX / PAGE_SIZE * __genradix_objs_per_page(_radix) - 1)
+
+/**
+ * genradix_for_each_reverse - iterate over entry in a genradix, reverse order
+ * @_radix: genradix to iterate over
+ * @_iter: a genradix_iter to track current position
+ * @_p: pointer to genradix entry type
+ *
+ * On every iteration, @_p will point to the current entry, and @_iter.pos
+ * will be the current entry's index.
+ */
+#define genradix_for_each_reverse(_radix, _iter, _p) \
+ for (_iter = genradix_iter_init(_radix, genradix_last_pos(_radix));\
+ (_p = genradix_iter_peek_prev(&_iter, _radix)) != NULL;\
+ genradix_iter_rewind(&_iter, _radix))
+
int __genradix_prealloc(struct __genradix *, size_t, gfp_t);

/**
diff --git a/lib/generic-radix-tree.c b/lib/generic-radix-tree.c
index 7dfa88282b..41f1bcdc44 100644
--- a/lib/generic-radix-tree.c
+++ b/lib/generic-radix-tree.c
@@ -1,4 +1,5 @@

+#include <linux/atomic.h>
#include <linux/export.h>
#include <linux/generic-radix-tree.h>
#include <linux/gfp.h>
@@ -212,6 +213,64 @@ void *__genradix_iter_peek(struct genradix_iter *iter,
}
EXPORT_SYMBOL(__genradix_iter_peek);

+void *__genradix_iter_peek_prev(struct genradix_iter *iter,
+ struct __genradix *radix,
+ size_t objs_per_page,
+ size_t obj_size_plus_page_remainder)
+{
+ struct genradix_root *r;
+ struct genradix_node *n;
+ unsigned level, i;
+
+ if (iter->offset == SIZE_MAX)
+ return NULL;
+
+restart:
+ r = READ_ONCE(radix->root);
+ if (!r)
+ return NULL;
+
+ n = genradix_root_to_node(r);
+ level = genradix_root_to_depth(r);
+
+ if (ilog2(iter->offset) >= genradix_depth_shift(level)) {
+ iter->offset = genradix_depth_size(level);
+ iter->pos = (iter->offset >> PAGE_SHIFT) * objs_per_page;
+
+ iter->offset -= obj_size_plus_page_remainder;
+ iter->pos--;
+ }
+
+ while (level) {
+ level--;
+
+ i = (iter->offset >> genradix_depth_shift(level)) &
+ (GENRADIX_ARY - 1);
+
+ while (!n->children[i]) {
+ size_t objs_per_ptr = genradix_depth_size(level);
+
+ iter->offset = round_down(iter->offset, objs_per_ptr);
+ iter->pos = (iter->offset >> PAGE_SHIFT) * objs_per_page;
+
+ if (!iter->offset)
+ return NULL;
+
+ iter->offset -= obj_size_plus_page_remainder;
+ iter->pos--;
+
+ if (!i)
+ goto restart;
+ --i;
+ }
+
+ n = n->children[i];
+ }
+
+ return &n->data[iter->offset & (PAGE_SIZE - 1)];
+}
+EXPORT_SYMBOL(__genradix_iter_peek_prev);
+
static void genradix_free_recurse(struct genradix_node *n, unsigned level)
{
if (level) {
--
2.40.1

2023-05-09 17:13:52

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 14/32] block: Don't block on s_umount from __invalidate_super()

__invalidate_super() is used to flush any filesystem mounted on a
device, generally on some sort of media change event.

However, when unmounting a filesystem and closing the underlying block
devices, we can deadlock if the block driver then calls
__invalidate_device() (e.g. because the block device goes away when it
is no longer in use).

This happens with bcachefs on top of loopback, and can be triggered by
fstests generic/042:

put_super
-> blkdev_put
-> lo_release
-> disk_force_media_change
-> __invalidate_device
-> get_super

This isn't inherently specific to bcachefs - it hasn't shown up with
other filesystems before because most other filesystems use the sget()
mechanism for opening/closing block devices (and enforcing exclusion),
however sget() has its own downsides and weird/sketchy behaviour w.r.t.
block device open lifetime - if that ever gets fixed more code will run
into this issue.

The __invalidate_device() call here is really a best effort "I just
yanked the device for a mounted filesystem, please try not to lose my
data" - if it's ever actually needed the user has already done something
crazy, and we probably shouldn't make things worse by deadlocking.
Switching to a trylock seems in keeping with what the code is trying to
do.

If we ever get revoke() at the block layer, perhaps we would look at
rearchitecting to use that instead.

Signed-off-by: Kent Overstreet <[email protected]>
---
block/bdev.c | 2 +-
fs/super.c | 40 +++++++++++++++++++++++++++++++---------
include/linux/fs.h | 1 +
3 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index 1795c7d4b9..743e969b7b 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -922,7 +922,7 @@ EXPORT_SYMBOL(lookup_bdev);

int __invalidate_device(struct block_device *bdev, bool kill_dirty)
{
- struct super_block *sb = get_super(bdev);
+ struct super_block *sb = try_get_super(bdev);
int res = 0;

if (sb) {
diff --git a/fs/super.c b/fs/super.c
index 04bc62ab7d..a2decce02f 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -791,14 +791,7 @@ void iterate_supers_type(struct file_system_type *type,

EXPORT_SYMBOL(iterate_supers_type);

-/**
- * get_super - get the superblock of a device
- * @bdev: device to get the superblock for
- *
- * Scans the superblock list and finds the superblock of the file system
- * mounted on the device given. %NULL is returned if no match is found.
- */
-struct super_block *get_super(struct block_device *bdev)
+static struct super_block *__get_super(struct block_device *bdev, bool try)
{
struct super_block *sb;

@@ -813,7 +806,12 @@ struct super_block *get_super(struct block_device *bdev)
if (sb->s_bdev == bdev) {
sb->s_count++;
spin_unlock(&sb_lock);
- down_read(&sb->s_umount);
+
+ if (!try)
+ down_read(&sb->s_umount);
+ else if (!down_read_trylock(&sb->s_umount))
+ return NULL;
+
/* still alive? */
if (sb->s_root && (sb->s_flags & SB_BORN))
return sb;
@@ -828,6 +826,30 @@ struct super_block *get_super(struct block_device *bdev)
return NULL;
}

+/**
+ * get_super - get the superblock of a device
+ * @bdev: device to get the superblock for
+ *
+ * Scans the superblock list and finds the superblock of the file system
+ * mounted on the device given. %NULL is returned if no match is found.
+ */
+struct super_block *get_super(struct block_device *bdev)
+{
+ return __get_super(bdev, false);
+}
+
+/**
+ * try_get_super - get the superblock of a device, using trylock on sb->s_umount
+ * @bdev: device to get the superblock for
+ *
+ * Scans the superblock list and finds the superblock of the file system
+ * mounted on the device given. %NULL is returned if no match is found.
+ */
+struct super_block *try_get_super(struct block_device *bdev)
+{
+ return __get_super(bdev, true);
+}
+
/**
* get_active_super - get an active reference to the superblock of a device
* @bdev: device to get the superblock for
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c85916e9f7..1a6f951942 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2878,6 +2878,7 @@ extern struct file_system_type *get_filesystem(struct file_system_type *fs);
extern void put_filesystem(struct file_system_type *fs);
extern struct file_system_type *get_fs_type(const char *name);
extern struct super_block *get_super(struct block_device *);
+extern struct super_block *try_get_super(struct block_device *);
extern struct super_block *get_active_super(struct block_device *bdev);
extern void drop_super(struct super_block *sb);
extern void drop_super_exclusive(struct super_block *sb);
--
2.40.1

2023-05-09 17:13:53

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()

This adds a method to tell lockdep not to check lock ordering within a
lock class - but to still check lock ordering w.r.t. other lock types.

This is for bcachefs, where for btree node locks we have our own
deadlock avoidance strategy w.r.t. other btree node locks (cycle
detection), but we still want lockdep to check lock ordering w.r.t.
other lock types.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
---
include/linux/lockdep.h | 6 ++++++
include/linux/lockdep_types.h | 2 +-
kernel/locking/lockdep.c | 26 ++++++++++++++++++++++++++
3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index e858c288c7..f6cc8709e2 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -665,4 +665,10 @@ lockdep_rcu_suspicious(const char *file, const int line, const char *s)
}
#endif

+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+void lockdep_set_no_check_recursion(struct lockdep_map *map);
+#else
+static inline void lockdep_set_no_check_recursion(struct lockdep_map *map) {}
+#endif
+
#endif /* __LINUX_LOCKDEP_H */
diff --git a/include/linux/lockdep_types.h b/include/linux/lockdep_types.h
index d22430840b..506e769b4a 100644
--- a/include/linux/lockdep_types.h
+++ b/include/linux/lockdep_types.h
@@ -128,7 +128,7 @@ struct lock_class {
u8 wait_type_inner;
u8 wait_type_outer;
u8 lock_type;
- /* u8 hole; */
+ u8 no_check_recursion;

#ifdef CONFIG_LOCK_STAT
unsigned long contention_point[LOCKSTAT_POINTS];
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index e631464070..f022b58dfa 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3024,6 +3024,9 @@ check_deadlock(struct task_struct *curr, struct held_lock *next)
if ((next->read == 2) && prev->read)
continue;

+ if (hlock_class(next)->no_check_recursion)
+ continue;
+
/*
* We're holding the nest_lock, which serializes this lock's
* nesting behaviour.
@@ -3085,6 +3088,10 @@ check_prev_add(struct task_struct *curr, struct held_lock *prev,
return 2;
}

+ if (hlock_class(prev) == hlock_class(next) &&
+ hlock_class(prev)->no_check_recursion)
+ return 2;
+
/*
* Prove that the new <prev> -> <next> dependency would not
* create a circular dependency in the graph. (We do this by
@@ -6620,3 +6627,22 @@ void lockdep_rcu_suspicious(const char *file, const int line, const char *s)
warn_rcu_exit(rcu);
}
EXPORT_SYMBOL_GPL(lockdep_rcu_suspicious);
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+void lockdep_set_no_check_recursion(struct lockdep_map *lock)
+{
+ struct lock_class *class = lock->class_cache[0];
+ unsigned long flags;
+
+ raw_local_irq_save(flags);
+ lockdep_recursion_inc();
+
+ if (!class)
+ class = register_lock_class(lock, 0, 0);
+ if (class)
+ class->no_check_recursion = true;
+ lockdep_recursion_finish();
+ raw_local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(lockdep_set_no_check_recursion);
+#endif
--
2.40.1

2023-05-09 17:14:38

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 16/32] MAINTAINERS: Add entry for closures

closures, from bcache, are async widgets with a variety of uses.
bcachefs also uses them, so they're being moved to lib/; mark them as
maintained.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Coly Li <[email protected]>
---
MAINTAINERS | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 3fc37de3d6..5d76169140 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5044,6 +5044,14 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers/core
F: Documentation/devicetree/bindings/timer/
F: drivers/clocksource/

+CLOSURES:
+M: Kent Overstreet <[email protected]>
+L: [email protected]
+S: Supported
+C: irc://irc.oftc.net/bcache
+F: include/linux/closure.h
+F: lib/closure.c
+
CMPC ACPI DRIVER
M: Thadeu Lima de Souza Cascardo <[email protected]>
M: Daniel Oliveira Nascimento <[email protected]>
--
2.40.1

2023-05-09 17:15:54

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic()

Add a foliated version of copy_page_from_iter_atomic()

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Matthew Wilcox <[email protected]>
---
include/linux/uio.h | 2 ++
lib/iov_iter.c | 53 ++++++++++++++++++++++++++++++++++++---------
2 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 27e3fd9429..b2c281cb10 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -154,6 +154,8 @@ static inline struct iovec iov_iter_iovec(const struct iov_iter *iter)

size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
size_t bytes, struct iov_iter *i);
+size_t copy_folio_from_iter_atomic(struct folio *folio, size_t offset,
+ size_t bytes, struct iov_iter *i);
void iov_iter_advance(struct iov_iter *i, size_t bytes);
void iov_iter_revert(struct iov_iter *i, size_t bytes);
size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t bytes);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 274014e4ea..27ba7e9f9e 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -800,18 +800,10 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
}
EXPORT_SYMBOL(iov_iter_zero);

-size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t bytes,
- struct iov_iter *i)
+static inline size_t __copy_page_from_iter_atomic(struct page *page, unsigned offset,
+ size_t bytes, struct iov_iter *i)
{
char *kaddr = kmap_atomic(page), *p = kaddr + offset;
- if (!page_copy_sane(page, offset, bytes)) {
- kunmap_atomic(kaddr);
- return 0;
- }
- if (WARN_ON_ONCE(!i->data_source)) {
- kunmap_atomic(kaddr);
- return 0;
- }
iterate_and_advance(i, bytes, base, len, off,
copyin(p + off, base, len),
memcpy(p + off, base, len)
@@ -819,8 +811,49 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t byt
kunmap_atomic(kaddr);
return bytes;
}
+
+size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t bytes,
+ struct iov_iter *i)
+{
+ if (!page_copy_sane(page, offset, bytes))
+ return 0;
+ if (WARN_ON_ONCE(!i->data_source))
+ return 0;
+ return __copy_page_from_iter_atomic(page, offset, bytes, i);
+}
EXPORT_SYMBOL(copy_page_from_iter_atomic);

+size_t copy_folio_from_iter_atomic(struct folio *folio, size_t offset,
+ size_t bytes, struct iov_iter *i)
+{
+ size_t ret = 0;
+
+ if (WARN_ON(offset + bytes > folio_size(folio)))
+ return 0;
+ if (WARN_ON_ONCE(!i->data_source))
+ return 0;
+
+#ifdef CONFIG_HIGHMEM
+ while (bytes) {
+ struct page *page = folio_page(folio, offset >> PAGE_SHIFT);
+ unsigned b = min(bytes, PAGE_SIZE - (offset & PAGE_MASK));
+ unsigned r = __copy_page_from_iter_atomic(page, offset, b, i);
+
+ offset += r;
+ bytes -= r;
+ ret += r;
+
+ if (r != b)
+ break;
+ }
+#else
+ ret = __copy_page_from_iter_atomic(&folio->page, offset, bytes, i);
+#endif
+
+ return ret;
+}
+EXPORT_SYMBOL(copy_folio_from_iter_atomic);
+
static void pipe_advance(struct iov_iter *i, size_t size)
{
struct pipe_inode_info *pipe = i->pipe;
--
2.40.1

2023-05-09 17:20:13

by Miguel Ojeda

[permalink] [raw]
Subject: Re: [PATCH 01/32] Compiler Attributes: add __flatten

On Tue, May 9, 2023 at 6:57 PM Kent Overstreet
<[email protected]> wrote:
>
> This makes __attribute__((flatten)) available, which is used by
> bcachefs.

We already have it in mainline, so I think it is one less patch you
need to care about! :)

Cheers,
Miguel

2023-05-09 17:32:59

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 28/32] stacktrace: Export stack_trace_save_tsk

From: Christopher James Halse Rogers <[email protected]>

The bcachefs module wants it, and there doesn't seem to be any
reason it shouldn't be exported like the other functions.

Signed-off-by: Christopher James Halse Rogers <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
kernel/stacktrace.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c
index 9ed5ce9894..4f65824879 100644
--- a/kernel/stacktrace.c
+++ b/kernel/stacktrace.c
@@ -151,6 +151,7 @@ unsigned int stack_trace_save_tsk(struct task_struct *tsk, unsigned long *store,
put_task_stack(tsk);
return c.len;
}
+EXPORT_SYMBOL_GPL(stack_trace_save_tsk);

/**
* stack_trace_save_regs - Save a stack trace based on pt_regs into a storage array
@@ -301,6 +302,7 @@ unsigned int stack_trace_save_tsk(struct task_struct *task,
save_stack_trace_tsk(task, &trace);
return trace.nr_entries;
}
+EXPORT_SYMBOL_GPL(stack_trace_save_tsk);

/**
* stack_trace_save_regs - Save a stack trace based on pt_regs into a storage array
--
2.40.1

2023-05-09 17:33:22

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 31/32] lib: add mean and variance module.

From: Daniel Hill <[email protected]>

This module provides a fast 64bit implementation of basic statistics
functions, including mean, variance and standard deviation in both
weighted and unweighted variants, the unweighted variant has a 32bit
limitation per sample to prevent overflow when squaring.

Signed-off-by: Daniel Hill <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
MAINTAINERS | 9 ++
include/linux/mean_and_variance.h | 219 ++++++++++++++++++++++++++++++
lib/Kconfig.debug | 9 ++
lib/math/Kconfig | 3 +
lib/math/Makefile | 2 +
lib/math/mean_and_variance.c | 136 +++++++++++++++++++
lib/math/mean_and_variance_test.c | 155 +++++++++++++++++++++
7 files changed, 533 insertions(+)
create mode 100644 include/linux/mean_and_variance.h
create mode 100644 lib/math/mean_and_variance.c
create mode 100644 lib/math/mean_and_variance_test.c

diff --git a/MAINTAINERS b/MAINTAINERS
index c550f5909e..dbf3c33c31 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12767,6 +12767,15 @@ F: Documentation/devicetree/bindings/net/ieee802154/mcr20a.txt
F: drivers/net/ieee802154/mcr20a.c
F: drivers/net/ieee802154/mcr20a.h

+MEAN AND VARIANCE LIBRARY
+M: Daniel B. Hill <[email protected]>
+M: Kent Overstreet <[email protected]>
+S: Maintained
+T: git https://github.com/YellowOnion/linux/
+F: include/linux/mean_and_variance.h
+F: lib/math/mean_and_variance.c
+F: lib/math/mean_and_variance_test.c
+
MEASUREMENT COMPUTING CIO-DAC IIO DRIVER
M: William Breathitt Gray <[email protected]>
L: [email protected]
diff --git a/include/linux/mean_and_variance.h b/include/linux/mean_and_variance.h
new file mode 100644
index 0000000000..89540628e8
--- /dev/null
+++ b/include/linux/mean_and_variance.h
@@ -0,0 +1,219 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef MEAN_AND_VARIANCE_H_
+#define MEAN_AND_VARIANCE_H_
+
+#include <linux/types.h>
+#include <linux/limits.h>
+#include <linux/math64.h>
+
+#define SQRT_U64_MAX 4294967295ULL
+
+
+#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
+
+typedef unsigned __int128 u128;
+
+static inline u128 u64_to_u128(u64 a)
+{
+ return (u128)a;
+}
+
+static inline u64 u128_to_u64(u128 a)
+{
+ return (u64)a;
+}
+
+static inline u64 u128_shr64_to_u64(u128 a)
+{
+ return (u64)(a >> 64);
+}
+
+static inline u128 u128_add(u128 a, u128 b)
+{
+ return a + b;
+}
+
+static inline u128 u128_sub(u128 a, u128 b)
+{
+ return a - b;
+}
+
+static inline u128 u128_shl(u128 i, s8 shift)
+{
+ return i << shift;
+}
+
+static inline u128 u128_shl64_add(u64 a, u64 b)
+{
+ return ((u128)a << 64) + b;
+}
+
+static inline u128 u128_square(u64 i)
+{
+ return i*i;
+}
+
+#else
+
+typedef struct {
+ u64 hi, lo;
+} u128;
+
+static inline u128 u64_to_u128(u64 a)
+{
+ return (u128){ .lo = a };
+}
+
+static inline u64 u128_to_u64(u128 a)
+{
+ return a.lo;
+}
+
+static inline u64 u128_shr64_to_u64(u128 a)
+{
+ return a.hi;
+}
+
+static inline u128 u128_add(u128 a, u128 b)
+{
+ u128 c;
+
+ c.lo = a.lo + b.lo;
+ c.hi = a.hi + b.hi + (c.lo < a.lo);
+ return c;
+}
+
+static inline u128 u128_sub(u128 a, u128 b)
+{
+ u128 c;
+
+ c.lo = a.lo - b.lo;
+ c.hi = a.hi - b.hi - (c.lo > a.lo);
+ return c;
+}
+
+static inline u128 u128_shl(u128 i, s8 shift)
+{
+ u128 r;
+
+ r.lo = i.lo << shift;
+ if (shift < 64)
+ r.hi = (i.hi << shift) | (i.lo >> (64 - shift));
+ else {
+ r.hi = i.lo << (shift - 64);
+ r.lo = 0;
+ }
+ return r;
+}
+
+static inline u128 u128_shl64_add(u64 a, u64 b)
+{
+ return u128_add(u128_shl(u64_to_u128(a), 64), u64_to_u128(b));
+}
+
+static inline u128 u128_square(u64 i)
+{
+ u128 r;
+ u64 h = i >> 32, l = i & (u64)U32_MAX;
+
+ r = u128_shl(u64_to_u128(h*h), 64);
+ r = u128_add(r, u128_shl(u64_to_u128(h*l), 32));
+ r = u128_add(r, u128_shl(u64_to_u128(l*h), 32));
+ r = u128_add(r, u64_to_u128(l*l));
+ return r;
+}
+
+#endif
+
+static inline u128 u128_div(u128 n, u64 d)
+{
+ u128 r;
+ u64 rem;
+ u64 hi = u128_shr64_to_u64(n);
+ u64 lo = u128_to_u64(n);
+ u64 h = hi & ((u64)U32_MAX << 32);
+ u64 l = (hi & (u64)U32_MAX) << 32;
+
+ r = u128_shl(u64_to_u128(div64_u64_rem(h, d, &rem)), 64);
+ r = u128_add(r, u128_shl(u64_to_u128(div64_u64_rem(l + (rem << 32), d, &rem)), 32));
+ r = u128_add(r, u64_to_u128(div64_u64_rem(lo + (rem << 32), d, &rem)));
+ return r;
+}
+
+struct mean_and_variance {
+ s64 n;
+ s64 sum;
+ u128 sum_squares;
+};
+
+/* expontentially weighted variant */
+struct mean_and_variance_weighted {
+ bool init;
+ u8 w;
+ s64 mean;
+ u64 variance;
+};
+
+/**
+ * fast_divpow2() - fast approximation for n / (1 << d)
+ * @n: numerator
+ * @d: the power of 2 denominator.
+ *
+ * note: this rounds towards 0.
+ */
+static inline s64 fast_divpow2(s64 n, u8 d)
+{
+ return (n + ((n < 0) ? ((1 << d) - 1) : 0)) >> d;
+}
+
+static inline struct mean_and_variance
+mean_and_variance_update_inlined(struct mean_and_variance s1, s64 v1)
+{
+ struct mean_and_variance s2;
+ u64 v2 = abs(v1);
+
+ s2.n = s1.n + 1;
+ s2.sum = s1.sum + v1;
+ s2.sum_squares = u128_add(s1.sum_squares, u128_square(v2));
+ return s2;
+}
+
+static inline struct mean_and_variance_weighted
+mean_and_variance_weighted_update_inlined(struct mean_and_variance_weighted s1, s64 x)
+{
+ struct mean_and_variance_weighted s2;
+ // previous weighted variance.
+ u64 var_w0 = s1.variance;
+ u8 w = s2.w = s1.w;
+ // new value weighted.
+ s64 x_w = x << w;
+ s64 diff_w = x_w - s1.mean;
+ s64 diff = fast_divpow2(diff_w, w);
+ // new mean weighted.
+ s64 u_w1 = s1.mean + diff;
+
+ BUG_ON(w % 2 != 0);
+
+ if (!s1.init) {
+ s2.mean = x_w;
+ s2.variance = 0;
+ } else {
+ s2.mean = u_w1;
+ s2.variance = ((var_w0 << w) - var_w0 + ((diff_w * (x_w - u_w1)) >> w)) >> w;
+ }
+ s2.init = true;
+
+ return s2;
+}
+
+struct mean_and_variance mean_and_variance_update(struct mean_and_variance s1, s64 v1);
+ s64 mean_and_variance_get_mean(struct mean_and_variance s);
+ u64 mean_and_variance_get_variance(struct mean_and_variance s1);
+ u32 mean_and_variance_get_stddev(struct mean_and_variance s);
+
+struct mean_and_variance_weighted mean_and_variance_weighted_update(struct mean_and_variance_weighted s1, s64 v1);
+ s64 mean_and_variance_weighted_get_mean(struct mean_and_variance_weighted s);
+ u64 mean_and_variance_weighted_get_variance(struct mean_and_variance_weighted s);
+ u32 mean_and_variance_weighted_get_stddev(struct mean_and_variance_weighted s);
+
+#endif // MEAN_AND_VAIRANCE_H_
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 3dba7a9aff..9ca88e0027 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2101,6 +2101,15 @@ config CPUMASK_KUNIT_TEST

If unsure, say N.

+config MEAN_AND_VARIANCE_UNIT_TEST
+ tristate "mean_and_variance unit tests" if !KUNIT_ALL_TESTS
+ depends on KUNIT
+ select MEAN_AND_VARIANCE
+ default KUNIT_ALL_TESTS
+ help
+ This option enables the kunit tests for mean_and_variance module.
+ If unsure, say N.
+
config TEST_LIST_SORT
tristate "Linked list sorting test" if !KUNIT_ALL_TESTS
depends on KUNIT
diff --git a/lib/math/Kconfig b/lib/math/Kconfig
index 0634b428d0..7530ae9a35 100644
--- a/lib/math/Kconfig
+++ b/lib/math/Kconfig
@@ -15,3 +15,6 @@ config PRIME_NUMBERS

config RATIONAL
tristate
+
+config MEAN_AND_VARIANCE
+ tristate
diff --git a/lib/math/Makefile b/lib/math/Makefile
index bfac26ddfc..2ef1487e01 100644
--- a/lib/math/Makefile
+++ b/lib/math/Makefile
@@ -4,6 +4,8 @@ obj-y += div64.o gcd.o lcm.o int_pow.o int_sqrt.o reciprocal_div.o
obj-$(CONFIG_CORDIC) += cordic.o
obj-$(CONFIG_PRIME_NUMBERS) += prime_numbers.o
obj-$(CONFIG_RATIONAL) += rational.o
+obj-$(CONFIG_MEAN_AND_VARIANCE) += mean_and_variance.o

obj-$(CONFIG_TEST_DIV64) += test_div64.o
obj-$(CONFIG_RATIONAL_KUNIT_TEST) += rational-test.o
+obj-$(CONFIG_MEAN_AND_VARIANCE_UNIT_TEST) += mean_and_variance_test.o
diff --git a/lib/math/mean_and_variance.c b/lib/math/mean_and_variance.c
new file mode 100644
index 0000000000..6e315d3a13
--- /dev/null
+++ b/lib/math/mean_and_variance.c
@@ -0,0 +1,136 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Functions for incremental mean and variance.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * Copyright © 2022 Daniel B. Hill
+ *
+ * Author: Daniel B. Hill <[email protected]>
+ *
+ * Description:
+ *
+ * This is includes some incremental algorithms for mean and variance calculation
+ *
+ * Derived from the paper: https://fanf2.user.srcf.net/hermes/doc/antiforgery/stats.pdf
+ *
+ * Create a struct and if it's the weighted variant set the w field (weight = 2^k).
+ *
+ * Use mean_and_variance[_weighted]_update() on the struct to update it's state.
+ *
+ * Use the mean_and_variance[_weighted]_get_* functions to calculate the mean and variance, some computation
+ * is deferred to these functions for performance reasons.
+ *
+ * see lib/math/mean_and_variance_test.c for examples of usage.
+ *
+ * DO NOT access the mean and variance fields of the weighted variants directly.
+ * DO NOT change the weight after calling update.
+ */
+
+#include <linux/bug.h>
+#include <linux/compiler.h>
+#include <linux/export.h>
+#include <linux/limits.h>
+#include <linux/math.h>
+#include <linux/math64.h>
+#include <linux/mean_and_variance.h>
+#include <linux/module.h>
+
+/**
+ * mean_and_variance_update() - update a mean_and_variance struct @s1 with a new sample @v1
+ * and return it.
+ * @s1: the mean_and_variance to update.
+ * @v1: the new sample.
+ *
+ * see linked pdf equation 12.
+ */
+struct mean_and_variance mean_and_variance_update(struct mean_and_variance s1, s64 v1)
+{
+ return mean_and_variance_update_inlined(s1, v1);
+}
+EXPORT_SYMBOL_GPL(mean_and_variance_update);
+
+/**
+ * mean_and_variance_get_mean() - get mean from @s
+ */
+s64 mean_and_variance_get_mean(struct mean_and_variance s)
+{
+ return div64_u64(s.sum, s.n);
+}
+EXPORT_SYMBOL_GPL(mean_and_variance_get_mean);
+
+/**
+ * mean_and_variance_get_variance() - get variance from @s1
+ *
+ * see linked pdf equation 12.
+ */
+u64 mean_and_variance_get_variance(struct mean_and_variance s1)
+{
+ u128 s2 = u128_div(s1.sum_squares, s1.n);
+ u64 s3 = abs(mean_and_variance_get_mean(s1));
+
+ return u128_to_u64(u128_sub(s2, u128_square(s3)));
+}
+EXPORT_SYMBOL_GPL(mean_and_variance_get_variance);
+
+/**
+ * mean_and_variance_get_stddev() - get standard deviation from @s
+ */
+u32 mean_and_variance_get_stddev(struct mean_and_variance s)
+{
+ return int_sqrt64(mean_and_variance_get_variance(s));
+}
+EXPORT_SYMBOL_GPL(mean_and_variance_get_stddev);
+
+/**
+ * mean_and_variance_weighted_update() - exponentially weighted variant of mean_and_variance_update()
+ * @s1: ..
+ * @s2: ..
+ *
+ * see linked pdf: function derived from equations 140-143 where alpha = 2^w.
+ * values are stored bitshifted for performance and added precision.
+ */
+struct mean_and_variance_weighted mean_and_variance_weighted_update(struct mean_and_variance_weighted s1,
+ s64 x)
+{
+ return mean_and_variance_weighted_update_inlined(s1, x);
+}
+EXPORT_SYMBOL_GPL(mean_and_variance_weighted_update);
+
+/**
+ * mean_and_variance_weighted_get_mean() - get mean from @s
+ */
+s64 mean_and_variance_weighted_get_mean(struct mean_and_variance_weighted s)
+{
+ return fast_divpow2(s.mean, s.w);
+}
+EXPORT_SYMBOL_GPL(mean_and_variance_weighted_get_mean);
+
+/**
+ * mean_and_variance_weighted_get_variance() -- get variance from @s
+ */
+u64 mean_and_variance_weighted_get_variance(struct mean_and_variance_weighted s)
+{
+ // always positive don't need fast divpow2
+ return s.variance >> s.w;
+}
+EXPORT_SYMBOL_GPL(mean_and_variance_weighted_get_variance);
+
+/**
+ * mean_and_variance_weighted_get_stddev() - get standard deviation from @s
+ */
+u32 mean_and_variance_weighted_get_stddev(struct mean_and_variance_weighted s)
+{
+ return int_sqrt64(mean_and_variance_weighted_get_variance(s));
+}
+EXPORT_SYMBOL_GPL(mean_and_variance_weighted_get_stddev);
+
+MODULE_AUTHOR("Daniel B. Hill");
+MODULE_LICENSE("GPL");
diff --git a/lib/math/mean_and_variance_test.c b/lib/math/mean_and_variance_test.c
new file mode 100644
index 0000000000..79a96d7307
--- /dev/null
+++ b/lib/math/mean_and_variance_test.c
@@ -0,0 +1,155 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <kunit/test.h>
+#include <linux/mean_and_variance.h>
+
+#define MAX_SQR (SQRT_U64_MAX*SQRT_U64_MAX)
+
+static void mean_and_variance_basic_test(struct kunit *test)
+{
+ struct mean_and_variance s = {};
+
+ s = mean_and_variance_update(s, 2);
+ s = mean_and_variance_update(s, 2);
+
+ KUNIT_EXPECT_EQ(test, mean_and_variance_get_mean(s), 2);
+ KUNIT_EXPECT_EQ(test, mean_and_variance_get_variance(s), 0);
+ KUNIT_EXPECT_EQ(test, s.n, 2);
+
+ s = mean_and_variance_update(s, 4);
+ s = mean_and_variance_update(s, 4);
+
+ KUNIT_EXPECT_EQ(test, mean_and_variance_get_mean(s), 3);
+ KUNIT_EXPECT_EQ(test, mean_and_variance_get_variance(s), 1);
+ KUNIT_EXPECT_EQ(test, s.n, 4);
+}
+
+/*
+ * Test values computed using a spreadsheet from the psuedocode at the bottom:
+ * https://fanf2.user.srcf.net/hermes/doc/antiforgery/stats.pdf
+ */
+
+static void mean_and_variance_weighted_test(struct kunit *test)
+{
+ struct mean_and_variance_weighted s = {};
+
+ s.w = 2;
+
+ s = mean_and_variance_weighted_update(s, 10);
+ KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), 10);
+ KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 0);
+
+ s = mean_and_variance_weighted_update(s, 20);
+ KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), 12);
+ KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 18);
+
+ s = mean_and_variance_weighted_update(s, 30);
+ KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), 16);
+ KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 72);
+
+ s = (struct mean_and_variance_weighted){};
+ s.w = 2;
+
+ s = mean_and_variance_weighted_update(s, -10);
+ KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), -10);
+ KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 0);
+
+ s = mean_and_variance_weighted_update(s, -20);
+ KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), -12);
+ KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 18);
+
+ s = mean_and_variance_weighted_update(s, -30);
+ KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), -16);
+ KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 72);
+
+}
+
+static void mean_and_variance_weighted_advanced_test(struct kunit *test)
+{
+ struct mean_and_variance_weighted s = {};
+ s64 i;
+
+ s.w = 8;
+ for (i = 10; i <= 100; i += 10)
+ s = mean_and_variance_weighted_update(s, i);
+
+ KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), 11);
+ KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 107);
+
+ s = (struct mean_and_variance_weighted){};
+
+ s.w = 8;
+ for (i = -10; i >= -100; i -= 10)
+ s = mean_and_variance_weighted_update(s, i);
+
+ KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), -11);
+ KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 107);
+
+}
+
+static void mean_and_variance_fast_divpow2(struct kunit *test)
+{
+ s64 i;
+ u8 d;
+
+ for (i = 0; i < 100; i++) {
+ d = 0;
+ KUNIT_EXPECT_EQ(test, fast_divpow2(i, d), div_u64(i, 1LLU << d));
+ KUNIT_EXPECT_EQ(test, abs(fast_divpow2(-i, d)), div_u64(i, 1LLU << d));
+ for (d = 1; d < 32; d++) {
+ KUNIT_EXPECT_EQ_MSG(test, abs(fast_divpow2(i, d)),
+ div_u64(i, 1 << d), "%lld %u", i, d);
+ KUNIT_EXPECT_EQ_MSG(test, abs(fast_divpow2(-i, d)),
+ div_u64(i, 1 << d), "%lld %u", -i, d);
+ }
+ }
+}
+
+static void mean_and_variance_u128_basic_test(struct kunit *test)
+{
+ u128 a = u128_shl64_add(0, U64_MAX);
+ u128 a1 = u128_shl64_add(0, 1);
+ u128 b = u128_shl64_add(1, 0);
+ u128 c = u128_shl64_add(0, 1LLU << 63);
+ u128 c2 = u128_shl64_add(U64_MAX, U64_MAX);
+
+ KUNIT_EXPECT_EQ(test, u128_shr64_to_u64(u128_add(a, a1)), 1);
+ KUNIT_EXPECT_EQ(test, u128_to_u64(u128_add(a, a1)), 0);
+ KUNIT_EXPECT_EQ(test, u128_shr64_to_u64(u128_add(a1, a)), 1);
+ KUNIT_EXPECT_EQ(test, u128_to_u64(u128_add(a1, a)), 0);
+
+ KUNIT_EXPECT_EQ(test, u128_to_u64(u128_sub(b, a1)), U64_MAX);
+ KUNIT_EXPECT_EQ(test, u128_shr64_to_u64(u128_sub(b, a1)), 0);
+
+ KUNIT_EXPECT_EQ(test, u128_shr64_to_u64(u128_shl(c, 1)), 1);
+ KUNIT_EXPECT_EQ(test, u128_to_u64(u128_shl(c, 1)), 0);
+
+ KUNIT_EXPECT_EQ(test, u128_shr64_to_u64(u128_square(U64_MAX)), U64_MAX - 1);
+ KUNIT_EXPECT_EQ(test, u128_to_u64(u128_square(U64_MAX)), 1);
+
+ KUNIT_EXPECT_EQ(test, u128_to_u64(u128_div(b, 2)), 1LLU << 63);
+
+ KUNIT_EXPECT_EQ(test, u128_shr64_to_u64(u128_div(c2, 2)), U64_MAX >> 1);
+ KUNIT_EXPECT_EQ(test, u128_to_u64(u128_div(c2, 2)), U64_MAX);
+
+ KUNIT_EXPECT_EQ(test, u128_shr64_to_u64(u128_div(u128_shl(u64_to_u128(U64_MAX), 32), 2)), U32_MAX >> 1);
+ KUNIT_EXPECT_EQ(test, u128_to_u64(u128_div(u128_shl(u64_to_u128(U64_MAX), 32), 2)), U64_MAX << 31);
+}
+
+static struct kunit_case mean_and_variance_test_cases[] = {
+ KUNIT_CASE(mean_and_variance_fast_divpow2),
+ KUNIT_CASE(mean_and_variance_u128_basic_test),
+ KUNIT_CASE(mean_and_variance_basic_test),
+ KUNIT_CASE(mean_and_variance_weighted_test),
+ KUNIT_CASE(mean_and_variance_weighted_advanced_test),
+ {}
+};
+
+static struct kunit_suite mean_and_variance_test_suite = {
+.name = "mean and variance tests",
+.test_cases = mean_and_variance_test_cases
+};
+
+kunit_test_suite(mean_and_variance_test_suite);
+
+MODULE_AUTHOR("Daniel B. Hill");
+MODULE_LICENSE("GPL");
--
2.40.1

2023-05-09 17:37:57

by Coly Li

[permalink] [raw]
Subject: Re: [PATCH 16/32] MAINTAINERS: Add entry for closures



> 2023年5月10日 00:56,Kent Overstreet <[email protected]> 写道:
>
> closures, from bcache, are async widgets with a variety of uses.
> bcachefs also uses them, so they're being moved to lib/; mark them as
> maintained.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> Cc: Coly Li <[email protected]>

Acked-by: Coly Li <[email protected]>

Thanks.

Coly Li

> ---
> MAINTAINERS | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 3fc37de3d6..5d76169140 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5044,6 +5044,14 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers/core
> F: Documentation/devicetree/bindings/timer/
> F: drivers/clocksource/
>
> +CLOSURES:
> +M: Kent Overstreet <[email protected]>
> +L: [email protected]
> +S: Supported
> +C: irc://irc.oftc.net/bcache
> +F: include/linux/closure.h
> +F: lib/closure.c
> +
> CMPC ACPI DRIVER
> M: Thadeu Lima de Souza Cascardo <[email protected]>
> M: Daniel Oliveira Nascimento <[email protected]>
> --
> 2.40.1
>

2023-05-09 17:42:43

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 01/32] Compiler Attributes: add __flatten

On Tue, May 09, 2023 at 07:04:43PM +0200, Miguel Ojeda wrote:
> On Tue, May 9, 2023 at 6:57 PM Kent Overstreet
> <[email protected]> wrote:
> >
> > This makes __attribute__((flatten)) available, which is used by
> > bcachefs.
>
> We already have it in mainline, so I think it is one less patch you
> need to care about! :)
>
> Cheers,
> Miguel

Wonderful :)

Cheers,
Kent

2023-05-09 19:40:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 02/32] locking/lockdep: lock_class_is_held()

On Tue, May 09, 2023 at 12:56:27PM -0400, Kent Overstreet wrote:
> From: Kent Overstreet <[email protected]>
>
> This patch adds lock_class_is_held(), which can be used to assert that a
> particular type of lock is not held.

How is lock_is_held_type() not sufficient? Which is what's used to
implement lockdep_assert_held*().

2023-05-09 19:53:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()

On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
> This adds a method to tell lockdep not to check lock ordering within a
> lock class - but to still check lock ordering w.r.t. other lock types.
>
> This is for bcachefs, where for btree node locks we have our own
> deadlock avoidance strategy w.r.t. other btree node locks (cycle
> detection), but we still want lockdep to check lock ordering w.r.t.
> other lock types.
>

ISTR you had a much nicer version of this where you gave a custom order
function -- what happend to that?

2023-05-09 20:11:25

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()

On Tue, May 09, 2023 at 09:31:47PM +0200, Peter Zijlstra wrote:
> On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
> > This adds a method to tell lockdep not to check lock ordering within a
> > lock class - but to still check lock ordering w.r.t. other lock types.
> >
> > This is for bcachefs, where for btree node locks we have our own
> > deadlock avoidance strategy w.r.t. other btree node locks (cycle
> > detection), but we still want lockdep to check lock ordering w.r.t.
> > other lock types.
> >
>
> ISTR you had a much nicer version of this where you gave a custom order
> function -- what happend to that?

Probably in the other branch that I was meaning to re-mail you separately,
clearly I hadn't pulled the latest versions back into here... expect
that shortly :)

2023-05-09 20:25:47

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 02/32] locking/lockdep: lock_class_is_held()

On Tue, May 09, 2023 at 09:30:39PM +0200, Peter Zijlstra wrote:
> On Tue, May 09, 2023 at 12:56:27PM -0400, Kent Overstreet wrote:
> > From: Kent Overstreet <[email protected]>
> >
> > This patch adds lock_class_is_held(), which can be used to assert that a
> > particular type of lock is not held.
>
> How is lock_is_held_type() not sufficient? Which is what's used to
> implement lockdep_assert_held*().

I should've looked at that before - it returns a tristate, so it's
closer than I thought, but this is used in contexts where we don't have
a lock or lockdep_map to pass and need to pass the lock_class_key
instead.

e.g, when initializing a btree_trans, or waiting on btree node IO, we
need to assert that no btree node locks are held.

Looking at the code, __lock_is_held() -> match_held_lock() has to care
about a bunch of stuff related to subclasses that doesn't seem relevant
to lock_class_is_held() - lock_class_is_held() is practically no code in
comparison, so I'm inclined to think they should just be separate.

But I'm not the lockdep expert :) Thoughts?

2023-05-09 20:29:32

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()

On Tue, May 09, 2023 at 09:31:47PM +0200, Peter Zijlstra wrote:
> On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
> > This adds a method to tell lockdep not to check lock ordering within a
> > lock class - but to still check lock ordering w.r.t. other lock types.
> >
> > This is for bcachefs, where for btree node locks we have our own
> > deadlock avoidance strategy w.r.t. other btree node locks (cycle
> > detection), but we still want lockdep to check lock ordering w.r.t.
> > other lock types.
> >
>
> ISTR you had a much nicer version of this where you gave a custom order
> function -- what happend to that?

Actually, I spoke too soon; this patch and the other series with the
comparison function solve different problems.

For bcachefs btree node locks, we don't have a defined lock ordering at
all - we do full runtime cycle detection, so we don't want lockdep
checking for self deadlock because we're handling that but we _do_ want
lockdep checking lock ordering of btree node locks w.r.t. other locks in
the system.

2023-05-09 20:38:04

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()


On 5/9/23 16:18, Kent Overstreet wrote:
> On Tue, May 09, 2023 at 09:31:47PM +0200, Peter Zijlstra wrote:
>> On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
>>> This adds a method to tell lockdep not to check lock ordering within a
>>> lock class - but to still check lock ordering w.r.t. other lock types.
>>>
>>> This is for bcachefs, where for btree node locks we have our own
>>> deadlock avoidance strategy w.r.t. other btree node locks (cycle
>>> detection), but we still want lockdep to check lock ordering w.r.t.
>>> other lock types.
>>>
>> ISTR you had a much nicer version of this where you gave a custom order
>> function -- what happend to that?
> Actually, I spoke too soon; this patch and the other series with the
> comparison function solve different problems.
>
> For bcachefs btree node locks, we don't have a defined lock ordering at
> all - we do full runtime cycle detection, so we don't want lockdep
> checking for self deadlock because we're handling that but we _do_ want
> lockdep checking lock ordering of btree node locks w.r.t. other locks in
> the system.

Maybe you can use lock_set_novalidate_class() instead.

Cheers,
Longman

2023-05-09 20:49:52

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()

On Tue, May 09, 2023 at 04:27:46PM -0400, Waiman Long wrote:
>
> On 5/9/23 16:18, Kent Overstreet wrote:
> > On Tue, May 09, 2023 at 09:31:47PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
> > > > This adds a method to tell lockdep not to check lock ordering within a
> > > > lock class - but to still check lock ordering w.r.t. other lock types.
> > > >
> > > > This is for bcachefs, where for btree node locks we have our own
> > > > deadlock avoidance strategy w.r.t. other btree node locks (cycle
> > > > detection), but we still want lockdep to check lock ordering w.r.t.
> > > > other lock types.
> > > >
> > > ISTR you had a much nicer version of this where you gave a custom order
> > > function -- what happend to that?
> > Actually, I spoke too soon; this patch and the other series with the
> > comparison function solve different problems.
> >
> > For bcachefs btree node locks, we don't have a defined lock ordering at
> > all - we do full runtime cycle detection, so we don't want lockdep
> > checking for self deadlock because we're handling that but we _do_ want
> > lockdep checking lock ordering of btree node locks w.r.t. other locks in
> > the system.
>
> Maybe you can use lock_set_novalidate_class() instead.

No, we want that to go away, this is the replacement.

2023-05-09 21:18:11

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 32/32] MAINTAINERS: Add entry for bcachefs



On 5/9/23 09:56, Kent Overstreet wrote:
> bcachefs is a new copy-on-write filesystem; add a MAINTAINERS entry for
> it.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> ---
> MAINTAINERS | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index dbf3c33c31..0ac2b432f0 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3509,6 +3509,13 @@ W: http://bcache.evilpiepirate.org
> C: irc://irc.oftc.net/bcache
> F: drivers/md/bcache/
>
> +BCACHEFS:

No colon at the end of the line.


> +M: Kent Overstreet <[email protected]>
> +L: [email protected]
> +S: Supported
> +C: irc://irc.oftc.net/bcache
> +F: fs/bcachefs/
> +
> BDISP ST MEDIA DRIVER
> M: Fabien Dessenne <[email protected]>
> L: [email protected]

--
~Randy

2023-05-09 22:07:33

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 24/32] MAINTAINERS: Add entry for generic-radix-tree



On 5/9/23 09:56, Kent Overstreet wrote:
> lib/generic-radix-tree.c is a simple radix tree that supports storing
> arbitrary types. Add a maintainers entry for it.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> ---
> MAINTAINERS | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 5d76169140..c550f5909e 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -8615,6 +8615,13 @@ F: Documentation/devicetree/bindings/power/power?domain*
> F: drivers/base/power/domain*.c
> F: include/linux/pm_domain.h
>
> +GENERIC RADIX TREE:

No colon at the end of the line.

> +M: Kent Overstreet <[email protected]>
> +S: Supported
> +C: irc://irc.oftc.net/bcache
> +F: include/linux/generic-radix-tree.h
> +F: lib/generic-radix-tree.c
> +
> GENERIC RESISTIVE TOUCHSCREEN ADC DRIVER
> M: Eugen Hristev <[email protected]>
> L: [email protected]

--
~Randy

2023-05-09 22:10:49

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 32/32] MAINTAINERS: Add entry for bcachefs

On Tue, May 09, 2023 at 02:04:00PM -0700, Randy Dunlap wrote:
>
>
> On 5/9/23 09:56, Kent Overstreet wrote:
> > bcachefs is a new copy-on-write filesystem; add a MAINTAINERS entry for
> > it.
> >
> > Signed-off-by: Kent Overstreet <[email protected]>
> > ---
> > MAINTAINERS | 7 +++++++
> > 1 file changed, 7 insertions(+)
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index dbf3c33c31..0ac2b432f0 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -3509,6 +3509,13 @@ W: http://bcache.evilpiepirate.org
> > C: irc://irc.oftc.net/bcache
> > F: drivers/md/bcache/
> >
> > +BCACHEFS:
>
> No colon at the end of the line.

Thanks, updated.

2023-05-09 22:15:54

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 16/32] MAINTAINERS: Add entry for closures



On 5/9/23 09:56, Kent Overstreet wrote:
> closures, from bcache, are async widgets with a variety of uses.
> bcachefs also uses them, so they're being moved to lib/; mark them as
> maintained.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> Cc: Coly Li <[email protected]>
> ---
> MAINTAINERS | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 3fc37de3d6..5d76169140 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5044,6 +5044,14 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers/core
> F: Documentation/devicetree/bindings/timer/
> F: drivers/clocksource/
>
> +CLOSURES:

No colon at the end of the line.

> +M: Kent Overstreet <[email protected]>
> +L: [email protected]
> +S: Supported
> +C: irc://irc.oftc.net/bcache
> +F: include/linux/closure.h
> +F: lib/closure.c
> +
> CMPC ACPI DRIVER
> M: Thadeu Lima de Souza Cascardo <[email protected]>
> M: Daniel Oliveira Nascimento <[email protected]>

--
~Randy

2023-05-09 22:34:29

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()

On 5/9/23 16:35, Kent Overstreet wrote:
> On Tue, May 09, 2023 at 04:27:46PM -0400, Waiman Long wrote:
>> On 5/9/23 16:18, Kent Overstreet wrote:
>>> On Tue, May 09, 2023 at 09:31:47PM +0200, Peter Zijlstra wrote:
>>>> On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
>>>>> This adds a method to tell lockdep not to check lock ordering within a
>>>>> lock class - but to still check lock ordering w.r.t. other lock types.
>>>>>
>>>>> This is for bcachefs, where for btree node locks we have our own
>>>>> deadlock avoidance strategy w.r.t. other btree node locks (cycle
>>>>> detection), but we still want lockdep to check lock ordering w.r.t.
>>>>> other lock types.
>>>>>
>>>> ISTR you had a much nicer version of this where you gave a custom order
>>>> function -- what happend to that?
>>> Actually, I spoke too soon; this patch and the other series with the
>>> comparison function solve different problems.
>>>
>>> For bcachefs btree node locks, we don't have a defined lock ordering at
>>> all - we do full runtime cycle detection, so we don't want lockdep
>>> checking for self deadlock because we're handling that but we _do_ want
>>> lockdep checking lock ordering of btree node locks w.r.t. other locks in
>>> the system.
>> Maybe you can use lock_set_novalidate_class() instead.
> No, we want that to go away, this is the replacement.

OK, you can mention that in the commit log then.

Cheers,
Longman

2023-05-10 01:25:53

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

On Tue 09-05-23 12:56:31, Kent Overstreet wrote:
> From: Kent Overstreet <[email protected]>
>
> This is used by bcachefs to fix a page cache coherency issue with
> O_DIRECT writes.
>
> Also relevant: mapping->invalidate_lock, see below.
>
> O_DIRECT writes (and other filesystem operations that modify file data
> while bypassing the page cache) need to shoot down ranges of the page
> cache - and additionally, need locking to prevent those pages from
> pulled back in.
>
> But O_DIRECT writes invoke the page fault handler (via get_user_pages),
> and the page fault handler will need to take that same lock - this is a
> classic recursive deadlock if userspace has mmaped the file they're DIO
> writing to and uses those pages for the buffer to write from, and it's a
> lock ordering deadlock in general.
>
> Thus we need a way to signal from the dio code to the page fault handler
> when we already are holding the pagecache add lock on an address space -
> this patch just adds a member to task_struct for this purpose. For now
> only bcachefs is implementing this locking, though it may be moved out
> of bcachefs and made available to other filesystems in the future.

It would be nice to have at least a link to the code that's actually using
the field you are adding.

Also I think we were already through this discussion [1] and we ended up
agreeing that your scheme actually solves only the AA deadlock but a
malicious userspace can easily create AB BA deadlock by running direct IO
to file A using mapped file B as a buffer *and* direct IO to file B using
mapped file A as a buffer.

[1] https://lore.kernel.org/all/[email protected]

> ---------------------------------
>
> The closest current VFS equivalent is mapping->invalidate_lock, which
> comes from XFS. However, it's not used by direct IO. Instead, direct IO
> paths shoot down the page cache twice - before starting the IO and at
> the end, and they're still technically racy w.r.t. page cache coherency.
>
> This is a more complete approach: in the future we might consider
> replacing mapping->invalidate_lock with the bcachefs code.

Yes, and this is because we never provided 100% consistent buffered VS
direct IO behavior on the same file exactly because we never found the
complexity worth the usefulness...

Honza

>
> Signed-off-by: Kent Overstreet <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Darrick J. Wong <[email protected]>
> Cc: [email protected]
> ---
> include/linux/sched.h | 1 +
> init/init_task.c | 1 +
> 2 files changed, 2 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 63d242164b..f2a56f64f7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -869,6 +869,7 @@ struct task_struct {
>
> struct mm_struct *mm;
> struct mm_struct *active_mm;
> + struct address_space *faults_disabled_mapping;
>
> int exit_state;
> int exit_code;
> diff --git a/init/init_task.c b/init/init_task.c
> index ff6c4b9bfe..f703116e05 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -85,6 +85,7 @@ struct task_struct init_task
> .nr_cpus_allowed= NR_CPUS,
> .mm = NULL,
> .active_mm = &init_mm,
> + .faults_disabled_mapping = NULL,
> .restart_block = {
> .fn = do_no_restart_syscall,
> },
> --
> 2.40.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-05-10 01:54:38

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 15/32] bcache: move closures to lib/



On 5/9/23 09:56, Kent Overstreet wrote:
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 39d1d93164..3dba7a9aff 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1618,6 +1618,15 @@ config DEBUG_NOTIFIERS
> This is a relatively cheap check but if you care about maximum
> performance, say N.
>
> +config DEBUG_CLOSURES
> + bool "Debug closures (bcache async widgits)"
> + depends on CLOSURES
> + select DEBUG_FS
> + help
> + Keeps all active closures in a linked list and provides a debugfs
> + interface to list them, which makes it possible to see asynchronous
> + operations that get stuck.

According to coding-style.rst, the help text (3 lines above) should be
indented with 2 additional spaces.

> + help
> + Keeps all active closures in a linked list and provides a debugfs
> + interface to list them, which makes it possible to see asynchronous
> + operations that get stuck.

--
~Randy

2023-05-10 02:35:25

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic()

Hi Kent,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/locking/core]
[cannot apply to axboe-block/for-next akpm-mm/mm-everything kdave/for-next linus/master v6.4-rc1 next-20230509]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Kent-Overstreet/Compiler-Attributes-add-__flatten/20230510-010302
base: tip/locking/core
patch link: https://lore.kernel.org/r/20230509165657.1735798-24-kent.overstreet%40linux.dev
patch subject: [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic()
config: i386-randconfig-a002 (https://download.01.org/0day-ci/archive/20230510/[email protected]/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/0e5d4229f5e7671dabba56ea36583b1ca20a9a18
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Kent-Overstreet/Compiler-Attributes-add-__flatten/20230510-010302
git checkout 0e5d4229f5e7671dabba56ea36583b1ca20a9a18
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

>> lib/iov_iter.c:839:16: warning: comparison of distinct pointer types ('typeof (bytes) *' (aka 'unsigned int *') and 'typeof (((1UL) << 12) - (offset & (~(((1UL) << 12) - 1)))) *' (aka 'unsigned long *')) [-Wcompare-distinct-pointer-types]
unsigned b = min(bytes, PAGE_SIZE - (offset & PAGE_MASK));
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/minmax.h:67:19: note: expanded from macro 'min'
#define min(x, y) __careful_cmp(x, y, <)
^~~~~~~~~~~~~~~~~~~~~~
include/linux/minmax.h:36:24: note: expanded from macro '__careful_cmp'
__builtin_choose_expr(__safe_cmp(x, y), \
^~~~~~~~~~~~~~~~
include/linux/minmax.h:26:4: note: expanded from macro '__safe_cmp'
(__typecheck(x, y) && __no_side_effects(x, y))
^~~~~~~~~~~~~~~~~
include/linux/minmax.h:20:28: note: expanded from macro '__typecheck'
(!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~
1 warning generated.


vim +839 lib/iov_iter.c

825
826 size_t copy_folio_from_iter_atomic(struct folio *folio, size_t offset,
827 size_t bytes, struct iov_iter *i)
828 {
829 size_t ret = 0;
830
831 if (WARN_ON(offset + bytes > folio_size(folio)))
832 return 0;
833 if (WARN_ON_ONCE(!i->data_source))
834 return 0;
835
836 #ifdef CONFIG_HIGHMEM
837 while (bytes) {
838 struct page *page = folio_page(folio, offset >> PAGE_SHIFT);
> 839 unsigned b = min(bytes, PAGE_SIZE - (offset & PAGE_MASK));
840 unsigned r = __copy_page_from_iter_atomic(page, offset, b, i);
841
842 offset += r;
843 bytes -= r;
844 ret += r;
845
846 if (r != b)
847 break;
848 }
849 #else
850 ret = __copy_page_from_iter_atomic(&folio->page, offset, bytes, i);
851 #endif
852
853 return ret;
854 }
855 EXPORT_SYMBOL(copy_folio_from_iter_atomic);
856

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

2023-05-10 04:48:50

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 22/32] vfs: inode cache conversion to hash-bl

On Tue, May 09, 2023 at 12:56:47PM -0400, Kent Overstreet wrote:
> From: Dave Chinner <[email protected]>
>
> Because scalability of the global inode_hash_lock really, really
> sucks.
>
> 32-way concurrent create on a couple of different filesystems
> before:
>
> - 52.13% 0.04% [kernel] [k] ext4_create
> - 52.09% ext4_create
> - 41.03% __ext4_new_inode
> - 29.92% insert_inode_locked
> - 25.35% _raw_spin_lock
> - do_raw_spin_lock
> - 24.97% __pv_queued_spin_lock_slowpath
>
> - 72.33% 0.02% [kernel] [k] do_filp_open
> - 72.31% do_filp_open
> - 72.28% path_openat
> - 57.03% bch2_create
> - 56.46% __bch2_create
> - 40.43% inode_insert5
> - 36.07% _raw_spin_lock
> - do_raw_spin_lock
> 35.86% __pv_queued_spin_lock_slowpath
> 4.02% find_inode
>
> Convert the inode hash table to a RCU-aware hash-bl table just like
> the dentry cache. Note that we need to store a pointer to the
> hlist_bl_head the inode has been added to in the inode so that when
> it comes to unhash the inode we know what list to lock. We need to
> do this because the hash value that is used to hash the inode is
> generated from the inode itself - filesystems can provide this
> themselves so we have to either store the hash or the head pointer
> in the inode to be able to find the right list head for removal...
>
> Same workload after:
>
> Signed-off-by: Dave Chinner <[email protected]>
> Cc: Alexander Viro <[email protected]>
> Cc: Christian Brauner <[email protected]>
> Cc: [email protected]
> Signed-off-by: Kent Overstreet <[email protected]>

I have been maintaining this patchset uptodate in my own local trees
and the code in this patch looks the same. The commit message above,
however, has been mangled. The full commit message should be:

vfs: inode cache conversion to hash-bl

Because scalability of the global inode_hash_lock really, really
sucks and prevents me from doing scalability characterisation and
analysis of bcachefs algorithms.

Profiles of a 32-way concurrent create of 51.2m inodes with fsmark
on a couple of different filesystems on a 5.10 kernel:

- 52.13% 0.04% [kernel] [k] ext4_create
- 52.09% ext4_create
- 41.03% __ext4_new_inode
- 29.92% insert_inode_locked
- 25.35% _raw_spin_lock
- do_raw_spin_lock
- 24.97% __pv_queued_spin_lock_slowpath


- 72.33% 0.02% [kernel] [k] do_filp_open
- 72.31% do_filp_open
- 72.28% path_openat
- 57.03% bch2_create
- 56.46% __bch2_create
- 40.43% inode_insert5
- 36.07% _raw_spin_lock
- do_raw_spin_lock
35.86% __pv_queued_spin_lock_slowpath
4.02% find_inode

btrfs was tested but it is limited by internal lock contention at
>=2 threads on this workload, so never hammers the inode cache lock
hard enough for this change to matter to it's performance.

However, both bcachefs and ext4 demonstrate poor scaling at >=8
threads on concurrent lookup or create workloads.

Hence convert the inode hash table to a RCU-aware hash-bl table just
like the dentry cache. Note that we need to store a pointer to the
hlist_bl_head the inode has been added to in the inode so that when
it comes to unhash the inode we know what list to lock. We need to
do this because, unlike the dentry cache, the hash value that is
used to hash the inode is not generated from the inode itself. i.e.
filesystems can provide this themselves so we have to either store
the hashval or the hlist head pointer in the inode to be able to
find the right list head for removal...

Concurrent create with variying thread count (files/s):

ext4 bcachefs
threads vanilla patched vanilla patched
2 117k 112k 80k 85k
4 185k 190k 133k 145k
8 303k 346k 185k 255k
16 389k 465k 190k 420k
32 360k 437k 142k 481k

CPU usage for both bcachefs and ext4 at 16 and 32 threads has been
halved on the patched kernel, while performance has increased
marginally on ext4 and massively on bcachefs. Internal filesystem
algorithms now limit performance on these workloads, not the global
inode_hash_lock.

Profile of the workloads on the patched kernels:

- 35.94% 0.07% [kernel] [k] ext4_create
- 35.87% ext4_create
- 20.45% __ext4_new_inode
...
3.36% insert_inode_locked

- 78.43% do_filp_open
- 78.36% path_openat
- 53.95% bch2_create
- 47.99% __bch2_create
....
- 7.57% inode_insert5
6.94% find_inode

Spinlock contention is largely gone from the inode hash operations
and the filesystems are limited by contention in their internal
algorithms.

Signed-off-by: Dave Chinner <[email protected]>
---

Other than that, the diffstat is the same and I don't see any obvious
differences in the code comapred to what I've been running locally.

-Dave.
--
Dave Chinner
[email protected]

2023-05-10 04:57:08

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 21/32] hlist-bl: add hlist_bl_fake()

On Tue, May 09, 2023 at 12:56:46PM -0400, Kent Overstreet wrote:
> From: Dave Chinner <[email protected]>
>
> in preparation for switching the VFS inode cache over the hlist_bl
In

> lists, we nee dto be able to fake a list node that looks like it is
need to

> hased for correct operation of filesystems that don't directly use
hashed

> the VFS indoe cache.
inode cache hash index.

-Dave.

--
Dave Chinner
[email protected]

2023-05-10 06:25:09

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

On Wed, May 10, 2023 at 03:07:37AM +0200, Jan Kara wrote:
> On Tue 09-05-23 12:56:31, Kent Overstreet wrote:
> > From: Kent Overstreet <[email protected]>
> >
> > This is used by bcachefs to fix a page cache coherency issue with
> > O_DIRECT writes.
> >
> > Also relevant: mapping->invalidate_lock, see below.
> >
> > O_DIRECT writes (and other filesystem operations that modify file data
> > while bypassing the page cache) need to shoot down ranges of the page
> > cache - and additionally, need locking to prevent those pages from
> > pulled back in.
> >
> > But O_DIRECT writes invoke the page fault handler (via get_user_pages),
> > and the page fault handler will need to take that same lock - this is a
> > classic recursive deadlock if userspace has mmaped the file they're DIO
> > writing to and uses those pages for the buffer to write from, and it's a
> > lock ordering deadlock in general.
> >
> > Thus we need a way to signal from the dio code to the page fault handler
> > when we already are holding the pagecache add lock on an address space -
> > this patch just adds a member to task_struct for this purpose. For now
> > only bcachefs is implementing this locking, though it may be moved out
> > of bcachefs and made available to other filesystems in the future.
>
> It would be nice to have at least a link to the code that's actually using
> the field you are adding.

Bit of a trick to link to a _later_ patch in the series from a commit
message, but...

https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n975
https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n2454

> Also I think we were already through this discussion [1] and we ended up
> agreeing that your scheme actually solves only the AA deadlock but a
> malicious userspace can easily create AB BA deadlock by running direct IO
> to file A using mapped file B as a buffer *and* direct IO to file B using
> mapped file A as a buffer.

No, that's definitely handled (and you can see it in the code I linked),
and I wrote a torture test for fstests as well.

David Howells was also just running into a strange locking situation with
iov_iters and recursive gups - I don't recall all the details, but it
sounded like this might be a solution for that. David, did you have
thoughts on that?

2023-05-10 09:35:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()

On Tue, May 09, 2023 at 04:18:59PM -0400, Kent Overstreet wrote:
> On Tue, May 09, 2023 at 09:31:47PM +0200, Peter Zijlstra wrote:
> > On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
> > > This adds a method to tell lockdep not to check lock ordering within a
> > > lock class - but to still check lock ordering w.r.t. other lock types.
> > >
> > > This is for bcachefs, where for btree node locks we have our own
> > > deadlock avoidance strategy w.r.t. other btree node locks (cycle
> > > detection), but we still want lockdep to check lock ordering w.r.t.
> > > other lock types.
> > >
> >
> > ISTR you had a much nicer version of this where you gave a custom order
> > function -- what happend to that?
>
> Actually, I spoke too soon; this patch and the other series with the
> comparison function solve different problems.
>
> For bcachefs btree node locks, we don't have a defined lock ordering at
> all - we do full runtime cycle detection, so we don't want lockdep
> checking for self deadlock because we're handling that but we _do_ want
> lockdep checking lock ordering of btree node locks w.r.t. other locks in
> the system.

Have you read the ww_mutex code? If not, please do so, it does similar
things.

The way it gets around the self-nesting check is by using the nest_lock
annotation, the acquire context itself also has a dep_map for this
purpose.

2023-05-10 20:43:53

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()

On Wed, May 10, 2023 at 10:59:05AM +0200, Peter Zijlstra wrote:
> On Tue, May 09, 2023 at 04:18:59PM -0400, Kent Overstreet wrote:
> > On Tue, May 09, 2023 at 09:31:47PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
> > > > This adds a method to tell lockdep not to check lock ordering within a
> > > > lock class - but to still check lock ordering w.r.t. other lock types.
> > > >
> > > > This is for bcachefs, where for btree node locks we have our own
> > > > deadlock avoidance strategy w.r.t. other btree node locks (cycle
> > > > detection), but we still want lockdep to check lock ordering w.r.t.
> > > > other lock types.
> > > >
> > >
> > > ISTR you had a much nicer version of this where you gave a custom order
> > > function -- what happend to that?
> >
> > Actually, I spoke too soon; this patch and the other series with the
> > comparison function solve different problems.
> >
> > For bcachefs btree node locks, we don't have a defined lock ordering at
> > all - we do full runtime cycle detection, so we don't want lockdep
> > checking for self deadlock because we're handling that but we _do_ want
> > lockdep checking lock ordering of btree node locks w.r.t. other locks in
> > the system.
>
> Have you read the ww_mutex code? If not, please do so, it does similar
> things.
>
> The way it gets around the self-nesting check is by using the nest_lock
> annotation, the acquire context itself also has a dep_map for this
> purpose.

This might work.

I was confused for a good bit when reading tho code to figure out how
it works - nest_lock seems to be a pretty bad name, it's really not a
lock. acquire_ctx?

2023-05-11 02:47:47

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic()

Hi Kent,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/locking/core]
[cannot apply to axboe-block/for-next akpm-mm/mm-everything kdave/for-next linus/master v6.4-rc1 next-20230510]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Kent-Overstreet/Compiler-Attributes-add-__flatten/20230510-010302
base: tip/locking/core
patch link: https://lore.kernel.org/r/20230509165657.1735798-24-kent.overstreet%40linux.dev
patch subject: [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic()
config: powerpc-randconfig-s042-20230509 (https://download.01.org/0day-ci/archive/20230511/[email protected]/config)
compiler: powerpc-linux-gcc (GCC) 12.1.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# apt-get install sparse
# sparse version: v0.6.4-39-gce1a6720-dirty
# https://github.com/intel-lab-lkp/linux/commit/0e5d4229f5e7671dabba56ea36583b1ca20a9a18
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Kent-Overstreet/Compiler-Attributes-add-__flatten/20230510-010302
git checkout 0e5d4229f5e7671dabba56ea36583b1ca20a9a18
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=powerpc olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=powerpc SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/

sparse warnings: (new ones prefixed by >>)
>> lib/iov_iter.c:839:30: sparse: sparse: incompatible types in comparison expression (different type sizes):
>> lib/iov_iter.c:839:30: sparse: unsigned int *
>> lib/iov_iter.c:839:30: sparse: unsigned long *

vim +839 lib/iov_iter.c

825
826 size_t copy_folio_from_iter_atomic(struct folio *folio, size_t offset,
827 size_t bytes, struct iov_iter *i)
828 {
829 size_t ret = 0;
830
831 if (WARN_ON(offset + bytes > folio_size(folio)))
832 return 0;
833 if (WARN_ON_ONCE(!i->data_source))
834 return 0;
835
836 #ifdef CONFIG_HIGHMEM
837 while (bytes) {
838 struct page *page = folio_page(folio, offset >> PAGE_SHIFT);
> 839 unsigned b = min(bytes, PAGE_SIZE - (offset & PAGE_MASK));
840 unsigned r = __copy_page_from_iter_atomic(page, offset, b, i);
841
842 offset += r;
843 bytes -= r;
844 ret += r;
845
846 if (r != b)
847 break;
848 }
849 #else
850 ret = __copy_page_from_iter_atomic(&folio->page, offset, bytes, i);
851 #endif
852
853 return ret;
854 }
855 EXPORT_SYMBOL(copy_folio_from_iter_atomic);
856

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

2023-05-11 08:38:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()

On Wed, May 10, 2023 at 04:38:15PM -0400, Kent Overstreet wrote:
> On Wed, May 10, 2023 at 10:59:05AM +0200, Peter Zijlstra wrote:

> > Have you read the ww_mutex code? If not, please do so, it does similar
> > things.
> >
> > The way it gets around the self-nesting check is by using the nest_lock
> > annotation, the acquire context itself also has a dep_map for this
> > purpose.
>
> This might work.
>
> I was confused for a good bit when reading tho code to figure out how
> it works - nest_lock seems to be a pretty bad name, it's really not a
> lock. acquire_ctx?

That's just how ww_mutex uses it, the annotation itself comes from
mm_take_all_locks() where mm->mmap_lock (the lock formerly known as
mmap_sem) is used to serialize multi acquisition of vma locks.

That is, no other code takes multiple vma locks (be it i_mmap_rwsem or
anonvma->root->rwsem) in any order. These locks nest inside mmap_lock
and therefore by holding mmap_lock you serialize the whole thing and can
take them in any order you like.

Perhaps, now, all these many years later another name would've made more
sense, but I don't think it's worth the hassle of the tree-wide rename
(there's a few other users since).

2023-05-11 09:46:58

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()

On Thu, May 11, 2023 at 10:25:44AM +0200, Peter Zijlstra wrote:
> On Wed, May 10, 2023 at 04:38:15PM -0400, Kent Overstreet wrote:
> > On Wed, May 10, 2023 at 10:59:05AM +0200, Peter Zijlstra wrote:
>
> > > Have you read the ww_mutex code? If not, please do so, it does similar
> > > things.
> > >
> > > The way it gets around the self-nesting check is by using the nest_lock
> > > annotation, the acquire context itself also has a dep_map for this
> > > purpose.
> >
> > This might work.
> >
> > I was confused for a good bit when reading tho code to figure out how
> > it works - nest_lock seems to be a pretty bad name, it's really not a
> > lock. acquire_ctx?
>
> That's just how ww_mutex uses it, the annotation itself comes from
> mm_take_all_locks() where mm->mmap_lock (the lock formerly known as
> mmap_sem) is used to serialize multi acquisition of vma locks.
>
> That is, no other code takes multiple vma locks (be it i_mmap_rwsem or
> anonvma->root->rwsem) in any order. These locks nest inside mmap_lock
> and therefore by holding mmap_lock you serialize the whole thing and can
> take them in any order you like.
>
> Perhaps, now, all these many years later another name would've made more
> sense, but I don't think it's worth the hassle of the tree-wide rename
> (there's a few other users since).

Thanks for the history lesson :)

2023-05-12 21:02:40

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()

On Wed, May 10, 2023 at 10:59:05AM +0200, Peter Zijlstra wrote:
> On Tue, May 09, 2023 at 04:18:59PM -0400, Kent Overstreet wrote:
> > On Tue, May 09, 2023 at 09:31:47PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
> > > > This adds a method to tell lockdep not to check lock ordering within a
> > > > lock class - but to still check lock ordering w.r.t. other lock types.
> > > >
> > > > This is for bcachefs, where for btree node locks we have our own
> > > > deadlock avoidance strategy w.r.t. other btree node locks (cycle
> > > > detection), but we still want lockdep to check lock ordering w.r.t.
> > > > other lock types.
> > > >
> > >
> > > ISTR you had a much nicer version of this where you gave a custom order
> > > function -- what happend to that?
> >
> > Actually, I spoke too soon; this patch and the other series with the
> > comparison function solve different problems.
> >
> > For bcachefs btree node locks, we don't have a defined lock ordering at
> > all - we do full runtime cycle detection, so we don't want lockdep
> > checking for self deadlock because we're handling that but we _do_ want
> > lockdep checking lock ordering of btree node locks w.r.t. other locks in
> > the system.
>
> Have you read the ww_mutex code? If not, please do so, it does similar
> things.
>
> The way it gets around the self-nesting check is by using the nest_lock
> annotation, the acquire context itself also has a dep_map for this
> purpose.

So, spent some time plumbing this through the six lock code and seeing
how it'd work.

I like it in theory, it's got the right semantics and it would allow for
lockdep to check that we're not taking locks with more than one
btree_trans in the same thread. Unfortunately, we've got code paths that
are meant to be called from contexts that may or may not have a
btree_trans - and this is fine right now, because they just use
trylock(), but having to plumb nest_lock through is going to make a mess
of things.

(The relevant codepaths include shrinker callbacks, where we definitely
can not just init a new btree_trans, and also the btree node write path
which can be kicked off from all sorts of places).

Can we go with lockdep_set_no_check_recursion() for now? It's a pretty
small addition to the lockdep code.

2023-05-15 11:21:24

by Askar Safin

[permalink] [raw]
Subject: Re: [PATCH 00/32] bcachefs - a new COW filesystem

Kent, please, make sure you dealt with problems specific to another
fs: btrfs: https://arstechnica.com/gadgets/2021/09/examining-btrfs-linuxs-perpetually-half-finished-filesystem/
. In particular, I dislike this btrfs problems, mentioned in the
article:

- "Yes, you read that correctly—you mount the array using the name of
any given disk in the array. No, it doesn't matter which one"
- "Even though our array is technically "redundant," it refuses to
mount with /dev/vdc missing... In the worst-case scenario—a root
filesystem that itself is stored "redundantly" on btrfs-raid1 or
btrfs-raid10—the entire system refuses to boot... If you're thinking,
"Well, the obvious step here is just to always mount degraded," the
btrfs devs would like to have a word with you... If you lose a drive
from a conventional RAID array, or an mdraid array, or a ZFS zpool,
that array keeps on trucking without needing any special flags to
mount it. If you then add the failed drive back to the array, your
RAID manager will similarly automatically begin "resilvering" or
"rebuilding" the array... That, unfortunately, is not the case with
btrfs-native RAID"

I suggest reading the article in full, at least from section "Btrfs
RAID array management is a mess" till the end.

Please, ensure that bcachefs has no these problems! These problems
scary me away from btrfs.

Please, CC me when answering
--
Askar Safin

2023-05-16 15:53:18

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 22/32] vfs: inode cache conversion to hash-bl

On Wed, May 10, 2023 at 02:45:57PM +1000, Dave Chinner wrote:
> On Tue, May 09, 2023 at 12:56:47PM -0400, Kent Overstreet wrote:
> > From: Dave Chinner <[email protected]>
> >
> > Because scalability of the global inode_hash_lock really, really
> > sucks.
> >
> > 32-way concurrent create on a couple of different filesystems
> > before:
> >
> > - 52.13% 0.04% [kernel] [k] ext4_create
> > - 52.09% ext4_create
> > - 41.03% __ext4_new_inode
> > - 29.92% insert_inode_locked
> > - 25.35% _raw_spin_lock
> > - do_raw_spin_lock
> > - 24.97% __pv_queued_spin_lock_slowpath
> >
> > - 72.33% 0.02% [kernel] [k] do_filp_open
> > - 72.31% do_filp_open
> > - 72.28% path_openat
> > - 57.03% bch2_create
> > - 56.46% __bch2_create
> > - 40.43% inode_insert5
> > - 36.07% _raw_spin_lock
> > - do_raw_spin_lock
> > 35.86% __pv_queued_spin_lock_slowpath
> > 4.02% find_inode
> >
> > Convert the inode hash table to a RCU-aware hash-bl table just like
> > the dentry cache. Note that we need to store a pointer to the
> > hlist_bl_head the inode has been added to in the inode so that when
> > it comes to unhash the inode we know what list to lock. We need to
> > do this because the hash value that is used to hash the inode is
> > generated from the inode itself - filesystems can provide this
> > themselves so we have to either store the hash or the head pointer
> > in the inode to be able to find the right list head for removal...
> >
> > Same workload after:
> >
> > Signed-off-by: Dave Chinner <[email protected]>
> > Cc: Alexander Viro <[email protected]>
> > Cc: Christian Brauner <[email protected]>
> > Cc: [email protected]
> > Signed-off-by: Kent Overstreet <[email protected]>
>
> I have been maintaining this patchset uptodate in my own local trees
> and the code in this patch looks the same. The commit message above,
> however, has been mangled. The full commit message should be:
>
> vfs: inode cache conversion to hash-bl
>
> Because scalability of the global inode_hash_lock really, really
> sucks and prevents me from doing scalability characterisation and
> analysis of bcachefs algorithms.
>
> Profiles of a 32-way concurrent create of 51.2m inodes with fsmark
> on a couple of different filesystems on a 5.10 kernel:
>
> - 52.13% 0.04% [kernel] [k] ext4_create
> - 52.09% ext4_create
> - 41.03% __ext4_new_inode
> - 29.92% insert_inode_locked
> - 25.35% _raw_spin_lock
> - do_raw_spin_lock
> - 24.97% __pv_queued_spin_lock_slowpath
>
>
> - 72.33% 0.02% [kernel] [k] do_filp_open
> - 72.31% do_filp_open
> - 72.28% path_openat
> - 57.03% bch2_create
> - 56.46% __bch2_create
> - 40.43% inode_insert5
> - 36.07% _raw_spin_lock
> - do_raw_spin_lock
> 35.86% __pv_queued_spin_lock_slowpath
> 4.02% find_inode
>
> btrfs was tested but it is limited by internal lock contention at
> >=2 threads on this workload, so never hammers the inode cache lock
> hard enough for this change to matter to it's performance.
>
> However, both bcachefs and ext4 demonstrate poor scaling at >=8
> threads on concurrent lookup or create workloads.
>
> Hence convert the inode hash table to a RCU-aware hash-bl table just
> like the dentry cache. Note that we need to store a pointer to the
> hlist_bl_head the inode has been added to in the inode so that when
> it comes to unhash the inode we know what list to lock. We need to
> do this because, unlike the dentry cache, the hash value that is
> used to hash the inode is not generated from the inode itself. i.e.
> filesystems can provide this themselves so we have to either store
> the hashval or the hlist head pointer in the inode to be able to
> find the right list head for removal...
>
> Concurrent create with variying thread count (files/s):
>
> ext4 bcachefs
> threads vanilla patched vanilla patched
> 2 117k 112k 80k 85k
> 4 185k 190k 133k 145k
> 8 303k 346k 185k 255k
> 16 389k 465k 190k 420k
> 32 360k 437k 142k 481k
>
> CPU usage for both bcachefs and ext4 at 16 and 32 threads has been
> halved on the patched kernel, while performance has increased
> marginally on ext4 and massively on bcachefs. Internal filesystem
> algorithms now limit performance on these workloads, not the global
> inode_hash_lock.
>
> Profile of the workloads on the patched kernels:
>
> - 35.94% 0.07% [kernel] [k] ext4_create
> - 35.87% ext4_create
> - 20.45% __ext4_new_inode
> ...
> 3.36% insert_inode_locked
>
> - 78.43% do_filp_open
> - 78.36% path_openat
> - 53.95% bch2_create
> - 47.99% __bch2_create
> ....
> - 7.57% inode_insert5
> 6.94% find_inode
>
> Spinlock contention is largely gone from the inode hash operations
> and the filesystems are limited by contention in their internal
> algorithms.
>
> Signed-off-by: Dave Chinner <[email protected]>
> ---
>
> Other than that, the diffstat is the same and I don't see any obvious
> differences in the code comapred to what I've been running locally.

There's a bit of a backlog before I get around to looking at this but
it'd be great if we'd have a few reviewers for this change.

2023-05-16 16:32:18

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 22/32] vfs: inode cache conversion to hash-bl

On Tue, May 16, 2023 at 05:45:19PM +0200, Christian Brauner wrote:
> On Wed, May 10, 2023 at 02:45:57PM +1000, Dave Chinner wrote:
> There's a bit of a backlog before I get around to looking at this but
> it'd be great if we'd have a few reviewers for this change.

It is well tested - it's been in the bcachefs tree for ages with zero
issues. I'm pulling it out of the bcachefs-prerequisites series though
since Dave's still got it in his tree, he's got a newer version with
better commit messages.

It's a significant performance boost on metadata heavy workloads for any
non-XFS filesystem, we should definitely get it in.

2023-05-16 23:24:29

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 22/32] vfs: inode cache conversion to hash-bl

On Tue, May 16, 2023 at 12:17:04PM -0400, Kent Overstreet wrote:
> On Tue, May 16, 2023 at 05:45:19PM +0200, Christian Brauner wrote:
> > On Wed, May 10, 2023 at 02:45:57PM +1000, Dave Chinner wrote:
> > There's a bit of a backlog before I get around to looking at this but
> > it'd be great if we'd have a few reviewers for this change.
>
> It is well tested - it's been in the bcachefs tree for ages with zero
> issues. I'm pulling it out of the bcachefs-prerequisites series though
> since Dave's still got it in his tree, he's got a newer version with
> better commit messages.
>
> It's a significant performance boost on metadata heavy workloads for any
> non-XFS filesystem, we should definitely get it in.

I've got an up to date vfs-scale tree here (6.4-rc1) but I have not
been able to test it effectively right now because my local
performance test server is broken. I'll do what I can on the old
small machine that I have to validate it when I get time, but that
might be a few weeks away....

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git vfs-scale

As it is, the inode hash-bl changes have zero impact on XFS because
it has it's own highly scalable lockless, sharded inode cache. So
unless I'm explicitly testing ext4 or btrfs scalability (rare) it's
not getting a lot of scalability exercise. It is being used by the
root filesytsems on all those test VMs, but that's about it...

That said, my vfs-scale tree also has Waiman Long's old dlist code
(per cpu linked list) which converts the sb inode list and removes
the global lock there. This does make a huge impact for XFS - the
current code limits inode cache cycling to about 600,000 inodes/sec
on >=16p machines. With dlists, however:

| 5.17.0 on a XFS filesystem with 50 million inodes in it on a 32p
| machine with a 1.6MIOPS/6.5GB/s block device.
|
| Fully concurrent full filesystem bulkstat:
|
| wall time sys time IOPS BW rate
| unpatched: 1m56.035s 56m12.234s 8k 200MB/s 0.4M/s
| patched: 0m15.710s 3m45.164s 70k 1.9GB/s 3.4M/s
|
| Unpatched flat kernel profile:
|
| 81.97% [kernel] [k] __pv_queued_spin_lock_slowpath
| 1.84% [kernel] [k] do_raw_spin_lock
| 1.33% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock
| 0.50% [kernel] [k] memset_erms
| 0.42% [kernel] [k] do_raw_spin_unlock
| 0.42% [kernel] [k] xfs_perag_get
| 0.40% [kernel] [k] xfs_buf_find
| 0.39% [kernel] [k] __raw_spin_lock_init
|
| Patched flat kernel profile:
|
| 10.90% [kernel] [k] do_raw_spin_lock
| 7.21% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock
| 3.16% [kernel] [k] xfs_buf_find
| 3.06% [kernel] [k] rcu_segcblist_enqueue
| 2.73% [kernel] [k] memset_erms
| 2.31% [kernel] [k] __pv_queued_spin_lock_slowpath
| 2.15% [kernel] [k] __raw_spin_lock_init
| 2.15% [kernel] [k] do_raw_spin_unlock
| 2.12% [kernel] [k] xfs_perag_get
| 1.93% [kernel] [k] xfs_btree_lookup

Cheers,

Dave.
--
Dave Chinner
[email protected]

2023-05-22 13:30:09

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 22/32] vfs: inode cache conversion to hash-bl

On Wed, May 17, 2023 at 09:15:34AM +1000, Dave Chinner wrote:
> On Tue, May 16, 2023 at 12:17:04PM -0400, Kent Overstreet wrote:
> > On Tue, May 16, 2023 at 05:45:19PM +0200, Christian Brauner wrote:
> > > On Wed, May 10, 2023 at 02:45:57PM +1000, Dave Chinner wrote:
> > > There's a bit of a backlog before I get around to looking at this but
> > > it'd be great if we'd have a few reviewers for this change.
> >
> > It is well tested - it's been in the bcachefs tree for ages with zero
> > issues. I'm pulling it out of the bcachefs-prerequisites series though
> > since Dave's still got it in his tree, he's got a newer version with
> > better commit messages.
> >
> > It's a significant performance boost on metadata heavy workloads for any
> > non-XFS filesystem, we should definitely get it in.
>
> I've got an up to date vfs-scale tree here (6.4-rc1) but I have not
> been able to test it effectively right now because my local
> performance test server is broken. I'll do what I can on the old
> small machine that I have to validate it when I get time, but that
> might be a few weeks away....
>
> git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git vfs-scale
>
> As it is, the inode hash-bl changes have zero impact on XFS because
> it has it's own highly scalable lockless, sharded inode cache. So
> unless I'm explicitly testing ext4 or btrfs scalability (rare) it's
> not getting a lot of scalability exercise. It is being used by the
> root filesytsems on all those test VMs, but that's about it...

I think there's a bunch of perf tests being run on -next. So we can
stuff it into a vfs.unstable.* branch and see what -next thinks of this
performance wise.

2023-05-23 09:39:54

by Christian Brauner

[permalink] [raw]
Subject: Re: (subset) [PATCH 22/32] vfs: inode cache conversion to hash-bl

On Tue, 09 May 2023 12:56:47 -0400, Kent Overstreet wrote:
> Because scalability of the global inode_hash_lock really, really
> sucks.
>
> 32-way concurrent create on a couple of different filesystems
> before:
>
> - 52.13% 0.04% [kernel] [k] ext4_create
> - 52.09% ext4_create
> - 41.03% __ext4_new_inode
> - 29.92% insert_inode_locked
> - 25.35% _raw_spin_lock
> - do_raw_spin_lock
> - 24.97% __pv_queued_spin_lock_slowpath
>
> [...]

This is interesting completely independent of bcachefs so we should give
it some testing.

I updated a few places that had outdated comments.

---

Applied to the vfs.unstable.inode-hash branch of the vfs/vfs.git tree.
Patches in the vfs.unstable.inode-hash branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.unstable.inode-hash

[22/32] vfs: inode cache conversion to hash-bl
https://git.kernel.org/vfs/vfs/c/e3e92d47e6b1

2023-05-23 09:43:53

by Christian Brauner

[permalink] [raw]
Subject: Re: (subset) [PATCH 20/32] vfs: factor out inode hash head calculation

On Tue, 09 May 2023 12:56:45 -0400, Kent Overstreet wrote:
> In preparation for changing the inode hash table implementation.
>
>

This is interesting completely independent of bcachefs so we should give
it some testing.

---

Applied to the vfs.unstable.inode-hash branch of the vfs/vfs.git tree.
Patches in the vfs.unstable.inode-hash branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.unstable.inode-hash

[20/32] vfs: factor out inode hash head calculation
https://git.kernel.org/vfs/vfs/c/b54a4516146d

2023-05-23 09:45:39

by Christian Brauner

[permalink] [raw]
Subject: Re: (subset) [PATCH 21/32] hlist-bl: add hlist_bl_fake()

On Tue, 09 May 2023 12:56:46 -0400, Kent Overstreet wrote:
> in preparation for switching the VFS inode cache over the hlist_bl
> lists, we nee dto be able to fake a list node that looks like it is
> hased for correct operation of filesystems that don't directly use
> the VFS indoe cache.
>
>

This is interesting completely independent of bcachefs so we should give
it some testing.

---

Applied to the vfs.unstable.inode-hash branch of the vfs/vfs.git tree.
Patches in the vfs.unstable.inode-hash branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.unstable.inode-hash

[21/32] hlist-bl: add hlist_bl_fake()
https://git.kernel.org/vfs/vfs/c/0ef99590b01f

2023-05-23 13:42:27

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

On Wed 10-05-23 02:18:45, Kent Overstreet wrote:
> On Wed, May 10, 2023 at 03:07:37AM +0200, Jan Kara wrote:
> > On Tue 09-05-23 12:56:31, Kent Overstreet wrote:
> > > From: Kent Overstreet <[email protected]>
> > >
> > > This is used by bcachefs to fix a page cache coherency issue with
> > > O_DIRECT writes.
> > >
> > > Also relevant: mapping->invalidate_lock, see below.
> > >
> > > O_DIRECT writes (and other filesystem operations that modify file data
> > > while bypassing the page cache) need to shoot down ranges of the page
> > > cache - and additionally, need locking to prevent those pages from
> > > pulled back in.
> > >
> > > But O_DIRECT writes invoke the page fault handler (via get_user_pages),
> > > and the page fault handler will need to take that same lock - this is a
> > > classic recursive deadlock if userspace has mmaped the file they're DIO
> > > writing to and uses those pages for the buffer to write from, and it's a
> > > lock ordering deadlock in general.
> > >
> > > Thus we need a way to signal from the dio code to the page fault handler
> > > when we already are holding the pagecache add lock on an address space -
> > > this patch just adds a member to task_struct for this purpose. For now
> > > only bcachefs is implementing this locking, though it may be moved out
> > > of bcachefs and made available to other filesystems in the future.
> >
> > It would be nice to have at least a link to the code that's actually using
> > the field you are adding.
>
> Bit of a trick to link to a _later_ patch in the series from a commit
> message, but...
>
> https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n975
> https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n2454

Thanks and I'm sorry for the delay.

> > Also I think we were already through this discussion [1] and we ended up
> > agreeing that your scheme actually solves only the AA deadlock but a
> > malicious userspace can easily create AB BA deadlock by running direct IO
> > to file A using mapped file B as a buffer *and* direct IO to file B using
> > mapped file A as a buffer.
>
> No, that's definitely handled (and you can see it in the code I linked),
> and I wrote a torture test for fstests as well.

I've checked the code and AFAICT it is all indeed handled. BTW, I've now
remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
way (by prefaulting pages from the iter before grabbing the problematic
lock and then disabling page faults for the iomap_dio_rw() call). I guess
we should somehow unify these schemes so that we don't have two mechanisms
for avoiding exactly the same deadlock. Adding GFS2 guys to CC.

Also good that you've written a fstest for this, that is definitely a useful
addition, although I suspect GFS2 guys added a test for this not so long
ago when testing their stuff. Maybe they have a pointer handy?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-05-23 16:33:11

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

On Tue, May 23, 2023 at 03:34:31PM +0200, Jan Kara wrote:
> I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> way (by prefaulting pages from the iter before grabbing the problematic
> lock and then disabling page faults for the iomap_dio_rw() call). I guess
> we should somehow unify these schemes so that we don't have two mechanisms
> for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
>
> Also good that you've written a fstest for this, that is definitely a useful
> addition, although I suspect GFS2 guys added a test for this not so long
> ago when testing their stuff. Maybe they have a pointer handy?

generic/708 is the btrfs version of this.

But I think all of the file systems that have this deadlock are actually
fundamentally broken because they have a mess up locking hierarchy
where page faults take the same lock that is held over the the direct I/
operation. And the right thing is to fix this. I have work in progress
for btrfs, and something similar should apply to gfs2, with the added
complication that it probably means a revision to their network
protocol.

I'm absolutely not in favour to add workarounds for thes kind of locking
problems to the core kernel. I already feel bad for allowing the
small workaround in iomap for btrfs, as just fixing the locking back
then would have avoid massive ratholing.

2023-05-23 17:00:27

by Kent Overstreet

[permalink] [raw]
Subject: Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

On Tue, May 23, 2023 at 09:21:56AM -0700, Christoph Hellwig wrote:
> On Tue, May 23, 2023 at 03:34:31PM +0200, Jan Kara wrote:
> > I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> > way (by prefaulting pages from the iter before grabbing the problematic
> > lock and then disabling page faults for the iomap_dio_rw() call). I guess
> > we should somehow unify these schemes so that we don't have two mechanisms
> > for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
> >
> > Also good that you've written a fstest for this, that is definitely a useful
> > addition, although I suspect GFS2 guys added a test for this not so long
> > ago when testing their stuff. Maybe they have a pointer handy?
>
> generic/708 is the btrfs version of this.
>
> But I think all of the file systems that have this deadlock are actually
> fundamentally broken because they have a mess up locking hierarchy
> where page faults take the same lock that is held over the the direct I/
> operation. And the right thing is to fix this. I have work in progress
> for btrfs, and something similar should apply to gfs2, with the added
> complication that it probably means a revision to their network
> protocol.

No, this is fundamentally because userspace controls the ordering of
locking because the buffer passed to dio can point into any address
space. You can't solve this by changing the locking heirarchy.

If you want to be able to have locking around adding things to the
pagecache so that things that bypass the pagecache can prevent
inconsistencies (and we do, the big one is fcollapse), and if you want
dio to be able to use that same locking (because otherwise dio will also
cause page cache inconsistency), this is the way to do it.

2023-05-23 17:20:44

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

> > No, that's definitely handled (and you can see it in the code I linked),
> > and I wrote a torture test for fstests as well.
>
> I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> way (by prefaulting pages from the iter before grabbing the problematic
> lock and then disabling page faults for the iomap_dio_rw() call). I guess
> we should somehow unify these schemes so that we don't have two mechanisms
> for avoiding exactly the same deadlock. Adding GFS2 guys to CC.

Oof, that sounds a bit sketchy. What happens if the dio call passes in
an address from the same address space? What happens if we race with the
pages we faulted in being evicted?

> Also good that you've written a fstest for this, that is definitely a useful
> addition, although I suspect GFS2 guys added a test for this not so long
> ago when testing their stuff. Maybe they have a pointer handy?

More tests more good.

So if we want to lift this scheme to the VFS layer, we'd start by
replacing the lock you added (grepping for it, the name escapes me) with
a different type of lock - two_state_shared_lock in my code, it's like a
rw lock except writers don't exclude other writers. That way the DIO
path can use it without singlethreading writes to a single file.

2023-05-23 22:58:10

by Dave Chinner

[permalink] [raw]
Subject: Re: (subset) [PATCH 20/32] vfs: factor out inode hash head calculation

On Tue, May 23, 2023 at 11:27:06AM +0200, Christian Brauner wrote:
> On Tue, 09 May 2023 12:56:45 -0400, Kent Overstreet wrote:
> > In preparation for changing the inode hash table implementation.
> >
> >
>
> This is interesting completely independent of bcachefs so we should give
> it some testing.
>
> ---
>
> Applied to the vfs.unstable.inode-hash branch of the vfs/vfs.git tree.
> Patches in the vfs.unstable.inode-hash branch should appear in linux-next soon.
>
> Please report any outstanding bugs that were missed during review in a
> new review to the original patch series allowing us to drop it.
>
> It's encouraged to provide Acked-bys and Reviewed-bys even though the
> patch has now been applied. If possible patch trailers will be updated.
>
> tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
> branch: vfs.unstable.inode-hash
>
> [20/32] vfs: factor out inode hash head calculation
> https://git.kernel.org/vfs/vfs/c/b54a4516146d

Hi Christian - I suspect you should pull the latest version of these
patches from:

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git vfs-scale

The commit messages are more recent and complete, and I've been
testing the branch in all my test kernels since 6.4-rc1 without
issues.

There's also the dlist-lock stuff for avoiding s_inode_list_lock
contention in that branch. Once the global hash lock is removed,
the s_inode_list_lock is the only global lock in the inode
instantiation and reclaim paths. It nests inside the hash locks, so
all the contention is currently taken on the hash locks - remove the
global hash locks and we just contend on the next global cache
line and the workload doesn't go any faster.

i.e. to see the full benefit of the inode hash lock contention
reduction, we also need the sb->s_inode_list_lock contention to be
fixed....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2023-05-24 06:47:34

by Christoph Hellwig

[permalink] [raw]
Subject: Re: (subset) [PATCH 20/32] vfs: factor out inode hash head calculation

On Wed, May 24, 2023 at 08:53:22AM +1000, Dave Chinner wrote:
> Hi Christian - I suspect you should pull the latest version of these
> patches from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git vfs-scale
>
> The commit messages are more recent and complete, and I've been
> testing the branch in all my test kernels since 6.4-rc1 without
> issues.

Can you please send the series to linux-fsdevel for review?

2023-05-24 06:56:26

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

On Tue, May 23, 2023 at 12:35:35PM -0400, Kent Overstreet wrote:
> No, this is fundamentally because userspace controls the ordering of
> locking because the buffer passed to dio can point into any address
> space. You can't solve this by changing the locking heirarchy.
>
> If you want to be able to have locking around adding things to the
> pagecache so that things that bypass the pagecache can prevent
> inconsistencies (and we do, the big one is fcollapse), and if you want
> dio to be able to use that same locking (because otherwise dio will also
> cause page cache inconsistency), this is the way to do it.

Well, it seems like you are talking about something else than the
existing cases in gfs2 and btrfs, that is you want full consistency
between direct I/O and buffered I/O. That's something nothing in the
kernel has ever provided, so I'd be curious why you think you need it
and want different semantics from everyone else?

2023-05-24 07:40:05

by Dave Chinner

[permalink] [raw]
Subject: Re: (subset) [PATCH 20/32] vfs: factor out inode hash head calculation

On Tue, May 23, 2023 at 11:44:03PM -0700, Christoph Hellwig wrote:
> On Wed, May 24, 2023 at 08:53:22AM +1000, Dave Chinner wrote:
> > Hi Christian - I suspect you should pull the latest version of these
> > patches from:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git vfs-scale
> >
> > The commit messages are more recent and complete, and I've been
> > testing the branch in all my test kernels since 6.4-rc1 without
> > issues.
>
> Can you please send the series to linux-fsdevel for review?

When it gets back to the top of my priority pile. Last time I sent
it there was zero interest in reviewing it from fs/vfs developers
but it attracted lots of obnoxious shouting from some RTPREEMPT
people about using bit locks. If there's interest in getting it
merged, then I can add it to my backlog of stuff to do...

As it is, I'm buried layers deep right now, so I really have no
bandwidth to deal with this in the foreseeable future. The code is
there, it works just fine, if you want to push it through the
process of getting it merged, you're more than welcome to do so.

-Dave.
--
Dave Chinner
[email protected]

2023-05-24 08:12:42

by Kent Overstreet

[permalink] [raw]
Subject: Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

On Tue, May 23, 2023 at 11:43:32PM -0700, Christoph Hellwig wrote:
> On Tue, May 23, 2023 at 12:35:35PM -0400, Kent Overstreet wrote:
> > No, this is fundamentally because userspace controls the ordering of
> > locking because the buffer passed to dio can point into any address
> > space. You can't solve this by changing the locking heirarchy.
> >
> > If you want to be able to have locking around adding things to the
> > pagecache so that things that bypass the pagecache can prevent
> > inconsistencies (and we do, the big one is fcollapse), and if you want
> > dio to be able to use that same locking (because otherwise dio will also
> > cause page cache inconsistency), this is the way to do it.
>
> Well, it seems like you are talking about something else than the
> existing cases in gfs2 and btrfs, that is you want full consistency
> between direct I/O and buffered I/O. That's something nothing in the
> kernel has ever provided, so I'd be curious why you think you need it
> and want different semantics from everyone else?

Because I like code that is correct.

2023-05-24 08:33:12

by Christian Brauner

[permalink] [raw]
Subject: Re: (subset) [PATCH 20/32] vfs: factor out inode hash head calculation

On Wed, May 24, 2023 at 05:35:02PM +1000, Dave Chinner wrote:
> On Tue, May 23, 2023 at 11:44:03PM -0700, Christoph Hellwig wrote:
> > On Wed, May 24, 2023 at 08:53:22AM +1000, Dave Chinner wrote:
> > > Hi Christian - I suspect you should pull the latest version of these
> > > patches from:
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git vfs-scale
> > >
> > > The commit messages are more recent and complete, and I've been
> > > testing the branch in all my test kernels since 6.4-rc1 without
> > > issues.
> >
> > Can you please send the series to linux-fsdevel for review?
>
> When it gets back to the top of my priority pile. Last time I sent
> it there was zero interest in reviewing it from fs/vfs developers
> but it attracted lots of obnoxious shouting from some RTPREEMPT
> people about using bit locks. If there's interest in getting it

I think there is given that it seems to have nice perf gains.

> merged, then I can add it to my backlog of stuff to do...
>
> As it is, I'm buried layers deep right now, so I really have no
> bandwidth to deal with this in the foreseeable future. The code is
> there, it works just fine, if you want to push it through the
> process of getting it merged, you're more than welcome to do so.

I'm here to help get more review done and pick stuff up. I won't be able
to do it without additional reviewers such as Christoph helping of
course as this isn't a one-man show.

Let's see if we can get this reviewed. If you have the bandwith to send
it to fsdevel that'd be great. If it takes you a while to get back to it
then that's fine too.

2023-05-24 09:22:39

by Kent Overstreet

[permalink] [raw]
Subject: Re: (subset) [PATCH 20/32] vfs: factor out inode hash head calculation

On Wed, May 24, 2023 at 10:31:14AM +0200, Christian Brauner wrote:
> I'm here to help get more review done and pick stuff up. I won't be able
> to do it without additional reviewers such as Christoph helping of
> course as this isn't a one-man show.
>
> Let's see if we can get this reviewed. If you have the bandwith to send
> it to fsdevel that'd be great. If it takes you a while to get back to it
> then that's fine too.

These patches really should have my reviewed-by on them already, I
stared at them quite a bit (and fs/inode.c in general) awhile back.

(I was attempting to convert fs/inode.c to an rhashtable up until I
realized the inode lifetime rules are completely insane, so when I saw
Dave's much simpler approach I was _more_ than happy to not have to
contemplate that mess anymore...)

2023-05-25 09:09:37

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

On Wed, May 24, 2023 at 04:09:02AM -0400, Kent Overstreet wrote:
> > Well, it seems like you are talking about something else than the
> > existing cases in gfs2 and btrfs, that is you want full consistency
> > between direct I/O and buffered I/O. That's something nothing in the
> > kernel has ever provided, so I'd be curious why you think you need it
> > and want different semantics from everyone else?
>
> Because I like code that is correct.

Well, start with explaining your definition of correctness, why everyone
else is "not correct", an how you can help fixing this correctness
problem in the existing kernel. Thanks for your cooperation!

2023-05-25 09:10:36

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

On Tue 23-05-23 12:49:06, Kent Overstreet wrote:
> > > No, that's definitely handled (and you can see it in the code I linked),
> > > and I wrote a torture test for fstests as well.
> >
> > I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> > way (by prefaulting pages from the iter before grabbing the problematic
> > lock and then disabling page faults for the iomap_dio_rw() call). I guess
> > we should somehow unify these schemes so that we don't have two mechanisms
> > for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
>
> Oof, that sounds a bit sketchy. What happens if the dio call passes in
> an address from the same address space?

If we submit direct IO that uses mapped file F at offset O as a buffer for
direct IO from file F, offset O, it will currently livelock in an
indefinite retry loop. It should rather return error or fall back to
buffered IO. But that should be fixable. Andreas?

But if the buffer and direct IO range does not overlap, it will just
happily work - iomap_dio_rw() invalidates only the range direct IO is done
to.

> What happens if we race with the pages we faulted in being evicted?

We fault them in again and retry.

> > Also good that you've written a fstest for this, that is definitely a useful
> > addition, although I suspect GFS2 guys added a test for this not so long
> > ago when testing their stuff. Maybe they have a pointer handy?
>
> More tests more good.
>
> So if we want to lift this scheme to the VFS layer, we'd start by
> replacing the lock you added (grepping for it, the name escapes me) with
> a different type of lock - two_state_shared_lock in my code, it's like a
> rw lock except writers don't exclude other writers. That way the DIO
> path can use it without singlethreading writes to a single file.

Yes, I've noticed that you are introducing in bcachefs a lock with very
similar semantics to mapping->invalidate_lock, just with this special lock
type. What I'm kind of worried about with two_state_shared_lock as
implemented in bcachefs is the fairness. AFAICS so far if someone is e.g.
heavily faulting pages on a file, direct IO to that file can be starved
indefinitely. That is IMHO not a good thing and I would not like to use
this type of lock in VFS until this problem is resolved. But it should be
fixable e.g. by introducing some kind of deadline for a waiter after which
it will block acquisitions of the other lock state.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-05-25 20:54:41

by Kent Overstreet

[permalink] [raw]
Subject: Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

On Thu, May 25, 2023 at 01:58:13AM -0700, Christoph Hellwig wrote:
> On Wed, May 24, 2023 at 04:09:02AM -0400, Kent Overstreet wrote:
> > > Well, it seems like you are talking about something else than the
> > > existing cases in gfs2 and btrfs, that is you want full consistency
> > > between direct I/O and buffered I/O. That's something nothing in the
> > > kernel has ever provided, so I'd be curious why you think you need it
> > > and want different semantics from everyone else?
> >
> > Because I like code that is correct.
>
> Well, start with explaining your definition of correctness, why everyone
> else is "not correct", an how you can help fixing this correctness
> problem in the existing kernel. Thanks for your cooperation!

A cache that isn't actually consistent is a _bug_. You're being
Obsequious. And any time this has come up in previous discussions
(including at LSF), that was never up for debate, the only question has
been whether it was even possible to practically fix it.

The DIO code recognizes cache incoherency as something to be avoided by
shooting down the page cache both at the beginning of the IO _and again
at the end_. That's the kind of obvious hackery for a race condition
that we would like to avoid.

Regarding the consequences of this kind of bug - stale data exposed to
userspace, possibly stale data overwriting a write we acked, and worse
any filesystem state that hangs off the page cache being inconsistent
with the data on disk.

And look, we've been over all this before, so I don't see what this adds
to the discussion.

2023-05-25 21:44:08

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

On Thu, May 25, 2023 at 10:47:31AM +0200, Jan Kara wrote:
> If we submit direct IO that uses mapped file F at offset O as a buffer for
> direct IO from file F, offset O, it will currently livelock in an
> indefinite retry loop. It should rather return error or fall back to
> buffered IO. But that should be fixable. Andreas?
>
> But if the buffer and direct IO range does not overlap, it will just
> happily work - iomap_dio_rw() invalidates only the range direct IO is done
> to.

*nod*

readahead triggered from the page fault path is another consideration.
No idea how that interacts with the gf2s method; IIRC there's a hack in
the page fault path that says somewhere "we may be getting called via
gup(), don't invoke readahead".

We could potentially kill that hack if we lifted this to the VFS layer.

>
> > What happens if we race with the pages we faulted in being evicted?
>
> We fault them in again and retry.
>
> > > Also good that you've written a fstest for this, that is definitely a useful
> > > addition, although I suspect GFS2 guys added a test for this not so long
> > > ago when testing their stuff. Maybe they have a pointer handy?
> >
> > More tests more good.
> >
> > So if we want to lift this scheme to the VFS layer, we'd start by
> > replacing the lock you added (grepping for it, the name escapes me) with
> > a different type of lock - two_state_shared_lock in my code, it's like a
> > rw lock except writers don't exclude other writers. That way the DIO
> > path can use it without singlethreading writes to a single file.
>
> Yes, I've noticed that you are introducing in bcachefs a lock with very
> similar semantics to mapping->invalidate_lock, just with this special lock
> type. What I'm kind of worried about with two_state_shared_lock as
> implemented in bcachefs is the fairness. AFAICS so far if someone is e.g.
> heavily faulting pages on a file, direct IO to that file can be starved
> indefinitely. That is IMHO not a good thing and I would not like to use
> this type of lock in VFS until this problem is resolved. But it should be
> fixable e.g. by introducing some kind of deadline for a waiter after which
> it will block acquisitions of the other lock state.

Yeah, my two_state_shared lock is definitely at the quick and dirty
prototype level, the implementation would need work. Lockdep support
would be another hard requirement.

The deadline might be a good idea, OTOH it'd want tuning. Maybe
something like what rwsem does where we block new read acquirerers if
there's a writer waiting would work.

2023-05-25 21:51:48

by Kent Overstreet

[permalink] [raw]
Subject: Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

On Thu, May 25, 2023 at 01:58:13AM -0700, Christoph Hellwig wrote:
> On Wed, May 24, 2023 at 04:09:02AM -0400, Kent Overstreet wrote:
> > > Well, it seems like you are talking about something else than the
> > > existing cases in gfs2 and btrfs, that is you want full consistency
> > > between direct I/O and buffered I/O. That's something nothing in the
> > > kernel has ever provided, so I'd be curious why you think you need it
> > > and want different semantics from everyone else?
> >
> > Because I like code that is correct.
>
> Well, start with explaining your definition of correctness, why everyone
> else is "not correct", an how you can help fixing this correctness
> problem in the existing kernel. Thanks for your cooperation!

BTW, if you wanted a more serious answer, just asking for the commit
message to be expanded would be a better way to ask...

2023-05-25 22:38:53

by Andreas Grünbacher

[permalink] [raw]
Subject: Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

Am Di., 23. Mai 2023 um 18:28 Uhr schrieb Christoph Hellwig <[email protected]>:
> On Tue, May 23, 2023 at 03:34:31PM +0200, Jan Kara wrote:
> > I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> > way (by prefaulting pages from the iter before grabbing the problematic
> > lock and then disabling page faults for the iomap_dio_rw() call). I guess
> > we should somehow unify these schemes so that we don't have two mechanisms
> > for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
> >
> > Also good that you've written a fstest for this, that is definitely a useful
> > addition, although I suspect GFS2 guys added a test for this not so long
> > ago when testing their stuff. Maybe they have a pointer handy?
>
> generic/708 is the btrfs version of this.
>
> But I think all of the file systems that have this deadlock are actually
> fundamentally broken because they have a mess up locking hierarchy
> where page faults take the same lock that is held over the the direct I/
> operation. And the right thing is to fix this. I have work in progress
> for btrfs, and something similar should apply to gfs2, with the added
> complication that it probably means a revision to their network
> protocol.

We do disable page faults, and there can be deadlocks in page fault
handlers while no page faults are allowed.

I'm roughly aware of the locking hierarchy that other filesystems use,
and that's something we want to avoid because of two reasons: (1) it
would be an incompatible change, and (2) we want to avoid cluster-wide
locking operations as much as possible because they are very slow.

These kinds of locking conflicts are so rare in practice that the
theoretical inefficiency of having to retry the operation doesn't
matter.

> I'm absolutely not in favour to add workarounds for thes kind of locking
> problems to the core kernel. I already feel bad for allowing the
> small workaround in iomap for btrfs, as just fixing the locking back
> then would have avoid massive ratholing.

Please let me know when those btrfs changes are in a presentable shape ...

Thanks,
Andreas

2023-05-25 22:59:39

by Andreas Grünbacher

[permalink] [raw]
Subject: Re: [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

Am Do., 25. Mai 2023 um 10:56 Uhr schrieb Jan Kara <[email protected]>:
> On Tue 23-05-23 12:49:06, Kent Overstreet wrote:
> > > > No, that's definitely handled (and you can see it in the code I linked),
> > > > and I wrote a torture test for fstests as well.
> > >
> > > I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> > > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> > > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> > > way (by prefaulting pages from the iter before grabbing the problematic
> > > lock and then disabling page faults for the iomap_dio_rw() call). I guess
> > > we should somehow unify these schemes so that we don't have two mechanisms
> > > for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
> >
> > Oof, that sounds a bit sketchy. What happens if the dio call passes in
> > an address from the same address space?
>
> If we submit direct IO that uses mapped file F at offset O as a buffer for
> direct IO from file F, offset O, it will currently livelock in an
> indefinite retry loop. It should rather return error or fall back to
> buffered IO. But that should be fixable. Andreas?

Yes, I guess. Thanks for the heads-up.

Andreas

> But if the buffer and direct IO range does not overlap, it will just
> happily work - iomap_dio_rw() invalidates only the range direct IO is done
> to.
>
> > What happens if we race with the pages we faulted in being evicted?
>
> We fault them in again and retry.
>
> > > Also good that you've written a fstest for this, that is definitely a useful
> > > addition, although I suspect GFS2 guys added a test for this not so long
> > > ago when testing their stuff. Maybe they have a pointer handy?
> >
> > More tests more good.
> >
> > So if we want to lift this scheme to the VFS layer, we'd start by
> > replacing the lock you added (grepping for it, the name escapes me) with
> > a different type of lock - two_state_shared_lock in my code, it's like a
> > rw lock except writers don't exclude other writers. That way the DIO
> > path can use it without singlethreading writes to a single file.
>
> Yes, I've noticed that you are introducing in bcachefs a lock with very
> similar semantics to mapping->invalidate_lock, just with this special lock
> type. What I'm kind of worried about with two_state_shared_lock as
> implemented in bcachefs is the fairness. AFAICS so far if someone is e.g.
> heavily faulting pages on a file, direct IO to that file can be starved
> indefinitely. That is IMHO not a good thing and I would not like to use
> this type of lock in VFS until this problem is resolved. But it should be
> fixable e.g. by introducing some kind of deadline for a waiter after which
> it will block acquisitions of the other lock state.
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2023-05-25 22:59:54

by Andreas Grünbacher

[permalink] [raw]
Subject: Re: [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

Am Di., 23. Mai 2023 um 15:37 Uhr schrieb Jan Kara <[email protected]>:
> On Wed 10-05-23 02:18:45, Kent Overstreet wrote:
> > On Wed, May 10, 2023 at 03:07:37AM +0200, Jan Kara wrote:
> > > On Tue 09-05-23 12:56:31, Kent Overstreet wrote:
> > > > From: Kent Overstreet <[email protected]>
> > > >
> > > > This is used by bcachefs to fix a page cache coherency issue with
> > > > O_DIRECT writes.
> > > >
> > > > Also relevant: mapping->invalidate_lock, see below.
> > > >
> > > > O_DIRECT writes (and other filesystem operations that modify file data
> > > > while bypassing the page cache) need to shoot down ranges of the page
> > > > cache - and additionally, need locking to prevent those pages from
> > > > pulled back in.
> > > >
> > > > But O_DIRECT writes invoke the page fault handler (via get_user_pages),
> > > > and the page fault handler will need to take that same lock - this is a
> > > > classic recursive deadlock if userspace has mmaped the file they're DIO
> > > > writing to and uses those pages for the buffer to write from, and it's a
> > > > lock ordering deadlock in general.
> > > >
> > > > Thus we need a way to signal from the dio code to the page fault handler
> > > > when we already are holding the pagecache add lock on an address space -
> > > > this patch just adds a member to task_struct for this purpose. For now
> > > > only bcachefs is implementing this locking, though it may be moved out
> > > > of bcachefs and made available to other filesystems in the future.
> > >
> > > It would be nice to have at least a link to the code that's actually using
> > > the field you are adding.
> >
> > Bit of a trick to link to a _later_ patch in the series from a commit
> > message, but...
> >
> > https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n975
> > https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n2454
>
> Thanks and I'm sorry for the delay.
>
> > > Also I think we were already through this discussion [1] and we ended up
> > > agreeing that your scheme actually solves only the AA deadlock but a
> > > malicious userspace can easily create AB BA deadlock by running direct IO
> > > to file A using mapped file B as a buffer *and* direct IO to file B using
> > > mapped file A as a buffer.
> >
> > No, that's definitely handled (and you can see it in the code I linked),
> > and I wrote a torture test for fstests as well.
>
> I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> way (by prefaulting pages from the iter before grabbing the problematic
> lock and then disabling page faults for the iomap_dio_rw() call). I guess
> we should somehow unify these schemes so that we don't have two mechanisms
> for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
>
> Also good that you've written a fstest for this, that is definitely a useful
> addition, although I suspect GFS2 guys added a test for this not so long
> ago when testing their stuff. Maybe they have a pointer handy?

Ah yes, that's xfstests commit d3cbdabf ("generic: Test page faults
during read and write").

Thanks,
Andreas

> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2023-05-25 23:31:51

by Kent Overstreet

[permalink] [raw]
Subject: Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

On Fri, May 26, 2023 at 12:25:31AM +0200, Andreas Grünbacher wrote:
> Am Di., 23. Mai 2023 um 18:28 Uhr schrieb Christoph Hellwig <[email protected]>:
> > On Tue, May 23, 2023 at 03:34:31PM +0200, Jan Kara wrote:
> > > I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> > > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> > > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> > > way (by prefaulting pages from the iter before grabbing the problematic
> > > lock and then disabling page faults for the iomap_dio_rw() call). I guess
> > > we should somehow unify these schemes so that we don't have two mechanisms
> > > for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
> > >
> > > Also good that you've written a fstest for this, that is definitely a useful
> > > addition, although I suspect GFS2 guys added a test for this not so long
> > > ago when testing their stuff. Maybe they have a pointer handy?
> >
> > generic/708 is the btrfs version of this.
> >
> > But I think all of the file systems that have this deadlock are actually
> > fundamentally broken because they have a mess up locking hierarchy
> > where page faults take the same lock that is held over the the direct I/
> > operation. And the right thing is to fix this. I have work in progress
> > for btrfs, and something similar should apply to gfs2, with the added
> > complication that it probably means a revision to their network
> > protocol.
>
> We do disable page faults, and there can be deadlocks in page fault
> handlers while no page faults are allowed.
>
> I'm roughly aware of the locking hierarchy that other filesystems use,
> and that's something we want to avoid because of two reasons: (1) it
> would be an incompatible change, and (2) we want to avoid cluster-wide
> locking operations as much as possible because they are very slow.
>
> These kinds of locking conflicts are so rare in practice that the
> theoretical inefficiency of having to retry the operation doesn't
> matter.

Would you be willing to expand on that? I'm wondering if this would
simplify things for gfs2, but you mention locking heirarchy being an
incompatible change - how does that work?

>
> > I'm absolutely not in favour to add workarounds for thes kind of locking
> > problems to the core kernel. I already feel bad for allowing the
> > small workaround in iomap for btrfs, as just fixing the locking back
> > then would have avoid massive ratholing.
>
> Please let me know when those btrfs changes are in a presentable shape ...

I would also be curious to know what btrfs needs and what the approach
is there.

2023-05-26 00:31:48

by Andreas Grünbacher

[permalink] [raw]
Subject: Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

Am Fr., 26. Mai 2023 um 01:20 Uhr schrieb Kent Overstreet
<[email protected]>:
> On Fri, May 26, 2023 at 12:25:31AM +0200, Andreas Grünbacher wrote:
> > Am Di., 23. Mai 2023 um 18:28 Uhr schrieb Christoph Hellwig <[email protected]>:
> > > On Tue, May 23, 2023 at 03:34:31PM +0200, Jan Kara wrote:
> > > > I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> > > > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> > > > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> > > > way (by prefaulting pages from the iter before grabbing the problematic
> > > > lock and then disabling page faults for the iomap_dio_rw() call). I guess
> > > > we should somehow unify these schemes so that we don't have two mechanisms
> > > > for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
> > > >
> > > > Also good that you've written a fstest for this, that is definitely a useful
> > > > addition, although I suspect GFS2 guys added a test for this not so long
> > > > ago when testing their stuff. Maybe they have a pointer handy?
> > >
> > > generic/708 is the btrfs version of this.
> > >
> > > But I think all of the file systems that have this deadlock are actually
> > > fundamentally broken because they have a mess up locking hierarchy
> > > where page faults take the same lock that is held over the the direct I/
> > > operation. And the right thing is to fix this. I have work in progress
> > > for btrfs, and something similar should apply to gfs2, with the added
> > > complication that it probably means a revision to their network
> > > protocol.
> >
> > We do disable page faults, and there can be deadlocks in page fault
> > handlers while no page faults are allowed.
> >
> > I'm roughly aware of the locking hierarchy that other filesystems use,
> > and that's something we want to avoid because of two reasons: (1) it
> > would be an incompatible change, and (2) we want to avoid cluster-wide
> > locking operations as much as possible because they are very slow.
> >
> > These kinds of locking conflicts are so rare in practice that the
> > theoretical inefficiency of having to retry the operation doesn't
> > matter.
>
> Would you be willing to expand on that? I'm wondering if this would
> simplify things for gfs2, but you mention locking heirarchy being an
> incompatible change - how does that work?

Oh, it's just that gfs2 uses one dlm lock per inode to control access
to that inode. In the code, this is called the "inode glock" ---
glocks being an abstraction above dlm locks --- but it boils down to
dlm locks in the end. An additional layer of locking will only work
correctly if all cluster nodes use the new locks consistently, so old
cluster nodes will become incompatible. Those kinds of changes are
hard.

But the additional lock taking would also hurt performance, forever,
and I'd really like to avoid taking that hit.

It may not be obvious to everyone, but allowing page faults during
reads and writes (i.e., while holding dlm locks) can lead to
distributed deadlocks. We cannot just pretend to be a local
filesystem.

Thanks,
Andreas

> > > I'm absolutely not in favour to add workarounds for thes kind of locking
> > > problems to the core kernel. I already feel bad for allowing the
> > > small workaround in iomap for btrfs, as just fixing the locking back
> > > then would have avoid massive ratholing.
> >
> > Please let me know when those btrfs changes are in a presentable shape ...
>
> I would also be curious to know what btrfs needs and what the approach
> is there.

2023-05-26 00:47:57

by Kent Overstreet

[permalink] [raw]
Subject: Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

On Fri, May 26, 2023 at 02:05:26AM +0200, Andreas Grünbacher wrote:
> Oh, it's just that gfs2 uses one dlm lock per inode to control access
> to that inode. In the code, this is called the "inode glock" ---
> glocks being an abstraction above dlm locks --- but it boils down to
> dlm locks in the end. An additional layer of locking will only work
> correctly if all cluster nodes use the new locks consistently, so old
> cluster nodes will become incompatible. Those kinds of changes are
> hard.
>
> But the additional lock taking would also hurt performance, forever,
> and I'd really like to avoid taking that hit.
>
> It may not be obvious to everyone, but allowing page faults during
> reads and writes (i.e., while holding dlm locks) can lead to
> distributed deadlocks. We cannot just pretend to be a local
> filesystem.

Ah, gotcha. Same basic issue then, dio -> page fault means you're taking
two DLM locks simulateously, so without lock ordering you get deadlock.

So that means promoting this to the VFS won't be be useful to you
(because the VFS lock will be a local lock, not your DLM lock).

Do you have trylock() and a defined lock ordering you can check for or
glocks (inode number)? Here's the new and expanded commit message, in
case my approach is of interest to you:

commit 32858d0fb90b503c8c39e899ad5155e44639d3b5
Author: Kent Overstreet <[email protected]>
Date: Wed Oct 16 15:03:50 2019 -0400

sched: Add task_struct->faults_disabled_mapping

There has been a long standing page cache coherence bug with direct IO.
This provides part of a mechanism to fix it, currently just used by
bcachefs but potentially worth promoting to the VFS.

Direct IO evicts the range of the pagecache being read or written to.

For reads, we need dirty pages to be written to disk, so that the read
doesn't return stale data. For writes, we need to evict that range of
the pagecache so that it's not stale after the write completes.

However, without a locking mechanism to prevent those pages from being
re-added to the pagecache - by a buffered read or page fault - page
cache inconsistency is still possible.

This isn't necessarily just an issue for userspace when they're playing
games; filesystems may hang arbitrary state off the pagecache, and so
page cache inconsistency may cause real filesystem bugs, depending on
the filesystem. This is less of an issue for iomap based filesystems,
but e.g. buffer heads caches disk block mappings (!) and attaches them
to the pagecache, and bcachefs attaches disk reservations to pagecache
pages.

This issue has been hard to fix, because
- we need to add a lock (henceforth calld pagecache_add_lock), which
would be held for the duration of the direct IO
- page faults add pages to the page cache, thus need to take the same
lock
- dio -> gup -> page fault thus can deadlock

And we cannot enforce a lock ordering with this lock, since userspace
will be controlling the lock ordering (via the fd and buffer arguments
to direct IOs), so we need a different method of deadlock avoidance.

We need to tell the page fault handler that we're already holding a
pagecache_add_lock, and since plumbing it through the entire gup() path
would be highly impractical this adds a field to task_struct.

Then the full method is:
- in the dio path, when we take first pagecache_add_lock, note the
mapping in task_struct
- in the page fault handler, if faults_disabled_mapping is set, we
check if it's the same mapping as the one taking a page fault for,
and if so return an error.

Then we check lock ordering: if there's a lock ordering violation and
trylock fails, we'll have to cycle the locks and return an error that
tells the DIO path to retry: faults_disabled_mapping is also used for
signalling "locks were dropped, please retry".

Also relevant to this patch: mapping->invalidate_lock.
mapping->invalidate_lock provides most of the required semantics - it's
used by truncate/fallocate to block pages being added to the pagecache.
However, since it's a rwsem, direct IOs would need to take the write
side in order to block page cache adds, and would then be exclusive with
each other - we'll need a new type of lock to pair with this approach.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: [email protected]
Cc: Andreas Grünbacher <[email protected]>

2023-05-26 08:18:14

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

On Thu, May 25, 2023 at 04:50:39PM -0400, Kent Overstreet wrote:
> A cache that isn't actually consistent is a _bug_. You're being
> Obsequious. And any time this has come up in previous discussions
> (including at LSF), that was never up for debate, the only question has
> been whether it was even possible to practically fix it.

That is not my impression. But again, if you think it is useful,
go ahead and seel people on the idea. But please prepare a series
that includes the rationale, performance tradeoffs and real live
implications for it. And do it on the existing code that people use
and not just your shiny new thing.


2023-05-26 08:26:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

On Thu, May 25, 2023 at 07:20:46PM -0400, Kent Overstreet wrote:
> > > I'm absolutely not in favour to add workarounds for thes kind of locking
> > > problems to the core kernel. I already feel bad for allowing the
> > > small workaround in iomap for btrfs, as just fixing the locking back
> > > then would have avoid massive ratholing.
> >
> > Please let me know when those btrfs changes are in a presentable shape ...
>
> I would also be curious to know what btrfs needs and what the approach
> is there.

btrfs has the extent locked, where "extent locked" is a somewhat magic
range lock that actually includes different lock bits. It does so
because it clears the page writeback bit when the data made it to the
media, but before the metadata required to find it is commited, and
the extent lock prevents it from trying to do a readpage on something
that has actually very recently been written back but not fully
commited. Once btrfs is changed to only clear the page writeback bit
once the write is fully commited like in other file systems this extra
level of locking can go away, and there are no more locks in the
readpage path that are also taken by the direct I/O code. With that
a lot of code in btrfs working around this can go away, including the
no fault direct I/O code.

2023-05-26 08:44:25

by Kent Overstreet

[permalink] [raw]
Subject: Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

On Fri, May 26, 2023 at 01:06:46AM -0700, Christoph Hellwig wrote:
> On Thu, May 25, 2023 at 04:50:39PM -0400, Kent Overstreet wrote:
> > A cache that isn't actually consistent is a _bug_. You're being
> > Obsequious. And any time this has come up in previous discussions
> > (including at LSF), that was never up for debate, the only question has
> > been whether it was even possible to practically fix it.
>
> That is not my impression. But again, if you think it is useful,
> go ahead and seel people on the idea. But please prepare a series
> that includes the rationale, performance tradeoffs and real live
> implications for it. And do it on the existing code that people use
> and not just your shiny new thing.

When I'm ready to lift this to the VFS level I will; it should simplify
locking overall and it'll be one less thing for people to worry about.

(i.e. the fact that even _readahead_ can pull in pages a dio is
invalidating is a really nice footgun if not worked around).

Right now though I've got more than enough on my plate just trying to
finally get bcachefs merged, I'm happy to explain what this is for but
I'm not ready for additional headaches or projects yet.

2023-05-26 08:51:33

by Kent Overstreet

[permalink] [raw]
Subject: Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

On Fri, May 26, 2023 at 01:10:43AM -0700, Christoph Hellwig wrote:
> On Thu, May 25, 2023 at 07:20:46PM -0400, Kent Overstreet wrote:
> > > > I'm absolutely not in favour to add workarounds for thes kind of locking
> > > > problems to the core kernel. I already feel bad for allowing the
> > > > small workaround in iomap for btrfs, as just fixing the locking back
> > > > then would have avoid massive ratholing.
> > >
> > > Please let me know when those btrfs changes are in a presentable shape ...
> >
> > I would also be curious to know what btrfs needs and what the approach
> > is there.
>
> btrfs has the extent locked, where "extent locked" is a somewhat magic
> range lock that actually includes different lock bits. It does so
> because it clears the page writeback bit when the data made it to the
> media, but before the metadata required to find it is commited, and
> the extent lock prevents it from trying to do a readpage on something
> that has actually very recently been written back but not fully
> commited. Once btrfs is changed to only clear the page writeback bit
> once the write is fully commited like in other file systems this extra
> level of locking can go away, and there are no more locks in the
> readpage path that are also taken by the direct I/O code. With that
> a lot of code in btrfs working around this can go away, including the
> no fault direct I/O code.

wow, yeah, that is not how that is supposed to work...

2023-06-15 20:49:06

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 00/32] bcachefs - a new COW filesystem

Hi!

> I'm submitting the bcachefs filesystem for review and inclusion.
>
> Included in this patch series are all the non fs/bcachefs/ patches. The
> entire tree, based on v6.3, may be found at:
>
> http://evilpiepirate.org/git/bcachefs.git bcachefs-for-upstream
>
> ----------------------------------------------------------------
>
> bcachefs overview, status:
>
> Features:
> - too many to list
>
> Known bugs:
> - too many to list


Documentation: missing.

Dunno. I guess it would help review if feature and known bugs lists were included.

BR,
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2023-06-15 21:45:00

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 00/32] bcachefs - a new COW filesystem

On Thu, Jun 15, 2023 at 10:41:56PM +0200, Pavel Machek wrote:
> Hi!
>
> > I'm submitting the bcachefs filesystem for review and inclusion.
> >
> > Included in this patch series are all the non fs/bcachefs/ patches. The
> > entire tree, based on v6.3, may be found at:
> >
> > http://evilpiepirate.org/git/bcachefs.git bcachefs-for-upstream
> >
> > ----------------------------------------------------------------
> >
> > bcachefs overview, status:
> >
> > Features:
> > - too many to list
> >
> > Known bugs:
> > - too many to list
>
>
> Documentation: missing.

https://bcachefs.org/bcachefs-principles-of-operation.pdf

> Dunno. I guess it would help review if feature and known bugs lists were included.

https://evilpiepirate.org/~testdashboard/ci?branch=bcachefs

https://github.com/koverstreet/bcachefs/issues/

Hope that helps...

2023-06-19 09:22:02

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 28/32] stacktrace: Export stack_trace_save_tsk

On Tue, May 09, 2023 at 12:56:53PM -0400, Kent Overstreet wrote:
> From: Christopher James Halse Rogers <[email protected]>
>
> The bcachefs module wants it, and there doesn't seem to be any
> reason it shouldn't be exported like the other functions.

What is the bcachefs module using this for?

Is that just for debug purposes? Assuming so, mentioning that in the commit
message would be helpful.

Thanks,
Mark.

> Signed-off-by: Christopher James Halse Rogers <[email protected]>
> Signed-off-by: Kent Overstreet <[email protected]>
> ---
> kernel/stacktrace.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c
> index 9ed5ce9894..4f65824879 100644
> --- a/kernel/stacktrace.c
> +++ b/kernel/stacktrace.c
> @@ -151,6 +151,7 @@ unsigned int stack_trace_save_tsk(struct task_struct *tsk, unsigned long *store,
> put_task_stack(tsk);
> return c.len;
> }
> +EXPORT_SYMBOL_GPL(stack_trace_save_tsk);
>
> /**
> * stack_trace_save_regs - Save a stack trace based on pt_regs into a storage array
> @@ -301,6 +302,7 @@ unsigned int stack_trace_save_tsk(struct task_struct *task,
> save_stack_trace_tsk(task, &trace);
> return trace.nr_entries;
> }
> +EXPORT_SYMBOL_GPL(stack_trace_save_tsk);
>
> /**
> * stack_trace_save_regs - Save a stack trace based on pt_regs into a storage array
> --
> 2.40.1
>
>

2023-06-19 11:40:46

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 28/32] stacktrace: Export stack_trace_save_tsk

On Mon, Jun 19, 2023 at 10:10:56AM +0100, Mark Rutland wrote:
> On Tue, May 09, 2023 at 12:56:53PM -0400, Kent Overstreet wrote:
> > From: Christopher James Halse Rogers <[email protected]>
> >
> > The bcachefs module wants it, and there doesn't seem to be any
> > reason it shouldn't be exported like the other functions.
>
> What is the bcachefs module using this for?
>
> Is that just for debug purposes? Assuming so, mentioning that in the commit
> message would be helpful.

Yes, the output ends up in debugfs.

2023-07-12 20:40:09

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote

On Wed, Jul 12, 2023 at 12:58:54PM -0700, Kees Cook wrote:
> On Tue, May 09, 2023 at 12:56:54PM -0400, Kent Overstreet wrote:
> > From: Kent Overstreet <[email protected]>
> >
> > printbuf now needs to know the number of characters that would have been
> > written if the buffer was too small, like snprintf(); this changes
> > string_get_size() to return the the return value of snprintf().
>
> Unfortunately, snprintf doesn't return characters written, it return
> what it TRIED to write, and can cause a lot of problems[1]. This patch
> would be fine with me if the snprintf was also replaced by scnprintf,
> which will return the actual string length copied (or 0) *not* including
> the trailing %NUL.

...All of which would be solved if we were converting code away from raw
char * buffers to a proper string building type.

Which I tried to address when I tried to push printbufs upstream, but
that turned into a giant exercise in frustration in dealing with
maintainers.

2023-07-12 20:41:22

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote

On Wed, Jul 12, 2023 at 12:58:54PM -0700, Kees Cook wrote:
> On Tue, May 09, 2023 at 12:56:54PM -0400, Kent Overstreet wrote:
> > From: Kent Overstreet <[email protected]>
> >
> > printbuf now needs to know the number of characters that would have been
> > written if the buffer was too small, like snprintf(); this changes
> > string_get_size() to return the the return value of snprintf().
>
> Unfortunately, snprintf doesn't return characters written, it return
> what it TRIED to write, and can cause a lot of problems[1]. This patch
> would be fine with me if the snprintf was also replaced by scnprintf,
> which will return the actual string length copied (or 0) *not* including
> the trailing %NUL.

Anyways, I can't use scnprintf here, printbufs/seq_buf both need the
number of characters that would have been written, but I'll update the
comment.

2023-07-12 20:47:06

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote

On Tue, May 09, 2023 at 12:56:54PM -0400, Kent Overstreet wrote:
> From: Kent Overstreet <[email protected]>
>
> printbuf now needs to know the number of characters that would have been
> written if the buffer was too small, like snprintf(); this changes
> string_get_size() to return the the return value of snprintf().

Unfortunately, snprintf doesn't return characters written, it return
what it TRIED to write, and can cause a lot of problems[1]. This patch
would be fine with me if the snprintf was also replaced by scnprintf,
which will return the actual string length copied (or 0) *not* including
the trailing %NUL.

> [...]
> @@ -126,8 +126,8 @@ void string_get_size(u64 size, u64 blk_size, const enum string_size_units units,
> else
> unit = units_str[units][i];
>
> - snprintf(buf, len, "%u%s %s", (u32)size,
> - tmp, unit);
> + return snprintf(buf, len, "%u%s %s", (u32)size,
> + tmp, unit);

-Kees

[1] https://github.com/KSPP/linux/issues/105

--
Kees Cook

2023-07-12 23:02:39

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote

On Wed, Jul 12, 2023 at 04:19:31PM -0400, Kent Overstreet wrote:
> On Wed, Jul 12, 2023 at 12:58:54PM -0700, Kees Cook wrote:
> > On Tue, May 09, 2023 at 12:56:54PM -0400, Kent Overstreet wrote:
> > > From: Kent Overstreet <[email protected]>
> > >
> > > printbuf now needs to know the number of characters that would have been
> > > written if the buffer was too small, like snprintf(); this changes
> > > string_get_size() to return the the return value of snprintf().
> >
> > Unfortunately, snprintf doesn't return characters written, it return
> > what it TRIED to write, and can cause a lot of problems[1]. This patch
> > would be fine with me if the snprintf was also replaced by scnprintf,
> > which will return the actual string length copied (or 0) *not* including
> > the trailing %NUL.
>
> ...All of which would be solved if we were converting code away from raw
> char * buffers to a proper string building type.
>
> Which I tried to address when I tried to push printbufs upstream, but
> that turned into a giant exercise in frustration in dealing with
> maintainers.

Heh, yeah, I've been trying to aim people at using seq_buf instead of
a long series of snprintf/strlcat/etc calls. Where can I look at how
you wired this up to seq_buf/printbuf? I had trouble finding it when I
looked before. I'd really like to find a way to do it without leaving
around foot-guns for future callers of string_get_size(). :)

I found the printbuf series:
https://lore.kernel.org/lkml/[email protected]/
It seems there are some nice improvements in there. It'd be really nice
if seq_buf could just grow those changes. Adding a static version of
seq_buf_init to be used like you have PRINTBUF_EXTERN would be nice
(or even a statically sized initializer). And much of the conversions
is just changing types and functions. If we can leave all that alone,
things become MUCH easier to review, etc, etc. I'd *love* to see an
incremental improvement for seq_buf, especially the heap-allocation
part.

--
Kees Cook

2023-07-13 00:07:01

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote

On Wed, Jul 12, 2023 at 03:38:44PM -0700, Kees Cook wrote:
> Heh, yeah, I've been trying to aim people at using seq_buf instead of
> a long series of snprintf/strlcat/etc calls. Where can I look at how
> you wired this up to seq_buf/printbuf? I had trouble finding it when I
> looked before. I'd really like to find a way to do it without leaving
> around foot-guns for future callers of string_get_size(). :)
>
> I found the printbuf series:
> https://lore.kernel.org/lkml/[email protected]/
> It seems there are some nice improvements in there. It'd be really nice
> if seq_buf could just grow those changes. Adding a static version of
> seq_buf_init to be used like you have PRINTBUF_EXTERN would be nice
> (or even a statically sized initializer). And much of the conversions
> is just changing types and functions. If we can leave all that alone,
> things become MUCH easier to review, etc, etc. I'd *love* to see an
> incremental improvement for seq_buf, especially the heap-allocation
> part.

Well, I raised that with Steve way back when I was starting on the
conversions of existing code, and I couldn't get any communication out
him regarding making those changes to seq_buf.

So, I'd _love_ to resurrect that patch series and get it in after the
bcachefs merger, but don't expect me to go back and redo everything :)
the amount of code in existing seq_buf users is fairly small compared to
bcachef's printbuf usage, and what that patch series does in the rest of
the kernel anyways.

I'd rather save that energy for ditching the seq_file interface and
making that just use a printbuf - clean up that bit of API
fragmentation.

2023-10-19 15:31:24

by Mateusz Guzik

[permalink] [raw]
Subject: Re: (subset) [PATCH 22/32] vfs: inode cache conversion to hash-bl

On Tue, May 23, 2023 at 11:28:38AM +0200, Christian Brauner wrote:
> On Tue, 09 May 2023 12:56:47 -0400, Kent Overstreet wrote:
> > Because scalability of the global inode_hash_lock really, really
> > sucks.
> >
> > 32-way concurrent create on a couple of different filesystems
> > before:
> >
> > - 52.13% 0.04% [kernel] [k] ext4_create
> > - 52.09% ext4_create
> > - 41.03% __ext4_new_inode
> > - 29.92% insert_inode_locked
> > - 25.35% _raw_spin_lock
> > - do_raw_spin_lock
> > - 24.97% __pv_queued_spin_lock_slowpath
> >
> > [...]
>
> This is interesting completely independent of bcachefs so we should give
> it some testing.
>
> I updated a few places that had outdated comments.
>
> ---
>
> Applied to the vfs.unstable.inode-hash branch of the vfs/vfs.git tree.
> Patches in the vfs.unstable.inode-hash branch should appear in linux-next soon.
>
> Please report any outstanding bugs that were missed during review in a
> new review to the original patch series allowing us to drop it.
>
> It's encouraged to provide Acked-bys and Reviewed-bys even though the
> patch has now been applied. If possible patch trailers will be updated.
>
> tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
> branch: vfs.unstable.inode-hash
>
> [22/32] vfs: inode cache conversion to hash-bl
> https://git.kernel.org/vfs/vfs/c/e3e92d47e6b1

What, if anything, is blocking this? It is over 5 months now, I don't
see it in master nor -next.

To be clear there is no urgency as far as I'm concerned, but I did run
into something which is primarily bottlenecked by inode hash lock and
looks like the above should sort it out.

Looks like the patch was simply forgotten.

tl;dr can this land in -next please

2023-10-19 16:00:41

by Mateusz Guzik

[permalink] [raw]
Subject: Re: (subset) [PATCH 22/32] vfs: inode cache conversion to hash-bl

On Thu, Oct 19, 2023 at 05:30:40PM +0200, Mateusz Guzik wrote:
> On Tue, May 23, 2023 at 11:28:38AM +0200, Christian Brauner wrote:
> > On Tue, 09 May 2023 12:56:47 -0400, Kent Overstreet wrote:
> > > Because scalability of the global inode_hash_lock really, really
> > > sucks.
> > >
> > > 32-way concurrent create on a couple of different filesystems
> > > before:
> > >
> > > - 52.13% 0.04% [kernel] [k] ext4_create
> > > - 52.09% ext4_create
> > > - 41.03% __ext4_new_inode
> > > - 29.92% insert_inode_locked
> > > - 25.35% _raw_spin_lock
> > > - do_raw_spin_lock
> > > - 24.97% __pv_queued_spin_lock_slowpath
> > >
> > > [...]
> >
> > This is interesting completely independent of bcachefs so we should give
> > it some testing.
> >
> > I updated a few places that had outdated comments.
> >
> > ---
> >
> > Applied to the vfs.unstable.inode-hash branch of the vfs/vfs.git tree.
> > Patches in the vfs.unstable.inode-hash branch should appear in linux-next soon.
> >
> > Please report any outstanding bugs that were missed during review in a
> > new review to the original patch series allowing us to drop it.
> >
> > It's encouraged to provide Acked-bys and Reviewed-bys even though the
> > patch has now been applied. If possible patch trailers will be updated.
> >
> > tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
> > branch: vfs.unstable.inode-hash
> >
> > [22/32] vfs: inode cache conversion to hash-bl
> > https://git.kernel.org/vfs/vfs/c/e3e92d47e6b1
>
> What, if anything, is blocking this? It is over 5 months now, I don't
> see it in master nor -next.
>
> To be clear there is no urgency as far as I'm concerned, but I did run
> into something which is primarily bottlenecked by inode hash lock and
> looks like the above should sort it out.
>
> Looks like the patch was simply forgotten.
>
> tl;dr can this land in -next please

In case you can't be arsed, here is something funny which may convince
you to expedite. ;)

I did some benching by running 20 processes in parallel, each doing stat
on a tree of 1 million files (one tree per proc, 1000 dirs x 1000 files,
so 20 mln inodes in total). Box had 24 cores and 24G RAM.

Best times:
Linux: 7.60s user 1306.90s system 1863% cpu 1:10.55 total
FreeBSD: 3.49s user 345.12s system 1983% cpu 17.573 total
OpenBSD: 5.01s user 6463.66s system 2000% cpu 5:23.42 total
DragonflyBSD: 11.73s user 1316.76s system 1023% cpu 2:09.78 total
OmniosCE: 9.17s user 516.53s system 1550% cpu 33.905 total

NetBSD failed to complete the run, OOM-killing workers:
http://mail-index.netbsd.org/tech-kern/2023/10/19/msg029242.html
OpenBSD is shafted by a big kernel lock, so no surprise it takes a long
time.

So what I find funny is that Linux needed more time than OmniosCE (an
Illumos variant, fork of Solaris).

It also needed more time than FreeBSD, which is not necessarily funny
but not that great either.

All systems were mostly busy contending on locks and in particular Linux
was almost exclusively busy waiting on inode hash lock.