2024-02-25 02:38:52

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 00/21] bcachefs disk accounting rewrite

here it is; the disk accounting rewrite I've been talking about since
forever.

git link:
https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-disk-accounting-rewrite

test dashboard (just rebased, results are regenerating as of this
writing but shouldn't be any regressions left):
https://evilpiepirate.org/~testdashboard/ci?branch=bcachefs-disk-accounting-rewrite

The old disk accounting scheme was fast, but had some limitations:

- lack of scalability: it was based on percpu counters additionally
sharded by outstanding journal buffer, and then just prior to journal
write we'd roll up the counters and add them to the journal entry.

But this meant that all counters were added to every journal write,
which meant it'd never be able to support per-snapshot counters.

- it was a pain to extend
this was why, until now, we didn't have proper compressed accounting,
and getting compression ratio required a full btree scan

In the new scheme:
- every set of counters is a bkey, a key in a btree
(BTREE_ID_accounting).

this means they aren't pinned in the journal

- the key has structure, and is extensible
disk_accounting_key is a tagged union, and it's just union'd over
bpos

- counters are deltas, until flushed to the underlying btree

this means counter updates are normal btree updates; the btree write
buffer makes counter updates efficient.

Since reading counters from the btree would be expensive - it'd require
a write buffer flush to get up-to-date counters - we also maintain a
parallel set of accounting in memory, a bit like the old scheme but
without the per-journal-buffer sharding. The in memory accounters
indexed in an eytzinger tree by disk_accounting_key/bpos, with the
counters themselves being percpu u64s.

Reviewers: do a "is this adequately documented, can I find my way
around, do things make sense", not line-by-line "does this have bugs".

Compatibility: this is in no way compatible with the old disk accounting
on disk format, and it's not feasible to write out accounting in the old
format - that means we have to regenerate accounting when upgrading or
downgrading past this version.

That should work more or less seamlessly with the most recent compat
bits (bch_sb_field downgrade, so we can tell older versions what
recovery psases to run and what to fix); additionally, userspace fsck
now checks if the kernel bcachefs version better matches the on disk
version than itself and if so uses the kernle fsck implementation with
the OFFLINE_FSCK ioctl - so we shouldn't be bouncing back and forth
between versions if your tools and kernel don't match.

upgrade/downgrade still need a bit more testing, but transparently using
kernel fsck is well tested as of latest versions.

but: 6.7 users (& possibly 6.8) beware, the sb_downgrade section is in
6.7 but BCH_IOCTL_OFFLINE_FSCK is not, and backporting that doesn't look
likely given current -stable process fiasco.

merge ETA - this stuff may make the next merge window; I'd like to get
per-snapshot-id accounting done with it, that should be the biggest item
left.

Cheers,
Kent

Kent Overstreet (21):
bcachefs: KEY_TYPE_accounting
bcachefs: Accumulate accounting keys in journal replay
bcachefs: btree write buffer knows how to accumulate bch_accounting
keys
bcachefs: Disk space accounting rewrite
bcachefs: dev_usage updated by new accounting
bcachefs: Kill bch2_fs_usage_initialize()
bcachefs: Convert bch2_ioctl_fs_usage() to new accounting
bcachefs: kill bch2_fs_usage_read()
bcachefs: Kill writing old accounting to journal
bcachefs: Delete journal-buf-sharded old style accounting
bcachefs: Kill bch2_fs_usage_to_text()
bcachefs: Kill fs_usage_online
bcachefs: Kill replicas_journal_res
bcachefs: Convert gc to new accounting
bcachefs: Convert bch2_replicas_gc2() to new accounting
bcachefs: bch2_verify_accounting_clean()
bcachefs: Eytzinger accumulation for accounting keys
bcachefs: bch_acct_compression
bcachefs: Convert bch2_compression_stats_to_text() to new accounting
bcachefs: bch2_fs_accounting_to_text()
bcachefs: bch2_fs_usage_base_to_text()

fs/bcachefs/Makefile | 3 +-
fs/bcachefs/alloc_background.c | 137 +++--
fs/bcachefs/alloc_background.h | 2 +
fs/bcachefs/bcachefs.h | 22 +-
fs/bcachefs/bcachefs_format.h | 81 +--
fs/bcachefs/bcachefs_ioctl.h | 7 +-
fs/bcachefs/bkey_methods.c | 1 +
fs/bcachefs/btree_gc.c | 259 ++++------
fs/bcachefs/btree_iter.c | 9 -
fs/bcachefs/btree_journal_iter.c | 23 +-
fs/bcachefs/btree_journal_iter.h | 15 +
fs/bcachefs/btree_trans_commit.c | 71 ++-
fs/bcachefs/btree_types.h | 1 -
fs/bcachefs/btree_update.h | 22 +-
fs/bcachefs/btree_write_buffer.c | 120 ++++-
fs/bcachefs/btree_write_buffer.h | 50 +-
fs/bcachefs/btree_write_buffer_types.h | 2 +
fs/bcachefs/buckets.c | 663 ++++---------------------
fs/bcachefs/buckets.h | 70 +--
fs/bcachefs/buckets_types.h | 14 +-
fs/bcachefs/chardev.c | 75 +--
fs/bcachefs/disk_accounting.c | 584 ++++++++++++++++++++++
fs/bcachefs/disk_accounting.h | 203 ++++++++
fs/bcachefs/disk_accounting_format.h | 145 ++++++
fs/bcachefs/disk_accounting_types.h | 20 +
fs/bcachefs/ec.c | 166 ++++---
fs/bcachefs/inode.c | 42 +-
fs/bcachefs/journal_io.c | 13 +-
fs/bcachefs/recovery.c | 126 +++--
fs/bcachefs/recovery_types.h | 1 +
fs/bcachefs/replicas.c | 242 ++-------
fs/bcachefs/replicas.h | 16 +-
fs/bcachefs/replicas_format.h | 21 +
fs/bcachefs/replicas_types.h | 16 -
fs/bcachefs/sb-clean.c | 62 ---
fs/bcachefs/sb-downgrade.c | 12 +-
fs/bcachefs/sb-errors_types.h | 4 +-
fs/bcachefs/super.c | 74 ++-
fs/bcachefs/sysfs.c | 109 ++--
39 files changed, 1873 insertions(+), 1630 deletions(-)
create mode 100644 fs/bcachefs/disk_accounting.c
create mode 100644 fs/bcachefs/disk_accounting.h
create mode 100644 fs/bcachefs/disk_accounting_format.h
create mode 100644 fs/bcachefs/disk_accounting_types.h
create mode 100644 fs/bcachefs/replicas_format.h

--
2.43.0



2024-02-25 02:39:00

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 01/21] bcachefs: KEY_TYPE_accounting

New key type for the disk space accounting rewrite.

- Holds a variable sized array of u64s (may be more than one for
accounting e.g. compressed and uncompressed size, or buckets and
sectors for a given data type)

- Updates are deltas, not new versions of the key: this means updates
to accounting can happen via the btree write buffer, which we'll be
teaching to accumulate deltas.

Signed-off-by: Kent Overstreet <[email protected]>
---
fs/bcachefs/Makefile | 3 +-
fs/bcachefs/bcachefs.h | 1 +
fs/bcachefs/bcachefs_format.h | 80 +++------------
fs/bcachefs/bkey_methods.c | 1 +
fs/bcachefs/disk_accounting.c | 70 ++++++++++++++
fs/bcachefs/disk_accounting.h | 52 ++++++++++
fs/bcachefs/disk_accounting_format.h | 139 +++++++++++++++++++++++++++
fs/bcachefs/replicas_format.h | 21 ++++
fs/bcachefs/sb-downgrade.c | 12 ++-
fs/bcachefs/sb-errors_types.h | 3 +-
10 files changed, 311 insertions(+), 71 deletions(-)
create mode 100644 fs/bcachefs/disk_accounting.c
create mode 100644 fs/bcachefs/disk_accounting.h
create mode 100644 fs/bcachefs/disk_accounting_format.h
create mode 100644 fs/bcachefs/replicas_format.h

diff --git a/fs/bcachefs/Makefile b/fs/bcachefs/Makefile
index f42f6d256945..94b2edb4155f 100644
--- a/fs/bcachefs/Makefile
+++ b/fs/bcachefs/Makefile
@@ -27,10 +27,11 @@ bcachefs-y := \
checksum.o \
clock.o \
compress.o \
+ data_update.o \
debug.o \
dirent.o \
+ disk_accounting.o \
disk_groups.o \
- data_update.o \
ec.o \
errcode.o \
error.o \
diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h
index 0bee9dab6068..62812fc1cad0 100644
--- a/fs/bcachefs/bcachefs.h
+++ b/fs/bcachefs/bcachefs.h
@@ -509,6 +509,7 @@ enum gc_phase {
GC_PHASE_BTREE_logged_ops,
GC_PHASE_BTREE_rebalance_work,
GC_PHASE_BTREE_subvolume_children,
+ GC_PHASE_BTREE_accounting,

GC_PHASE_PENDING_DELETE,
};
diff --git a/fs/bcachefs/bcachefs_format.h b/fs/bcachefs/bcachefs_format.h
index bff8750ac0d7..313ca7dc370d 100644
--- a/fs/bcachefs/bcachefs_format.h
+++ b/fs/bcachefs/bcachefs_format.h
@@ -416,7 +416,8 @@ static inline void bkey_init(struct bkey *k)
x(bucket_gens, 30) \
x(snapshot_tree, 31) \
x(logged_op_truncate, 32) \
- x(logged_op_finsert, 33)
+ x(logged_op_finsert, 33) \
+ x(accounting, 34)

enum bch_bkey_type {
#define x(name, nr) KEY_TYPE_##name = nr,
@@ -501,17 +502,19 @@ struct bch_sb_field {
x(downgrade, 14)

#include "alloc_background_format.h"
+#include "dirent_format.h"
+#include "disk_accounting_format.h"
#include "extents_format.h"
-#include "reflink_format.h"
#include "ec_format.h"
#include "inode_format.h"
-#include "dirent_format.h"
-#include "xattr_format.h"
-#include "quota_format.h"
#include "logged_ops_format.h"
+#include "quota_format.h"
+#include "reflink_format.h"
+#include "replicas_format.h"
+#include "sb-counters_format.h"
#include "snapshot_format.h"
#include "subvolume_format.h"
-#include "sb-counters_format.h"
+#include "xattr_format.h"

enum bch_sb_field_type {
#define x(f, nr) BCH_SB_FIELD_##f = nr,
@@ -680,69 +683,11 @@ LE64_BITMASK(BCH_KDF_SCRYPT_P, struct bch_sb_field_crypt, kdf_flags, 32, 48);

/* BCH_SB_FIELD_replicas: */

-#define BCH_DATA_TYPES() \
- x(free, 0) \
- x(sb, 1) \
- x(journal, 2) \
- x(btree, 3) \
- x(user, 4) \
- x(cached, 5) \
- x(parity, 6) \
- x(stripe, 7) \
- x(need_gc_gens, 8) \
- x(need_discard, 9)
-
-enum bch_data_type {
-#define x(t, n) BCH_DATA_##t,
- BCH_DATA_TYPES()
-#undef x
- BCH_DATA_NR
-};
-
-static inline bool data_type_is_empty(enum bch_data_type type)
-{
- switch (type) {
- case BCH_DATA_free:
- case BCH_DATA_need_gc_gens:
- case BCH_DATA_need_discard:
- return true;
- default:
- return false;
- }
-}
-
-static inline bool data_type_is_hidden(enum bch_data_type type)
-{
- switch (type) {
- case BCH_DATA_sb:
- case BCH_DATA_journal:
- return true;
- default:
- return false;
- }
-}
-
-struct bch_replicas_entry_v0 {
- __u8 data_type;
- __u8 nr_devs;
- __u8 devs[];
-} __packed;
-
struct bch_sb_field_replicas_v0 {
struct bch_sb_field field;
struct bch_replicas_entry_v0 entries[];
} __packed __aligned(8);

-struct bch_replicas_entry_v1 {
- __u8 data_type;
- __u8 nr_devs;
- __u8 nr_required;
- __u8 devs[];
-} __packed;
-
-#define replicas_entry_bytes(_i) \
- (offsetof(typeof(*(_i)), devs) + (_i)->nr_devs)
-
struct bch_sb_field_replicas {
struct bch_sb_field field;
struct bch_replicas_entry_v1 entries[];
@@ -875,7 +820,8 @@ struct bch_sb_field_downgrade {
x(rebalance_work, BCH_VERSION(1, 3)) \
x(member_seq, BCH_VERSION(1, 4)) \
x(subvolume_fs_parent, BCH_VERSION(1, 5)) \
- x(btree_subvolume_children, BCH_VERSION(1, 6))
+ x(btree_subvolume_children, BCH_VERSION(1, 6)) \
+ x(disk_accounting_v2, BCH_VERSION(1, 7))

enum bcachefs_metadata_version {
bcachefs_metadata_version_min = 9,
@@ -1525,7 +1471,9 @@ enum btree_id_flags {
x(rebalance_work, 18, BTREE_ID_SNAPSHOT_FIELD, \
BIT_ULL(KEY_TYPE_set)|BIT_ULL(KEY_TYPE_cookie)) \
x(subvolume_children, 19, 0, \
- BIT_ULL(KEY_TYPE_set))
+ BIT_ULL(KEY_TYPE_set)) \
+ x(accounting, 20, BTREE_ID_SNAPSHOT_FIELD, \
+ BIT_ULL(KEY_TYPE_accounting)) \

enum btree_id {
#define x(name, nr, ...) BTREE_ID_##name = nr,
diff --git a/fs/bcachefs/bkey_methods.c b/fs/bcachefs/bkey_methods.c
index 5e52684764eb..da25bdd1e8a6 100644
--- a/fs/bcachefs/bkey_methods.c
+++ b/fs/bcachefs/bkey_methods.c
@@ -7,6 +7,7 @@
#include "btree_types.h"
#include "alloc_background.h"
#include "dirent.h"
+#include "disk_accounting.h"
#include "ec.h"
#include "error.h"
#include "extents.h"
diff --git a/fs/bcachefs/disk_accounting.c b/fs/bcachefs/disk_accounting.c
new file mode 100644
index 000000000000..209f59e87b34
--- /dev/null
+++ b/fs/bcachefs/disk_accounting.c
@@ -0,0 +1,70 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "bcachefs.h"
+#include "btree_update.h"
+#include "buckets.h"
+#include "disk_accounting.h"
+#include "replicas.h"
+
+static const char * const disk_accounting_type_strs[] = {
+#define x(t, n, ...) [n] = #t,
+ BCH_DISK_ACCOUNTING_TYPES()
+#undef x
+ NULL
+};
+
+int bch2_accounting_invalid(struct bch_fs *c, struct bkey_s_c k,
+ enum bkey_invalid_flags flags,
+ struct printbuf *err)
+{
+ return 0;
+}
+
+void bch2_accounting_key_to_text(struct printbuf *out, struct disk_accounting_key *k)
+{
+ if (k->type >= BCH_DISK_ACCOUNTING_TYPE_NR) {
+ prt_printf(out, "unknown type %u", k->type);
+ return;
+ }
+
+ prt_str(out, disk_accounting_type_strs[k->type]);
+ prt_str(out, " ");
+
+ switch (k->type) {
+ case BCH_DISK_ACCOUNTING_nr_inodes:
+ break;
+ case BCH_DISK_ACCOUNTING_persistent_reserved:
+ prt_printf(out, "replicas=%u", k->persistent_reserved.nr_replicas);
+ break;
+ case BCH_DISK_ACCOUNTING_replicas:
+ bch2_replicas_entry_to_text(out, &k->replicas);
+ break;
+ case BCH_DISK_ACCOUNTING_dev_data_type:
+ prt_printf(out, "dev=%u data_type=", k->dev_data_type.dev);
+ bch2_prt_data_type(out, k->dev_data_type.data_type);
+ break;
+ case BCH_DISK_ACCOUNTING_dev_stripe_buckets:
+ prt_printf(out, "dev=%u", k->dev_stripe_buckets.dev);
+ break;
+ }
+}
+
+void bch2_accounting_to_text(struct printbuf *out, struct bch_fs *c, struct bkey_s_c k)
+{
+ struct bkey_s_c_accounting acc = bkey_s_c_to_accounting(k);
+ struct disk_accounting_key acc_k;
+ bpos_to_disk_accounting_key(&acc_k, k.k->p);
+
+ bch2_accounting_key_to_text(out, &acc_k);
+
+ for (unsigned i = 0; i < bch2_accounting_counters(k.k); i++)
+ prt_printf(out, " %lli", acc.v->d[i]);
+}
+
+void bch2_accounting_swab(struct bkey_s k)
+{
+ for (u64 *p = (u64 *) k.v;
+ p < (u64 *) bkey_val_end(k);
+ p++)
+ *p = swab64(*p);
+}
diff --git a/fs/bcachefs/disk_accounting.h b/fs/bcachefs/disk_accounting.h
new file mode 100644
index 000000000000..e15299665859
--- /dev/null
+++ b/fs/bcachefs/disk_accounting.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _BCACHEFS_DISK_ACCOUNTING_H
+#define _BCACHEFS_DISK_ACCOUNTING_H
+
+static inline unsigned bch2_accounting_counters(const struct bkey *k)
+{
+ return bkey_val_u64s(k) - offsetof(struct bch_accounting, d) / sizeof(u64);
+}
+
+static inline void bch2_accounting_accumulate(struct bkey_i_accounting *dst,
+ struct bkey_s_c_accounting src)
+{
+ EBUG_ON(dst->k.u64s != src.k->u64s);
+
+ for (unsigned i = 0; i < bch2_accounting_counters(&dst->k); i++)
+ dst->v.d[i] += src.v->d[i];
+ if (bversion_cmp(dst->k.version, src.k->version) < 0)
+ dst->k.version = src.k->version;
+}
+
+static inline void bpos_to_disk_accounting_key(struct disk_accounting_key *acc, struct bpos p)
+{
+ acc->_pad = p;
+#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+ bch2_bpos_swab(&acc->_pad);
+#endif
+}
+
+static inline struct bpos disk_accounting_key_to_bpos(struct disk_accounting_key *k)
+{
+ struct bpos ret = k->_pad;
+
+#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+ bch2_bpos_swab(&ret);
+#endif
+ return ret;
+}
+
+int bch2_accounting_invalid(struct bch_fs *, struct bkey_s_c,
+ enum bkey_invalid_flags, struct printbuf *);
+void bch2_accounting_key_to_text(struct printbuf *, struct disk_accounting_key *);
+void bch2_accounting_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c);
+void bch2_accounting_swab(struct bkey_s);
+
+#define bch2_bkey_ops_accounting ((struct bkey_ops) { \
+ .key_invalid = bch2_accounting_invalid, \
+ .val_to_text = bch2_accounting_to_text, \
+ .swab = bch2_accounting_swab, \
+ .min_val_size = 8, \
+})
+
+#endif /* _BCACHEFS_DISK_ACCOUNTING_H */
diff --git a/fs/bcachefs/disk_accounting_format.h b/fs/bcachefs/disk_accounting_format.h
new file mode 100644
index 000000000000..e06a42f0d578
--- /dev/null
+++ b/fs/bcachefs/disk_accounting_format.h
@@ -0,0 +1,139 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _BCACHEFS_DISK_ACCOUNTING_FORMAT_H
+#define _BCACHEFS_DISK_ACCOUNTING_FORMAT_H
+
+#include "replicas_format.h"
+
+/*
+ * Disk accounting - KEY_TYPE_accounting - on disk format:
+ *
+ * Here, the key has considerably more structure than a typical key (bpos); an
+ * accounting key is 'struct disk_accounting_key', which is a union of bpos.
+ *
+ * This is a type-tagged union of all our various subtypes; a disk accounting
+ * key can be device counters, replicas counters, et cetera - it's extensible.
+ *
+ * The value is a list of u64s or s64s; the number of counters is specific to a
+ * given accounting type.
+ *
+ * Unlike with other key types, updates are _deltas_, and the deltas are not
+ * resolved until the update to the underlying btree, done by btree write buffer
+ * flush or journal replay.
+ *
+ * Journal replay in particular requires special handling. The journal tracks a
+ * range of entries which may possibly have not yet been applied to the btree
+ * yet - it does not know definitively whether individual entries are dirty and
+ * still need to be applied.
+ *
+ * To handle this, we use the version field of struct bkey, and give every
+ * accounting update a unique version number - a total ordering in time; the
+ * version number is derived from the key's position in the journal. Then
+ * journal replay can compare the version number of the key from the journal
+ * with the version number of the key in the btree to determine if a key needs
+ * to be replayed.
+ *
+ * For this to work, we must maintain this strict time ordering of updates as
+ * they are flushed to the btree, both via write buffer flush and via journal
+ * replay. This has complications for the write buffer code while journal replay
+ * is still in progress; the write buffer cannot flush any accounting keys to
+ * the btree until journal replay has finished replaying its accounting keys, or
+ * the (newer) version number of the keys from the write buffer will cause
+ * updates from journal replay to be lost.
+ */
+
+struct bch_accounting {
+ struct bch_val v;
+ __u64 d[];
+};
+
+#define BCH_ACCOUNTING_MAX_COUNTERS 3
+
+#define BCH_DATA_TYPES() \
+ x(free, 0) \
+ x(sb, 1) \
+ x(journal, 2) \
+ x(btree, 3) \
+ x(user, 4) \
+ x(cached, 5) \
+ x(parity, 6) \
+ x(stripe, 7) \
+ x(need_gc_gens, 8) \
+ x(need_discard, 9)
+
+enum bch_data_type {
+#define x(t, n) BCH_DATA_##t,
+ BCH_DATA_TYPES()
+#undef x
+ BCH_DATA_NR
+};
+
+static inline bool data_type_is_empty(enum bch_data_type type)
+{
+ switch (type) {
+ case BCH_DATA_free:
+ case BCH_DATA_need_gc_gens:
+ case BCH_DATA_need_discard:
+ return true;
+ default:
+ return false;
+ }
+}
+
+static inline bool data_type_is_hidden(enum bch_data_type type)
+{
+ switch (type) {
+ case BCH_DATA_sb:
+ case BCH_DATA_journal:
+ return true;
+ default:
+ return false;
+ }
+}
+
+#define BCH_DISK_ACCOUNTING_TYPES() \
+ x(nr_inodes, 0) \
+ x(persistent_reserved, 1) \
+ x(replicas, 2) \
+ x(dev_data_type, 3) \
+ x(dev_stripe_buckets, 4)
+
+enum disk_accounting_type {
+#define x(f, nr) BCH_DISK_ACCOUNTING_##f = nr,
+ BCH_DISK_ACCOUNTING_TYPES()
+#undef x
+ BCH_DISK_ACCOUNTING_TYPE_NR,
+};
+
+struct bch_nr_inodes {
+};
+
+struct bch_persistent_reserved {
+ __u8 nr_replicas;
+};
+
+struct bch_dev_data_type {
+ __u8 dev;
+ __u8 data_type;
+};
+
+struct bch_dev_stripe_buckets {
+ __u8 dev;
+};
+
+struct disk_accounting_key {
+ union {
+ struct {
+ __u8 type;
+ union {
+ struct bch_nr_inodes nr_inodes;
+ struct bch_persistent_reserved persistent_reserved;
+ struct bch_replicas_entry_v1 replicas;
+ struct bch_dev_data_type dev_data_type;
+ struct bch_dev_stripe_buckets dev_stripe_buckets;
+ };
+ };
+ struct bpos _pad;
+ };
+};
+
+#endif /* _BCACHEFS_DISK_ACCOUNTING_FORMAT_H */
diff --git a/fs/bcachefs/replicas_format.h b/fs/bcachefs/replicas_format.h
new file mode 100644
index 000000000000..ed94f8c636b3
--- /dev/null
+++ b/fs/bcachefs/replicas_format.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _BCACHEFS_REPLICAS_FORMAT_H
+#define _BCACHEFS_REPLICAS_FORMAT_H
+
+struct bch_replicas_entry_v0 {
+ __u8 data_type;
+ __u8 nr_devs;
+ __u8 devs[];
+} __packed;
+
+struct bch_replicas_entry_v1 {
+ __u8 data_type;
+ __u8 nr_devs;
+ __u8 nr_required;
+ __u8 devs[];
+} __packed;
+
+#define replicas_entry_bytes(_i) \
+ (offsetof(typeof(*(_i)), devs) + (_i)->nr_devs)
+
+#endif /* _BCACHEFS_REPLICAS_FORMAT_H */
diff --git a/fs/bcachefs/sb-downgrade.c b/fs/bcachefs/sb-downgrade.c
index 3337419faeff..33db8d7ca8c4 100644
--- a/fs/bcachefs/sb-downgrade.c
+++ b/fs/bcachefs/sb-downgrade.c
@@ -52,9 +52,15 @@
BCH_FSCK_ERR_subvol_fs_path_parent_wrong) \
x(btree_subvolume_children, \
BIT_ULL(BCH_RECOVERY_PASS_check_subvols), \
- BCH_FSCK_ERR_subvol_children_not_set)
+ BCH_FSCK_ERR_subvol_children_not_set) \
+ x(disk_accounting_v2, \
+ BIT_ULL(BCH_RECOVERY_PASS_check_allocations), \
+ BCH_FSCK_ERR_accounting_mismatch)

-#define DOWNGRADE_TABLE()
+#define DOWNGRADE_TABLE() \
+ x(disk_accounting_v2, \
+ BIT_ULL(BCH_RECOVERY_PASS_check_alloc_info), \
+ BCH_FSCK_ERR_dev_usage_buckets_wrong)

struct upgrade_downgrade_entry {
u64 recovery_passes;
@@ -108,7 +114,7 @@ void bch2_sb_set_upgrade(struct bch_fs *c,
}
}

-#define x(ver, passes, ...) static const u16 downgrade_ver_##errors[] = { __VA_ARGS__ };
+#define x(ver, passes, ...) static const u16 downgrade_##ver##_errors[] = { __VA_ARGS__ };
DOWNGRADE_TABLE()
#undef x

diff --git a/fs/bcachefs/sb-errors_types.h b/fs/bcachefs/sb-errors_types.h
index 0df4b0e7071a..383e13711001 100644
--- a/fs/bcachefs/sb-errors_types.h
+++ b/fs/bcachefs/sb-errors_types.h
@@ -264,7 +264,8 @@
x(subvol_children_not_set, 256) \
x(subvol_children_bad, 257) \
x(subvol_loop, 258) \
- x(subvol_unreachable, 259)
+ x(subvol_unreachable, 259) \
+ x(accounting_mismatch, 260)

enum bch_sb_error_id {
#define x(t, n) BCH_FSCK_ERR_##t = n,
--
2.43.0


2024-02-25 02:39:08

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 02/21] bcachefs: Accumulate accounting keys in journal replay

Until accounting keys hit the btree, they are deltas, not new versions
of the existing key; this means we have to teach journal replay to
accumulate them.

Additionally, the journal doesn't track precisely which entries have
been flushed to the btree; it only tracks a range of entries that may
possibly still need to be flushed.

That means we need to compare accounting keys against the version in the
btree and only flush updates that are newer.

There's another wrinkle with the write buffer: if the write buffer
starts flushing accounting keys before journal replay has finished
flushing accounting keys, journal replay will see the version number
from the new updates and updates from the journal will be lost.

To avoid this, journal replay has to flush accounting keys first, and
we'll be adding a flag so that write buffer flush knows to hold
accounting keys until then.

Signed-off-by: Kent Overstreet <[email protected]>
---
fs/bcachefs/btree_journal_iter.c | 23 +++-------
fs/bcachefs/btree_journal_iter.h | 15 +++++++
fs/bcachefs/btree_trans_commit.c | 9 +++-
fs/bcachefs/btree_update.h | 14 +++++-
fs/bcachefs/recovery.c | 76 +++++++++++++++++++++++++++++++-
5 files changed, 117 insertions(+), 20 deletions(-)

diff --git a/fs/bcachefs/btree_journal_iter.c b/fs/bcachefs/btree_journal_iter.c
index 207dd32e2ecc..164a316d8995 100644
--- a/fs/bcachefs/btree_journal_iter.c
+++ b/fs/bcachefs/btree_journal_iter.c
@@ -16,21 +16,6 @@
* operations for the regular btree iter code to use:
*/

-static int __journal_key_cmp(enum btree_id l_btree_id,
- unsigned l_level,
- struct bpos l_pos,
- const struct journal_key *r)
-{
- return (cmp_int(l_btree_id, r->btree_id) ?:
- cmp_int(l_level, r->level) ?:
- bpos_cmp(l_pos, r->k->k.p));
-}
-
-static int journal_key_cmp(const struct journal_key *l, const struct journal_key *r)
-{
- return __journal_key_cmp(l->btree_id, l->level, l->k->k.p, r);
-}
-
static inline size_t idx_to_pos(struct journal_keys *keys, size_t idx)
{
size_t gap_size = keys->size - keys->nr;
@@ -492,7 +477,13 @@ static void __journal_keys_sort(struct journal_keys *keys)
struct journal_key *dst = keys->data;

darray_for_each(*keys, src) {
- if (src + 1 < &darray_top(*keys) &&
+ /*
+ * We don't accumulate accounting keys here because we have to
+ * compare each individual accounting key against the version in
+ * the btree during replay:
+ */
+ if (src->k->k.type != KEY_TYPE_accounting &&
+ src + 1 < &darray_top(*keys) &&
!journal_key_cmp(src, src + 1))
continue;

diff --git a/fs/bcachefs/btree_journal_iter.h b/fs/bcachefs/btree_journal_iter.h
index c9d19da3ea04..8f3d9a3f1969 100644
--- a/fs/bcachefs/btree_journal_iter.h
+++ b/fs/bcachefs/btree_journal_iter.h
@@ -26,6 +26,21 @@ struct btree_and_journal_iter {
bool prefetch;
};

+static inline int __journal_key_cmp(enum btree_id l_btree_id,
+ unsigned l_level,
+ struct bpos l_pos,
+ const struct journal_key *r)
+{
+ return (cmp_int(l_btree_id, r->btree_id) ?:
+ cmp_int(l_level, r->level) ?:
+ bpos_cmp(l_pos, r->k->k.p));
+}
+
+static inline int journal_key_cmp(const struct journal_key *l, const struct journal_key *r)
+{
+ return __journal_key_cmp(l->btree_id, l->level, l->k->k.p, r);
+}
+
struct bkey_i *bch2_journal_keys_peek_upto(struct bch_fs *, enum btree_id,
unsigned, struct bpos, struct bpos, size_t *);
struct bkey_i *bch2_journal_keys_peek_slot(struct bch_fs *, enum btree_id,
diff --git a/fs/bcachefs/btree_trans_commit.c b/fs/bcachefs/btree_trans_commit.c
index 30d69a6d133e..60f6255367b9 100644
--- a/fs/bcachefs/btree_trans_commit.c
+++ b/fs/bcachefs/btree_trans_commit.c
@@ -760,8 +760,15 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,

static noinline void bch2_drop_overwrites_from_journal(struct btree_trans *trans)
{
+ /*
+ * Accounting keys aren't deduped in the journal: we have to compare
+ * each individual update against what's in the btree to see if it has
+ * been applied yet, and accounting updates also don't overwrite,
+ * they're deltas that accumulate.
+ */
trans_for_each_update(trans, i)
- bch2_journal_key_overwritten(trans->c, i->btree_id, i->level, i->k->k.p);
+ if (i->k->k.type != KEY_TYPE_accounting)
+ bch2_journal_key_overwritten(trans->c, i->btree_id, i->level, i->k->k.p);
}

static noinline int bch2_trans_commit_bkey_invalid(struct btree_trans *trans,
diff --git a/fs/bcachefs/btree_update.h b/fs/bcachefs/btree_update.h
index cc7c53e83f89..21f887fe857c 100644
--- a/fs/bcachefs/btree_update.h
+++ b/fs/bcachefs/btree_update.h
@@ -128,7 +128,19 @@ static inline int __must_check bch2_trans_update_buffered(struct btree_trans *tr
enum btree_id btree,
struct bkey_i *k)
{
- if (unlikely(trans->journal_replay_not_finished))
+ /*
+ * Most updates skip the btree write buffer until journal replay is
+ * finished because synchronization with journal replay relies on having
+ * a btree node locked - if we're overwriting a key in the journal that
+ * journal replay hasn't yet replayed, we have to mark it as
+ * overwritten.
+ *
+ * But accounting updates don't overwrite, they're deltas, and they have
+ * to be flushed to the btree strictly in order for journal replay to be
+ * able to tell which updates need to be applied:
+ */
+ if (k->k.type != KEY_TYPE_accounting &&
+ unlikely(trans->journal_replay_not_finished))
return bch2_btree_insert_clone_trans(trans, btree, k);

struct jset_entry *e = bch2_trans_jset_entry_alloc(trans, jset_u64s(k->k.u64s));
diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c
index 96e7a1ec7091..6829d80bd181 100644
--- a/fs/bcachefs/recovery.c
+++ b/fs/bcachefs/recovery.c
@@ -11,6 +11,7 @@
#include "btree_io.h"
#include "buckets.h"
#include "dirent.h"
+#include "disk_accounting.h"
#include "ec.h"
#include "errcode.h"
#include "error.h"
@@ -87,6 +88,56 @@ static void replay_now_at(struct journal *j, u64 seq)
bch2_journal_pin_put(j, j->replay_journal_seq++);
}

+static int bch2_journal_replay_accounting_key(struct btree_trans *trans,
+ struct journal_key *k)
+{
+ struct journal_keys *keys = &trans->c->journal_keys;
+
+ struct btree_iter iter;
+ bch2_trans_node_iter_init(trans, &iter, k->btree_id, k->k->k.p,
+ BTREE_MAX_DEPTH, k->level,
+ BTREE_ITER_INTENT);
+ int ret = bch2_btree_iter_traverse(&iter);
+ if (ret)
+ goto out;
+
+ struct bkey u;
+ struct bkey_s_c old = bch2_btree_path_peek_slot(btree_iter_path(trans, &iter), &u);
+
+ if (bversion_cmp(old.k->version, k->k->k.version) >= 0) {
+ ret = 0;
+ goto out;
+ }
+
+ if (k + 1 < &darray_top(*keys) &&
+ !journal_key_cmp(k, k + 1)) {
+ BUG_ON(bversion_cmp(k[0].k->k.version, k[1].k->k.version) > 0);
+
+ bch2_accounting_accumulate(bkey_i_to_accounting(k[1].k),
+ bkey_i_to_s_c_accounting(k[0].k));
+ ret = 0;
+ goto out;
+ }
+
+ struct bkey_i *new = k->k;
+ if (old.k->type == KEY_TYPE_accounting) {
+ new = bch2_bkey_make_mut_noupdate(trans, bkey_i_to_s_c(k->k));
+ ret = PTR_ERR_OR_ZERO(new);
+ if (ret)
+ goto out;
+
+ bch2_accounting_accumulate(bkey_i_to_accounting(new),
+ bkey_s_c_to_accounting(old));
+ }
+
+ trans->journal_res.seq = k->journal_seq;
+
+ ret = bch2_trans_update(trans, &iter, new, BTREE_TRIGGER_NORUN);
+out:
+ bch2_trans_iter_exit(trans, &iter);
+ return ret;
+}
+
static int bch2_journal_replay_key(struct btree_trans *trans,
struct journal_key *k)
{
@@ -159,12 +210,33 @@ static int bch2_journal_replay(struct bch_fs *c)

BUG_ON(!atomic_read(&keys->ref));

+ /*
+ * Replay accounting keys first: we can't allow the write buffer to
+ * flush accounting keys until we're done
+ */
+ darray_for_each(*keys, k) {
+ if (!(k->k->k.type == KEY_TYPE_accounting && !k->allocated))
+ continue;
+
+ cond_resched();
+
+ ret = commit_do(trans, NULL, NULL,
+ BCH_TRANS_COMMIT_no_enospc|
+ BCH_TRANS_COMMIT_no_journal_res,
+ bch2_journal_replay_accounting_key(trans, k));
+ if (bch2_fs_fatal_err_on(ret, c, "error replaying accounting; %s", bch2_err_str(ret)))
+ goto err;
+ }
+
/*
* First, attempt to replay keys in sorted order. This is more
* efficient - better locality of btree access - but some might fail if
* that would cause a journal deadlock.
*/
darray_for_each(*keys, k) {
+ if (k->k->k.type == KEY_TYPE_accounting && !k->allocated)
+ continue;
+
cond_resched();

/* Skip fastpath if we're low on space in the journal */
@@ -174,7 +246,7 @@ static int bch2_journal_replay(struct bch_fs *c)
BCH_TRANS_COMMIT_journal_reclaim|
(!k->allocated ? BCH_TRANS_COMMIT_no_journal_res : 0),
bch2_journal_replay_key(trans, k));
- BUG_ON(!ret && !k->overwritten);
+ BUG_ON(!ret && !k->overwritten && k->k->k.type != KEY_TYPE_accounting);
if (ret) {
ret = darray_push(&keys_sorted, k);
if (ret)
@@ -208,7 +280,7 @@ static int bch2_journal_replay(struct bch_fs *c)
if (ret)
goto err;

- BUG_ON(!k->overwritten);
+ BUG_ON(k->btree_id != BTREE_ID_accounting && !k->overwritten);
}

/*
--
2.43.0


2024-02-25 02:39:16

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 03/21] bcachefs: btree write buffer knows how to accumulate bch_accounting keys

Teach the btree write buffer how to accumulate accounting keys - instead
of having the newer key overwrite the older key as we do with other
updates, we need to add them together.

Also, add a flag so that write buffer flush knows when journal replay is
finished flushing accounting, and teach it to hold accounting keys until
that flag is set.

Signed-off-by: Kent Overstreet <[email protected]>
---
fs/bcachefs/bcachefs.h | 1 +
fs/bcachefs/btree_write_buffer.c | 66 +++++++++++++++++++++++++++-----
fs/bcachefs/recovery.c | 3 ++
3 files changed, 61 insertions(+), 9 deletions(-)

diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h
index 62812fc1cad0..9a24989c9a6a 100644
--- a/fs/bcachefs/bcachefs.h
+++ b/fs/bcachefs/bcachefs.h
@@ -616,6 +616,7 @@ struct bch_dev {

#define BCH_FS_FLAGS() \
x(started) \
+ x(accounting_replay_done) \
x(may_go_rw) \
x(rw) \
x(was_rw) \
diff --git a/fs/bcachefs/btree_write_buffer.c b/fs/bcachefs/btree_write_buffer.c
index b77e7b382b66..002a0762fc85 100644
--- a/fs/bcachefs/btree_write_buffer.c
+++ b/fs/bcachefs/btree_write_buffer.c
@@ -5,6 +5,7 @@
#include "btree_update.h"
#include "btree_update_interior.h"
#include "btree_write_buffer.h"
+#include "disk_accounting.h"
#include "error.h"
#include "journal.h"
#include "journal_io.h"
@@ -123,7 +124,9 @@ static noinline int wb_flush_one_slowpath(struct btree_trans *trans,

static inline int wb_flush_one(struct btree_trans *trans, struct btree_iter *iter,
struct btree_write_buffered_key *wb,
- bool *write_locked, size_t *fast)
+ bool *write_locked,
+ bool *accounting_accumulated,
+ size_t *fast)
{
struct btree_path *path;
int ret;
@@ -136,6 +139,16 @@ static inline int wb_flush_one(struct btree_trans *trans, struct btree_iter *ite
if (ret)
return ret;

+ if (!*accounting_accumulated && wb->k.k.type == KEY_TYPE_accounting) {
+ struct bkey u;
+ struct bkey_s_c k = bch2_btree_path_peek_slot_exact(btree_iter_path(trans, iter), &u);
+
+ if (k.k->type == KEY_TYPE_accounting)
+ bch2_accounting_accumulate(bkey_i_to_accounting(&wb->k),
+ bkey_s_c_to_accounting(k));
+ }
+ *accounting_accumulated = true;
+
/*
* We can't clone a path that has write locks: unshare it now, before
* set_pos and traverse():
@@ -248,8 +261,9 @@ static int bch2_btree_write_buffer_flush_locked(struct btree_trans *trans)
struct journal *j = &c->journal;
struct btree_write_buffer *wb = &c->btree_write_buffer;
struct btree_iter iter = { NULL };
- size_t skipped = 0, fast = 0, slowpath = 0;
+ size_t overwritten = 0, fast = 0, slowpath = 0, could_not_insert = 0;
bool write_locked = false;
+ bool accounting_replay_done = test_bit(BCH_FS_accounting_replay_done, &c->flags);
int ret = 0;

bch2_trans_unlock(trans);
@@ -284,17 +298,29 @@ static int bch2_btree_write_buffer_flush_locked(struct btree_trans *trans)

darray_for_each(wb->sorted, i) {
struct btree_write_buffered_key *k = &wb->flushing.keys.data[i->idx];
+ bool accounting_accumulated = false;

for (struct wb_key_ref *n = i + 1; n < min(i + 4, &darray_top(wb->sorted)); n++)
prefetch(&wb->flushing.keys.data[n->idx]);

BUG_ON(!k->journal_seq);

+ if (!accounting_replay_done &&
+ k->k.k.type == KEY_TYPE_accounting) {
+ slowpath++;
+ continue;
+ }
+
if (i + 1 < &darray_top(wb->sorted) &&
wb_key_eq(i, i + 1)) {
struct btree_write_buffered_key *n = &wb->flushing.keys.data[i[1].idx];

- skipped++;
+ if (k->k.k.type == KEY_TYPE_accounting &&
+ n->k.k.type == KEY_TYPE_accounting)
+ bch2_accounting_accumulate(bkey_i_to_accounting(&n->k),
+ bkey_i_to_s_c_accounting(&k->k));
+
+ overwritten++;
n->journal_seq = min_t(u64, n->journal_seq, k->journal_seq);
k->journal_seq = 0;
continue;
@@ -325,7 +351,8 @@ static int bch2_btree_write_buffer_flush_locked(struct btree_trans *trans)
break;
}

- ret = wb_flush_one(trans, &iter, k, &write_locked, &fast);
+ ret = wb_flush_one(trans, &iter, k, &write_locked,
+ &accounting_accumulated, &fast);
if (!write_locked)
bch2_trans_begin(trans);
} while (bch2_err_matches(ret, BCH_ERR_transaction_restart));
@@ -361,8 +388,15 @@ static int bch2_btree_write_buffer_flush_locked(struct btree_trans *trans)
if (!i->journal_seq)
continue;

- bch2_journal_pin_update(j, i->journal_seq, &wb->flushing.pin,
- bch2_btree_write_buffer_journal_flush);
+ if (!accounting_replay_done &&
+ i->k.k.type == KEY_TYPE_accounting) {
+ could_not_insert++;
+ continue;
+ }
+
+ if (!could_not_insert)
+ bch2_journal_pin_update(j, i->journal_seq, &wb->flushing.pin,
+ bch2_btree_write_buffer_journal_flush);

bch2_trans_begin(trans);

@@ -375,13 +409,27 @@ static int bch2_btree_write_buffer_flush_locked(struct btree_trans *trans)
btree_write_buffered_insert(trans, i));
if (ret)
goto err;
+
+ i->journal_seq = 0;
+ }
+
+ if (could_not_insert) {
+ struct btree_write_buffered_key *dst = wb->flushing.keys.data;
+
+ darray_for_each(wb->flushing.keys, i)
+ if (i->journal_seq)
+ *dst++ = *i;
+ wb->flushing.keys.nr = dst - wb->flushing.keys.data;
}
}
err:
+ if (ret || !could_not_insert) {
+ bch2_journal_pin_drop(j, &wb->flushing.pin);
+ wb->flushing.keys.nr = 0;
+ }
+
bch2_fs_fatal_err_on(ret, c, "%s: insert error %s", __func__, bch2_err_str(ret));
- trace_write_buffer_flush(trans, wb->flushing.keys.nr, skipped, fast, 0);
- bch2_journal_pin_drop(j, &wb->flushing.pin);
- wb->flushing.keys.nr = 0;
+ trace_write_buffer_flush(trans, wb->flushing.keys.nr, overwritten, fast, 0);
return ret;
}

diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c
index 6829d80bd181..b8289af66c8e 100644
--- a/fs/bcachefs/recovery.c
+++ b/fs/bcachefs/recovery.c
@@ -228,6 +228,8 @@ static int bch2_journal_replay(struct bch_fs *c)
goto err;
}

+ set_bit(BCH_FS_accounting_replay_done, &c->flags);
+
/*
* First, attempt to replay keys in sorted order. This is more
* efficient - better locality of btree access - but some might fail if
@@ -1204,6 +1206,7 @@ int bch2_fs_initialize(struct bch_fs *c)
* set up the journal.pin FIFO and journal.cur pointer:
*/
bch2_fs_journal_start(&c->journal, 1);
+ set_bit(BCH_FS_accounting_replay_done, &c->flags);
bch2_journal_set_replay_done(&c->journal);

ret = bch2_fs_read_write_early(c);
--
2.43.0


2024-02-25 02:39:29

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 05/21] bcachefs: dev_usage updated by new accounting

Reading disk accounting now requires an eytzinger lookup (see:
bch2_accounting_mem_read()), but the per-device counters are used
frequently enough that we'd like to still be able to read them with just
a percpu sum, as in the old code.

This patch special cases the device counters; when we update in-memory
accounting we also update the old style percpu counters if it's a deice
counter update.

Signed-off-by: Kent Overstreet <[email protected]>
---
fs/bcachefs/bcachefs.h | 3 +--
fs/bcachefs/btree_gc.c | 2 +-
fs/bcachefs/buckets.c | 36 +++++------------------------------
fs/bcachefs/buckets_types.h | 2 +-
fs/bcachefs/disk_accounting.c | 14 ++++++++++++++
fs/bcachefs/disk_accounting.h | 11 ++++++++++-
fs/bcachefs/recovery.c | 14 --------------
fs/bcachefs/sb-clean.c | 17 -----------------
8 files changed, 32 insertions(+), 67 deletions(-)

diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h
index 18c00051a8f6..91c40fde1925 100644
--- a/fs/bcachefs/bcachefs.h
+++ b/fs/bcachefs/bcachefs.h
@@ -576,8 +576,7 @@ struct bch_dev {
unsigned long *buckets_nouse;
struct rw_semaphore bucket_lock;

- struct bch_dev_usage *usage_base;
- struct bch_dev_usage __percpu *usage[JOURNAL_BUF_NR];
+ struct bch_dev_usage __percpu *usage;
struct bch_dev_usage __percpu *usage_gc;

/* Allocator: */
diff --git a/fs/bcachefs/btree_gc.c b/fs/bcachefs/btree_gc.c
index 2dfa7ca95fc0..93826749356e 100644
--- a/fs/bcachefs/btree_gc.c
+++ b/fs/bcachefs/btree_gc.c
@@ -1233,7 +1233,7 @@ static int bch2_gc_done(struct bch_fs *c,
bch2_fs_usage_acc_to_base(c, i);

__for_each_member_device(c, ca) {
- struct bch_dev_usage *dst = ca->usage_base;
+ struct bch_dev_usage *dst = this_cpu_ptr(ca->usage);
struct bch_dev_usage *src = (void *)
bch2_acc_percpu_u64s((u64 __percpu *) ca->usage_gc,
dev_usage_u64s());
diff --git a/fs/bcachefs/buckets.c b/fs/bcachefs/buckets.c
index fb915c1b7844..7540486ae266 100644
--- a/fs/bcachefs/buckets.c
+++ b/fs/bcachefs/buckets.c
@@ -69,15 +69,8 @@ void bch2_fs_usage_initialize(struct bch_fs *c)

void bch2_dev_usage_read_fast(struct bch_dev *ca, struct bch_dev_usage *usage)
{
- struct bch_fs *c = ca->fs;
- unsigned seq, i, u64s = dev_usage_u64s();
-
- do {
- seq = read_seqcount_begin(&c->usage_lock);
- memcpy(usage, ca->usage_base, u64s * sizeof(u64));
- for (i = 0; i < ARRAY_SIZE(ca->usage); i++)
- acc_u64s_percpu((u64 *) usage, (u64 __percpu *) ca->usage[i], u64s);
- } while (read_seqcount_retry(&c->usage_lock, seq));
+ memset(usage, 0, sizeof(*usage));
+ acc_u64s_percpu((u64 *) usage, (u64 __percpu *) ca->usage, dev_usage_u64s());
}

u64 bch2_fs_usage_read_one(struct bch_fs *c, u64 *v)
@@ -147,16 +140,6 @@ void bch2_fs_usage_acc_to_base(struct bch_fs *c, unsigned idx)
(u64 __percpu *) c->usage[idx], u64s);
percpu_memset(c->usage[idx], 0, u64s * sizeof(u64));

- rcu_read_lock();
- for_each_member_device_rcu(c, ca, NULL) {
- u64s = dev_usage_u64s();
-
- acc_u64s_percpu((u64 *) ca->usage_base,
- (u64 __percpu *) ca->usage[idx], u64s);
- percpu_memset(ca->usage[idx], 0, u64s * sizeof(u64));
- }
- rcu_read_unlock();
-
write_seqcount_end(&c->usage_lock);
preempt_enable();
}
@@ -1214,23 +1197,14 @@ void bch2_dev_buckets_free(struct bch_dev *ca)
{
kvfree(ca->buckets_nouse);
kvfree(rcu_dereference_protected(ca->bucket_gens, 1));
-
- for (unsigned i = 0; i < ARRAY_SIZE(ca->usage); i++)
- free_percpu(ca->usage[i]);
- kfree(ca->usage_base);
+ free_percpu(ca->usage);
}

int bch2_dev_buckets_alloc(struct bch_fs *c, struct bch_dev *ca)
{
- ca->usage_base = kzalloc(sizeof(struct bch_dev_usage), GFP_KERNEL);
- if (!ca->usage_base)
+ ca->usage = alloc_percpu(struct bch_dev_usage);
+ if (!ca->usage)
return -BCH_ERR_ENOMEM_usage_init;

- for (unsigned i = 0; i < ARRAY_SIZE(ca->usage); i++) {
- ca->usage[i] = alloc_percpu(struct bch_dev_usage);
- if (!ca->usage[i])
- return -BCH_ERR_ENOMEM_usage_init;
- }
-
return bch2_dev_buckets_resize(c, ca, ca->mi.nbuckets);
}
diff --git a/fs/bcachefs/buckets_types.h b/fs/bcachefs/buckets_types.h
index 6a31740222a7..baa7e0924390 100644
--- a/fs/bcachefs/buckets_types.h
+++ b/fs/bcachefs/buckets_types.h
@@ -33,7 +33,7 @@ struct bucket_gens {
};

struct bch_dev_usage {
- struct {
+ struct bch_dev_usage_type {
u64 buckets;
u64 sectors; /* _compressed_ sectors: */
/*
diff --git a/fs/bcachefs/disk_accounting.c b/fs/bcachefs/disk_accounting.c
index 327c586ac661..e0114d8eb5a8 100644
--- a/fs/bcachefs/disk_accounting.c
+++ b/fs/bcachefs/disk_accounting.c
@@ -330,6 +330,20 @@ int bch2_accounting_read(struct bch_fs *c)
case BCH_DISK_ACCOUNTING_replicas:
fs_usage_data_type_to_base(usage, k.replicas.data_type, v[0]);
break;
+ case BCH_DISK_ACCOUNTING_dev_data_type:
+ if (bch2_dev_exists2(c, k.dev_data_type.dev)) {
+ struct bch_dev *ca = bch_dev_bkey_exists(c, k.dev_data_type.dev);
+ struct bch_dev_usage_type __percpu *d = &ca->usage->d[k.dev_data_type.data_type];
+
+ percpu_u64_set(&d->buckets, v[0]);
+ percpu_u64_set(&d->sectors, v[1]);
+ percpu_u64_set(&d->fragmented, v[2]);
+
+ if (k.dev_data_type.data_type == BCH_DATA_sb ||
+ k.dev_data_type.data_type == BCH_DATA_journal)
+ usage->hidden += v[0] * ca->mi.bucket_size;
+ }
+ break;
}
}
preempt_enable();
diff --git a/fs/bcachefs/disk_accounting.h b/fs/bcachefs/disk_accounting.h
index 5fd053a819df..a8526bf43207 100644
--- a/fs/bcachefs/disk_accounting.h
+++ b/fs/bcachefs/disk_accounting.h
@@ -3,6 +3,7 @@
#define _BCACHEFS_DISK_ACCOUNTING_H

#include <linux/eytzinger.h>
+#include "sb-members.h"

static inline void bch2_u64s_neg(u64 *v, unsigned nr)
{
@@ -126,6 +127,7 @@ static inline int __bch2_accounting_mem_add(struct bch_fs *c, struct bkey_s_c_ac

static inline int bch2_accounting_mem_add(struct btree_trans *trans, struct bkey_s_c_accounting a)
{
+ struct bch_fs *c = trans->c;
struct disk_accounting_key acc_k;
bpos_to_disk_accounting_key(&acc_k, a.k->p);

@@ -136,8 +138,15 @@ static inline int bch2_accounting_mem_add(struct btree_trans *trans, struct bkey
case BCH_DISK_ACCOUNTING_replicas:
fs_usage_data_type_to_base(&trans->fs_usage_delta, acc_k.replicas.data_type, a.v->d[0]);
break;
+ case BCH_DISK_ACCOUNTING_dev_data_type: {
+ struct bch_dev *ca = bch_dev_bkey_exists(c, acc_k.dev_data_type.dev);
+
+ this_cpu_add(ca->usage->d[acc_k.dev_data_type.data_type].buckets, a.v->d[0]);
+ this_cpu_add(ca->usage->d[acc_k.dev_data_type.data_type].sectors, a.v->d[1]);
+ this_cpu_add(ca->usage->d[acc_k.dev_data_type.data_type].fragmented, a.v->d[2]);
+ }
}
- return __bch2_accounting_mem_add(trans->c, a);
+ return __bch2_accounting_mem_add(c, a);
}

static inline void bch2_accounting_mem_read_counters(struct bch_fs *c,
diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c
index 140393256f32..5a0ab3920382 100644
--- a/fs/bcachefs/recovery.c
+++ b/fs/bcachefs/recovery.c
@@ -368,20 +368,6 @@ static int journal_replay_entry_early(struct bch_fs *c,
le64_to_cpu(u->v));
break;
}
- case BCH_JSET_ENTRY_dev_usage: {
- struct jset_entry_dev_usage *u =
- container_of(entry, struct jset_entry_dev_usage, entry);
- struct bch_dev *ca = bch_dev_bkey_exists(c, le32_to_cpu(u->dev));
- unsigned i, nr_types = jset_entry_dev_usage_nr_types(u);
-
- for (i = 0; i < min_t(unsigned, nr_types, BCH_DATA_NR); i++) {
- ca->usage_base->d[i].buckets = le64_to_cpu(u->d[i].buckets);
- ca->usage_base->d[i].sectors = le64_to_cpu(u->d[i].sectors);
- ca->usage_base->d[i].fragmented = le64_to_cpu(u->d[i].fragmented);
- }
-
- break;
- }
case BCH_JSET_ENTRY_blacklist: {
struct jset_entry_blacklist *bl_entry =
container_of(entry, struct jset_entry_blacklist, entry);
diff --git a/fs/bcachefs/sb-clean.c b/fs/bcachefs/sb-clean.c
index 5980ba2563fe..a7f2cc774492 100644
--- a/fs/bcachefs/sb-clean.c
+++ b/fs/bcachefs/sb-clean.c
@@ -228,23 +228,6 @@ void bch2_journal_super_entries_add_common(struct bch_fs *c,
"embedded variable length struct");
}

- for_each_member_device(c, ca) {
- unsigned b = sizeof(struct jset_entry_dev_usage) +
- sizeof(struct jset_entry_dev_usage_type) * BCH_DATA_NR;
- struct jset_entry_dev_usage *u =
- container_of(jset_entry_init(end, b),
- struct jset_entry_dev_usage, entry);
-
- u->entry.type = BCH_JSET_ENTRY_dev_usage;
- u->dev = cpu_to_le32(ca->dev_idx);
-
- for (unsigned i = 0; i < BCH_DATA_NR; i++) {
- u->d[i].buckets = cpu_to_le64(ca->usage_base->d[i].buckets);
- u->d[i].sectors = cpu_to_le64(ca->usage_base->d[i].sectors);
- u->d[i].fragmented = cpu_to_le64(ca->usage_base->d[i].fragmented);
- }
- }
-
percpu_up_read(&c->mark_lock);

for (unsigned i = 0; i < 2; i++) {
--
2.43.0


2024-02-25 02:39:40

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 06/21] bcachefs: Kill bch2_fs_usage_initialize()

Deleting code for the old disk accounting scheme.

Signed-off-by: Kent Overstreet <[email protected]>
---
fs/bcachefs/buckets.c | 29 -----------------------------
fs/bcachefs/buckets.h | 2 --
fs/bcachefs/recovery.c | 2 --
3 files changed, 33 deletions(-)

diff --git a/fs/bcachefs/buckets.c b/fs/bcachefs/buckets.c
index 7540486ae266..054c4c8d9c1b 100644
--- a/fs/bcachefs/buckets.c
+++ b/fs/bcachefs/buckets.c
@@ -38,35 +38,6 @@ static inline struct bch_fs_usage *fs_usage_ptr(struct bch_fs *c,
: c->usage[journal_seq & JOURNAL_BUF_MASK]);
}

-void bch2_fs_usage_initialize(struct bch_fs *c)
-{
- percpu_down_write(&c->mark_lock);
- struct bch_fs_usage *usage = c->usage_base;
-
- for (unsigned i = 0; i < ARRAY_SIZE(c->usage); i++)
- bch2_fs_usage_acc_to_base(c, i);
-
- for (unsigned i = 0; i < BCH_REPLICAS_MAX; i++)
- usage->b.reserved += usage->persistent_reserved[i];
-
- for (unsigned i = 0; i < c->replicas.nr; i++) {
- struct bch_replicas_entry_v1 *e =
- cpu_replicas_entry(&c->replicas, i);
-
- fs_usage_data_type_to_base(&usage->b, e->data_type, usage->replicas[i]);
- }
-
- for_each_member_device(c, ca) {
- struct bch_dev_usage dev = bch2_dev_usage_read(ca);
-
- usage->b.hidden += (dev.d[BCH_DATA_sb].buckets +
- dev.d[BCH_DATA_journal].buckets) *
- ca->mi.bucket_size;
- }
-
- percpu_up_write(&c->mark_lock);
-}
-
void bch2_dev_usage_read_fast(struct bch_dev *ca, struct bch_dev_usage *usage)
{
memset(usage, 0, sizeof(*usage));
diff --git a/fs/bcachefs/buckets.h b/fs/bcachefs/buckets.h
index f9a1d24c997b..4e14615c770e 100644
--- a/fs/bcachefs/buckets.h
+++ b/fs/bcachefs/buckets.h
@@ -316,8 +316,6 @@ void bch2_dev_usage_update_m(struct bch_fs *, struct bch_dev *,
int bch2_update_replicas(struct bch_fs *, struct bkey_s_c,
struct bch_replicas_entry_v1 *, s64);

-void bch2_fs_usage_initialize(struct bch_fs *);
-
int bch2_check_bucket_ref(struct btree_trans *, struct bkey_s_c,
const struct bch_extent_ptr *,
s64, enum bch_data_type, u8, u8, u32);
diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c
index 5a0ab3920382..4936b18e5a58 100644
--- a/fs/bcachefs/recovery.c
+++ b/fs/bcachefs/recovery.c
@@ -426,8 +426,6 @@ static int journal_replay_early(struct bch_fs *c,
}
}

- bch2_fs_usage_initialize(c);
-
return 0;
}

--
2.43.0


2024-02-25 02:39:48

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 07/21] bcachefs: Convert bch2_ioctl_fs_usage() to new accounting

This converts bch2_ioctl_fs_usage() to read from the new disk
accounting, via bch2_fs_replicas_usage_read().

Signed-off-by: Kent Overstreet <[email protected]>
---
fs/bcachefs/chardev.c | 68 ++++++++++++-------------------------------
1 file changed, 19 insertions(+), 49 deletions(-)

diff --git a/fs/bcachefs/chardev.c b/fs/bcachefs/chardev.c
index 992939152f01..13ea852be153 100644
--- a/fs/bcachefs/chardev.c
+++ b/fs/bcachefs/chardev.c
@@ -5,6 +5,7 @@
#include "bcachefs_ioctl.h"
#include "buckets.h"
#include "chardev.h"
+#include "disk_accounting.h"
#include "journal.h"
#include "move.h"
#include "recovery.h"
@@ -500,11 +501,11 @@ static long bch2_ioctl_data(struct bch_fs *c,
static long bch2_ioctl_fs_usage(struct bch_fs *c,
struct bch_ioctl_fs_usage __user *user_arg)
{
- struct bch_ioctl_fs_usage *arg = NULL;
- struct bch_replicas_usage *dst_e, *dst_end;
- struct bch_fs_usage_online *src;
- u32 replica_entries_bytes;
+ struct bch_ioctl_fs_usage arg;
+ struct bch_fs_usage_online *src = NULL;
+ darray_char replicas = {};
unsigned i;
+ u32 replica_entries_bytes;
int ret = 0;

if (!test_bit(BCH_FS_started, &c->flags))
@@ -513,9 +514,16 @@ static long bch2_ioctl_fs_usage(struct bch_fs *c,
if (get_user(replica_entries_bytes, &user_arg->replica_entries_bytes))
return -EFAULT;

- arg = kzalloc(size_add(sizeof(*arg), replica_entries_bytes), GFP_KERNEL);
- if (!arg)
- return -ENOMEM;
+ ret = bch2_fs_replicas_usage_read(c, &replicas) ?:
+ (replica_entries_bytes < replicas.nr ? -ERANGE : 0) ?:
+ copy_to_user_errcode(&user_arg->replicas, replicas.data, replicas.nr);
+ if (ret)
+ goto err;
+
+ arg.capacity = c->capacity;
+ arg.used = bch2_fs_sectors_used(c, src);
+ arg.online_reserved = src->online_reserved;
+ arg.replica_entries_bytes = replicas.nr;

src = bch2_fs_usage_read(c);
if (!src) {
@@ -523,52 +531,14 @@ static long bch2_ioctl_fs_usage(struct bch_fs *c,
goto err;
}

- arg->capacity = c->capacity;
- arg->used = bch2_fs_sectors_used(c, src);
- arg->online_reserved = src->online_reserved;
-
for (i = 0; i < BCH_REPLICAS_MAX; i++)
- arg->persistent_reserved[i] = src->u.persistent_reserved[i];
-
- dst_e = arg->replicas;
- dst_end = (void *) arg->replicas + replica_entries_bytes;
-
- for (i = 0; i < c->replicas.nr; i++) {
- struct bch_replicas_entry_v1 *src_e =
- cpu_replicas_entry(&c->replicas, i);
-
- /* check that we have enough space for one replicas entry */
- if (dst_e + 1 > dst_end) {
- ret = -ERANGE;
- break;
- }
-
- dst_e->sectors = src->u.replicas[i];
- dst_e->r = *src_e;
-
- /* recheck after setting nr_devs: */
- if (replicas_usage_next(dst_e) > dst_end) {
- ret = -ERANGE;
- break;
- }
-
- memcpy(dst_e->r.devs, src_e->devs, src_e->nr_devs);
-
- dst_e = replicas_usage_next(dst_e);
- }
-
- arg->replica_entries_bytes = (void *) dst_e - (void *) arg->replicas;
-
+ arg.persistent_reserved[i] = src->u.persistent_reserved[i];
percpu_up_read(&c->mark_lock);
- kfree(src);
-
- if (ret)
- goto err;

- ret = copy_to_user_errcode(user_arg, arg,
- sizeof(*arg) + arg->replica_entries_bytes);
+ ret = copy_to_user_errcode(user_arg, &arg, sizeof(arg));
err:
- kfree(arg);
+ darray_exit(&replicas);
+ kfree(src);
return ret;
}

--
2.43.0


2024-02-25 02:40:36

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 08/21] bcachefs: kill bch2_fs_usage_read()

With bch2_ioctl_fs_usage(), this is now dead code.

Signed-off-by: Kent Overstreet <[email protected]>
---
fs/bcachefs/bcachefs.h | 4 ----
fs/bcachefs/buckets.c | 34 ----------------------------------
fs/bcachefs/buckets.h | 2 --
fs/bcachefs/chardev.c | 25 ++++++++++++-------------
fs/bcachefs/replicas.c | 7 -------
fs/bcachefs/super.c | 2 --
6 files changed, 12 insertions(+), 62 deletions(-)

diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h
index 91c40fde1925..5824cf57defd 100644
--- a/fs/bcachefs/bcachefs.h
+++ b/fs/bcachefs/bcachefs.h
@@ -912,10 +912,6 @@ struct bch_fs {
struct bch_fs_usage __percpu *usage_gc;
u64 __percpu *online_reserved;

- /* single element mempool: */
- struct mutex usage_scratch_lock;
- struct bch_fs_usage_online *usage_scratch;
-
struct io_clock io_clock[2];

/* JOURNAL SEQ BLACKLIST */
diff --git a/fs/bcachefs/buckets.c b/fs/bcachefs/buckets.c
index 054c4c8d9c1b..24b53f449313 100644
--- a/fs/bcachefs/buckets.c
+++ b/fs/bcachefs/buckets.c
@@ -64,40 +64,6 @@ u64 bch2_fs_usage_read_one(struct bch_fs *c, u64 *v)
return ret;
}

-struct bch_fs_usage_online *bch2_fs_usage_read(struct bch_fs *c)
-{
- struct bch_fs_usage_online *ret;
- unsigned nr_replicas = READ_ONCE(c->replicas.nr);
- unsigned seq, i;
-retry:
- ret = kmalloc(__fs_usage_online_u64s(nr_replicas) * sizeof(u64), GFP_KERNEL);
- if (unlikely(!ret))
- return NULL;
-
- percpu_down_read(&c->mark_lock);
-
- if (nr_replicas != c->replicas.nr) {
- nr_replicas = c->replicas.nr;
- percpu_up_read(&c->mark_lock);
- kfree(ret);
- goto retry;
- }
-
- ret->online_reserved = percpu_u64_get(c->online_reserved);
-
- do {
- seq = read_seqcount_begin(&c->usage_lock);
- unsafe_memcpy(&ret->u, c->usage_base,
- __fs_usage_u64s(nr_replicas) * sizeof(u64),
- "embedded variable length struct");
- for (i = 0; i < ARRAY_SIZE(c->usage); i++)
- acc_u64s_percpu((u64 *) &ret->u, (u64 __percpu *) c->usage[i],
- __fs_usage_u64s(nr_replicas));
- } while (read_seqcount_retry(&c->usage_lock, seq));
-
- return ret;
-}
-
void bch2_fs_usage_acc_to_base(struct bch_fs *c, unsigned idx)
{
unsigned u64s = fs_usage_u64s(c);
diff --git a/fs/bcachefs/buckets.h b/fs/bcachefs/buckets.h
index 4e14615c770e..356f725a4fad 100644
--- a/fs/bcachefs/buckets.h
+++ b/fs/bcachefs/buckets.h
@@ -296,8 +296,6 @@ static inline unsigned dev_usage_u64s(void)

u64 bch2_fs_usage_read_one(struct bch_fs *, u64 *);

-struct bch_fs_usage_online *bch2_fs_usage_read(struct bch_fs *);
-
void bch2_fs_usage_acc_to_base(struct bch_fs *, unsigned);

void bch2_fs_usage_to_text(struct printbuf *,
diff --git a/fs/bcachefs/chardev.c b/fs/bcachefs/chardev.c
index 13ea852be153..03a1339e8f3b 100644
--- a/fs/bcachefs/chardev.c
+++ b/fs/bcachefs/chardev.c
@@ -502,9 +502,7 @@ static long bch2_ioctl_fs_usage(struct bch_fs *c,
struct bch_ioctl_fs_usage __user *user_arg)
{
struct bch_ioctl_fs_usage arg;
- struct bch_fs_usage_online *src = NULL;
darray_char replicas = {};
- unsigned i;
u32 replica_entries_bytes;
int ret = 0;

@@ -520,25 +518,26 @@ static long bch2_ioctl_fs_usage(struct bch_fs *c,
if (ret)
goto err;

+ struct bch_fs_usage_short u = bch2_fs_usage_read_short(c);
arg.capacity = c->capacity;
- arg.used = bch2_fs_sectors_used(c, src);
- arg.online_reserved = src->online_reserved;
+ arg.used = u.used;
+ arg.online_reserved = percpu_u64_get(c->online_reserved);
arg.replica_entries_bytes = replicas.nr;

- src = bch2_fs_usage_read(c);
- if (!src) {
- ret = -ENOMEM;
- goto err;
- }
+ for (unsigned i = 0; i < BCH_REPLICAS_MAX; i++) {
+ struct disk_accounting_key k = {
+ .type = BCH_DISK_ACCOUNTING_persistent_reserved,
+ .persistent_reserved.nr_replicas = i,
+ };

- for (i = 0; i < BCH_REPLICAS_MAX; i++)
- arg.persistent_reserved[i] = src->u.persistent_reserved[i];
- percpu_up_read(&c->mark_lock);
+ bch2_accounting_mem_read(c,
+ disk_accounting_key_to_bpos(&k),
+ &arg.persistent_reserved[i], 1);
+ }

ret = copy_to_user_errcode(user_arg, &arg, sizeof(arg));
err:
darray_exit(&replicas);
- kfree(src);
return ret;
}

diff --git a/fs/bcachefs/replicas.c b/fs/bcachefs/replicas.c
index dde581a49e28..d02eb03d2ebd 100644
--- a/fs/bcachefs/replicas.c
+++ b/fs/bcachefs/replicas.c
@@ -319,13 +319,10 @@ static int replicas_table_update(struct bch_fs *c,
struct bch_replicas_cpu *new_r)
{
struct bch_fs_usage __percpu *new_usage[JOURNAL_BUF_NR];
- struct bch_fs_usage_online *new_scratch = NULL;
struct bch_fs_usage __percpu *new_gc = NULL;
struct bch_fs_usage *new_base = NULL;
unsigned i, bytes = sizeof(struct bch_fs_usage) +
sizeof(u64) * new_r->nr;
- unsigned scratch_bytes = sizeof(struct bch_fs_usage_online) +
- sizeof(u64) * new_r->nr;
int ret = 0;

memset(new_usage, 0, sizeof(new_usage));
@@ -336,7 +333,6 @@ static int replicas_table_update(struct bch_fs *c,
goto err;

if (!(new_base = kzalloc(bytes, GFP_KERNEL)) ||
- !(new_scratch = kmalloc(scratch_bytes, GFP_KERNEL)) ||
(c->usage_gc &&
!(new_gc = __alloc_percpu_gfp(bytes, sizeof(u64), GFP_KERNEL))))
goto err;
@@ -355,12 +351,10 @@ static int replicas_table_update(struct bch_fs *c,
for (i = 0; i < ARRAY_SIZE(new_usage); i++)
swap(c->usage[i], new_usage[i]);
swap(c->usage_base, new_base);
- swap(c->usage_scratch, new_scratch);
swap(c->usage_gc, new_gc);
swap(c->replicas, *new_r);
out:
free_percpu(new_gc);
- kfree(new_scratch);
for (i = 0; i < ARRAY_SIZE(new_usage); i++)
free_percpu(new_usage[i]);
kfree(new_base);
@@ -1024,7 +1018,6 @@ void bch2_fs_replicas_exit(struct bch_fs *c)
{
unsigned i;

- kfree(c->usage_scratch);
for (i = 0; i < ARRAY_SIZE(c->usage); i++)
free_percpu(c->usage[i]);
kfree(c->usage_base);
diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c
index 685d54d0ddbb..a26472f4620a 100644
--- a/fs/bcachefs/super.c
+++ b/fs/bcachefs/super.c
@@ -783,8 +783,6 @@ static struct bch_fs *bch2_fs_alloc(struct bch_sb *sb, struct bch_opts opts)

INIT_LIST_HEAD(&c->list);

- mutex_init(&c->usage_scratch_lock);
-
mutex_init(&c->bio_bounce_pages_lock);
mutex_init(&c->snapshot_table_lock);
init_rwsem(&c->snapshot_create_lock);
--
2.43.0


2024-02-25 02:40:48

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 09/21] bcachefs: Kill writing old accounting to journal

More ripping out of the old disk space accounting.

Note that the new disk space accounting is incompatible with the old,
and writing out old style disk space accounting with the new code is
infeasible.

This means upgrading and downgrading past this version requires
regenerating accounting.

Signed-off-by: Kent Overstreet <[email protected]>
---
fs/bcachefs/sb-clean.c | 45 ------------------------------------------
1 file changed, 45 deletions(-)

diff --git a/fs/bcachefs/sb-clean.c b/fs/bcachefs/sb-clean.c
index a7f2cc774492..1af2785653f6 100644
--- a/fs/bcachefs/sb-clean.c
+++ b/fs/bcachefs/sb-clean.c
@@ -175,25 +175,6 @@ void bch2_journal_super_entries_add_common(struct bch_fs *c,
struct jset_entry **end,
u64 journal_seq)
{
- percpu_down_read(&c->mark_lock);
-
- if (!journal_seq) {
- for (unsigned i = 0; i < ARRAY_SIZE(c->usage); i++)
- bch2_fs_usage_acc_to_base(c, i);
- } else {
- bch2_fs_usage_acc_to_base(c, journal_seq & JOURNAL_BUF_MASK);
- }
-
- {
- struct jset_entry_usage *u =
- container_of(jset_entry_init(end, sizeof(*u)),
- struct jset_entry_usage, entry);
-
- u->entry.type = BCH_JSET_ENTRY_usage;
- u->entry.btree_id = BCH_FS_USAGE_inodes;
- u->v = cpu_to_le64(c->usage_base->b.nr_inodes);
- }
-
{
struct jset_entry_usage *u =
container_of(jset_entry_init(end, sizeof(*u)),
@@ -204,32 +185,6 @@ void bch2_journal_super_entries_add_common(struct bch_fs *c,
u->v = cpu_to_le64(atomic64_read(&c->key_version));
}

- for (unsigned i = 0; i < BCH_REPLICAS_MAX; i++) {
- struct jset_entry_usage *u =
- container_of(jset_entry_init(end, sizeof(*u)),
- struct jset_entry_usage, entry);
-
- u->entry.type = BCH_JSET_ENTRY_usage;
- u->entry.btree_id = BCH_FS_USAGE_reserved;
- u->entry.level = i;
- u->v = cpu_to_le64(c->usage_base->persistent_reserved[i]);
- }
-
- for (unsigned i = 0; i < c->replicas.nr; i++) {
- struct bch_replicas_entry_v1 *e =
- cpu_replicas_entry(&c->replicas, i);
- struct jset_entry_data_usage *u =
- container_of(jset_entry_init(end, sizeof(*u) + e->nr_devs),
- struct jset_entry_data_usage, entry);
-
- u->entry.type = BCH_JSET_ENTRY_data_usage;
- u->v = cpu_to_le64(c->usage_base->replicas[i]);
- unsafe_memcpy(&u->r, e, replicas_entry_bytes(e),
- "embedded variable length struct");
- }
-
- percpu_up_read(&c->mark_lock);
-
for (unsigned i = 0; i < 2; i++) {
struct jset_entry_clock *clock =
container_of(jset_entry_init(end, sizeof(*clock)),
--
2.43.0


2024-02-25 02:40:48

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 11/21] bcachefs: Kill bch2_fs_usage_to_text()

Dead code.

Signed-off-by: Kent Overstreet <[email protected]>
---
fs/bcachefs/buckets.c | 39 ---------------------------------------
fs/bcachefs/buckets.h | 3 ---
2 files changed, 42 deletions(-)

diff --git a/fs/bcachefs/buckets.c b/fs/bcachefs/buckets.c
index 8476bd5cb3af..c261fa3a0273 100644
--- a/fs/bcachefs/buckets.c
+++ b/fs/bcachefs/buckets.c
@@ -32,45 +32,6 @@ void bch2_dev_usage_read_fast(struct bch_dev *ca, struct bch_dev_usage *usage)
acc_u64s_percpu((u64 *) usage, (u64 __percpu *) ca->usage, dev_usage_u64s());
}

-void bch2_fs_usage_to_text(struct printbuf *out,
- struct bch_fs *c,
- struct bch_fs_usage_online *fs_usage)
-{
- unsigned i;
-
- prt_printf(out, "capacity:\t\t\t%llu\n", c->capacity);
-
- prt_printf(out, "hidden:\t\t\t\t%llu\n",
- fs_usage->u.b.hidden);
- prt_printf(out, "data:\t\t\t\t%llu\n",
- fs_usage->u.b.data);
- prt_printf(out, "cached:\t\t\t\t%llu\n",
- fs_usage->u.b.cached);
- prt_printf(out, "reserved:\t\t\t%llu\n",
- fs_usage->u.b.reserved);
- prt_printf(out, "nr_inodes:\t\t\t%llu\n",
- fs_usage->u.b.nr_inodes);
- prt_printf(out, "online reserved:\t\t%llu\n",
- fs_usage->online_reserved);
-
- for (i = 0;
- i < ARRAY_SIZE(fs_usage->u.persistent_reserved);
- i++) {
- prt_printf(out, "%u replicas:\n", i + 1);
- prt_printf(out, "\treserved:\t\t%llu\n",
- fs_usage->u.persistent_reserved[i]);
- }
-
- for (i = 0; i < c->replicas.nr; i++) {
- struct bch_replicas_entry_v1 *e =
- cpu_replicas_entry(&c->replicas, i);
-
- prt_printf(out, "\t");
- bch2_replicas_entry_to_text(out, e);
- prt_printf(out, ":\t%llu\n", fs_usage->u.replicas[i]);
- }
-}
-
static u64 reserve_factor(u64 r)
{
return r + (round_up(r, (1 << RESERVE_FACTOR)) >> RESERVE_FACTOR);
diff --git a/fs/bcachefs/buckets.h b/fs/bcachefs/buckets.h
index dfdf1b3ee817..ccf9813c65e7 100644
--- a/fs/bcachefs/buckets.h
+++ b/fs/bcachefs/buckets.h
@@ -294,9 +294,6 @@ static inline unsigned dev_usage_u64s(void)
return sizeof(struct bch_dev_usage) / sizeof(u64);
}

-void bch2_fs_usage_to_text(struct printbuf *,
- struct bch_fs *, struct bch_fs_usage_online *);
-
u64 bch2_fs_sectors_used(struct bch_fs *, struct bch_fs_usage_online *);

struct bch_fs_usage_short
--
2.43.0


2024-02-25 02:40:55

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 10/21] bcachefs: Delete journal-buf-sharded old style accounting

More deletion of dead code.

Signed-off-by: Kent Overstreet <[email protected]>
---
fs/bcachefs/bcachefs.h | 3 +-
fs/bcachefs/btree_gc.c | 9 ++---
fs/bcachefs/buckets.c | 61 ++++-----------------------------
fs/bcachefs/buckets.h | 4 ---
fs/bcachefs/disk_accounting.c | 2 +-
fs/bcachefs/recovery.c | 20 +----------
fs/bcachefs/replicas.c | 63 +++--------------------------------
fs/bcachefs/replicas.h | 4 ---
fs/bcachefs/super.c | 2 ++
9 files changed, 21 insertions(+), 147 deletions(-)

diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h
index 5824cf57defd..2e7c4d10c951 100644
--- a/fs/bcachefs/bcachefs.h
+++ b/fs/bcachefs/bcachefs.h
@@ -907,8 +907,7 @@ struct bch_fs {
struct percpu_rw_semaphore mark_lock;

seqcount_t usage_lock;
- struct bch_fs_usage *usage_base;
- struct bch_fs_usage __percpu *usage[JOURNAL_BUF_NR];
+ struct bch_fs_usage_base __percpu *usage;
struct bch_fs_usage __percpu *usage_gc;
u64 __percpu *online_reserved;

diff --git a/fs/bcachefs/btree_gc.c b/fs/bcachefs/btree_gc.c
index 93826749356e..15a8796197f3 100644
--- a/fs/bcachefs/btree_gc.c
+++ b/fs/bcachefs/btree_gc.c
@@ -1229,10 +1229,8 @@ static int bch2_gc_done(struct bch_fs *c,
#define copy_fs_field(_err, _f, _msg, ...) \
copy_field(_err, _f, "fs has wrong " _msg, ##__VA_ARGS__)

- for (i = 0; i < ARRAY_SIZE(c->usage); i++)
- bch2_fs_usage_acc_to_base(c, i);
-
__for_each_member_device(c, ca) {
+ /* XXX */
struct bch_dev_usage *dst = this_cpu_ptr(ca->usage);
struct bch_dev_usage *src = (void *)
bch2_acc_percpu_u64s((u64 __percpu *) ca->usage_gc,
@@ -1249,8 +1247,10 @@ static int bch2_gc_done(struct bch_fs *c,
}

{
+#if 0
unsigned nr = fs_usage_u64s(c);
- struct bch_fs_usage *dst = c->usage_base;
+ /* XX: */
+ struct bch_fs_usage *dst = this_cpu_ptr(c->usage);
struct bch_fs_usage *src = (void *)
bch2_acc_percpu_u64s((u64 __percpu *) c->usage_gc, nr);

@@ -1290,6 +1290,7 @@ static int bch2_gc_done(struct bch_fs *c,
copy_fs_field(fs_usage_replicas_wrong,
replicas[i], "%s", buf.buf);
}
+#endif
}

#undef copy_fs_field
diff --git a/fs/bcachefs/buckets.c b/fs/bcachefs/buckets.c
index 24b53f449313..8476bd5cb3af 100644
--- a/fs/bcachefs/buckets.c
+++ b/fs/bcachefs/buckets.c
@@ -26,61 +26,12 @@

#include <linux/preempt.h>

-static inline struct bch_fs_usage *fs_usage_ptr(struct bch_fs *c,
- unsigned journal_seq,
- bool gc)
-{
- percpu_rwsem_assert_held(&c->mark_lock);
- BUG_ON(!gc && !journal_seq);
-
- return this_cpu_ptr(gc
- ? c->usage_gc
- : c->usage[journal_seq & JOURNAL_BUF_MASK]);
-}
-
void bch2_dev_usage_read_fast(struct bch_dev *ca, struct bch_dev_usage *usage)
{
memset(usage, 0, sizeof(*usage));
acc_u64s_percpu((u64 *) usage, (u64 __percpu *) ca->usage, dev_usage_u64s());
}

-u64 bch2_fs_usage_read_one(struct bch_fs *c, u64 *v)
-{
- ssize_t offset = v - (u64 *) c->usage_base;
- unsigned i, seq;
- u64 ret;
-
- BUG_ON(offset < 0 || offset >= fs_usage_u64s(c));
- percpu_rwsem_assert_held(&c->mark_lock);
-
- do {
- seq = read_seqcount_begin(&c->usage_lock);
- ret = *v;
-
- for (i = 0; i < ARRAY_SIZE(c->usage); i++)
- ret += percpu_u64_get((u64 __percpu *) c->usage[i] + offset);
- } while (read_seqcount_retry(&c->usage_lock, seq));
-
- return ret;
-}
-
-void bch2_fs_usage_acc_to_base(struct bch_fs *c, unsigned idx)
-{
- unsigned u64s = fs_usage_u64s(c);
-
- BUG_ON(idx >= ARRAY_SIZE(c->usage));
-
- preempt_disable();
- write_seqcount_begin(&c->usage_lock);
-
- acc_u64s_percpu((u64 *) c->usage_base,
- (u64 __percpu *) c->usage[idx], u64s);
- percpu_memset(c->usage[idx], 0, u64s * sizeof(u64));
-
- write_seqcount_end(&c->usage_lock);
- preempt_enable();
-}
-
void bch2_fs_usage_to_text(struct printbuf *out,
struct bch_fs *c,
struct bch_fs_usage_online *fs_usage)
@@ -142,17 +93,17 @@ __bch2_fs_usage_read_short(struct bch_fs *c)
u64 data, reserved;

ret.capacity = c->capacity -
- bch2_fs_usage_read_one(c, &c->usage_base->b.hidden);
+ percpu_u64_get(&c->usage->hidden);

- data = bch2_fs_usage_read_one(c, &c->usage_base->b.data) +
- bch2_fs_usage_read_one(c, &c->usage_base->b.btree);
- reserved = bch2_fs_usage_read_one(c, &c->usage_base->b.reserved) +
+ data = percpu_u64_get(&c->usage->data) +
+ percpu_u64_get(&c->usage->btree);
+ reserved = percpu_u64_get(&c->usage->reserved) +
percpu_u64_get(c->online_reserved);

ret.used = min(ret.capacity, data + reserve_factor(reserved));
ret.free = ret.capacity - ret.used;

- ret.nr_inodes = bch2_fs_usage_read_one(c, &c->usage_base->b.nr_inodes);
+ ret.nr_inodes = percpu_u64_get(&c->usage->nr_inodes);

return ret;
}
@@ -461,7 +412,7 @@ void bch2_trans_account_disk_usage_change(struct btree_trans *trans)

percpu_down_read(&c->mark_lock);
preempt_disable();
- struct bch_fs_usage_base *dst = &fs_usage_ptr(c, trans->journal_res.seq, false)->b;
+ struct bch_fs_usage_base *dst = this_cpu_ptr(c->usage);
struct bch_fs_usage_base *src = &trans->fs_usage_delta;

s64 added = src->btree + src->data + src->reserved;
diff --git a/fs/bcachefs/buckets.h b/fs/bcachefs/buckets.h
index 356f725a4fad..dfdf1b3ee817 100644
--- a/fs/bcachefs/buckets.h
+++ b/fs/bcachefs/buckets.h
@@ -294,10 +294,6 @@ static inline unsigned dev_usage_u64s(void)
return sizeof(struct bch_dev_usage) / sizeof(u64);
}

-u64 bch2_fs_usage_read_one(struct bch_fs *, u64 *);
-
-void bch2_fs_usage_acc_to_base(struct bch_fs *, unsigned);
-
void bch2_fs_usage_to_text(struct printbuf *,
struct bch_fs *, struct bch_fs_usage_online *);

diff --git a/fs/bcachefs/disk_accounting.c b/fs/bcachefs/disk_accounting.c
index e0114d8eb5a8..f898323f72c7 100644
--- a/fs/bcachefs/disk_accounting.c
+++ b/fs/bcachefs/disk_accounting.c
@@ -314,7 +314,7 @@ int bch2_accounting_read(struct bch_fs *c)

percpu_down_read(&c->mark_lock);
preempt_disable();
- struct bch_fs_usage_base *usage = &c->usage_base->b;
+ struct bch_fs_usage_base *usage = this_cpu_ptr(c->usage);

for (unsigned i = 0; i < acc->k.nr; i++) {
struct disk_accounting_key k;
diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c
index 4936b18e5a58..18fd71960d2e 100644
--- a/fs/bcachefs/recovery.c
+++ b/fs/bcachefs/recovery.c
@@ -344,28 +344,10 @@ static int journal_replay_entry_early(struct bch_fs *c,
container_of(entry, struct jset_entry_usage, entry);

switch (entry->btree_id) {
- case BCH_FS_USAGE_reserved:
- if (entry->level < BCH_REPLICAS_MAX)
- c->usage_base->persistent_reserved[entry->level] =
- le64_to_cpu(u->v);
- break;
- case BCH_FS_USAGE_inodes:
- c->usage_base->b.nr_inodes = le64_to_cpu(u->v);
- break;
case BCH_FS_USAGE_key_version:
- atomic64_set(&c->key_version,
- le64_to_cpu(u->v));
+ atomic64_set(&c->key_version, le64_to_cpu(u->v));
break;
}
-
- break;
- }
- case BCH_JSET_ENTRY_data_usage: {
- struct jset_entry_data_usage *u =
- container_of(entry, struct jset_entry_data_usage, entry);
-
- ret = bch2_replicas_set_usage(c, &u->r,
- le64_to_cpu(u->v));
break;
}
case BCH_JSET_ENTRY_blacklist: {
diff --git a/fs/bcachefs/replicas.c b/fs/bcachefs/replicas.c
index d02eb03d2ebd..6dca705eaf1f 100644
--- a/fs/bcachefs/replicas.c
+++ b/fs/bcachefs/replicas.c
@@ -318,46 +318,23 @@ static void __replicas_table_update_pcpu(struct bch_fs_usage __percpu *dst_p,
static int replicas_table_update(struct bch_fs *c,
struct bch_replicas_cpu *new_r)
{
- struct bch_fs_usage __percpu *new_usage[JOURNAL_BUF_NR];
struct bch_fs_usage __percpu *new_gc = NULL;
- struct bch_fs_usage *new_base = NULL;
- unsigned i, bytes = sizeof(struct bch_fs_usage) +
+ unsigned bytes = sizeof(struct bch_fs_usage) +
sizeof(u64) * new_r->nr;
int ret = 0;

- memset(new_usage, 0, sizeof(new_usage));
-
- for (i = 0; i < ARRAY_SIZE(new_usage); i++)
- if (!(new_usage[i] = __alloc_percpu_gfp(bytes,
- sizeof(u64), GFP_KERNEL)))
- goto err;
-
- if (!(new_base = kzalloc(bytes, GFP_KERNEL)) ||
- (c->usage_gc &&
+ if ((c->usage_gc &&
!(new_gc = __alloc_percpu_gfp(bytes, sizeof(u64), GFP_KERNEL))))
goto err;

- for (i = 0; i < ARRAY_SIZE(new_usage); i++)
- if (c->usage[i])
- __replicas_table_update_pcpu(new_usage[i], new_r,
- c->usage[i], &c->replicas);
- if (c->usage_base)
- __replicas_table_update(new_base, new_r,
- c->usage_base, &c->replicas);
if (c->usage_gc)
__replicas_table_update_pcpu(new_gc, new_r,
c->usage_gc, &c->replicas);

- for (i = 0; i < ARRAY_SIZE(new_usage); i++)
- swap(c->usage[i], new_usage[i]);
- swap(c->usage_base, new_base);
swap(c->usage_gc, new_gc);
swap(c->replicas, *new_r);
out:
free_percpu(new_gc);
- for (i = 0; i < ARRAY_SIZE(new_usage); i++)
- free_percpu(new_usage[i]);
- kfree(new_base);
return ret;
err:
bch_err(c, "error updating replicas table: memory allocation failure");
@@ -544,6 +521,8 @@ int bch2_replicas_gc_start(struct bch_fs *c, unsigned typemask)
*/
int bch2_replicas_gc2(struct bch_fs *c)
{
+ return 0;
+#if 0
struct bch_replicas_cpu new = { 0 };
unsigned i, nr;
int ret = 0;
@@ -598,34 +577,7 @@ int bch2_replicas_gc2(struct bch_fs *c)
mutex_unlock(&c->sb_lock);

return ret;
-}
-
-int bch2_replicas_set_usage(struct bch_fs *c,
- struct bch_replicas_entry_v1 *r,
- u64 sectors)
-{
- int ret, idx = bch2_replicas_entry_idx(c, r);
-
- if (idx < 0) {
- struct bch_replicas_cpu n;
-
- n = cpu_replicas_add_entry(c, &c->replicas, r);
- if (!n.entries)
- return -BCH_ERR_ENOMEM_cpu_replicas;
-
- ret = replicas_table_update(c, &n);
- if (ret)
- return ret;
-
- kfree(n.entries);
-
- idx = bch2_replicas_entry_idx(c, r);
- BUG_ON(ret < 0);
- }
-
- c->usage_base->replicas[idx] = sectors;
-
- return 0;
+#endif
}

/* Replicas tracking - superblock: */
@@ -1016,11 +968,6 @@ unsigned bch2_dev_has_data(struct bch_fs *c, struct bch_dev *ca)

void bch2_fs_replicas_exit(struct bch_fs *c)
{
- unsigned i;
-
- for (i = 0; i < ARRAY_SIZE(c->usage); i++)
- free_percpu(c->usage[i]);
- kfree(c->usage_base);
kfree(c->replicas.entries);
kfree(c->replicas_gc.entries);
}
diff --git a/fs/bcachefs/replicas.h b/fs/bcachefs/replicas.h
index f00c586f8cd9..eac2dff20423 100644
--- a/fs/bcachefs/replicas.h
+++ b/fs/bcachefs/replicas.h
@@ -54,10 +54,6 @@ int bch2_replicas_gc_end(struct bch_fs *, int);
int bch2_replicas_gc_start(struct bch_fs *, unsigned);
int bch2_replicas_gc2(struct bch_fs *);

-int bch2_replicas_set_usage(struct bch_fs *,
- struct bch_replicas_entry_v1 *,
- u64);
-
#define for_each_cpu_replicas_entry(_r, _i) \
for (_i = (_r)->entries; \
(void *) (_i) < (void *) (_r)->entries + (_r)->nr * (_r)->entry_size;\
diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c
index a26472f4620a..30b41c8de309 100644
--- a/fs/bcachefs/super.c
+++ b/fs/bcachefs/super.c
@@ -567,6 +567,7 @@ static void __bch2_fs_free(struct bch_fs *c)

darray_exit(&c->btree_roots_extra);
free_percpu(c->pcpu);
+ free_percpu(c->usage);
mempool_exit(&c->large_bkey_pool);
mempool_exit(&c->btree_bounce_pool);
bioset_exit(&c->btree_bio);
@@ -893,6 +894,7 @@ static struct bch_fs *bch2_fs_alloc(struct bch_sb *sb, struct bch_opts opts)
offsetof(struct btree_write_bio, wbio.bio)),
BIOSET_NEED_BVECS) ||
!(c->pcpu = alloc_percpu(struct bch_fs_pcpu)) ||
+ !(c->usage = alloc_percpu(struct bch_fs_usage_base)) ||
!(c->online_reserved = alloc_percpu(u64)) ||
mempool_init_kvmalloc_pool(&c->btree_bounce_pool, 1,
c->opts.btree_node_size) ||
--
2.43.0


2024-02-25 02:40:58

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 12/21] bcachefs: Kill fs_usage_online

More dead code deletion.

Signed-off-by: Kent Overstreet <[email protected]>
---
fs/bcachefs/buckets.c | 10 ----------
fs/bcachefs/buckets.h | 12 ------------
fs/bcachefs/buckets_types.h | 5 -----
3 files changed, 27 deletions(-)

diff --git a/fs/bcachefs/buckets.c b/fs/bcachefs/buckets.c
index c261fa3a0273..5e2b9aa93241 100644
--- a/fs/bcachefs/buckets.c
+++ b/fs/bcachefs/buckets.c
@@ -37,16 +37,6 @@ static u64 reserve_factor(u64 r)
return r + (round_up(r, (1 << RESERVE_FACTOR)) >> RESERVE_FACTOR);
}

-u64 bch2_fs_sectors_used(struct bch_fs *c, struct bch_fs_usage_online *fs_usage)
-{
- return min(fs_usage->u.b.hidden +
- fs_usage->u.b.btree +
- fs_usage->u.b.data +
- reserve_factor(fs_usage->u.b.reserved +
- fs_usage->online_reserved),
- c->capacity);
-}
-
static struct bch_fs_usage_short
__bch2_fs_usage_read_short(struct bch_fs *c)
{
diff --git a/fs/bcachefs/buckets.h b/fs/bcachefs/buckets.h
index ccf9813c65e7..f9d8d7b9fbd1 100644
--- a/fs/bcachefs/buckets.h
+++ b/fs/bcachefs/buckets.h
@@ -279,23 +279,11 @@ static inline unsigned fs_usage_u64s(struct bch_fs *c)
return __fs_usage_u64s(READ_ONCE(c->replicas.nr));
}

-static inline unsigned __fs_usage_online_u64s(unsigned nr_replicas)
-{
- return sizeof(struct bch_fs_usage_online) / sizeof(u64) + nr_replicas;
-}
-
-static inline unsigned fs_usage_online_u64s(struct bch_fs *c)
-{
- return __fs_usage_online_u64s(READ_ONCE(c->replicas.nr));
-}
-
static inline unsigned dev_usage_u64s(void)
{
return sizeof(struct bch_dev_usage) / sizeof(u64);
}

-u64 bch2_fs_sectors_used(struct bch_fs *, struct bch_fs_usage_online *);
-
struct bch_fs_usage_short
bch2_fs_usage_read_short(struct bch_fs *);

diff --git a/fs/bcachefs/buckets_types.h b/fs/bcachefs/buckets_types.h
index baa7e0924390..570acdf455bb 100644
--- a/fs/bcachefs/buckets_types.h
+++ b/fs/bcachefs/buckets_types.h
@@ -61,11 +61,6 @@ struct bch_fs_usage {
u64 replicas[];
};

-struct bch_fs_usage_online {
- u64 online_reserved;
- struct bch_fs_usage u;
-};
-
struct bch_fs_usage_short {
u64 capacity;
u64 used;
--
2.43.0


2024-02-25 02:41:00

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 04/21] bcachefs: Disk space accounting rewrite

Main part of the disk accounting rewrite.

This is a wholesale rewrite of the existing disk space accounting, which
relies on percepu counters that are sharded by journal buffer, and
rolled up and added to each journal write.

With the new scheme, every set of counters is a distinct key in the
accounting btree; this fixes scaling limitations of the old scheme,
where counters took up space in each journal entry and required multiple
percpu counters.

Now, in memory accounting requires a single set of percpu counters - not
multiple for each in flight journal buffer - and in the future we'll
probably also have counters that don't use in memory percpu counters,
they're not strictly required.

An accounting update is now a normal btree update, using the btree write
buffer path. At transaction commit time, we apply accounting updates to
the in memory counters, which are percpu counters indexed in an
eytzinger tree by the accounting key.

Signed-off-by: Kent Overstreet <[email protected]>
---
fs/bcachefs/alloc_background.c | 68 +++++-
fs/bcachefs/bcachefs.h | 6 +-
fs/bcachefs/bcachefs_format.h | 1 -
fs/bcachefs/bcachefs_ioctl.h | 7 +-
fs/bcachefs/btree_gc.c | 3 +-
fs/bcachefs/btree_iter.c | 9 -
fs/bcachefs/btree_trans_commit.c | 62 ++++--
fs/bcachefs/btree_types.h | 1 -
fs/bcachefs/btree_update.h | 8 -
fs/bcachefs/buckets.c | 289 +++++---------------------
fs/bcachefs/buckets.h | 33 +--
fs/bcachefs/disk_accounting.c | 308 ++++++++++++++++++++++++++++
fs/bcachefs/disk_accounting.h | 126 ++++++++++++
fs/bcachefs/disk_accounting_types.h | 20 ++
fs/bcachefs/ec.c | 24 ++-
fs/bcachefs/inode.c | 9 +-
fs/bcachefs/recovery.c | 12 +-
fs/bcachefs/recovery_types.h | 1 +
fs/bcachefs/replicas.c | 42 ++--
fs/bcachefs/replicas.h | 11 +-
fs/bcachefs/replicas_types.h | 16 --
fs/bcachefs/sb-errors_types.h | 3 +-
fs/bcachefs/super.c | 49 +++--
23 files changed, 704 insertions(+), 404 deletions(-)
create mode 100644 fs/bcachefs/disk_accounting_types.h

diff --git a/fs/bcachefs/alloc_background.c b/fs/bcachefs/alloc_background.c
index ccd6cbfd470e..d8ad5bb28a7f 100644
--- a/fs/bcachefs/alloc_background.c
+++ b/fs/bcachefs/alloc_background.c
@@ -14,6 +14,7 @@
#include "buckets_waiting_for_journal.h"
#include "clock.h"
#include "debug.h"
+#include "disk_accounting.h"
#include "ec.h"
#include "error.h"
#include "lru.h"
@@ -813,8 +814,60 @@ int bch2_trigger_alloc(struct btree_trans *trans,

if ((flags & BTREE_TRIGGER_BUCKET_INVALIDATE) &&
old_a->cached_sectors) {
- ret = bch2_update_cached_sectors_list(trans, new.k->p.inode,
- -((s64) old_a->cached_sectors));
+ ret = bch2_mod_dev_cached_sectors(trans, new.k->p.inode,
+ -((s64) old_a->cached_sectors));
+ if (ret)
+ return ret;
+ }
+
+
+ if (old_a->data_type != new_a->data_type ||
+ old_a->dirty_sectors != new_a->dirty_sectors) {
+ struct disk_accounting_key acc = {
+ .type = BCH_DISK_ACCOUNTING_dev_data_type,
+ .dev_data_type.dev = new.k->p.inode,
+ .dev_data_type.data_type = new_a->data_type,
+ };
+ s64 d[3];
+
+ if (old_a->data_type == new_a->data_type) {
+ d[0] = 0;
+ d[1] = (s64) new_a->dirty_sectors - (s64) old_a->dirty_sectors;
+ d[2] = bucket_sectors_fragmented(ca, *new_a) -
+ bucket_sectors_fragmented(ca, *old_a);
+
+ ret = bch2_disk_accounting_mod(trans, &acc, d, 3);
+ if (ret)
+ return ret;
+ } else {
+ d[0] = 1;
+ d[1] = new_a->dirty_sectors;
+ d[2] = bucket_sectors_fragmented(ca, *new_a);
+
+ ret = bch2_disk_accounting_mod(trans, &acc, d, 3);
+ if (ret)
+ return ret;
+
+ acc.dev_data_type.data_type = old_a->data_type;
+ d[0] = -1;
+ d[1] = -(s64) old_a->dirty_sectors;
+ d[2] = -bucket_sectors_fragmented(ca, *old_a);
+
+ ret = bch2_disk_accounting_mod(trans, &acc, d, 3);
+ if (ret)
+ return ret;
+ }
+ }
+
+ if (!!old_a->stripe != !!new_a->stripe) {
+ struct disk_accounting_key acc = {
+ .type = BCH_DISK_ACCOUNTING_dev_stripe_buckets,
+ .dev_stripe_buckets.dev = new.k->p.inode,
+ };
+ u64 d[1];
+
+ d[0] = (s64) !!new_a->stripe - (s64) !!old_a->stripe;
+ ret = bch2_disk_accounting_mod(trans, &acc, d, 1);
if (ret)
return ret;
}
@@ -857,12 +910,11 @@ int bch2_trigger_alloc(struct btree_trans *trans,
}
}

- percpu_down_read(&c->mark_lock);
- if (new_a->gen != old_a->gen)
+ if (new_a->gen != old_a->gen) {
+ percpu_down_read(&c->mark_lock);
*bucket_gen(ca, new.k->p.offset) = new_a->gen;
-
- bch2_dev_usage_update(c, ca, old_a, new_a, journal_seq, false);
- percpu_up_read(&c->mark_lock);
+ percpu_up_read(&c->mark_lock);
+ }

#define eval_state(_a, expr) ({ const struct bch_alloc_v4 *a = _a; expr; })
#define statechange(expr) !eval_state(old_a, expr) && eval_state(new_a, expr)
@@ -906,6 +958,8 @@ int bch2_trigger_alloc(struct btree_trans *trans,

bucket_unlock(g);
percpu_up_read(&c->mark_lock);
+
+ bch2_dev_usage_update(c, ca, old_a, new_a);
}

return 0;
diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h
index 9a24989c9a6a..18c00051a8f6 100644
--- a/fs/bcachefs/bcachefs.h
+++ b/fs/bcachefs/bcachefs.h
@@ -207,6 +207,7 @@
#include <linux/zstd.h>

#include "bcachefs_format.h"
+#include "disk_accounting_types.h"
#include "errcode.h"
#include "fifo.h"
#include "nocow_locking_types.h"
@@ -695,8 +696,6 @@ struct btree_trans_buf {
struct btree_trans *trans;
};

-#define REPLICAS_DELTA_LIST_MAX (1U << 16)
-
#define BCACHEFS_ROOT_SUBVOL_INUM \
((subvol_inum) { BCACHEFS_ROOT_SUBVOL, BCACHEFS_ROOT_INO })

@@ -763,10 +762,11 @@ struct bch_fs {

struct bch_dev __rcu *devs[BCH_SB_MEMBERS_MAX];

+ struct bch_accounting_mem accounting;
+
struct bch_replicas_cpu replicas;
struct bch_replicas_cpu replicas_gc;
struct mutex replicas_gc_lock;
- mempool_t replicas_delta_pool;

struct journal_entry_res btree_root_journal_res;
struct journal_entry_res replicas_journal_res;
diff --git a/fs/bcachefs/bcachefs_format.h b/fs/bcachefs/bcachefs_format.h
index 313ca7dc370d..6edd3fd63bfa 100644
--- a/fs/bcachefs/bcachefs_format.h
+++ b/fs/bcachefs/bcachefs_format.h
@@ -1271,7 +1271,6 @@ static inline bool jset_entry_is_key(struct jset_entry *e)
switch (e->type) {
case BCH_JSET_ENTRY_btree_keys:
case BCH_JSET_ENTRY_btree_root:
- case BCH_JSET_ENTRY_overwrite:
case BCH_JSET_ENTRY_write_buffer_keys:
return true;
}
diff --git a/fs/bcachefs/bcachefs_ioctl.h b/fs/bcachefs/bcachefs_ioctl.h
index 4b8fba754b1c..0b82a4dd099f 100644
--- a/fs/bcachefs/bcachefs_ioctl.h
+++ b/fs/bcachefs/bcachefs_ioctl.h
@@ -251,10 +251,15 @@ struct bch_replicas_usage {
struct bch_replicas_entry_v1 r;
} __packed;

+static inline unsigned replicas_usage_bytes(struct bch_replicas_usage *u)
+{
+ return offsetof(struct bch_replicas_usage, r) + replicas_entry_bytes(&u->r);
+}
+
static inline struct bch_replicas_usage *
replicas_usage_next(struct bch_replicas_usage *u)
{
- return (void *) u + replicas_entry_bytes(&u->r) + 8;
+ return (void *) u + replicas_usage_bytes(u);
}

/*
diff --git a/fs/bcachefs/btree_gc.c b/fs/bcachefs/btree_gc.c
index 6c52f116098f..2dfa7ca95fc0 100644
--- a/fs/bcachefs/btree_gc.c
+++ b/fs/bcachefs/btree_gc.c
@@ -827,7 +827,8 @@ static int bch2_gc_mark_key(struct btree_trans *trans, enum btree_id btree_id,
if (ret)
goto err;

- if (fsck_err_on(k->k->version.lo > atomic64_read(&c->key_version), c,
+ if (fsck_err_on(btree_id != BTREE_ID_accounting &&
+ k->k->version.lo > atomic64_read(&c->key_version), c,
bkey_version_in_future,
"key version number higher than recorded: %llu > %llu",
k->k->version.lo,
diff --git a/fs/bcachefs/btree_iter.c b/fs/bcachefs/btree_iter.c
index 2357af3e6757..ef7cb7174c8b 100644
--- a/fs/bcachefs/btree_iter.c
+++ b/fs/bcachefs/btree_iter.c
@@ -3072,15 +3072,6 @@ void bch2_trans_put(struct btree_trans *trans)
srcu_read_unlock(&c->btree_trans_barrier, trans->srcu_idx);
}

- if (trans->fs_usage_deltas) {
- if (trans->fs_usage_deltas->size + sizeof(trans->fs_usage_deltas) ==
- REPLICAS_DELTA_LIST_MAX)
- mempool_free(trans->fs_usage_deltas,
- &c->replicas_delta_pool);
- else
- kfree(trans->fs_usage_deltas);
- }
-
if (unlikely(trans->journal_replay_not_finished))
bch2_journal_keys_put(c);

diff --git a/fs/bcachefs/btree_trans_commit.c b/fs/bcachefs/btree_trans_commit.c
index 60f6255367b9..b005e20039bb 100644
--- a/fs/bcachefs/btree_trans_commit.c
+++ b/fs/bcachefs/btree_trans_commit.c
@@ -9,6 +9,7 @@
#include "btree_update_interior.h"
#include "btree_write_buffer.h"
#include "buckets.h"
+#include "disk_accounting.h"
#include "errcode.h"
#include "error.h"
#include "journal.h"
@@ -598,6 +599,14 @@ static noinline int bch2_trans_commit_run_gc_triggers(struct btree_trans *trans)
return 0;
}

+static struct bversion journal_pos_to_bversion(struct journal_res *res, unsigned offset)
+{
+ return (struct bversion) {
+ .hi = res->seq >> 32,
+ .lo = (res->seq << 32) | (res->offset + offset),
+ };
+}
+
static inline int
bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,
struct btree_insert_entry **stopped_at,
@@ -606,7 +615,7 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,
struct bch_fs *c = trans->c;
struct btree_trans_commit_hook *h;
unsigned u64s = 0;
- int ret;
+ int ret = 0;

if (race_fault()) {
trace_and_count(c, trans_restart_fault_inject, trans, trace_ip);
@@ -668,21 +677,35 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,
i->k->k.version = MAX_VERSION;
}

- if (trans->fs_usage_deltas &&
- bch2_trans_fs_usage_apply(trans, trans->fs_usage_deltas))
- return -BCH_ERR_btree_insert_need_mark_replicas;
-
- /* XXX: we only want to run this if deltas are nonzero */
- bch2_trans_account_disk_usage_change(trans);
-
h = trans->hooks;
while (h) {
ret = h->fn(trans, h);
if (ret)
- goto revert_fs_usage;
+ return ret;
h = h->next;
}

+ percpu_down_read(&c->mark_lock);
+ struct jset_entry *entry = trans->journal_entries;
+
+ for (entry = trans->journal_entries;
+ entry != (void *) ((u64 *) trans->journal_entries + trans->journal_entries_u64s);
+ entry = vstruct_next(entry))
+ if (jset_entry_is_key(entry) && entry->start->k.type == KEY_TYPE_accounting) {
+ struct bkey_i_accounting *a = bkey_i_to_accounting(entry->start);
+
+ a->k.version = journal_pos_to_bversion(&trans->journal_res,
+ (u64 *) entry - (u64 *) trans->journal_entries);
+ BUG_ON(bversion_zero(a->k.version));
+ ret = bch2_accounting_mem_add(trans, accounting_i_to_s_c(a));
+ if (ret)
+ goto revert_fs_usage;
+ }
+ percpu_up_read(&c->mark_lock);
+
+ /* XXX: we only want to run this if deltas are nonzero */
+ bch2_trans_account_disk_usage_change(trans);
+
trans_for_each_update(trans, i)
if (BTREE_NODE_TYPE_HAS_ATOMIC_TRIGGERS & (1U << i->bkey_type)) {
ret = run_one_mem_trigger(trans, i, BTREE_TRIGGER_ATOMIC|i->flags);
@@ -751,10 +774,20 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,

return 0;
fatal_err:
- bch2_fatal_error(c);
+ bch2_fs_fatal_error(c, "fatal error in transaction commit: %s", bch2_err_str(ret));
+ percpu_down_read(&c->mark_lock);
revert_fs_usage:
- if (trans->fs_usage_deltas)
- bch2_trans_fs_usage_revert(trans, trans->fs_usage_deltas);
+ for (struct jset_entry *entry2 = trans->journal_entries;
+ entry2 != entry;
+ entry2 = vstruct_next(entry2))
+ if (jset_entry_is_key(entry2) && entry2->start->k.type == KEY_TYPE_accounting) {
+ struct bkey_s_accounting a = bkey_i_to_s_accounting(entry2->start);
+
+ bch2_accounting_neg(a);
+ bch2_accounting_mem_add(trans, a.c);
+ bch2_accounting_neg(a);
+ }
+ percpu_up_read(&c->mark_lock);
return ret;
}

@@ -904,7 +937,7 @@ int bch2_trans_commit_error(struct btree_trans *trans, unsigned flags,
break;
case -BCH_ERR_btree_insert_need_mark_replicas:
ret = drop_locks_do(trans,
- bch2_replicas_delta_list_mark(c, trans->fs_usage_deltas));
+ bch2_accounting_update_sb(trans));
break;
case -BCH_ERR_journal_res_get_blocked:
/*
@@ -996,8 +1029,6 @@ int __bch2_trans_commit(struct btree_trans *trans, unsigned flags)
!trans->journal_entries_u64s)
goto out_reset;

- memset(&trans->fs_usage_delta, 0, sizeof(trans->fs_usage_delta));
-
ret = bch2_trans_commit_run_triggers(trans);
if (ret)
goto out_reset;
@@ -1093,6 +1124,7 @@ int __bch2_trans_commit(struct btree_trans *trans, unsigned flags)
bch2_trans_verify_not_in_restart(trans);
if (likely(!(flags & BCH_TRANS_COMMIT_no_journal_res)))
memset(&trans->journal_res, 0, sizeof(trans->journal_res));
+ memset(&trans->fs_usage_delta, 0, sizeof(trans->fs_usage_delta));

ret = do_bch2_trans_commit(trans, flags, &errored_at, _RET_IP_);

diff --git a/fs/bcachefs/btree_types.h b/fs/bcachefs/btree_types.h
index b2ebf143c3b7..2acca37eb831 100644
--- a/fs/bcachefs/btree_types.h
+++ b/fs/bcachefs/btree_types.h
@@ -441,7 +441,6 @@ struct btree_trans {

unsigned journal_u64s;
unsigned extra_disk_res; /* XXX kill */
- struct replicas_delta_list *fs_usage_deltas;

/* Entries before this are zeroed out on every bch2_trans_get() call */

diff --git a/fs/bcachefs/btree_update.h b/fs/bcachefs/btree_update.h
index 21f887fe857c..6f8812f21444 100644
--- a/fs/bcachefs/btree_update.h
+++ b/fs/bcachefs/btree_update.h
@@ -213,14 +213,6 @@ static inline void bch2_trans_reset_updates(struct btree_trans *trans)
trans->journal_entries_u64s = 0;
trans->hooks = NULL;
trans->extra_disk_res = 0;
-
- if (trans->fs_usage_deltas) {
- trans->fs_usage_deltas->used = 0;
- memset((void *) trans->fs_usage_deltas +
- offsetof(struct replicas_delta_list, memset_start), 0,
- (void *) &trans->fs_usage_deltas->memset_end -
- (void *) &trans->fs_usage_deltas->memset_start);
- }
}

static inline struct bkey_i *__bch2_bkey_make_mut_noupdate(struct btree_trans *trans, struct bkey_s_c k,
diff --git a/fs/bcachefs/buckets.c b/fs/bcachefs/buckets.c
index c2f46b267b3a..fb915c1b7844 100644
--- a/fs/bcachefs/buckets.c
+++ b/fs/bcachefs/buckets.c
@@ -13,6 +13,7 @@
#include "btree_update.h"
#include "buckets.h"
#include "buckets_waiting_for_journal.h"
+#include "disk_accounting.h"
#include "ec.h"
#include "error.h"
#include "inode.h"
@@ -25,24 +26,16 @@

#include <linux/preempt.h>

-static inline void fs_usage_data_type_to_base(struct bch_fs_usage_base *fs_usage,
- enum bch_data_type data_type,
- s64 sectors)
+static inline struct bch_fs_usage *fs_usage_ptr(struct bch_fs *c,
+ unsigned journal_seq,
+ bool gc)
{
- switch (data_type) {
- case BCH_DATA_btree:
- fs_usage->btree += sectors;
- break;
- case BCH_DATA_user:
- case BCH_DATA_parity:
- fs_usage->data += sectors;
- break;
- case BCH_DATA_cached:
- fs_usage->cached += sectors;
- break;
- default:
- break;
- }
+ percpu_rwsem_assert_held(&c->mark_lock);
+ BUG_ON(!gc && !journal_seq);
+
+ return this_cpu_ptr(gc
+ ? c->usage_gc
+ : c->usage[journal_seq & JOURNAL_BUF_MASK]);
}

void bch2_fs_usage_initialize(struct bch_fs *c)
@@ -67,24 +60,13 @@ void bch2_fs_usage_initialize(struct bch_fs *c)
struct bch_dev_usage dev = bch2_dev_usage_read(ca);

usage->b.hidden += (dev.d[BCH_DATA_sb].buckets +
- dev.d[BCH_DATA_journal].buckets) *
+ dev.d[BCH_DATA_journal].buckets) *
ca->mi.bucket_size;
}

percpu_up_write(&c->mark_lock);
}

-static inline struct bch_dev_usage *dev_usage_ptr(struct bch_dev *ca,
- unsigned journal_seq,
- bool gc)
-{
- BUG_ON(!gc && !journal_seq);
-
- return this_cpu_ptr(gc
- ? ca->usage_gc
- : ca->usage[journal_seq & JOURNAL_BUF_MASK]);
-}
-
void bch2_dev_usage_read_fast(struct bch_dev *ca, struct bch_dev_usage *usage)
{
struct bch_fs *c = ca->fs;
@@ -267,11 +249,6 @@ bch2_fs_usage_read_short(struct bch_fs *c)
return ret;
}

-void bch2_dev_usage_init(struct bch_dev *ca)
-{
- ca->usage_base->d[BCH_DATA_free].buckets = ca->mi.nbuckets - ca->mi.first_bucket;
-}
-
void bch2_dev_usage_to_text(struct printbuf *out, struct bch_dev_usage *usage)
{
prt_tab(out);
@@ -298,21 +275,20 @@ void bch2_dev_usage_to_text(struct printbuf *out, struct bch_dev_usage *usage)

void bch2_dev_usage_update(struct bch_fs *c, struct bch_dev *ca,
const struct bch_alloc_v4 *old,
- const struct bch_alloc_v4 *new,
- u64 journal_seq, bool gc)
+ const struct bch_alloc_v4 *new)
{
struct bch_fs_usage *fs_usage;
struct bch_dev_usage *u;

preempt_disable();
- fs_usage = fs_usage_ptr(c, journal_seq, gc);
+ fs_usage = this_cpu_ptr(c->usage_gc);

if (data_type_is_hidden(old->data_type))
fs_usage->b.hidden -= ca->mi.bucket_size;
if (data_type_is_hidden(new->data_type))
fs_usage->b.hidden += ca->mi.bucket_size;

- u = dev_usage_ptr(ca, journal_seq, gc);
+ u = this_cpu_ptr(ca->usage_gc);

u->d[old->data_type].buckets--;
u->d[new->data_type].buckets++;
@@ -346,27 +322,11 @@ void bch2_dev_usage_update_m(struct bch_fs *c, struct bch_dev *ca,
struct bch_alloc_v4 old_a = bucket_m_to_alloc(*old);
struct bch_alloc_v4 new_a = bucket_m_to_alloc(*new);

- bch2_dev_usage_update(c, ca, &old_a, &new_a, 0, true);
-}
-
-static inline int __update_replicas(struct bch_fs *c,
- struct bch_fs_usage *fs_usage,
- struct bch_replicas_entry_v1 *r,
- s64 sectors)
-{
- int idx = bch2_replicas_entry_idx(c, r);
-
- if (idx < 0)
- return -1;
-
- fs_usage_data_type_to_base(&fs_usage->b, r->data_type, sectors);
- fs_usage->replicas[idx] += sectors;
- return 0;
+ bch2_dev_usage_update(c, ca, &old_a, &new_a);
}

int bch2_update_replicas(struct bch_fs *c, struct bkey_s_c k,
- struct bch_replicas_entry_v1 *r, s64 sectors,
- unsigned journal_seq, bool gc)
+ struct bch_replicas_entry_v1 *r, s64 sectors)
{
struct bch_fs_usage *fs_usage;
int idx, ret = 0;
@@ -393,7 +353,7 @@ int bch2_update_replicas(struct bch_fs *c, struct bkey_s_c k,
}

preempt_disable();
- fs_usage = fs_usage_ptr(c, journal_seq, gc);
+ fs_usage = this_cpu_ptr(c->usage_gc);
fs_usage_data_type_to_base(&fs_usage->b, r->data_type, sectors);
fs_usage->replicas[idx] += sectors;
preempt_enable();
@@ -406,94 +366,13 @@ int bch2_update_replicas(struct bch_fs *c, struct bkey_s_c k,

static inline int update_cached_sectors(struct bch_fs *c,
struct bkey_s_c k,
- unsigned dev, s64 sectors,
- unsigned journal_seq, bool gc)
+ unsigned dev, s64 sectors)
{
struct bch_replicas_padded r;

bch2_replicas_entry_cached(&r.e, dev);

- return bch2_update_replicas(c, k, &r.e, sectors, journal_seq, gc);
-}
-
-static int __replicas_deltas_realloc(struct btree_trans *trans, unsigned more,
- gfp_t gfp)
-{
- struct replicas_delta_list *d = trans->fs_usage_deltas;
- unsigned new_size = d ? (d->size + more) * 2 : 128;
- unsigned alloc_size = sizeof(*d) + new_size;
-
- WARN_ON_ONCE(alloc_size > REPLICAS_DELTA_LIST_MAX);
-
- if (!d || d->used + more > d->size) {
- d = krealloc(d, alloc_size, gfp|__GFP_ZERO);
-
- if (unlikely(!d)) {
- if (alloc_size > REPLICAS_DELTA_LIST_MAX)
- return -ENOMEM;
-
- d = mempool_alloc(&trans->c->replicas_delta_pool, gfp);
- if (!d)
- return -ENOMEM;
-
- memset(d, 0, REPLICAS_DELTA_LIST_MAX);
-
- if (trans->fs_usage_deltas)
- memcpy(d, trans->fs_usage_deltas,
- trans->fs_usage_deltas->size + sizeof(*d));
-
- new_size = REPLICAS_DELTA_LIST_MAX - sizeof(*d);
- kfree(trans->fs_usage_deltas);
- }
-
- d->size = new_size;
- trans->fs_usage_deltas = d;
- }
-
- return 0;
-}
-
-int bch2_replicas_deltas_realloc(struct btree_trans *trans, unsigned more)
-{
- return allocate_dropping_locks_errcode(trans,
- __replicas_deltas_realloc(trans, more, _gfp));
-}
-
-int bch2_update_replicas_list(struct btree_trans *trans,
- struct bch_replicas_entry_v1 *r,
- s64 sectors)
-{
- struct replicas_delta_list *d;
- struct replicas_delta *n;
- unsigned b;
- int ret;
-
- if (!sectors)
- return 0;
-
- b = replicas_entry_bytes(r) + 8;
- ret = bch2_replicas_deltas_realloc(trans, b);
- if (ret)
- return ret;
-
- d = trans->fs_usage_deltas;
- n = (void *) d->d + d->used;
- n->delta = sectors;
- unsafe_memcpy((void *) n + offsetof(struct replicas_delta, r),
- r, replicas_entry_bytes(r),
- "flexible array member embedded in strcuct with padding");
- bch2_replicas_entry_sort(&n->r);
- d->used += b;
- return 0;
-}
-
-int bch2_update_cached_sectors_list(struct btree_trans *trans, unsigned dev, s64 sectors)
-{
- struct bch_replicas_padded r;
-
- bch2_replicas_entry_cached(&r.e, dev);
-
- return bch2_update_replicas_list(trans, &r.e, sectors);
+ return bch2_update_replicas(c, k, &r.e, sectors);
}

int bch2_mark_metadata_bucket(struct bch_fs *c, struct bch_dev *ca,
@@ -653,47 +532,6 @@ int bch2_check_bucket_ref(struct btree_trans *trans,
goto out;
}

-void bch2_trans_fs_usage_revert(struct btree_trans *trans,
- struct replicas_delta_list *deltas)
-{
- struct bch_fs *c = trans->c;
- struct bch_fs_usage *dst;
- struct replicas_delta *d, *top = (void *) deltas->d + deltas->used;
- s64 added = 0;
- unsigned i;
-
- percpu_down_read(&c->mark_lock);
- preempt_disable();
- dst = fs_usage_ptr(c, trans->journal_res.seq, false);
-
- /* revert changes: */
- for (d = deltas->d; d != top; d = replicas_delta_next(d)) {
- switch (d->r.data_type) {
- case BCH_DATA_btree:
- case BCH_DATA_user:
- case BCH_DATA_parity:
- added += d->delta;
- }
- BUG_ON(__update_replicas(c, dst, &d->r, -d->delta));
- }
-
- dst->b.nr_inodes -= deltas->nr_inodes;
-
- for (i = 0; i < BCH_REPLICAS_MAX; i++) {
- added -= deltas->persistent_reserved[i];
- dst->b.reserved -= deltas->persistent_reserved[i];
- dst->persistent_reserved[i] -= deltas->persistent_reserved[i];
- }
-
- if (added > 0) {
- trans->disk_res->sectors += added;
- this_cpu_add(*c->online_reserved, added);
- }
-
- preempt_enable();
- percpu_up_read(&c->mark_lock);
-}
-
void bch2_trans_account_disk_usage_change(struct btree_trans *trans)
{
struct bch_fs *c = trans->c;
@@ -747,43 +585,6 @@ void bch2_trans_account_disk_usage_change(struct btree_trans *trans)
should_not_have_added, disk_res_sectors);
}

-int bch2_trans_fs_usage_apply(struct btree_trans *trans,
- struct replicas_delta_list *deltas)
-{
- struct bch_fs *c = trans->c;
- struct replicas_delta *d, *d2;
- struct replicas_delta *top = (void *) deltas->d + deltas->used;
- struct bch_fs_usage *dst;
- unsigned i;
-
- percpu_down_read(&c->mark_lock);
- preempt_disable();
- dst = fs_usage_ptr(c, trans->journal_res.seq, false);
-
- for (d = deltas->d; d != top; d = replicas_delta_next(d))
- if (__update_replicas(c, dst, &d->r, d->delta))
- goto need_mark;
-
- dst->b.nr_inodes += deltas->nr_inodes;
-
- for (i = 0; i < BCH_REPLICAS_MAX; i++) {
- dst->b.reserved += deltas->persistent_reserved[i];
- dst->persistent_reserved[i] += deltas->persistent_reserved[i];
- }
-
- preempt_enable();
- percpu_up_read(&c->mark_lock);
- return 0;
-need_mark:
- /* revert changes: */
- for (d2 = deltas->d; d2 != d; d2 = replicas_delta_next(d2))
- BUG_ON(__update_replicas(c, dst, &d2->r, -d2->delta));
-
- preempt_enable();
- percpu_up_read(&c->mark_lock);
- return -1;
-}
-
/* KEY_TYPE_extent: */

static int __mark_pointer(struct btree_trans *trans,
@@ -911,10 +712,12 @@ static int bch2_trigger_stripe_ptr(struct btree_trans *trans,
stripe_blockcount_get(&s->v, p.ec.block) +
sectors);

- struct bch_replicas_padded r;
- bch2_bkey_to_replicas(&r.e, bkey_i_to_s_c(&s->k_i));
- r.e.data_type = data_type;
- ret = bch2_update_replicas_list(trans, &r.e, sectors);
+ struct disk_accounting_key acc = {
+ .type = BCH_DISK_ACCOUNTING_replicas,
+ };
+ bch2_bkey_to_replicas(&acc.replicas, bkey_i_to_s_c(&s->k_i));
+ acc.replicas.data_type = data_type;
+ ret = bch2_disk_accounting_mod(trans, &acc, &sectors, 1);
err:
bch2_trans_iter_exit(trans, &iter);
return ret;
@@ -951,7 +754,7 @@ static int bch2_trigger_stripe_ptr(struct btree_trans *trans,
mutex_unlock(&c->ec_stripes_heap_lock);

r.e.data_type = data_type;
- bch2_update_replicas(c, k, &r.e, sectors, trans->journal_res.seq, true);
+ bch2_update_replicas(c, k, &r.e, sectors);
}

return 0;
@@ -966,16 +769,18 @@ static int __trigger_extent(struct btree_trans *trans,
struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k);
const union bch_extent_entry *entry;
struct extent_ptr_decoded p;
- struct bch_replicas_padded r;
enum bch_data_type data_type = bkey_is_btree_ptr(k.k)
? BCH_DATA_btree
: BCH_DATA_user;
s64 dirty_sectors = 0;
int ret = 0;

- r.e.data_type = data_type;
- r.e.nr_devs = 0;
- r.e.nr_required = 1;
+ struct disk_accounting_key acc = {
+ .type = BCH_DISK_ACCOUNTING_replicas,
+ .replicas.data_type = data_type,
+ .replicas.nr_devs = 0,
+ .replicas.nr_required = 1,
+ };

bkey_for_each_ptr_decode(k.k, ptrs, p, entry) {
s64 disk_sectors;
@@ -988,8 +793,8 @@ static int __trigger_extent(struct btree_trans *trans,
if (p.ptr.cached) {
if (!stale) {
ret = !gc
- ? bch2_update_cached_sectors_list(trans, p.ptr.dev, disk_sectors)
- : update_cached_sectors(c, k, p.ptr.dev, disk_sectors, 0, true);
+ ? bch2_mod_dev_cached_sectors(trans, p.ptr.dev, disk_sectors)
+ : update_cached_sectors(c, k, p.ptr.dev, disk_sectors);
bch2_fs_fatal_err_on(ret && gc, c, "%s(): no replicas entry while updating cached sectors",
__func__);
if (ret)
@@ -997,7 +802,7 @@ static int __trigger_extent(struct btree_trans *trans,
}
} else if (!p.has_ec) {
dirty_sectors += disk_sectors;
- r.e.devs[r.e.nr_devs++] = p.ptr.dev;
+ acc.replicas.devs[acc.replicas.nr_devs++] = p.ptr.dev;
} else {
ret = bch2_trigger_stripe_ptr(trans, k, p, data_type, disk_sectors, flags);
if (ret)
@@ -1008,14 +813,14 @@ static int __trigger_extent(struct btree_trans *trans,
* if so they're not required for mounting if we have an
* erasure coded pointer in this extent:
*/
- r.e.nr_required = 0;
+ acc.replicas.nr_required = 0;
}
}

- if (r.e.nr_devs) {
+ if (acc.replicas.nr_devs) {
ret = !gc
- ? bch2_update_replicas_list(trans, &r.e, dirty_sectors)
- : bch2_update_replicas(c, k, &r.e, dirty_sectors, 0, true);
+ ? bch2_disk_accounting_mod(trans, &acc, &dirty_sectors, 1)
+ : bch2_update_replicas(c, k, &acc.replicas, dirty_sectors);
if (unlikely(ret && gc)) {
struct printbuf buf = PRINTBUF;

@@ -1074,23 +879,23 @@ static int __trigger_reservation(struct btree_trans *trans,
{
struct bch_fs *c = trans->c;
unsigned replicas = bkey_s_c_to_reservation(k).v->nr_replicas;
- s64 sectors = (s64) k.k->size * replicas;
+ s64 sectors = (s64) k.k->size;

if (flags & BTREE_TRIGGER_OVERWRITE)
sectors = -sectors;

if (flags & BTREE_TRIGGER_TRANSACTIONAL) {
- int ret = bch2_replicas_deltas_realloc(trans, 0);
- if (ret)
- return ret;
-
- struct replicas_delta_list *d = trans->fs_usage_deltas;
- replicas = min(replicas, ARRAY_SIZE(d->persistent_reserved));
+ struct disk_accounting_key acc = {
+ .type = BCH_DISK_ACCOUNTING_persistent_reserved,
+ .persistent_reserved.nr_replicas = replicas,
+ };

- d->persistent_reserved[replicas - 1] += sectors;
+ return bch2_disk_accounting_mod(trans, &acc, &sectors, 1);
}

if (flags & BTREE_TRIGGER_GC) {
+ sectors *= replicas;
+
percpu_down_read(&c->mark_lock);
preempt_disable();

diff --git a/fs/bcachefs/buckets.h b/fs/bcachefs/buckets.h
index 6387e039f789..f9a1d24c997b 100644
--- a/fs/bcachefs/buckets.h
+++ b/fs/bcachefs/buckets.h
@@ -202,7 +202,6 @@ static inline struct bch_dev_usage bch2_dev_usage_read(struct bch_dev *ca)
return ret;
}

-void bch2_dev_usage_init(struct bch_dev *);
void bch2_dev_usage_to_text(struct printbuf *, struct bch_dev_usage *);

static inline u64 bch2_dev_buckets_reserved(struct bch_dev *ca, enum bch_watermark watermark)
@@ -261,6 +260,13 @@ static inline u64 dev_buckets_available(struct bch_dev *ca,
return __dev_buckets_available(ca, bch2_dev_usage_read(ca), watermark);
}

+static inline s64 bucket_sectors_fragmented(struct bch_dev *ca, struct bch_alloc_v4 a)
+{
+ return a.dirty_sectors
+ ? max(0, (int) ca->mi.bucket_size - (int) a.dirty_sectors)
+ : 0;
+}
+
/* Filesystem usage: */

static inline unsigned __fs_usage_u64s(unsigned nr_replicas)
@@ -304,31 +310,11 @@ bch2_fs_usage_read_short(struct bch_fs *);

void bch2_dev_usage_update(struct bch_fs *, struct bch_dev *,
const struct bch_alloc_v4 *,
- const struct bch_alloc_v4 *, u64, bool);
+ const struct bch_alloc_v4 *);
void bch2_dev_usage_update_m(struct bch_fs *, struct bch_dev *,
struct bucket *, struct bucket *);
-
-/* key/bucket marking: */
-
-static inline struct bch_fs_usage *fs_usage_ptr(struct bch_fs *c,
- unsigned journal_seq,
- bool gc)
-{
- percpu_rwsem_assert_held(&c->mark_lock);
- BUG_ON(!gc && !journal_seq);
-
- return this_cpu_ptr(gc
- ? c->usage_gc
- : c->usage[journal_seq & JOURNAL_BUF_MASK]);
-}
-
int bch2_update_replicas(struct bch_fs *, struct bkey_s_c,
- struct bch_replicas_entry_v1 *, s64,
- unsigned, bool);
-int bch2_update_replicas_list(struct btree_trans *,
struct bch_replicas_entry_v1 *, s64);
-int bch2_update_cached_sectors_list(struct btree_trans *, unsigned, s64);
-int bch2_replicas_deltas_realloc(struct btree_trans *, unsigned);

void bch2_fs_usage_initialize(struct bch_fs *);

@@ -358,9 +344,6 @@ int bch2_trigger_reservation(struct btree_trans *, enum btree_id, unsigned,

void bch2_trans_account_disk_usage_change(struct btree_trans *);

-void bch2_trans_fs_usage_revert(struct btree_trans *, struct replicas_delta_list *);
-int bch2_trans_fs_usage_apply(struct btree_trans *, struct replicas_delta_list *);
-
int bch2_trans_mark_metadata_bucket(struct btree_trans *, struct bch_dev *,
size_t, enum bch_data_type, unsigned);
int bch2_trans_mark_dev_sb(struct bch_fs *, struct bch_dev *);
diff --git a/fs/bcachefs/disk_accounting.c b/fs/bcachefs/disk_accounting.c
index 209f59e87b34..327c586ac661 100644
--- a/fs/bcachefs/disk_accounting.c
+++ b/fs/bcachefs/disk_accounting.c
@@ -1,9 +1,13 @@
// SPDX-License-Identifier: GPL-2.0

#include "bcachefs.h"
+#include "bcachefs_ioctl.h"
#include "btree_update.h"
+#include "btree_write_buffer.h"
#include "buckets.h"
#include "disk_accounting.h"
+#include "error.h"
+#include "journal_io.h"
#include "replicas.h"

static const char * const disk_accounting_type_strs[] = {
@@ -13,6 +17,44 @@ static const char * const disk_accounting_type_strs[] = {
NULL
};

+int bch2_disk_accounting_mod(struct btree_trans *trans,
+ struct disk_accounting_key *k,
+ s64 *d, unsigned nr)
+{
+ /* Normalize: */
+ switch (k->type) {
+ case BCH_DISK_ACCOUNTING_replicas:
+ bubble_sort(k->replicas.devs, k->replicas.nr_devs, u8_cmp);
+ break;
+ }
+
+ BUG_ON(nr > BCH_ACCOUNTING_MAX_COUNTERS);
+
+ struct {
+ __BKEY_PADDED(k, BCH_ACCOUNTING_MAX_COUNTERS);
+ } k_i;
+ struct bkey_i_accounting *acc = bkey_accounting_init(&k_i.k);
+
+ acc->k.p = disk_accounting_key_to_bpos(k);
+ set_bkey_val_u64s(&acc->k, sizeof(struct bch_accounting) / sizeof(u64) + nr);
+
+ memcpy_u64s_small(acc->v.d, d, nr);
+
+ return bch2_trans_update_buffered(trans, BTREE_ID_accounting, &acc->k_i);
+}
+
+int bch2_mod_dev_cached_sectors(struct btree_trans *trans,
+ unsigned dev, s64 sectors)
+{
+ struct disk_accounting_key acc = {
+ .type = BCH_DISK_ACCOUNTING_replicas,
+ };
+
+ bch2_replicas_entry_cached(&acc.replicas, dev);
+
+ return bch2_disk_accounting_mod(trans, &acc, &sectors, 1);
+}
+
int bch2_accounting_invalid(struct bch_fs *c, struct bkey_s_c k,
enum bkey_invalid_flags flags,
struct printbuf *err)
@@ -68,3 +110,269 @@ void bch2_accounting_swab(struct bkey_s k)
p++)
*p = swab64(*p);
}
+
+static inline bool accounting_to_replicas(struct bch_replicas_entry_v1 *r, struct bpos p)
+{
+ struct disk_accounting_key acc_k;
+ bpos_to_disk_accounting_key(&acc_k, p);
+
+ switch (acc_k.type) {
+ case BCH_DISK_ACCOUNTING_replicas:
+ memcpy(r, &acc_k.replicas, replicas_entry_bytes(&acc_k.replicas));
+ return true;
+ default:
+ return false;
+ }
+}
+
+static int bch2_accounting_update_sb_one(struct bch_fs *c, struct bpos p)
+{
+ struct bch_replicas_padded r;
+ return accounting_to_replicas(&r.e, p)
+ ? bch2_mark_replicas(c, &r.e)
+ : 0;
+}
+
+int bch2_accounting_update_sb(struct btree_trans *trans)
+{
+ for (struct jset_entry *i = trans->journal_entries;
+ i != (void *) ((u64 *) trans->journal_entries + trans->journal_entries_u64s);
+ i = vstruct_next(i))
+ if (jset_entry_is_key(i) && i->start->k.type == KEY_TYPE_accounting) {
+ int ret = bch2_accounting_update_sb_one(trans->c, i->start->k.p);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
+static int __bch2_accounting_mem_add_slowpath(struct bch_fs *c, struct bkey_s_c_accounting a)
+{
+ struct bch_replicas_padded r;
+
+ if (accounting_to_replicas(&r.e, a.k->p) &&
+ !bch2_replicas_marked_locked(c, &r.e))
+ return -BCH_ERR_btree_insert_need_mark_replicas;
+
+ struct bch_accounting_mem *acc = &c->accounting;
+ unsigned new_nr_counters = acc->nr_counters + bch2_accounting_counters(a.k);
+
+ u64 __percpu *new_counters = __alloc_percpu_gfp(new_nr_counters * sizeof(u64),
+ sizeof(u64), GFP_KERNEL);
+ if (!new_counters)
+ return -BCH_ERR_ENOMEM_disk_accounting;
+
+ preempt_disable();
+ memcpy(this_cpu_ptr(new_counters),
+ bch2_acc_percpu_u64s(acc->v, acc->nr_counters),
+ acc->nr_counters * sizeof(u64));
+ preempt_enable();
+
+ struct accounting_pos_offset n = {
+ .pos = a.k->p,
+ .version = a.k->version,
+ .offset = acc->nr_counters,
+ .nr_counters = bch2_accounting_counters(a.k),
+ };
+ if (darray_push(&acc->k, n)) {
+ free_percpu(new_counters);
+ return -BCH_ERR_ENOMEM_disk_accounting;
+ }
+
+ eytzinger0_sort(acc->k.data, acc->k.nr, sizeof(acc->k.data[0]), accounting_pos_cmp, NULL);
+
+ free_percpu(acc->v);
+ acc->v = new_counters;
+ acc->nr_counters = new_nr_counters;
+
+ for (unsigned i = 0; i < n.nr_counters; i++)
+ this_cpu_add(acc->v[n.offset + i], a.v->d[i]);
+ return 0;
+}
+
+int bch2_accounting_mem_add_slowpath(struct bch_fs *c, struct bkey_s_c_accounting a)
+{
+ percpu_up_read(&c->mark_lock);
+ percpu_down_write(&c->mark_lock);
+ int ret = __bch2_accounting_mem_add_slowpath(c, a);
+ percpu_up_write(&c->mark_lock);
+ percpu_down_read(&c->mark_lock);
+ return ret;
+}
+
+int bch2_fs_replicas_usage_read(struct bch_fs *c, darray_char *usage)
+{
+ struct bch_accounting_mem *acc = &c->accounting;
+ int ret = 0;
+
+ darray_init(usage);
+
+ percpu_down_read(&c->mark_lock);
+ darray_for_each(acc->k, i) {
+ struct {
+ struct bch_replicas_usage r;
+ u8 pad[BCH_BKEY_PTRS_MAX];
+ } u;
+
+ if (!accounting_to_replicas(&u.r.r, i->pos))
+ continue;
+
+ bch2_accounting_mem_read(c, i->pos, &u.r.sectors, 1);
+
+ ret = darray_make_room(usage, replicas_usage_bytes(&u.r));
+ if (ret)
+ break;
+
+ memcpy(&darray_top(*usage), &u.r, replicas_usage_bytes(&u.r));
+ usage->nr += replicas_usage_bytes(&u.r);
+ }
+ percpu_up_read(&c->mark_lock);
+
+ if (ret)
+ darray_exit(usage);
+ return ret;
+}
+
+static bool accounting_key_is_zero(struct bkey_s_c_accounting a)
+{
+
+ for (unsigned i = 0; i < bch2_accounting_counters(a.k); i++)
+ if (a.v->d[i])
+ return false;
+ return true;
+}
+
+static int accounting_read_key(struct bch_fs *c, struct bkey_s_c k)
+{
+ struct printbuf buf = PRINTBUF;
+
+ if (k.k->type != KEY_TYPE_accounting)
+ return 0;
+
+ percpu_down_read(&c->mark_lock);
+ int ret = __bch2_accounting_mem_add(c, bkey_s_c_to_accounting(k));
+ percpu_up_read(&c->mark_lock);
+
+ if (accounting_key_is_zero(bkey_s_c_to_accounting(k)) &&
+ ret == -BCH_ERR_btree_insert_need_mark_replicas)
+ ret = 0;
+
+ struct disk_accounting_key acc;
+ bpos_to_disk_accounting_key(&acc, k.k->p);
+
+ if (fsck_err_on(ret == -BCH_ERR_btree_insert_need_mark_replicas,
+ c, accounting_replicas_not_marked,
+ "accounting not marked in superblock replicas\n %s",
+ (bch2_accounting_key_to_text(&buf, &acc),
+ buf.buf)))
+ ret = bch2_accounting_update_sb_one(c, k.k->p);
+fsck_err:
+ printbuf_exit(&buf);
+ return ret;
+}
+
+int bch2_accounting_read(struct bch_fs *c)
+{
+ struct bch_accounting_mem *acc = &c->accounting;
+
+ int ret = bch2_trans_run(c,
+ for_each_btree_key(trans, iter,
+ BTREE_ID_accounting, POS_MIN,
+ BTREE_ITER_PREFETCH|BTREE_ITER_ALL_SNAPSHOTS, k, ({
+ struct bkey u;
+ struct bkey_s_c k = bch2_btree_path_peek_slot_exact(btree_iter_path(trans, &iter), &u);
+ accounting_read_key(c, k);
+ })));
+ if (ret)
+ goto err;
+
+ struct genradix_iter iter;
+ struct journal_replay *i, **_i;
+
+ genradix_for_each(&c->journal_entries, iter, _i) {
+ i = *_i;
+
+ if (!i || i->ignore)
+ continue;
+
+ for_each_jset_key(k, entry, &i->j)
+ if (k->k.type == KEY_TYPE_accounting) {
+ struct bkey_s_c_accounting a = bkey_i_to_s_c_accounting(k);
+ unsigned idx = eytzinger0_find(acc->k.data, acc->k.nr,
+ sizeof(acc->k.data[0]),
+ accounting_pos_cmp, &a.k->p);
+ if (idx < acc->k.nr &&
+ bversion_cmp(acc->k.data[idx].version, a.k->version) >= 0)
+ continue;
+
+ ret = accounting_read_key(c, bkey_i_to_s_c(k));
+ if (ret)
+ goto err;
+ }
+ }
+
+ percpu_down_read(&c->mark_lock);
+ preempt_disable();
+ struct bch_fs_usage_base *usage = &c->usage_base->b;
+
+ for (unsigned i = 0; i < acc->k.nr; i++) {
+ struct disk_accounting_key k;
+ bpos_to_disk_accounting_key(&k, acc->k.data[i].pos);
+
+ u64 v[BCH_ACCOUNTING_MAX_COUNTERS];
+ bch2_accounting_mem_read_counters(c, i, v, ARRAY_SIZE(v));
+
+ switch (k.type) {
+ case BCH_DISK_ACCOUNTING_persistent_reserved:
+ usage->reserved += v[0] * k.persistent_reserved.nr_replicas;
+ break;
+ case BCH_DISK_ACCOUNTING_replicas:
+ fs_usage_data_type_to_base(usage, k.replicas.data_type, v[0]);
+ break;
+ }
+ }
+ preempt_enable();
+ percpu_up_read(&c->mark_lock);
+err:
+ bch_err_fn(c, ret);
+ return ret;
+}
+
+int bch2_dev_usage_remove(struct bch_fs *c, unsigned dev)
+{
+ return bch2_trans_run(c,
+ bch2_btree_write_buffer_flush_sync(trans) ?:
+ for_each_btree_key_commit(trans, iter, BTREE_ID_accounting, POS_MIN,
+ BTREE_ITER_ALL_SNAPSHOTS, k, NULL, NULL, 0, ({
+ struct disk_accounting_key acc;
+ bpos_to_disk_accounting_key(&acc, k.k->p);
+
+ acc.type == BCH_DISK_ACCOUNTING_dev_data_type &&
+ acc.dev_data_type.dev == dev
+ ? bch2_btree_bit_mod_buffered(trans, BTREE_ID_accounting, k.k->p, 0)
+ : 0;
+ })) ?:
+ bch2_btree_write_buffer_flush_sync(trans));
+}
+
+int bch2_dev_usage_init(struct bch_dev *ca)
+{
+ struct disk_accounting_key acc = {
+ .type = BCH_DISK_ACCOUNTING_dev_data_type,
+ .dev_data_type.dev = ca->dev_idx,
+ .dev_data_type.data_type = BCH_DATA_free,
+ };
+ u64 v[3] = { ca->mi.nbuckets - ca->mi.first_bucket, 0, 0 };
+
+ return bch2_trans_do(ca->fs, NULL, NULL, 0,
+ bch2_disk_accounting_mod(trans, &acc, v, ARRAY_SIZE(v)));
+}
+
+void bch2_fs_accounting_exit(struct bch_fs *c)
+{
+ struct bch_accounting_mem *acc = &c->accounting;
+
+ darray_exit(&acc->k);
+ free_percpu(acc->v);
+}
diff --git a/fs/bcachefs/disk_accounting.h b/fs/bcachefs/disk_accounting.h
index e15299665859..5fd053a819df 100644
--- a/fs/bcachefs/disk_accounting.h
+++ b/fs/bcachefs/disk_accounting.h
@@ -2,11 +2,32 @@
#ifndef _BCACHEFS_DISK_ACCOUNTING_H
#define _BCACHEFS_DISK_ACCOUNTING_H

+#include <linux/eytzinger.h>
+
+static inline void bch2_u64s_neg(u64 *v, unsigned nr)
+{
+ for (unsigned i = 0; i < nr; i++)
+ v[i] = -v[i];
+}
+
static inline unsigned bch2_accounting_counters(const struct bkey *k)
{
return bkey_val_u64s(k) - offsetof(struct bch_accounting, d) / sizeof(u64);
}

+static inline void bch2_accounting_neg(struct bkey_s_accounting a)
+{
+ bch2_u64s_neg(a.v->d, bch2_accounting_counters(a.k));
+}
+
+static inline bool bch2_accounting_key_is_zero(struct bkey_s_c_accounting a)
+{
+ for (unsigned i = 0; i < bch2_accounting_counters(a.k); i++)
+ if (a.v->d[i])
+ return false;
+ return true;
+}
+
static inline void bch2_accounting_accumulate(struct bkey_i_accounting *dst,
struct bkey_s_c_accounting src)
{
@@ -18,6 +39,26 @@ static inline void bch2_accounting_accumulate(struct bkey_i_accounting *dst,
dst->k.version = src.k->version;
}

+static inline void fs_usage_data_type_to_base(struct bch_fs_usage_base *fs_usage,
+ enum bch_data_type data_type,
+ s64 sectors)
+{
+ switch (data_type) {
+ case BCH_DATA_btree:
+ fs_usage->btree += sectors;
+ break;
+ case BCH_DATA_user:
+ case BCH_DATA_parity:
+ fs_usage->data += sectors;
+ break;
+ case BCH_DATA_cached:
+ fs_usage->cached += sectors;
+ break;
+ default:
+ break;
+ }
+}
+
static inline void bpos_to_disk_accounting_key(struct disk_accounting_key *acc, struct bpos p)
{
acc->_pad = p;
@@ -36,6 +77,12 @@ static inline struct bpos disk_accounting_key_to_bpos(struct disk_accounting_key
return ret;
}

+int bch2_disk_accounting_mod(struct btree_trans *,
+ struct disk_accounting_key *,
+ s64 *, unsigned);
+int bch2_mod_dev_cached_sectors(struct btree_trans *trans,
+ unsigned dev, s64 sectors);
+
int bch2_accounting_invalid(struct bch_fs *, struct bkey_s_c,
enum bkey_invalid_flags, struct printbuf *);
void bch2_accounting_key_to_text(struct printbuf *, struct disk_accounting_key *);
@@ -49,4 +96,83 @@ void bch2_accounting_swab(struct bkey_s);
.min_val_size = 8, \
})

+int bch2_accounting_update_sb(struct btree_trans *);
+
+static inline int accounting_pos_cmp(const void *_l, const void *_r)
+{
+ const struct bpos *l = _l, *r = _r;
+
+ return bpos_cmp(*l, *r);
+}
+
+int bch2_accounting_mem_add_slowpath(struct bch_fs *, struct bkey_s_c_accounting);
+
+static inline int __bch2_accounting_mem_add(struct bch_fs *c, struct bkey_s_c_accounting a)
+{
+ struct bch_accounting_mem *acc = &c->accounting;
+ unsigned idx = eytzinger0_find(acc->k.data, acc->k.nr, sizeof(acc->k.data[0]),
+ accounting_pos_cmp, &a.k->p);
+ if (unlikely(idx >= acc->k.nr))
+ return bch2_accounting_mem_add_slowpath(c, a);
+
+ unsigned offset = acc->k.data[idx].offset;
+
+ EBUG_ON(bch2_accounting_counters(a.k) != acc->k.data[idx].nr_counters);
+
+ for (unsigned i = 0; i < bch2_accounting_counters(a.k); i++)
+ this_cpu_add(acc->v[offset + i], a.v->d[i]);
+ return 0;
+}
+
+static inline int bch2_accounting_mem_add(struct btree_trans *trans, struct bkey_s_c_accounting a)
+{
+ struct disk_accounting_key acc_k;
+ bpos_to_disk_accounting_key(&acc_k, a.k->p);
+
+ switch (acc_k.type) {
+ case BCH_DISK_ACCOUNTING_persistent_reserved:
+ trans->fs_usage_delta.reserved += acc_k.persistent_reserved.nr_replicas * a.v->d[0];
+ break;
+ case BCH_DISK_ACCOUNTING_replicas:
+ fs_usage_data_type_to_base(&trans->fs_usage_delta, acc_k.replicas.data_type, a.v->d[0]);
+ break;
+ }
+ return __bch2_accounting_mem_add(trans->c, a);
+}
+
+static inline void bch2_accounting_mem_read_counters(struct bch_fs *c,
+ unsigned idx,
+ u64 *v, unsigned nr)
+{
+ memset(v, 0, sizeof(*v) * nr);
+
+ struct bch_accounting_mem *acc = &c->accounting;
+ if (unlikely(idx >= acc->k.nr))
+ return;
+
+ unsigned offset = acc->k.data[idx].offset;
+ nr = min_t(unsigned, nr, acc->k.data[idx].nr_counters);
+
+ for (unsigned i = 0; i < nr; i++)
+ v[i] = percpu_u64_get(acc->v + offset + i);
+}
+
+static inline void bch2_accounting_mem_read(struct bch_fs *c, struct bpos p,
+ u64 *v, unsigned nr)
+{
+ struct bch_accounting_mem *acc = &c->accounting;
+ unsigned idx = eytzinger0_find(acc->k.data, acc->k.nr, sizeof(acc->k.data[0]),
+ accounting_pos_cmp, &p);
+
+ bch2_accounting_mem_read_counters(c, idx, v, nr);
+}
+
+int bch2_fs_replicas_usage_read(struct bch_fs *, darray_char *);
+
+int bch2_accounting_read(struct bch_fs *);
+
+int bch2_dev_usage_remove(struct bch_fs *, unsigned);
+int bch2_dev_usage_init(struct bch_dev *);
+void bch2_fs_accounting_exit(struct bch_fs *);
+
#endif /* _BCACHEFS_DISK_ACCOUNTING_H */
diff --git a/fs/bcachefs/disk_accounting_types.h b/fs/bcachefs/disk_accounting_types.h
new file mode 100644
index 000000000000..8da5ac182b33
--- /dev/null
+++ b/fs/bcachefs/disk_accounting_types.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _BCACHEFS_DISK_ACCOUNTING_TYPES_H
+#define _BCACHEFS_DISK_ACCOUNTING_TYPES_H
+
+#include <linux/darray.h>
+
+struct accounting_pos_offset {
+ struct bpos pos;
+ struct bversion version;
+ u32 offset:24,
+ nr_counters:8;
+};
+
+struct bch_accounting_mem {
+ DARRAY(struct accounting_pos_offset) k;
+ u64 __percpu *v;
+ unsigned nr_counters;
+};
+
+#endif /* _BCACHEFS_DISK_ACCOUNTING_TYPES_H */
diff --git a/fs/bcachefs/ec.c b/fs/bcachefs/ec.c
index b98e2c2b8bf0..38e5e882f4a4 100644
--- a/fs/bcachefs/ec.c
+++ b/fs/bcachefs/ec.c
@@ -13,6 +13,7 @@
#include "btree_write_buffer.h"
#include "buckets.h"
#include "checksum.h"
+#include "disk_accounting.h"
#include "disk_groups.h"
#include "ec.h"
#include "error.h"
@@ -324,21 +325,25 @@ int bch2_trigger_stripe(struct btree_trans *trans,
new_s->nr_redundant != old_s->nr_redundant));

if (new_s) {
- s64 sectors = le16_to_cpu(new_s->sectors);
+ s64 sectors = (u64) le16_to_cpu(new_s->sectors) * new_s->nr_redundant;

- struct bch_replicas_padded r;
- bch2_bkey_to_replicas(&r.e, new);
- int ret = bch2_update_replicas_list(trans, &r.e, sectors * new_s->nr_redundant);
+ struct disk_accounting_key acc = {
+ .type = BCH_DISK_ACCOUNTING_replicas,
+ };
+ bch2_bkey_to_replicas(&acc.replicas, new);
+ int ret = bch2_disk_accounting_mod(trans, &acc, &sectors, 1);
if (ret)
return ret;
}

if (old_s) {
- s64 sectors = -((s64) le16_to_cpu(old_s->sectors));
+ s64 sectors = -((s64) le16_to_cpu(old_s->sectors)) * old_s->nr_redundant;

- struct bch_replicas_padded r;
- bch2_bkey_to_replicas(&r.e, old);
- int ret = bch2_update_replicas_list(trans, &r.e, sectors * old_s->nr_redundant);
+ struct disk_accounting_key acc = {
+ .type = BCH_DISK_ACCOUNTING_replicas,
+ };
+ bch2_bkey_to_replicas(&acc.replicas, old);
+ int ret = bch2_disk_accounting_mod(trans, &acc, &sectors, 1);
if (ret)
return ret;
}
@@ -442,8 +447,7 @@ int bch2_trigger_stripe(struct btree_trans *trans,
}

int ret = bch2_update_replicas(c, new, &m->r.e,
- ((s64) m->sectors * m->nr_redundant),
- 0, true);
+ ((s64) m->sectors * m->nr_redundant));
if (ret) {
struct printbuf buf = PRINTBUF;

diff --git a/fs/bcachefs/inode.c b/fs/bcachefs/inode.c
index a3139bb66f77..3dfa9f77c739 100644
--- a/fs/bcachefs/inode.c
+++ b/fs/bcachefs/inode.c
@@ -8,6 +8,7 @@
#include "buckets.h"
#include "compress.h"
#include "dirent.h"
+#include "disk_accounting.h"
#include "error.h"
#include "extents.h"
#include "extent_update.h"
@@ -610,11 +611,13 @@ int bch2_trigger_inode(struct btree_trans *trans,

if (flags & BTREE_TRIGGER_TRANSACTIONAL) {
if (nr) {
- int ret = bch2_replicas_deltas_realloc(trans, 0);
+ struct disk_accounting_key acc = {
+ .type = BCH_DISK_ACCOUNTING_nr_inodes
+ };
+
+ int ret = bch2_disk_accounting_mod(trans, &acc, &nr, 1);
if (ret)
return ret;
-
- trans->fs_usage_deltas->nr_inodes += nr;
}

bool old_deleted = bkey_is_deleted_inode(old);
diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c
index b8289af66c8e..140393256f32 100644
--- a/fs/bcachefs/recovery.c
+++ b/fs/bcachefs/recovery.c
@@ -1194,9 +1194,6 @@ int bch2_fs_initialize(struct bch_fs *c)
for (unsigned i = 0; i < BTREE_ID_NR; i++)
bch2_btree_root_alloc(c, i);

- for_each_member_device(c, ca)
- bch2_dev_usage_init(ca);
-
ret = bch2_fs_journal_alloc(c);
if (ret)
goto err;
@@ -1213,6 +1210,15 @@ int bch2_fs_initialize(struct bch_fs *c)
if (ret)
goto err;

+ for_each_member_device(c, ca) {
+ ret = bch2_dev_usage_init(ca);
+ bch_err_msg(c, ret, "initializing device usage");
+ if (ret) {
+ percpu_ref_put(&ca->ref);
+ goto err;
+ }
+ }
+
/*
* Write out the superblock and journal buckets, now that we can do
* btree updates
diff --git a/fs/bcachefs/recovery_types.h b/fs/bcachefs/recovery_types.h
index 1361e34d4e64..18582e2128ed 100644
--- a/fs/bcachefs/recovery_types.h
+++ b/fs/bcachefs/recovery_types.h
@@ -13,6 +13,7 @@
* must never change:
*/
#define BCH_RECOVERY_PASSES() \
+ x(accounting_read, 37, PASS_ALWAYS) \
x(alloc_read, 0, PASS_ALWAYS) \
x(stripes_read, 1, PASS_ALWAYS) \
x(initialize_subvolumes, 2, 0) \
diff --git a/fs/bcachefs/replicas.c b/fs/bcachefs/replicas.c
index 678b9c20e251..dde581a49e28 100644
--- a/fs/bcachefs/replicas.c
+++ b/fs/bcachefs/replicas.c
@@ -254,23 +254,25 @@ static bool __replicas_has_entry(struct bch_replicas_cpu *r,
return __replicas_entry_idx(r, search) >= 0;
}

-bool bch2_replicas_marked(struct bch_fs *c,
+bool bch2_replicas_marked_locked(struct bch_fs *c,
struct bch_replicas_entry_v1 *search)
{
- bool marked;
-
- if (!search->nr_devs)
- return true;
-
verify_replicas_entry(search);

+ return !search->nr_devs ||
+ (__replicas_has_entry(&c->replicas, search) &&
+ (likely((!c->replicas_gc.entries)) ||
+ __replicas_has_entry(&c->replicas_gc, search)));
+}
+
+bool bch2_replicas_marked(struct bch_fs *c,
+ struct bch_replicas_entry_v1 *search)
+{
percpu_down_read(&c->mark_lock);
- marked = __replicas_has_entry(&c->replicas, search) &&
- (likely((!c->replicas_gc.entries)) ||
- __replicas_has_entry(&c->replicas_gc, search));
+ bool ret = bch2_replicas_marked_locked(c, search);
percpu_up_read(&c->mark_lock);

- return marked;
+ return ret;
}

static void __replicas_table_update(struct bch_fs_usage *dst,
@@ -468,20 +470,6 @@ int bch2_mark_replicas(struct bch_fs *c, struct bch_replicas_entry_v1 *r)
? 0 : bch2_mark_replicas_slowpath(c, r);
}

-/* replicas delta list: */
-
-int bch2_replicas_delta_list_mark(struct bch_fs *c,
- struct replicas_delta_list *r)
-{
- struct replicas_delta *d = r->d;
- struct replicas_delta *top = (void *) r->d + r->used;
- int ret = 0;
-
- for (d = r->d; !ret && d != top; d = replicas_delta_next(d))
- ret = bch2_mark_replicas(c, &d->r);
- return ret;
-}
-
/*
* Old replicas_gc mechanism: only used for journal replicas entries now, should
* die at some point:
@@ -1042,8 +1030,6 @@ void bch2_fs_replicas_exit(struct bch_fs *c)
kfree(c->usage_base);
kfree(c->replicas.entries);
kfree(c->replicas_gc.entries);
-
- mempool_exit(&c->replicas_delta_pool);
}

int bch2_fs_replicas_init(struct bch_fs *c)
@@ -1052,7 +1038,5 @@ int bch2_fs_replicas_init(struct bch_fs *c)
&c->replicas_journal_res,
reserve_journal_replicas(c, &c->replicas));

- return mempool_init_kmalloc_pool(&c->replicas_delta_pool, 1,
- REPLICAS_DELTA_LIST_MAX) ?:
- replicas_table_update(c, &c->replicas);
+ return replicas_table_update(c, &c->replicas);
}
diff --git a/fs/bcachefs/replicas.h b/fs/bcachefs/replicas.h
index 983cce782ac2..f00c586f8cd9 100644
--- a/fs/bcachefs/replicas.h
+++ b/fs/bcachefs/replicas.h
@@ -26,18 +26,13 @@ int bch2_replicas_entry_idx(struct bch_fs *,
void bch2_devlist_to_replicas(struct bch_replicas_entry_v1 *,
enum bch_data_type,
struct bch_devs_list);
+
+bool bch2_replicas_marked_locked(struct bch_fs *,
+ struct bch_replicas_entry_v1 *);
bool bch2_replicas_marked(struct bch_fs *, struct bch_replicas_entry_v1 *);
int bch2_mark_replicas(struct bch_fs *,
struct bch_replicas_entry_v1 *);

-static inline struct replicas_delta *
-replicas_delta_next(struct replicas_delta *d)
-{
- return (void *) d + replicas_entry_bytes(&d->r) + 8;
-}
-
-int bch2_replicas_delta_list_mark(struct bch_fs *, struct replicas_delta_list *);
-
void bch2_bkey_to_replicas(struct bch_replicas_entry_v1 *, struct bkey_s_c);

static inline void bch2_replicas_entry_cached(struct bch_replicas_entry_v1 *e,
diff --git a/fs/bcachefs/replicas_types.h b/fs/bcachefs/replicas_types.h
index ac90d142c4e8..fed71c861fe7 100644
--- a/fs/bcachefs/replicas_types.h
+++ b/fs/bcachefs/replicas_types.h
@@ -8,20 +8,4 @@ struct bch_replicas_cpu {
struct bch_replicas_entry_v1 *entries;
};

-struct replicas_delta {
- s64 delta;
- struct bch_replicas_entry_v1 r;
-} __packed;
-
-struct replicas_delta_list {
- unsigned size;
- unsigned used;
-
- struct {} memset_start;
- u64 nr_inodes;
- u64 persistent_reserved[BCH_REPLICAS_MAX];
- struct {} memset_end;
- struct replicas_delta d[];
-};
-
#endif /* _BCACHEFS_REPLICAS_TYPES_H */
diff --git a/fs/bcachefs/sb-errors_types.h b/fs/bcachefs/sb-errors_types.h
index 383e13711001..777a1adc38cf 100644
--- a/fs/bcachefs/sb-errors_types.h
+++ b/fs/bcachefs/sb-errors_types.h
@@ -265,7 +265,8 @@
x(subvol_children_bad, 257) \
x(subvol_loop, 258) \
x(subvol_unreachable, 259) \
- x(accounting_mismatch, 260)
+ x(accounting_mismatch, 260) \
+ x(accounting_replicas_not_marked, 261)

enum bch_sb_error_id {
#define x(t, n) BCH_FSCK_ERR_##t = n,
diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c
index a7f9de220d90..685d54d0ddbb 100644
--- a/fs/bcachefs/super.c
+++ b/fs/bcachefs/super.c
@@ -24,6 +24,7 @@
#include "clock.h"
#include "compress.h"
#include "debug.h"
+#include "disk_accounting.h"
#include "disk_groups.h"
#include "ec.h"
#include "errcode.h"
@@ -535,6 +536,7 @@ static void __bch2_fs_free(struct bch_fs *c)
time_stats_exit(&c->times[i]);

bch2_free_pending_node_rewrites(c);
+ bch2_fs_accounting_exit(c);
bch2_fs_sb_errors_exit(c);
bch2_fs_counters_exit(c);
bch2_fs_snapshots_exit(c);
@@ -1581,7 +1583,8 @@ static int bch2_dev_remove_alloc(struct bch_fs *c, struct bch_dev *ca)
bch2_btree_delete_range(c, BTREE_ID_alloc, start, end,
BTREE_TRIGGER_NORUN, NULL) ?:
bch2_btree_delete_range(c, BTREE_ID_bucket_gens, start, end,
- BTREE_TRIGGER_NORUN, NULL);
+ BTREE_TRIGGER_NORUN, NULL) ?:
+ bch2_dev_usage_remove(c, ca->dev_idx);
bch_err_msg(c, ret, "removing dev alloc info");
return ret;
}
@@ -1618,6 +1621,16 @@ int bch2_dev_remove(struct bch_fs *c, struct bch_dev *ca, int flags)
if (ret)
goto err;

+ /*
+ * We need to flush the entire journal to get rid of keys that reference
+ * the device being removed before removing the superblock entry
+ */
+ bch2_journal_flush_all_pins(&c->journal);
+
+ /*
+ * this is really just needed for the bch2_replicas_gc_(start|end)
+ * calls, and could be cleaned up:
+ */
ret = bch2_journal_flush_device_pins(&c->journal, ca->dev_idx);
bch_err_msg(ca, ret, "bch2_journal_flush_device_pins()");
if (ret)
@@ -1655,17 +1668,6 @@ int bch2_dev_remove(struct bch_fs *c, struct bch_dev *ca, int flags)

bch2_dev_free(ca);

- /*
- * At this point the device object has been removed in-core, but the
- * on-disk journal might still refer to the device index via sb device
- * usage entries. Recovery fails if it sees usage information for an
- * invalid device. Flush journal pins to push the back of the journal
- * past now invalid device index references before we update the
- * superblock, but after the device object has been removed so any
- * further journal writes elide usage info for the device.
- */
- bch2_journal_flush_all_pins(&c->journal);
-
/*
* Free this device's slot in the bch_member array - all pointers to
* this device must be gone:
@@ -1727,8 +1729,6 @@ int bch2_dev_add(struct bch_fs *c, const char *path)
goto err;
}

- bch2_dev_usage_init(ca);
-
ret = __bch2_dev_attach_bdev(ca, &sb);
if (ret)
goto err;
@@ -1793,6 +1793,10 @@ int bch2_dev_add(struct bch_fs *c, const char *path)

bch2_dev_usage_journal_reserve(c);

+ ret = bch2_dev_usage_init(ca);
+ if (ret)
+ goto err_late;
+
ret = bch2_trans_mark_dev_sb(c, ca);
bch_err_msg(ca, ret, "marking new superblock");
if (ret)
@@ -1956,15 +1960,18 @@ int bch2_dev_resize(struct bch_fs *c, struct bch_dev *ca, u64 nbuckets)
mutex_unlock(&c->sb_lock);

if (ca->mi.freespace_initialized) {
- ret = bch2_dev_freespace_init(c, ca, old_nbuckets, nbuckets);
+ struct disk_accounting_key acc = {
+ .type = BCH_DISK_ACCOUNTING_dev_data_type,
+ .dev_data_type.dev = ca->dev_idx,
+ .dev_data_type.data_type = BCH_DATA_free,
+ };
+ u64 v[3] = { nbuckets - old_nbuckets, 0, 0 };
+
+ ret = bch2_dev_freespace_init(c, ca, old_nbuckets, nbuckets) ?:
+ bch2_trans_do(ca->fs, NULL, NULL, 0,
+ bch2_disk_accounting_mod(trans, &acc, v, ARRAY_SIZE(v)));
if (ret)
goto err;
-
- /*
- * XXX: this is all wrong transactionally - we'll be able to do
- * this correctly after the disk space accounting rewrite
- */
- ca->usage_base->d[BCH_DATA_free].buckets += nbuckets - old_nbuckets;
}

bch2_recalc_capacity(c);
--
2.43.0


2024-02-25 02:41:18

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 13/21] bcachefs: Kill replicas_journal_res

More dead code deletion

Signed-off-by: Kent Overstreet <[email protected]>
---
fs/bcachefs/bcachefs.h | 2 --
fs/bcachefs/replicas.c | 34 ----------------------------------
fs/bcachefs/super.c | 21 ---------------------
3 files changed, 57 deletions(-)

diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h
index 2e7c4d10c951..22dc455cb436 100644
--- a/fs/bcachefs/bcachefs.h
+++ b/fs/bcachefs/bcachefs.h
@@ -768,9 +768,7 @@ struct bch_fs {
struct mutex replicas_gc_lock;

struct journal_entry_res btree_root_journal_res;
- struct journal_entry_res replicas_journal_res;
struct journal_entry_res clock_journal_res;
- struct journal_entry_res dev_usage_journal_res;

struct bch_disk_groups_cpu __rcu *disk_groups;

diff --git a/fs/bcachefs/replicas.c b/fs/bcachefs/replicas.c
index 6dca705eaf1f..427dc6711427 100644
--- a/fs/bcachefs/replicas.c
+++ b/fs/bcachefs/replicas.c
@@ -342,32 +342,6 @@ static int replicas_table_update(struct bch_fs *c,
goto out;
}

-static unsigned reserve_journal_replicas(struct bch_fs *c,
- struct bch_replicas_cpu *r)
-{
- struct bch_replicas_entry_v1 *e;
- unsigned journal_res_u64s = 0;
-
- /* nr_inodes: */
- journal_res_u64s +=
- DIV_ROUND_UP(sizeof(struct jset_entry_usage), sizeof(u64));
-
- /* key_version: */
- journal_res_u64s +=
- DIV_ROUND_UP(sizeof(struct jset_entry_usage), sizeof(u64));
-
- /* persistent_reserved: */
- journal_res_u64s +=
- DIV_ROUND_UP(sizeof(struct jset_entry_usage), sizeof(u64)) *
- BCH_REPLICAS_MAX;
-
- for_each_cpu_replicas_entry(r, e)
- journal_res_u64s +=
- DIV_ROUND_UP(sizeof(struct jset_entry_data_usage) +
- e->nr_devs, sizeof(u64));
- return journal_res_u64s;
-}
-
noinline
static int bch2_mark_replicas_slowpath(struct bch_fs *c,
struct bch_replicas_entry_v1 *new_entry)
@@ -401,10 +375,6 @@ static int bch2_mark_replicas_slowpath(struct bch_fs *c,
ret = bch2_cpu_replicas_to_sb_replicas(c, &new_r);
if (ret)
goto err;
-
- bch2_journal_entry_res_resize(&c->journal,
- &c->replicas_journal_res,
- reserve_journal_replicas(c, &new_r));
}

if (!new_r.entries &&
@@ -974,9 +944,5 @@ void bch2_fs_replicas_exit(struct bch_fs *c)

int bch2_fs_replicas_init(struct bch_fs *c)
{
- bch2_journal_entry_res_resize(&c->journal,
- &c->replicas_journal_res,
- reserve_journal_replicas(c, &c->replicas));
-
return replicas_table_update(c, &c->replicas);
}
diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c
index 30b41c8de309..89c481831608 100644
--- a/fs/bcachefs/super.c
+++ b/fs/bcachefs/super.c
@@ -221,22 +221,6 @@ struct bch_fs *bch2_uuid_to_fs(__uuid_t uuid)
return c;
}

-static void bch2_dev_usage_journal_reserve(struct bch_fs *c)
-{
- unsigned nr = 0, u64s =
- ((sizeof(struct jset_entry_dev_usage) +
- sizeof(struct jset_entry_dev_usage_type) * BCH_DATA_NR)) /
- sizeof(u64);
-
- rcu_read_lock();
- for_each_member_device_rcu(c, ca, NULL)
- nr++;
- rcu_read_unlock();
-
- bch2_journal_entry_res_resize(&c->journal,
- &c->dev_usage_journal_res, u64s * nr);
-}
-
/* Filesystem RO/RW: */

/*
@@ -940,7 +924,6 @@ static struct bch_fs *bch2_fs_alloc(struct bch_sb *sb, struct bch_opts opts)
bch2_journal_entry_res_resize(&c->journal,
&c->btree_root_journal_res,
BTREE_ID_NR * (JSET_KEYS_U64s + BKEY_BTREE_PTR_U64s_MAX));
- bch2_dev_usage_journal_reserve(c);
bch2_journal_entry_res_resize(&c->journal,
&c->clock_journal_res,
(sizeof(struct jset_entry_clock) / sizeof(u64)) * 2);
@@ -1680,8 +1663,6 @@ int bch2_dev_remove(struct bch_fs *c, struct bch_dev *ca, int flags)

mutex_unlock(&c->sb_lock);
up_write(&c->state_lock);
-
- bch2_dev_usage_journal_reserve(c);
return 0;
err:
if (ca->mi.state == BCH_MEMBER_STATE_rw &&
@@ -1791,8 +1772,6 @@ int bch2_dev_add(struct bch_fs *c, const char *path)
bch2_write_super(c);
mutex_unlock(&c->sb_lock);

- bch2_dev_usage_journal_reserve(c);
-
ret = bch2_dev_usage_init(ca);
if (ret)
goto err_late;
--
2.43.0


2024-02-25 02:41:48

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 15/21] bcachefs: Convert bch2_replicas_gc2() to new accounting

bch2_replicas_gc2() is used for garbage collection superblock replicas
entries that are empty - this converts it to the new accounting scheme.

Signed-off-by: Kent Overstreet <[email protected]>
---
fs/bcachefs/replicas.c | 26 +++++++++++++++-----------
1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/fs/bcachefs/replicas.c b/fs/bcachefs/replicas.c
index cba5ba44cfd8..18137abb1857 100644
--- a/fs/bcachefs/replicas.c
+++ b/fs/bcachefs/replicas.c
@@ -2,6 +2,7 @@

#include "bcachefs.h"
#include "buckets.h"
+#include "disk_accounting.h"
#include "journal.h"
#include "replicas.h"
#include "super-io.h"
@@ -425,8 +426,6 @@ int bch2_replicas_gc_start(struct bch_fs *c, unsigned typemask)
*/
int bch2_replicas_gc2(struct bch_fs *c)
{
- return 0;
-#if 0
struct bch_replicas_cpu new = { 0 };
unsigned i, nr;
int ret = 0;
@@ -456,20 +455,26 @@ int bch2_replicas_gc2(struct bch_fs *c)
struct bch_replicas_entry_v1 *e =
cpu_replicas_entry(&c->replicas, i);

- if (e->data_type == BCH_DATA_journal ||
- c->usage_base->replicas[i] ||
- percpu_u64_get(&c->usage[0]->replicas[i]) ||
- percpu_u64_get(&c->usage[1]->replicas[i]) ||
- percpu_u64_get(&c->usage[2]->replicas[i]) ||
- percpu_u64_get(&c->usage[3]->replicas[i]))
+ struct disk_accounting_key k = {
+ .type = BCH_DISK_ACCOUNTING_replicas,
+ };
+
+ memcpy(&k.replicas, e, replicas_entry_bytes(e));
+
+ u64 v = 0;
+ bch2_accounting_mem_read(c, disk_accounting_key_to_bpos(&k), &v, 1);
+
+ if (e->data_type == BCH_DATA_journal || v)
memcpy(cpu_replicas_entry(&new, new.nr++),
e, new.entry_size);
}

bch2_cpu_replicas_sort(&new);

- ret = bch2_cpu_replicas_to_sb_replicas(c, &new) ?:
- replicas_table_update(c, &new);
+ ret = bch2_cpu_replicas_to_sb_replicas(c, &new);
+
+ if (!ret)
+ swap(c->replicas, new);

kfree(new.entries);

@@ -481,7 +486,6 @@ int bch2_replicas_gc2(struct bch_fs *c)
mutex_unlock(&c->sb_lock);

return ret;
-#endif
}

/* Replicas tracking - superblock: */
--
2.43.0


2024-02-25 02:41:55

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 16/21] bcachefs: bch2_verify_accounting_clean()

Verify that the in-memory accounting verifies the on-disk accounting
after a clean shutdown.

Signed-off-by: Kent Overstreet <[email protected]>
---
fs/bcachefs/disk_accounting.c | 27 +++++++++++++++++++++++++++
fs/bcachefs/disk_accounting.h | 4 +++-
fs/bcachefs/super.c | 1 +
3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/fs/bcachefs/disk_accounting.c b/fs/bcachefs/disk_accounting.c
index 2884615adc1e..8d7b6ab66e71 100644
--- a/fs/bcachefs/disk_accounting.c
+++ b/fs/bcachefs/disk_accounting.c
@@ -513,6 +513,33 @@ int bch2_dev_usage_init(struct bch_dev *ca, bool gc)
return ret;
}

+void bch2_verify_accounting_clean(struct bch_fs *c)
+{
+ bch2_trans_run(c,
+ for_each_btree_key(trans, iter,
+ BTREE_ID_accounting, POS_MIN,
+ BTREE_ITER_ALL_SNAPSHOTS, k, ({
+ u64 v[BCH_ACCOUNTING_MAX_COUNTERS];
+ struct bkey_s_c_accounting a = bkey_s_c_to_accounting(k);
+ unsigned nr = bch2_accounting_counters(k.k);
+
+ bch2_accounting_mem_read(c, k.k->p, v, nr);
+
+ if (memcmp(a.v->d, v, nr * sizeof(u64))) {
+ struct printbuf buf = PRINTBUF;
+
+ bch2_bkey_val_to_text(&buf, c, k);
+ prt_str(&buf, " in mem");
+ for (unsigned j = 0; j < nr; j++)
+ prt_printf(&buf, " %llu", v[j]);
+
+ WARN(1, "accounting mismatch: %s", buf.buf);
+ printbuf_exit(&buf);
+ }
+ 0;
+ })));
+}
+
void bch2_accounting_free(struct bch_accounting_mem *acc)
{
darray_exit(&acc->k);
diff --git a/fs/bcachefs/disk_accounting.h b/fs/bcachefs/disk_accounting.h
index 70ac67f4a3cb..a0cf7a0b84a7 100644
--- a/fs/bcachefs/disk_accounting.h
+++ b/fs/bcachefs/disk_accounting.h
@@ -164,7 +164,7 @@ static inline void bch2_accounting_mem_read_counters(struct bch_fs *c, unsigned
{
memset(v, 0, sizeof(*v) * nr);

- struct bch_accounting_mem *acc = &c->accounting[0];
+ struct bch_accounting_mem *acc = &c->accounting[gc];
if (unlikely(idx >= acc->k.nr))
return;

@@ -194,6 +194,8 @@ int bch2_accounting_read(struct bch_fs *);
int bch2_dev_usage_remove(struct bch_fs *, unsigned);
int bch2_dev_usage_init(struct bch_dev *, bool);

+void bch2_verify_accounting_clean(struct bch_fs *c);
+
void bch2_accounting_free(struct bch_accounting_mem *);
void bch2_fs_accounting_exit(struct bch_fs *);

diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c
index 6617c8912e51..201d7767e478 100644
--- a/fs/bcachefs/super.c
+++ b/fs/bcachefs/super.c
@@ -355,6 +355,7 @@ void bch2_fs_read_only(struct bch_fs *c)
BUG_ON(atomic_long_read(&c->btree_key_cache.nr_dirty));
BUG_ON(c->btree_write_buffer.inc.keys.nr);
BUG_ON(c->btree_write_buffer.flushing.keys.nr);
+ bch2_verify_accounting_clean(c);

bch_verbose(c, "marking filesystem clean");
bch2_fs_mark_clean(c);
--
2.43.0


2024-02-25 02:42:17

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 17/21] bcachefs: Eytzinger accumulation for accounting keys

The btree write buffer takes as input keys from the journal, sorts them,
deduplicates them, and flushes them back to the btree in sorted order.

The disk space accounting rewrite is moving accounting to normal btree
keys, with update (in this case deltas) accumulated in the write buffer
and then flushed to the btree; but this is going to increase the number
of keys handled by the write buffer by perhaps as much as a factor of
3x-5x.

The overhead from copying around and sorting this many keys would cause
a significant performance regression, but: there is huge locality in
updates to accounting keys that we can take advantage of.

Instead of appending accounting keys to the list of keys to be sorted,
this patch adds an eytzinger search tree of recently seen accounting
keys. We look up the accounting key in the eytzinger search tree and
apply the delta directly, adding it if it doesn't exist, and
periodically prune the eytzinger tree of unused entries.

Signed-off-by: Kent Overstreet <[email protected]>
---
fs/bcachefs/btree_write_buffer.c | 54 +++++++++++++++++++++++++-
fs/bcachefs/btree_write_buffer.h | 50 ++++++++++++++++++++++--
fs/bcachefs/btree_write_buffer_types.h | 2 +
fs/bcachefs/journal_io.c | 13 +++++--
4 files changed, 110 insertions(+), 9 deletions(-)

diff --git a/fs/bcachefs/btree_write_buffer.c b/fs/bcachefs/btree_write_buffer.c
index 002a0762fc85..13f5f63e22b7 100644
--- a/fs/bcachefs/btree_write_buffer.c
+++ b/fs/bcachefs/btree_write_buffer.c
@@ -531,6 +531,29 @@ static void bch2_btree_write_buffer_flush_work(struct work_struct *work)
bch2_write_ref_put(c, BCH_WRITE_REF_btree_write_buffer);
}

+static void wb_accounting_sort(struct btree_write_buffer *wb)
+{
+ eytzinger0_sort(wb->accounting.data, wb->accounting.nr,
+ sizeof(wb->accounting.data[0]),
+ wb_key_cmp, NULL);
+}
+
+int bch2_accounting_key_to_wb_slowpath(struct bch_fs *c, enum btree_id btree,
+ struct bkey_i_accounting *k)
+{
+ struct btree_write_buffer *wb = &c->btree_write_buffer;
+ struct btree_write_buffered_key new = { .btree = btree };
+
+ bkey_copy(&new.k, &k->k_i);
+
+ int ret = darray_push(&wb->accounting, new);
+ if (ret)
+ return ret;
+
+ wb_accounting_sort(wb);
+ return 0;
+}
+
int bch2_journal_key_to_wb_slowpath(struct bch_fs *c,
struct journal_keys_to_wb *dst,
enum btree_id btree, struct bkey_i *k)
@@ -600,11 +623,35 @@ void bch2_journal_keys_to_write_buffer_start(struct bch_fs *c, struct journal_ke

bch2_journal_pin_add(&c->journal, seq, &dst->wb->pin,
bch2_btree_write_buffer_journal_flush);
+
+ darray_for_each(wb->accounting, i)
+ memset(&i->k.v, 0, bkey_val_bytes(&i->k.k));
}

-void bch2_journal_keys_to_write_buffer_end(struct bch_fs *c, struct journal_keys_to_wb *dst)
+int bch2_journal_keys_to_write_buffer_end(struct bch_fs *c, struct journal_keys_to_wb *dst)
{
struct btree_write_buffer *wb = &c->btree_write_buffer;
+ unsigned live_accounting_keys = 0;
+ int ret = 0;
+
+ darray_for_each(wb->accounting, i)
+ if (!bch2_accounting_key_is_zero(bkey_i_to_s_c_accounting(&i->k))) {
+ i->journal_seq = dst->seq;
+ live_accounting_keys++;
+ ret = __bch2_journal_key_to_wb(c, dst, i->btree, &i->k);
+ if (ret)
+ break;
+ }
+
+ if (live_accounting_keys * 2 < wb->accounting.nr) {
+ struct btree_write_buffered_key *dst = wb->accounting.data;
+
+ darray_for_each(wb->accounting, src)
+ if (!bch2_accounting_key_is_zero(bkey_i_to_s_c_accounting(&src->k)))
+ *dst++ = *src;
+ wb->accounting.nr = dst - wb->accounting.data;
+ wb_accounting_sort(wb);
+ }

if (!dst->wb->keys.nr)
bch2_journal_pin_drop(&c->journal, &dst->wb->pin);
@@ -617,6 +664,8 @@ void bch2_journal_keys_to_write_buffer_end(struct bch_fs *c, struct journal_keys
if (dst->wb == &wb->flushing)
mutex_unlock(&wb->flushing.lock);
mutex_unlock(&wb->inc.lock);
+
+ return ret;
}

static int bch2_journal_keys_to_write_buffer(struct bch_fs *c, struct journal_buf *buf)
@@ -640,7 +689,7 @@ static int bch2_journal_keys_to_write_buffer(struct bch_fs *c, struct journal_bu
buf->need_flush_to_write_buffer = false;
spin_unlock(&c->journal.lock);
out:
- bch2_journal_keys_to_write_buffer_end(c, &dst);
+ ret = bch2_journal_keys_to_write_buffer_end(c, &dst) ?: ret;
return ret;
}

@@ -672,6 +721,7 @@ void bch2_fs_btree_write_buffer_exit(struct bch_fs *c)
BUG_ON((wb->inc.keys.nr || wb->flushing.keys.nr) &&
!bch2_journal_error(&c->journal));

+ darray_exit(&wb->accounting);
darray_exit(&wb->sorted);
darray_exit(&wb->flushing.keys);
darray_exit(&wb->inc.keys);
diff --git a/fs/bcachefs/btree_write_buffer.h b/fs/bcachefs/btree_write_buffer.h
index eebcd2b15249..828e2deaaa3d 100644
--- a/fs/bcachefs/btree_write_buffer.h
+++ b/fs/bcachefs/btree_write_buffer.h
@@ -3,6 +3,8 @@
#define _BCACHEFS_BTREE_WRITE_BUFFER_H

#include "bkey.h"
+#include "disk_accounting.h"
+#include <linux/eytzinger.h>

static inline bool bch2_btree_write_buffer_should_flush(struct bch_fs *c)
{
@@ -29,16 +31,45 @@ struct journal_keys_to_wb {
u64 seq;
};

+static inline int wb_key_cmp(const void *_l, const void *_r)
+{
+ const struct btree_write_buffered_key *l = _l;
+ const struct btree_write_buffered_key *r = _r;
+
+ return cmp_int(l->btree, r->btree) ?: bpos_cmp(l->k.k.p, r->k.k.p);
+}
+
+int bch2_accounting_key_to_wb_slowpath(struct bch_fs *,
+ enum btree_id, struct bkey_i_accounting *);
+
+static inline int bch2_accounting_key_to_wb(struct bch_fs *c,
+ enum btree_id btree, struct bkey_i_accounting *k)
+{
+ struct btree_write_buffer *wb = &c->btree_write_buffer;
+ struct btree_write_buffered_key search;
+ search.btree = btree;
+ search.k.k.p = k->k.p;
+
+ unsigned idx = eytzinger0_find(wb->accounting.data, wb->accounting.nr,
+ sizeof(wb->accounting.data[0]),
+ wb_key_cmp, &search);
+
+ if (idx >= wb->accounting.nr)
+ return bch2_accounting_key_to_wb_slowpath(c, btree, k);
+
+ struct bkey_i_accounting *dst = bkey_i_to_accounting(&wb->accounting.data[idx].k);
+ bch2_accounting_accumulate(dst, accounting_i_to_s_c(k));
+ return 0;
+}
+
int bch2_journal_key_to_wb_slowpath(struct bch_fs *,
struct journal_keys_to_wb *,
enum btree_id, struct bkey_i *);

-static inline int bch2_journal_key_to_wb(struct bch_fs *c,
+static inline int __bch2_journal_key_to_wb(struct bch_fs *c,
struct journal_keys_to_wb *dst,
enum btree_id btree, struct bkey_i *k)
{
- EBUG_ON(!dst->seq);
-
if (unlikely(!dst->room))
return bch2_journal_key_to_wb_slowpath(c, dst, btree, k);

@@ -51,8 +82,19 @@ static inline int bch2_journal_key_to_wb(struct bch_fs *c,
return 0;
}

+static inline int bch2_journal_key_to_wb(struct bch_fs *c,
+ struct journal_keys_to_wb *dst,
+ enum btree_id btree, struct bkey_i *k)
+{
+ EBUG_ON(!dst->seq);
+
+ return k->k.type == KEY_TYPE_accounting
+ ? bch2_accounting_key_to_wb(c, btree, bkey_i_to_accounting(k))
+ : __bch2_journal_key_to_wb(c, dst, btree, k);
+}
+
void bch2_journal_keys_to_write_buffer_start(struct bch_fs *, struct journal_keys_to_wb *, u64);
-void bch2_journal_keys_to_write_buffer_end(struct bch_fs *, struct journal_keys_to_wb *);
+int bch2_journal_keys_to_write_buffer_end(struct bch_fs *, struct journal_keys_to_wb *);

int bch2_btree_write_buffer_resize(struct bch_fs *, size_t);
void bch2_fs_btree_write_buffer_exit(struct bch_fs *);
diff --git a/fs/bcachefs/btree_write_buffer_types.h b/fs/bcachefs/btree_write_buffer_types.h
index 5f248873087c..d39d163c6ea9 100644
--- a/fs/bcachefs/btree_write_buffer_types.h
+++ b/fs/bcachefs/btree_write_buffer_types.h
@@ -52,6 +52,8 @@ struct btree_write_buffer {
struct btree_write_buffer_keys inc;
struct btree_write_buffer_keys flushing;
struct work_struct flush_work;
+
+ DARRAY(struct btree_write_buffered_key) accounting;
};

#endif /* _BCACHEFS_BTREE_WRITE_BUFFER_TYPES_H */
diff --git a/fs/bcachefs/journal_io.c b/fs/bcachefs/journal_io.c
index b37b75ccd602..3ea2be99d411 100644
--- a/fs/bcachefs/journal_io.c
+++ b/fs/bcachefs/journal_io.c
@@ -1815,7 +1815,8 @@ static int bch2_journal_write_prep(struct journal *j, struct journal_buf *w)
jset_entry_for_each_key(i, k) {
ret = bch2_journal_key_to_wb(c, &wb, i->btree_id, k);
if (ret) {
- bch2_fs_fatal_error(c, "-ENOMEM flushing journal keys to btree write buffer");
+ bch2_fs_fatal_error(c, "error flushing journal keys to btree write buffer: %s",
+ bch2_err_str(ret));
bch2_journal_keys_to_write_buffer_end(c, &wb);
return ret;
}
@@ -1825,8 +1826,14 @@ static int bch2_journal_write_prep(struct journal *j, struct journal_buf *w)
}
}

- if (wb.wb)
- bch2_journal_keys_to_write_buffer_end(c, &wb);
+ if (wb.wb) {
+ ret = bch2_journal_keys_to_write_buffer_end(c, &wb);
+ if (ret) {
+ bch2_fs_fatal_error(c, "error flushing journal keys to btree write buffer: %s",
+ bch2_err_str(ret));
+ return ret;
+ }
+ }

spin_lock(&c->journal.lock);
w->need_flush_to_write_buffer = false;
--
2.43.0


2024-02-25 02:42:24

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 18/21] bcachefs: bch_acct_compression

This adds per-compression-type accounting of compressed and uncompressed
size as well as number of extents - meaning we can now see compression
ratio (without walking the whole filesystem).

Signed-off-by: Kent Overstreet <[email protected]>
---
fs/bcachefs/buckets.c | 45 ++++++++++++++++++++++++----
fs/bcachefs/disk_accounting.c | 4 +++
fs/bcachefs/disk_accounting_format.h | 8 ++++-
3 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/fs/bcachefs/buckets.c b/fs/bcachefs/buckets.c
index 506bb580bff4..6078b67e51cf 100644
--- a/fs/bcachefs/buckets.c
+++ b/fs/bcachefs/buckets.c
@@ -503,6 +503,7 @@ static int __trigger_extent(struct btree_trans *trans,
: BCH_DATA_user;
s64 dirty_sectors = 0;
int ret = 0;
+ u64 compression_acct[3] = { 1, 0, 0 };

struct disk_accounting_key acc = {
.type = BCH_DISK_ACCOUNTING_replicas,
@@ -511,6 +512,10 @@ static int __trigger_extent(struct btree_trans *trans,
.replicas.nr_required = 1,
};

+ struct disk_accounting_key compression_key = {
+ .type = BCH_DISK_ACCOUNTING_compression,
+ };
+
bkey_for_each_ptr_decode(k.k, ptrs, p, entry) {
s64 disk_sectors;
ret = bch2_trigger_pointer(trans, btree_id, level, k, p, &disk_sectors, flags);
@@ -519,12 +524,13 @@ static int __trigger_extent(struct btree_trans *trans,

bool stale = ret > 0;

+ if (p.ptr.cached && stale)
+ continue;
+
if (p.ptr.cached) {
- if (!stale) {
- ret = bch2_mod_dev_cached_sectors(trans, p.ptr.dev, disk_sectors, gc);
- if (ret)
- return ret;
- }
+ ret = bch2_mod_dev_cached_sectors(trans, p.ptr.dev, disk_sectors, gc);
+ if (ret)
+ return ret;
} else if (!p.has_ec) {
dirty_sectors += disk_sectors;
acc.replicas.devs[acc.replicas.nr_devs++] = p.ptr.dev;
@@ -540,6 +546,26 @@ static int __trigger_extent(struct btree_trans *trans,
*/
acc.replicas.nr_required = 0;
}
+
+ if (compression_key.compression.type &&
+ compression_key.compression.type != p.crc.compression_type) {
+ if (flags & BTREE_TRIGGER_OVERWRITE)
+ bch2_u64s_neg(compression_acct, 3);
+
+ ret = bch2_disk_accounting_mod(trans, &compression_key, compression_acct, 2, gc);
+ if (ret)
+ return ret;
+
+ compression_acct[0] = 1;
+ compression_acct[1] = 0;
+ compression_acct[2] = 0;
+ }
+
+ compression_key.compression.type = p.crc.compression_type;
+ if (p.crc.compression_type) {
+ compression_acct[1] += p.crc.uncompressed_size;
+ compression_acct[2] += p.crc.compressed_size;
+ }
}

if (acc.replicas.nr_devs) {
@@ -548,6 +574,15 @@ static int __trigger_extent(struct btree_trans *trans,
return ret;
}

+ if (compression_key.compression.type) {
+ if (flags & BTREE_TRIGGER_OVERWRITE)
+ bch2_u64s_neg(compression_acct, 3);
+
+ ret = bch2_disk_accounting_mod(trans, &compression_key, compression_acct, 3, gc);
+ if (ret)
+ return ret;
+ }
+
return 0;
}

diff --git a/fs/bcachefs/disk_accounting.c b/fs/bcachefs/disk_accounting.c
index 8d7b6ab66e71..dc020d651d0a 100644
--- a/fs/bcachefs/disk_accounting.c
+++ b/fs/bcachefs/disk_accounting.c
@@ -5,6 +5,7 @@
#include "btree_update.h"
#include "btree_write_buffer.h"
#include "buckets.h"
+#include "compress.h"
#include "disk_accounting.h"
#include "error.h"
#include "journal_io.h"
@@ -91,6 +92,9 @@ void bch2_accounting_key_to_text(struct printbuf *out, struct disk_accounting_ke
case BCH_DISK_ACCOUNTING_dev_stripe_buckets:
prt_printf(out, "dev=%u", k->dev_stripe_buckets.dev);
break;
+ case BCH_DISK_ACCOUNTING_compression:
+ bch2_prt_compression_type(out, k->compression.type);
+ break;
}
}

diff --git a/fs/bcachefs/disk_accounting_format.h b/fs/bcachefs/disk_accounting_format.h
index e06a42f0d578..75bfc9bce79f 100644
--- a/fs/bcachefs/disk_accounting_format.h
+++ b/fs/bcachefs/disk_accounting_format.h
@@ -95,7 +95,8 @@ static inline bool data_type_is_hidden(enum bch_data_type type)
x(persistent_reserved, 1) \
x(replicas, 2) \
x(dev_data_type, 3) \
- x(dev_stripe_buckets, 4)
+ x(dev_stripe_buckets, 4) \
+ x(compression, 5)

enum disk_accounting_type {
#define x(f, nr) BCH_DISK_ACCOUNTING_##f = nr,
@@ -120,6 +121,10 @@ struct bch_dev_stripe_buckets {
__u8 dev;
};

+struct bch_acct_compression {
+ __u8 type;
+};
+
struct disk_accounting_key {
union {
struct {
@@ -130,6 +135,7 @@ struct disk_accounting_key {
struct bch_replicas_entry_v1 replicas;
struct bch_dev_data_type dev_data_type;
struct bch_dev_stripe_buckets dev_stripe_buckets;
+ struct bch_acct_compression compression;
};
};
struct bpos _pad;
--
2.43.0


2024-02-25 02:42:39

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 20/21] bcachefs: bch2_fs_accounting_to_text()

Helper to show raw accounting in sysfs, mainly for debugging.

Signed-off-by: Kent Overstreet <[email protected]>
---
fs/bcachefs/disk_accounting.c | 26 ++++++++++++++++++++++++++
fs/bcachefs/disk_accounting.h | 1 +
fs/bcachefs/sysfs.c | 5 +++++
3 files changed, 32 insertions(+)

diff --git a/fs/bcachefs/disk_accounting.c b/fs/bcachefs/disk_accounting.c
index dc020d651d0a..9d6ca2ea307b 100644
--- a/fs/bcachefs/disk_accounting.c
+++ b/fs/bcachefs/disk_accounting.c
@@ -286,6 +286,32 @@ int bch2_fs_replicas_usage_read(struct bch_fs *c, darray_char *usage)
return ret;
}

+void bch2_fs_accounting_to_text(struct printbuf *out, struct bch_fs *c)
+{
+ struct bch_accounting_mem *acc = &c->accounting[0];
+
+ percpu_down_read(&c->mark_lock);
+ out->atomic++;
+
+ eytzinger0_for_each(i, acc->k.nr) {
+ struct disk_accounting_key acc_k;
+ bpos_to_disk_accounting_key(&acc_k, acc->k.data[i].pos);
+
+ bch2_accounting_key_to_text(out, &acc_k);
+
+ u64 v[BCH_ACCOUNTING_MAX_COUNTERS];
+ bch2_accounting_mem_read_counters(c, i, v, ARRAY_SIZE(v), false);
+
+ prt_str(out, ":");
+ for (unsigned j = 0; j < acc->k.data[i].nr_counters; j++)
+ prt_printf(out, " %llu", v[j]);
+ prt_newline(out);
+ }
+
+ --out->atomic;
+ percpu_up_read(&c->mark_lock);
+}
+
static int accounting_write_key(struct btree_trans *trans, struct bpos pos, u64 *v, unsigned nr_counters)
{
struct bkey_i_accounting *a = bch2_trans_kmalloc(trans, sizeof(*a) + sizeof(*v) * nr_counters);
diff --git a/fs/bcachefs/disk_accounting.h b/fs/bcachefs/disk_accounting.h
index a0cf7a0b84a7..c4a8b9cce6ba 100644
--- a/fs/bcachefs/disk_accounting.h
+++ b/fs/bcachefs/disk_accounting.h
@@ -186,6 +186,7 @@ static inline void bch2_accounting_mem_read(struct bch_fs *c, struct bpos p,
}

int bch2_fs_replicas_usage_read(struct bch_fs *, darray_char *);
+void bch2_fs_accounting_to_text(struct printbuf *, struct bch_fs *);

int bch2_accounting_gc_done(struct bch_fs *);

diff --git a/fs/bcachefs/sysfs.c b/fs/bcachefs/sysfs.c
index 287a0bf920db..10470cef30f0 100644
--- a/fs/bcachefs/sysfs.c
+++ b/fs/bcachefs/sysfs.c
@@ -204,6 +204,7 @@ read_attribute(disk_groups);

read_attribute(has_data);
read_attribute(alloc_debug);
+read_attribute(accounting);

#define x(t, n, ...) read_attribute(t);
BCH_PERSISTENT_COUNTERS()
@@ -413,6 +414,9 @@ SHOW(bch2_fs)
if (attr == &sysfs_disk_groups)
bch2_disk_groups_to_text(out, c);

+ if (attr == &sysfs_accounting)
+ bch2_fs_accounting_to_text(out, c);
+
return 0;
}

@@ -625,6 +629,7 @@ struct attribute *bch2_fs_internal_files[] = {
&sysfs_internal_uuid,

&sysfs_disk_groups,
+ &sysfs_accounting,
NULL
};

--
2.43.0


2024-02-25 02:42:53

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 21/21] bcachefs: bch2_fs_usage_base_to_text()

Helper to show raw accounting in sysfs, mainly for debugging.

Signed-off-by: Kent Overstreet <[email protected]>
---
fs/bcachefs/sysfs.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/fs/bcachefs/sysfs.c b/fs/bcachefs/sysfs.c
index 10470cef30f0..27aca70cb385 100644
--- a/fs/bcachefs/sysfs.c
+++ b/fs/bcachefs/sysfs.c
@@ -205,6 +205,7 @@ read_attribute(disk_groups);
read_attribute(has_data);
read_attribute(alloc_debug);
read_attribute(accounting);
+read_attribute(usage_base);

#define x(t, n, ...) read_attribute(t);
BCH_PERSISTENT_COUNTERS()
@@ -329,6 +330,20 @@ static void bch2_btree_wakeup_all(struct bch_fs *c)
seqmutex_unlock(&c->btree_trans_lock);
}

+static void bch2_fs_usage_base_to_text(struct printbuf *out, struct bch_fs *c)
+{
+ struct bch_fs_usage_base b = {};
+
+ acc_u64s_percpu(&b.hidden, &c->usage->hidden, sizeof(b) / sizeof(u64));
+
+ prt_printf(out, "hidden:\t\t%llu\n", b.hidden);
+ prt_printf(out, "btree:\t\t%llu\n", b.btree);
+ prt_printf(out, "data:\t\t%llu\n", b.data);
+ prt_printf(out, "cached:\t%llu\n", b.cached);
+ prt_printf(out, "reserved:\t\t%llu\n", b.reserved);
+ prt_printf(out, "nr_inodes:\t%llu\n", b.nr_inodes);
+}
+
SHOW(bch2_fs)
{
struct bch_fs *c = container_of(kobj, struct bch_fs, kobj);
@@ -417,6 +432,9 @@ SHOW(bch2_fs)
if (attr == &sysfs_accounting)
bch2_fs_accounting_to_text(out, c);

+ if (attr == &sysfs_usage_base)
+ bch2_fs_usage_base_to_text(out, c);
+
return 0;
}

@@ -630,6 +648,7 @@ struct attribute *bch2_fs_internal_files[] = {

&sysfs_disk_groups,
&sysfs_accounting,
+ &sysfs_usage_base,
NULL
};

--
2.43.0


2024-02-25 02:43:00

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 14/21] bcachefs: Convert gc to new accounting

Rewrite fsck/gc for the new accounting scheme.

This adds a second set of in-memory accounting counters for gc to use;
like with other parts of gc we run all trigger in TRIGGER_GC mode, then
compare what we calculated to existing in-memory accounting at the end.

Signed-off-by: Kent Overstreet <[email protected]>
---
fs/bcachefs/alloc_background.c | 181 +++++++++++-----------
fs/bcachefs/alloc_background.h | 2 +
fs/bcachefs/bcachefs.h | 4 +-
fs/bcachefs/btree_gc.c | 257 ++++++++++---------------------
fs/bcachefs/btree_trans_commit.c | 4 +-
fs/bcachefs/buckets.c | 182 ++++------------------
fs/bcachefs/buckets.h | 20 +--
fs/bcachefs/buckets_types.h | 7 -
fs/bcachefs/disk_accounting.c | 171 +++++++++++++++++---
fs/bcachefs/disk_accounting.h | 81 ++++++----
fs/bcachefs/ec.c | 148 +++++++++---------
fs/bcachefs/inode.c | 43 ++----
fs/bcachefs/recovery.c | 3 +-
fs/bcachefs/replicas.c | 86 +----------
fs/bcachefs/replicas.h | 1 -
fs/bcachefs/super.c | 9 +-
16 files changed, 508 insertions(+), 691 deletions(-)

diff --git a/fs/bcachefs/alloc_background.c b/fs/bcachefs/alloc_background.c
index d8ad5bb28a7f..54cb345b104c 100644
--- a/fs/bcachefs/alloc_background.c
+++ b/fs/bcachefs/alloc_background.c
@@ -731,6 +731,96 @@ static noinline int bch2_bucket_gen_update(struct btree_trans *trans,
return ret;
}

+static int bch2_alloc_key_to_dev_counters(struct btree_trans *trans, struct bch_dev *ca,
+ const struct bch_alloc_v4 *old_a,
+ const struct bch_alloc_v4 *new_a,
+ unsigned flags)
+{
+ bool gc = flags & BTREE_TRIGGER_GC;
+
+ if ((flags & BTREE_TRIGGER_BUCKET_INVALIDATE) &&
+ old_a->cached_sectors) {
+ int ret = bch2_mod_dev_cached_sectors(trans, ca->dev_idx,
+ -((s64) old_a->cached_sectors), gc);
+ if (ret)
+ return ret;
+ }
+
+ if (old_a->data_type != new_a->data_type ||
+ old_a->dirty_sectors != new_a->dirty_sectors) {
+ struct disk_accounting_key acc = {
+ .type = BCH_DISK_ACCOUNTING_dev_data_type,
+ .dev_data_type.dev = ca->dev_idx,
+ .dev_data_type.data_type = new_a->data_type,
+ };
+ s64 d[3];
+
+ if (old_a->data_type == new_a->data_type) {
+ d[0] = 0;
+ d[1] = (s64) new_a->dirty_sectors - (s64) old_a->dirty_sectors;
+ d[2] = bucket_sectors_fragmented(ca, *new_a) -
+ bucket_sectors_fragmented(ca, *old_a);
+
+ int ret = bch2_disk_accounting_mod(trans, &acc, d, 3, gc);
+ if (ret)
+ return ret;
+ } else {
+ d[0] = 1;
+ d[1] = new_a->dirty_sectors;
+ d[2] = bucket_sectors_fragmented(ca, *new_a);
+
+ int ret = bch2_disk_accounting_mod(trans, &acc, d, 3, gc);
+ if (ret)
+ return ret;
+
+ acc.dev_data_type.data_type = old_a->data_type;
+ d[0] = -1;
+ d[1] = -(s64) old_a->dirty_sectors;
+ d[2] = -bucket_sectors_fragmented(ca, *old_a);
+
+ ret = bch2_disk_accounting_mod(trans, &acc, d, 3, gc);
+ if (ret)
+ return ret;
+ }
+ }
+
+ if (!!old_a->stripe != !!new_a->stripe) {
+ struct disk_accounting_key acc = {
+ .type = BCH_DISK_ACCOUNTING_dev_stripe_buckets,
+ .dev_stripe_buckets.dev = ca->dev_idx,
+ };
+ u64 d[1];
+
+ d[0] = (s64) !!new_a->stripe - (s64) !!old_a->stripe;
+ int ret = bch2_disk_accounting_mod(trans, &acc, d, 1, gc);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
+static inline struct bch_alloc_v4 bucket_m_to_alloc(struct bucket b)
+{
+ return (struct bch_alloc_v4) {
+ .gen = b.gen,
+ .data_type = b.data_type,
+ .dirty_sectors = b.dirty_sectors,
+ .cached_sectors = b.cached_sectors,
+ .stripe = b.stripe,
+ };
+}
+
+int bch2_bucket_to_dev_counters(struct btree_trans *trans, struct bch_dev *ca,
+ struct bucket *old, struct bucket *new,
+ unsigned flags)
+{
+ struct bch_alloc_v4 old_a = bucket_m_to_alloc(*old);
+ struct bch_alloc_v4 new_a = bucket_m_to_alloc(*new);
+
+ return bch2_alloc_key_to_dev_counters(trans, ca, &old_a, &new_a, flags);
+}
+
int bch2_trigger_alloc(struct btree_trans *trans,
enum btree_id btree, unsigned level,
struct bkey_s_c old, struct bkey_s new,
@@ -807,70 +897,9 @@ int bch2_trigger_alloc(struct btree_trans *trans,
return ret;
}

- /*
- * need to know if we're getting called from the invalidate path or
- * not:
- */
-
- if ((flags & BTREE_TRIGGER_BUCKET_INVALIDATE) &&
- old_a->cached_sectors) {
- ret = bch2_mod_dev_cached_sectors(trans, new.k->p.inode,
- -((s64) old_a->cached_sectors));
- if (ret)
- return ret;
- }
-
-
- if (old_a->data_type != new_a->data_type ||
- old_a->dirty_sectors != new_a->dirty_sectors) {
- struct disk_accounting_key acc = {
- .type = BCH_DISK_ACCOUNTING_dev_data_type,
- .dev_data_type.dev = new.k->p.inode,
- .dev_data_type.data_type = new_a->data_type,
- };
- s64 d[3];
-
- if (old_a->data_type == new_a->data_type) {
- d[0] = 0;
- d[1] = (s64) new_a->dirty_sectors - (s64) old_a->dirty_sectors;
- d[2] = bucket_sectors_fragmented(ca, *new_a) -
- bucket_sectors_fragmented(ca, *old_a);
-
- ret = bch2_disk_accounting_mod(trans, &acc, d, 3);
- if (ret)
- return ret;
- } else {
- d[0] = 1;
- d[1] = new_a->dirty_sectors;
- d[2] = bucket_sectors_fragmented(ca, *new_a);
-
- ret = bch2_disk_accounting_mod(trans, &acc, d, 3);
- if (ret)
- return ret;
-
- acc.dev_data_type.data_type = old_a->data_type;
- d[0] = -1;
- d[1] = -(s64) old_a->dirty_sectors;
- d[2] = -bucket_sectors_fragmented(ca, *old_a);
-
- ret = bch2_disk_accounting_mod(trans, &acc, d, 3);
- if (ret)
- return ret;
- }
- }
-
- if (!!old_a->stripe != !!new_a->stripe) {
- struct disk_accounting_key acc = {
- .type = BCH_DISK_ACCOUNTING_dev_stripe_buckets,
- .dev_stripe_buckets.dev = new.k->p.inode,
- };
- u64 d[1];
-
- d[0] = (s64) !!new_a->stripe - (s64) !!old_a->stripe;
- ret = bch2_disk_accounting_mod(trans, &acc, d, 1);
- if (ret)
- return ret;
- }
+ ret = bch2_alloc_key_to_dev_counters(trans, ca, old_a, new_a, flags);
+ if (ret)
+ return ret;
}

if ((flags & BTREE_TRIGGER_ATOMIC) && (flags & BTREE_TRIGGER_INSERT)) {
@@ -938,30 +967,6 @@ int bch2_trigger_alloc(struct btree_trans *trans,
bch2_do_gc_gens(c);
}

- if ((flags & BTREE_TRIGGER_GC) &&
- (flags & BTREE_TRIGGER_BUCKET_INVALIDATE)) {
- struct bch_alloc_v4 new_a_convert;
- const struct bch_alloc_v4 *new_a = bch2_alloc_to_v4(new.s_c, &new_a_convert);
-
- percpu_down_read(&c->mark_lock);
- struct bucket *g = gc_bucket(ca, new.k->p.offset);
-
- bucket_lock(g);
-
- g->gen_valid = 1;
- g->gen = new_a->gen;
- g->data_type = new_a->data_type;
- g->stripe = new_a->stripe;
- g->stripe_redundancy = new_a->stripe_redundancy;
- g->dirty_sectors = new_a->dirty_sectors;
- g->cached_sectors = new_a->cached_sectors;
-
- bucket_unlock(g);
- percpu_up_read(&c->mark_lock);
-
- bch2_dev_usage_update(c, ca, old_a, new_a);
- }
-
return 0;
}

diff --git a/fs/bcachefs/alloc_background.h b/fs/bcachefs/alloc_background.h
index 052b2fac25d6..6f273a456a6d 100644
--- a/fs/bcachefs/alloc_background.h
+++ b/fs/bcachefs/alloc_background.h
@@ -228,6 +228,8 @@ static inline bool bkey_is_alloc(const struct bkey *k)

int bch2_alloc_read(struct bch_fs *);

+int bch2_bucket_to_dev_counters(struct btree_trans *, struct bch_dev *,
+ struct bucket *, struct bucket *, unsigned);
int bch2_trigger_alloc(struct btree_trans *, enum btree_id, unsigned,
struct bkey_s_c, struct bkey_s, unsigned);
int bch2_check_alloc_info(struct bch_fs *);
diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h
index 22dc455cb436..41c436c608cf 100644
--- a/fs/bcachefs/bcachefs.h
+++ b/fs/bcachefs/bcachefs.h
@@ -577,7 +577,6 @@ struct bch_dev {
struct rw_semaphore bucket_lock;

struct bch_dev_usage __percpu *usage;
- struct bch_dev_usage __percpu *usage_gc;

/* Allocator: */
u64 new_fs_bucket_idx;
@@ -761,7 +760,7 @@ struct bch_fs {

struct bch_dev __rcu *devs[BCH_SB_MEMBERS_MAX];

- struct bch_accounting_mem accounting;
+ struct bch_accounting_mem accounting[2];

struct bch_replicas_cpu replicas;
struct bch_replicas_cpu replicas_gc;
@@ -906,7 +905,6 @@ struct bch_fs {

seqcount_t usage_lock;
struct bch_fs_usage_base __percpu *usage;
- struct bch_fs_usage __percpu *usage_gc;
u64 __percpu *online_reserved;

struct io_clock io_clock[2];
diff --git a/fs/bcachefs/btree_gc.c b/fs/bcachefs/btree_gc.c
index 15a8796197f3..54a90e88f5b8 100644
--- a/fs/bcachefs/btree_gc.c
+++ b/fs/bcachefs/btree_gc.c
@@ -18,6 +18,7 @@
#include "buckets.h"
#include "clock.h"
#include "debug.h"
+#include "disk_accounting.h"
#include "ec.h"
#include "error.h"
#include "extents.h"
@@ -1115,10 +1116,10 @@ static int bch2_gc_btrees(struct bch_fs *c, bool initial, bool metadata_only)
return ret;
}

-static void mark_metadata_sectors(struct bch_fs *c, struct bch_dev *ca,
- u64 start, u64 end,
- enum bch_data_type type,
- unsigned flags)
+static int mark_metadata_sectors(struct btree_trans *trans, struct bch_dev *ca,
+ u64 start, u64 end,
+ enum bch_data_type type,
+ unsigned flags)
{
u64 b = sector_to_bucket(ca, start);

@@ -1126,48 +1127,68 @@ static void mark_metadata_sectors(struct bch_fs *c, struct bch_dev *ca,
unsigned sectors =
min_t(u64, bucket_to_sector(ca, b + 1), end) - start;

- bch2_mark_metadata_bucket(c, ca, b, type, sectors,
- gc_phase(GC_PHASE_SB), flags);
+ int ret = bch2_mark_metadata_bucket(trans, ca, b, type, sectors,
+ gc_phase(GC_PHASE_SB), flags);
+ if (ret)
+ return ret;
+
b++;
start += sectors;
} while (start < end);
+
+ return 0;
}

-static void bch2_mark_dev_superblock(struct bch_fs *c, struct bch_dev *ca,
- unsigned flags)
+static int bch2_mark_dev_superblock(struct btree_trans *trans, struct bch_dev *ca,
+ unsigned flags)
{
struct bch_sb_layout *layout = &ca->disk_sb.sb->layout;
- unsigned i;
- u64 b;

- for (i = 0; i < layout->nr_superblocks; i++) {
+ for (unsigned i = 0; i < layout->nr_superblocks; i++) {
u64 offset = le64_to_cpu(layout->sb_offset[i]);

- if (offset == BCH_SB_SECTOR)
- mark_metadata_sectors(c, ca, 0, BCH_SB_SECTOR,
- BCH_DATA_sb, flags);
+ if (offset == BCH_SB_SECTOR) {
+ int ret = mark_metadata_sectors(trans, ca, 0, BCH_SB_SECTOR,
+ BCH_DATA_sb, flags);
+ if (ret)
+ return ret;
+ }

- mark_metadata_sectors(c, ca, offset,
+ int ret = mark_metadata_sectors(trans, ca, offset,
offset + (1 << layout->sb_max_size_bits),
BCH_DATA_sb, flags);
+ if (ret)
+ return ret;
}

- for (i = 0; i < ca->journal.nr; i++) {
- b = ca->journal.buckets[i];
- bch2_mark_metadata_bucket(c, ca, b, BCH_DATA_journal,
- ca->mi.bucket_size,
+ for (unsigned i = 0; i < ca->journal.nr; i++) {
+ int ret = bch2_mark_metadata_bucket(trans, ca, ca->journal.buckets[i],
+ BCH_DATA_journal, ca->mi.bucket_size,
gc_phase(GC_PHASE_SB), flags);
+ if (ret)
+ return ret;
}
+
+ return 0;
}

-static void bch2_mark_superblocks(struct bch_fs *c)
+static int bch2_mark_superblocks(struct btree_trans *trans)
{
+ struct bch_fs *c = trans->c;
+
mutex_lock(&c->sb_lock);
gc_pos_set(c, gc_phase(GC_PHASE_SB));

- for_each_online_member(c, ca)
- bch2_mark_dev_superblock(c, ca, BTREE_TRIGGER_GC);
+ for_each_online_member(c, ca) {
+ int ret = bch2_mark_dev_superblock(trans, ca, BTREE_TRIGGER_GC);
+ if (ret) {
+ percpu_ref_put(&ca->io_ref);
+ return ret;
+ }
+ }
mutex_unlock(&c->sb_lock);
+
+ return 0;
}

#if 0
@@ -1190,146 +1211,25 @@ static void bch2_mark_pending_btree_node_frees(struct bch_fs *c)

static void bch2_gc_free(struct bch_fs *c)
{
+ bch2_accounting_free(&c->accounting[1]);
+
genradix_free(&c->reflink_gc_table);
genradix_free(&c->gc_stripes);

for_each_member_device(c, ca) {
kvfree(rcu_dereference_protected(ca->buckets_gc, 1));
ca->buckets_gc = NULL;
-
- free_percpu(ca->usage_gc);
- ca->usage_gc = NULL;
- }
-
- free_percpu(c->usage_gc);
- c->usage_gc = NULL;
-}
-
-static int bch2_gc_done(struct bch_fs *c,
- bool initial, bool metadata_only)
-{
- struct bch_dev *ca = NULL;
- struct printbuf buf = PRINTBUF;
- bool verify = !metadata_only &&
- !c->opts.reconstruct_alloc &&
- (!initial || (c->sb.compat & (1ULL << BCH_COMPAT_alloc_info)));
- unsigned i;
- int ret = 0;
-
- percpu_down_write(&c->mark_lock);
-
-#define copy_field(_err, _f, _msg, ...) \
- if (dst->_f != src->_f && \
- (!verify || \
- fsck_err(c, _err, _msg ": got %llu, should be %llu" \
- , ##__VA_ARGS__, dst->_f, src->_f))) \
- dst->_f = src->_f
-#define copy_dev_field(_err, _f, _msg, ...) \
- copy_field(_err, _f, "dev %u has wrong " _msg, ca->dev_idx, ##__VA_ARGS__)
-#define copy_fs_field(_err, _f, _msg, ...) \
- copy_field(_err, _f, "fs has wrong " _msg, ##__VA_ARGS__)
-
- __for_each_member_device(c, ca) {
- /* XXX */
- struct bch_dev_usage *dst = this_cpu_ptr(ca->usage);
- struct bch_dev_usage *src = (void *)
- bch2_acc_percpu_u64s((u64 __percpu *) ca->usage_gc,
- dev_usage_u64s());
-
- for (i = 0; i < BCH_DATA_NR; i++) {
- copy_dev_field(dev_usage_buckets_wrong,
- d[i].buckets, "%s buckets", bch2_data_type_str(i));
- copy_dev_field(dev_usage_sectors_wrong,
- d[i].sectors, "%s sectors", bch2_data_type_str(i));
- copy_dev_field(dev_usage_fragmented_wrong,
- d[i].fragmented, "%s fragmented", bch2_data_type_str(i));
- }
- }
-
- {
-#if 0
- unsigned nr = fs_usage_u64s(c);
- /* XX: */
- struct bch_fs_usage *dst = this_cpu_ptr(c->usage);
- struct bch_fs_usage *src = (void *)
- bch2_acc_percpu_u64s((u64 __percpu *) c->usage_gc, nr);
-
- copy_fs_field(fs_usage_hidden_wrong,
- b.hidden, "hidden");
- copy_fs_field(fs_usage_btree_wrong,
- b.btree, "btree");
-
- if (!metadata_only) {
- copy_fs_field(fs_usage_data_wrong,
- b.data, "data");
- copy_fs_field(fs_usage_cached_wrong,
- b.cached, "cached");
- copy_fs_field(fs_usage_reserved_wrong,
- b.reserved, "reserved");
- copy_fs_field(fs_usage_nr_inodes_wrong,
- b.nr_inodes,"nr_inodes");
-
- for (i = 0; i < BCH_REPLICAS_MAX; i++)
- copy_fs_field(fs_usage_persistent_reserved_wrong,
- persistent_reserved[i],
- "persistent_reserved[%i]", i);
- }
-
- for (i = 0; i < c->replicas.nr; i++) {
- struct bch_replicas_entry_v1 *e =
- cpu_replicas_entry(&c->replicas, i);
-
- if (metadata_only &&
- (e->data_type == BCH_DATA_user ||
- e->data_type == BCH_DATA_cached))
- continue;
-
- printbuf_reset(&buf);
- bch2_replicas_entry_to_text(&buf, e);
-
- copy_fs_field(fs_usage_replicas_wrong,
- replicas[i], "%s", buf.buf);
- }
-#endif
}
-
-#undef copy_fs_field
-#undef copy_dev_field
-#undef copy_stripe_field
-#undef copy_field
-fsck_err:
- if (ca)
- percpu_ref_put(&ca->ref);
- bch_err_fn(c, ret);
-
- percpu_up_write(&c->mark_lock);
- printbuf_exit(&buf);
- return ret;
}

static int bch2_gc_start(struct bch_fs *c)
{
- BUG_ON(c->usage_gc);
-
- c->usage_gc = __alloc_percpu_gfp(fs_usage_u64s(c) * sizeof(u64),
- sizeof(u64), GFP_KERNEL);
- if (!c->usage_gc) {
- bch_err(c, "error allocating c->usage_gc");
- return -BCH_ERR_ENOMEM_gc_start;
- }
-
for_each_member_device(c, ca) {
- BUG_ON(ca->usage_gc);
-
- ca->usage_gc = alloc_percpu(struct bch_dev_usage);
- if (!ca->usage_gc) {
- bch_err(c, "error allocating ca->usage_gc");
+ int ret = bch2_dev_usage_init(ca, true);
+ if (ret) {
percpu_ref_put(&ca->ref);
- return -BCH_ERR_ENOMEM_gc_start;
+ return ret;
}
-
- this_cpu_write(ca->usage_gc->d[BCH_DATA_free].buckets,
- ca->mi.nbuckets - ca->mi.first_bucket);
}

return 0;
@@ -1337,13 +1237,7 @@ static int bch2_gc_start(struct bch_fs *c)

static int bch2_gc_reset(struct bch_fs *c)
{
- for_each_member_device(c, ca) {
- free_percpu(ca->usage_gc);
- ca->usage_gc = NULL;
- }
-
- free_percpu(c->usage_gc);
- c->usage_gc = NULL;
+ bch2_accounting_free(&c->accounting[1]);

return bch2_gc_start(c);
}
@@ -1368,7 +1262,7 @@ static int bch2_alloc_write_key(struct btree_trans *trans,
{
struct bch_fs *c = trans->c;
struct bch_dev *ca = bch_dev_bkey_exists(c, iter->pos.inode);
- struct bucket gc, *b;
+ struct bucket gc;
struct bkey_i_alloc_v4 *a;
struct bch_alloc_v4 old_convert, new;
const struct bch_alloc_v4 *old;
@@ -1379,30 +1273,39 @@ static int bch2_alloc_write_key(struct btree_trans *trans,
new = *old;

percpu_down_read(&c->mark_lock);
- b = gc_bucket(ca, iter->pos.offset);
+ gc = *gc_bucket(ca, iter->pos.offset);
+ percpu_up_read(&c->mark_lock);

/*
* b->data_type doesn't yet include need_discard & need_gc_gen states -
* fix that here:
*/
- type = __alloc_data_type(b->dirty_sectors,
- b->cached_sectors,
- b->stripe,
+ type = __alloc_data_type(gc.dirty_sectors,
+ gc.cached_sectors,
+ gc.stripe,
*old,
- b->data_type);
- if (b->data_type != type) {
- struct bch_dev_usage *u;
-
- preempt_disable();
- u = this_cpu_ptr(ca->usage_gc);
- u->d[b->data_type].buckets--;
- b->data_type = type;
- u->d[b->data_type].buckets++;
- preempt_enable();
- }
+ gc.data_type);

- gc = *b;
- percpu_up_read(&c->mark_lock);
+ if (gc.data_type != type) {
+ struct disk_accounting_key acc = {
+ .type = BCH_DISK_ACCOUNTING_dev_data_type,
+ .dev_data_type.dev = ca->dev_idx,
+ .dev_data_type.data_type = type,
+ };
+ u64 d[3] = { 1, 0, 0 };
+
+ ret = bch2_disk_accounting_mod(trans, &acc, d, 3, true);
+ if (ret)
+ return ret;
+
+ acc.dev_data_type.data_type = gc.data_type;
+ d[0] = -1;
+ ret = bch2_disk_accounting_mod(trans, &acc, d, 3, true);
+ if (ret)
+ return ret;
+
+ gc.data_type = type;
+ }

if (metadata_only &&
gc.data_type != BCH_DATA_sb &&
@@ -1778,10 +1681,12 @@ int bch2_gc(struct bch_fs *c, bool initial, bool metadata_only)
again:
gc_pos_set(c, gc_phase(GC_PHASE_START));

- bch2_mark_superblocks(c);
+ ret = bch2_trans_run(c, bch2_mark_superblocks(trans));
+ bch_err_msg(c, ret, "marking superblocks");
+ if (ret)
+ goto out;

ret = bch2_gc_btrees(c, initial, metadata_only);
-
if (ret)
goto out;

@@ -1823,7 +1728,7 @@ int bch2_gc(struct bch_fs *c, bool initial, bool metadata_only)
ret = bch2_gc_stripes_done(c, metadata_only) ?:
bch2_gc_reflink_done(c, metadata_only) ?:
bch2_gc_alloc_done(c, metadata_only) ?:
- bch2_gc_done(c, initial, metadata_only);
+ bch2_accounting_gc_done(c);

bch2_journal_unblock(&c->journal);
}
diff --git a/fs/bcachefs/btree_trans_commit.c b/fs/bcachefs/btree_trans_commit.c
index b005e20039bb..eac9d45bcc8c 100644
--- a/fs/bcachefs/btree_trans_commit.c
+++ b/fs/bcachefs/btree_trans_commit.c
@@ -697,7 +697,7 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,
a->k.version = journal_pos_to_bversion(&trans->journal_res,
(u64 *) entry - (u64 *) trans->journal_entries);
BUG_ON(bversion_zero(a->k.version));
- ret = bch2_accounting_mem_add(trans, accounting_i_to_s_c(a));
+ ret = bch2_accounting_mem_add_locked(trans, accounting_i_to_s_c(a), false);
if (ret)
goto revert_fs_usage;
}
@@ -784,7 +784,7 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,
struct bkey_s_accounting a = bkey_i_to_s_accounting(entry2->start);

bch2_accounting_neg(a);
- bch2_accounting_mem_add(trans, a.c);
+ bch2_accounting_mem_add_locked(trans, a.c, false);
bch2_accounting_neg(a);
}
percpu_up_read(&c->mark_lock);
diff --git a/fs/bcachefs/buckets.c b/fs/bcachefs/buckets.c
index 5e2b9aa93241..506bb580bff4 100644
--- a/fs/bcachefs/buckets.c
+++ b/fs/bcachefs/buckets.c
@@ -95,113 +95,12 @@ void bch2_dev_usage_to_text(struct printbuf *out, struct bch_dev_usage *usage)
}
}

-void bch2_dev_usage_update(struct bch_fs *c, struct bch_dev *ca,
- const struct bch_alloc_v4 *old,
- const struct bch_alloc_v4 *new)
-{
- struct bch_fs_usage *fs_usage;
- struct bch_dev_usage *u;
-
- preempt_disable();
- fs_usage = this_cpu_ptr(c->usage_gc);
-
- if (data_type_is_hidden(old->data_type))
- fs_usage->b.hidden -= ca->mi.bucket_size;
- if (data_type_is_hidden(new->data_type))
- fs_usage->b.hidden += ca->mi.bucket_size;
-
- u = this_cpu_ptr(ca->usage_gc);
-
- u->d[old->data_type].buckets--;
- u->d[new->data_type].buckets++;
-
- u->d[old->data_type].sectors -= bch2_bucket_sectors_dirty(*old);
- u->d[new->data_type].sectors += bch2_bucket_sectors_dirty(*new);
-
- u->d[BCH_DATA_cached].sectors += new->cached_sectors;
- u->d[BCH_DATA_cached].sectors -= old->cached_sectors;
-
- u->d[old->data_type].fragmented -= bch2_bucket_sectors_fragmented(ca, *old);
- u->d[new->data_type].fragmented += bch2_bucket_sectors_fragmented(ca, *new);
-
- preempt_enable();
-}
-
-static inline struct bch_alloc_v4 bucket_m_to_alloc(struct bucket b)
-{
- return (struct bch_alloc_v4) {
- .gen = b.gen,
- .data_type = b.data_type,
- .dirty_sectors = b.dirty_sectors,
- .cached_sectors = b.cached_sectors,
- .stripe = b.stripe,
- };
-}
-
-void bch2_dev_usage_update_m(struct bch_fs *c, struct bch_dev *ca,
- struct bucket *old, struct bucket *new)
-{
- struct bch_alloc_v4 old_a = bucket_m_to_alloc(*old);
- struct bch_alloc_v4 new_a = bucket_m_to_alloc(*new);
-
- bch2_dev_usage_update(c, ca, &old_a, &new_a);
-}
-
-int bch2_update_replicas(struct bch_fs *c, struct bkey_s_c k,
- struct bch_replicas_entry_v1 *r, s64 sectors)
-{
- struct bch_fs_usage *fs_usage;
- int idx, ret = 0;
- struct printbuf buf = PRINTBUF;
-
- percpu_down_read(&c->mark_lock);
-
- idx = bch2_replicas_entry_idx(c, r);
- if (idx < 0 &&
- fsck_err(c, ptr_to_missing_replicas_entry,
- "no replicas entry\n while marking %s",
- (bch2_bkey_val_to_text(&buf, c, k), buf.buf))) {
- percpu_up_read(&c->mark_lock);
- ret = bch2_mark_replicas(c, r);
- percpu_down_read(&c->mark_lock);
-
- if (ret)
- goto err;
- idx = bch2_replicas_entry_idx(c, r);
- }
- if (idx < 0) {
- ret = -1;
- goto err;
- }
-
- preempt_disable();
- fs_usage = this_cpu_ptr(c->usage_gc);
- fs_usage_data_type_to_base(&fs_usage->b, r->data_type, sectors);
- fs_usage->replicas[idx] += sectors;
- preempt_enable();
-err:
-fsck_err:
- percpu_up_read(&c->mark_lock);
- printbuf_exit(&buf);
- return ret;
-}
-
-static inline int update_cached_sectors(struct bch_fs *c,
- struct bkey_s_c k,
- unsigned dev, s64 sectors)
-{
- struct bch_replicas_padded r;
-
- bch2_replicas_entry_cached(&r.e, dev);
-
- return bch2_update_replicas(c, k, &r.e, sectors);
-}
-
-int bch2_mark_metadata_bucket(struct bch_fs *c, struct bch_dev *ca,
+int bch2_mark_metadata_bucket(struct btree_trans *trans, struct bch_dev *ca,
size_t b, enum bch_data_type data_type,
unsigned sectors, struct gc_pos pos,
unsigned flags)
{
+ struct bch_fs *c = trans->c;
struct bucket old, new, *g;
int ret = 0;

@@ -242,12 +141,15 @@ int bch2_mark_metadata_bucket(struct bch_fs *c, struct bch_dev *ca,
g->data_type = data_type;
g->dirty_sectors += sectors;
new = *g;
-err:
bucket_unlock(g);
- if (!ret)
- bch2_dev_usage_update_m(c, ca, &old, &new);
percpu_up_read(&c->mark_lock);
+ ret = bch2_bucket_to_dev_counters(trans, ca, &old, &new, flags);
+out:
return ret;
+err:
+ bucket_unlock(g);
+ percpu_up_read(&c->mark_lock);
+ goto out;
}

int bch2_check_bucket_ref(struct btree_trans *trans,
@@ -496,8 +398,11 @@ static int bch2_trigger_pointer(struct btree_trans *trans,
g->data_type = bucket_data_type;
struct bucket new = *g;
bucket_unlock(g);
- bch2_dev_usage_update_m(c, ca, &old, &new);
percpu_up_read(&c->mark_lock);
+
+ ret = bch2_bucket_to_dev_counters(trans, ca, &old, &new, flags);
+ if (ret)
+ return ret;
}

return 0;
@@ -539,7 +444,7 @@ static int bch2_trigger_stripe_ptr(struct btree_trans *trans,
};
bch2_bkey_to_replicas(&acc.replicas, bkey_i_to_s_c(&s->k_i));
acc.replicas.data_type = data_type;
- ret = bch2_disk_accounting_mod(trans, &acc, &sectors, 1);
+ ret = bch2_disk_accounting_mod(trans, &acc, &sectors, 1, false);
err:
bch2_trans_iter_exit(trans, &iter);
return ret;
@@ -548,8 +453,6 @@ static int bch2_trigger_stripe_ptr(struct btree_trans *trans,
if (flags & BTREE_TRIGGER_GC) {
struct bch_fs *c = trans->c;

- BUG_ON(!(flags & BTREE_TRIGGER_GC));
-
struct gc_stripe *m = genradix_ptr_alloc(&c->gc_stripes, p.ec.idx, GFP_KERNEL);
if (!m) {
bch_err(c, "error allocating memory for gc_stripes, idx %llu",
@@ -572,11 +475,16 @@ static int bch2_trigger_stripe_ptr(struct btree_trans *trans,

m->block_sectors[p.ec.block] += sectors;

- struct bch_replicas_padded r = m->r;
+ struct disk_accounting_key acc = {
+ .type = BCH_DISK_ACCOUNTING_replicas,
+ };
+ memcpy(&acc.replicas, &m->r.e, replicas_entry_bytes(&m->r.e));
mutex_unlock(&c->ec_stripes_heap_lock);

- r.e.data_type = data_type;
- bch2_update_replicas(c, k, &r.e, sectors);
+ acc.replicas.data_type = data_type;
+ int ret = bch2_disk_accounting_mod(trans, &acc, &sectors, 1, true);
+ if (ret)
+ return ret;
}

return 0;
@@ -587,7 +495,6 @@ static int __trigger_extent(struct btree_trans *trans,
struct bkey_s_c k, unsigned flags)
{
bool gc = flags & BTREE_TRIGGER_GC;
- struct bch_fs *c = trans->c;
struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k);
const union bch_extent_entry *entry;
struct extent_ptr_decoded p;
@@ -614,11 +521,7 @@ static int __trigger_extent(struct btree_trans *trans,

if (p.ptr.cached) {
if (!stale) {
- ret = !gc
- ? bch2_mod_dev_cached_sectors(trans, p.ptr.dev, disk_sectors)
- : update_cached_sectors(c, k, p.ptr.dev, disk_sectors);
- bch2_fs_fatal_err_on(ret && gc, c, "%s(): no replicas entry while updating cached sectors",
- __func__);
+ ret = bch2_mod_dev_cached_sectors(trans, p.ptr.dev, disk_sectors, gc);
if (ret)
return ret;
}
@@ -640,16 +543,7 @@ static int __trigger_extent(struct btree_trans *trans,
}

if (acc.replicas.nr_devs) {
- ret = !gc
- ? bch2_disk_accounting_mod(trans, &acc, &dirty_sectors, 1)
- : bch2_update_replicas(c, k, &acc.replicas, dirty_sectors);
- if (unlikely(ret && gc)) {
- struct printbuf buf = PRINTBUF;
-
- bch2_bkey_val_to_text(&buf, c, k);
- bch2_fs_fatal_error(c, "%s(): no replicas entry for %s", __func__, buf.buf);
- printbuf_exit(&buf);
- }
+ ret = bch2_disk_accounting_mod(trans, &acc, &dirty_sectors, 1, gc);
if (ret)
return ret;
}
@@ -699,36 +593,18 @@ static int __trigger_reservation(struct btree_trans *trans,
enum btree_id btree_id, unsigned level,
struct bkey_s_c k, unsigned flags)
{
- struct bch_fs *c = trans->c;
- unsigned replicas = bkey_s_c_to_reservation(k).v->nr_replicas;
- s64 sectors = (s64) k.k->size;
+ if (flags & (BTREE_TRIGGER_TRANSACTIONAL|BTREE_TRIGGER_GC)) {
+ s64 sectors = k.k->size;

- if (flags & BTREE_TRIGGER_OVERWRITE)
- sectors = -sectors;
+ if (flags & BTREE_TRIGGER_OVERWRITE)
+ sectors = -sectors;

- if (flags & BTREE_TRIGGER_TRANSACTIONAL) {
struct disk_accounting_key acc = {
.type = BCH_DISK_ACCOUNTING_persistent_reserved,
- .persistent_reserved.nr_replicas = replicas,
+ .persistent_reserved.nr_replicas = bkey_s_c_to_reservation(k).v->nr_replicas,
};

- return bch2_disk_accounting_mod(trans, &acc, &sectors, 1);
- }
-
- if (flags & BTREE_TRIGGER_GC) {
- sectors *= replicas;
-
- percpu_down_read(&c->mark_lock);
- preempt_disable();
-
- struct bch_fs_usage *fs_usage = this_cpu_ptr(c->usage_gc);
-
- replicas = min(replicas, ARRAY_SIZE(fs_usage->persistent_reserved));
- fs_usage->b.reserved += sectors;
- fs_usage->persistent_reserved[replicas - 1] += sectors;
-
- preempt_enable();
- percpu_up_read(&c->mark_lock);
+ return bch2_disk_accounting_mod(trans, &acc, &sectors, 1, flags & BTREE_TRIGGER_GC);
}

return 0;
diff --git a/fs/bcachefs/buckets.h b/fs/bcachefs/buckets.h
index f9d8d7b9fbd1..7b8b10f74be0 100644
--- a/fs/bcachefs/buckets.h
+++ b/fs/bcachefs/buckets.h
@@ -269,16 +269,6 @@ static inline s64 bucket_sectors_fragmented(struct bch_dev *ca, struct bch_alloc

/* Filesystem usage: */

-static inline unsigned __fs_usage_u64s(unsigned nr_replicas)
-{
- return sizeof(struct bch_fs_usage) / sizeof(u64) + nr_replicas;
-}
-
-static inline unsigned fs_usage_u64s(struct bch_fs *c)
-{
- return __fs_usage_u64s(READ_ONCE(c->replicas.nr));
-}
-
static inline unsigned dev_usage_u64s(void)
{
return sizeof(struct bch_dev_usage) / sizeof(u64);
@@ -287,19 +277,11 @@ static inline unsigned dev_usage_u64s(void)
struct bch_fs_usage_short
bch2_fs_usage_read_short(struct bch_fs *);

-void bch2_dev_usage_update(struct bch_fs *, struct bch_dev *,
- const struct bch_alloc_v4 *,
- const struct bch_alloc_v4 *);
-void bch2_dev_usage_update_m(struct bch_fs *, struct bch_dev *,
- struct bucket *, struct bucket *);
-int bch2_update_replicas(struct bch_fs *, struct bkey_s_c,
- struct bch_replicas_entry_v1 *, s64);
-
int bch2_check_bucket_ref(struct btree_trans *, struct bkey_s_c,
const struct bch_extent_ptr *,
s64, enum bch_data_type, u8, u8, u32);

-int bch2_mark_metadata_bucket(struct bch_fs *, struct bch_dev *,
+int bch2_mark_metadata_bucket(struct btree_trans *, struct bch_dev *,
size_t, enum bch_data_type, unsigned,
struct gc_pos, unsigned);

diff --git a/fs/bcachefs/buckets_types.h b/fs/bcachefs/buckets_types.h
index 570acdf455bb..7bd1a117afe4 100644
--- a/fs/bcachefs/buckets_types.h
+++ b/fs/bcachefs/buckets_types.h
@@ -54,13 +54,6 @@ struct bch_fs_usage_base {
u64 nr_inodes;
};

-struct bch_fs_usage {
- /* all fields are in units of 512 byte sectors: */
- struct bch_fs_usage_base b;
- u64 persistent_reserved[BCH_REPLICAS_MAX];
- u64 replicas[];
-};
-
struct bch_fs_usage_short {
u64 capacity;
u64 used;
diff --git a/fs/bcachefs/disk_accounting.c b/fs/bcachefs/disk_accounting.c
index f898323f72c7..2884615adc1e 100644
--- a/fs/bcachefs/disk_accounting.c
+++ b/fs/bcachefs/disk_accounting.c
@@ -19,7 +19,7 @@ static const char * const disk_accounting_type_strs[] = {

int bch2_disk_accounting_mod(struct btree_trans *trans,
struct disk_accounting_key *k,
- s64 *d, unsigned nr)
+ s64 *d, unsigned nr, bool gc)
{
/* Normalize: */
switch (k->type) {
@@ -40,11 +40,14 @@ int bch2_disk_accounting_mod(struct btree_trans *trans,

memcpy_u64s_small(acc->v.d, d, nr);

- return bch2_trans_update_buffered(trans, BTREE_ID_accounting, &acc->k_i);
+ return likely(!gc)
+ ? bch2_trans_update_buffered(trans, BTREE_ID_accounting, &acc->k_i)
+ : bch2_accounting_mem_add(trans, accounting_i_to_s_c(acc), true);
}

int bch2_mod_dev_cached_sectors(struct btree_trans *trans,
- unsigned dev, s64 sectors)
+ unsigned dev, s64 sectors,
+ bool gc)
{
struct disk_accounting_key acc = {
.type = BCH_DISK_ACCOUNTING_replicas,
@@ -52,7 +55,7 @@ int bch2_mod_dev_cached_sectors(struct btree_trans *trans,

bch2_replicas_entry_cached(&acc.replicas, dev);

- return bch2_disk_accounting_mod(trans, &acc, &sectors, 1);
+ return bch2_disk_accounting_mod(trans, &acc, &sectors, 1, gc);
}

int bch2_accounting_invalid(struct bch_fs *c, struct bkey_s_c k,
@@ -147,7 +150,7 @@ int bch2_accounting_update_sb(struct btree_trans *trans)
return 0;
}

-static int __bch2_accounting_mem_add_slowpath(struct bch_fs *c, struct bkey_s_c_accounting a)
+static int __bch2_accounting_mem_add_slowpath(struct bch_fs *c, struct bkey_s_c_accounting a, bool gc)
{
struct bch_replicas_padded r;

@@ -155,7 +158,7 @@ static int __bch2_accounting_mem_add_slowpath(struct bch_fs *c, struct bkey_s_c_
!bch2_replicas_marked_locked(c, &r.e))
return -BCH_ERR_btree_insert_need_mark_replicas;

- struct bch_accounting_mem *acc = &c->accounting;
+ struct bch_accounting_mem *acc = &c->accounting[gc];
unsigned new_nr_counters = acc->nr_counters + bch2_accounting_counters(a.k);

u64 __percpu *new_counters = __alloc_percpu_gfp(new_nr_counters * sizeof(u64),
@@ -191,19 +194,64 @@ static int __bch2_accounting_mem_add_slowpath(struct bch_fs *c, struct bkey_s_c_
return 0;
}

-int bch2_accounting_mem_add_slowpath(struct bch_fs *c, struct bkey_s_c_accounting a)
+int bch2_accounting_mem_add_slowpath(struct bch_fs *c, struct bkey_s_c_accounting a, bool gc)
{
percpu_up_read(&c->mark_lock);
percpu_down_write(&c->mark_lock);
- int ret = __bch2_accounting_mem_add_slowpath(c, a);
+ int ret = __bch2_accounting_mem_add_slowpath(c, a, gc);
percpu_up_write(&c->mark_lock);
percpu_down_read(&c->mark_lock);
return ret;
}

+/* Ensures all counters in @src exist in @dst: */
+static int copy_counters(struct bch_accounting_mem *dst,
+ struct bch_accounting_mem *src)
+{
+ unsigned orig_dst_k_nr = dst->k.nr;
+ unsigned dst_counters = dst->nr_counters;
+
+ darray_for_each(src->k, i)
+ if (eytzinger0_find(dst->k.data, orig_dst_k_nr, sizeof(dst->k.data[0]),
+ accounting_pos_cmp, &i->pos) >= orig_dst_k_nr) {
+ if (darray_push(&dst->k, ((struct accounting_pos_offset) {
+ .pos = i->pos,
+ .offset = dst_counters,
+ .nr_counters = i->nr_counters })))
+ goto err;
+
+ dst_counters += i->nr_counters;
+ }
+
+ if (dst->k.nr == orig_dst_k_nr)
+ return 0;
+
+ u64 __percpu *new_counters = __alloc_percpu_gfp(dst_counters * sizeof(u64),
+ sizeof(u64), GFP_KERNEL);
+ if (!new_counters)
+ goto err;
+
+ preempt_disable();
+ memcpy(this_cpu_ptr(new_counters),
+ bch2_acc_percpu_u64s(dst->v, dst->nr_counters),
+ dst->nr_counters * sizeof(u64));
+ preempt_enable();
+
+ free_percpu(dst->v);
+ dst->v = new_counters;
+ dst->nr_counters = dst_counters;
+
+ eytzinger0_sort(dst->k.data, dst->k.nr, sizeof(dst->k.data[0]), accounting_pos_cmp, NULL);
+
+ return 0;
+err:
+ dst->k.nr = orig_dst_k_nr;
+ return -BCH_ERR_ENOMEM_disk_accounting;
+}
+
int bch2_fs_replicas_usage_read(struct bch_fs *c, darray_char *usage)
{
- struct bch_accounting_mem *acc = &c->accounting;
+ struct bch_accounting_mem *acc = &c->accounting[0];
int ret = 0;

darray_init(usage);
@@ -234,6 +282,85 @@ int bch2_fs_replicas_usage_read(struct bch_fs *c, darray_char *usage)
return ret;
}

+static int accounting_write_key(struct btree_trans *trans, struct bpos pos, u64 *v, unsigned nr_counters)
+{
+ struct bkey_i_accounting *a = bch2_trans_kmalloc(trans, sizeof(*a) + sizeof(*v) * nr_counters);
+ int ret = PTR_ERR_OR_ZERO(a);
+ if (ret)
+ return ret;
+
+ bkey_accounting_init(&a->k_i);
+ a->k.p = pos;
+ set_bkey_val_bytes(&a->k, sizeof(a->v) + sizeof(*v) * nr_counters);
+ memcpy(a->v.d, v, sizeof(*v) * nr_counters);
+
+ return bch2_btree_insert_trans(trans, BTREE_ID_accounting, &a->k_i, 0);
+}
+
+int bch2_accounting_gc_done(struct bch_fs *c)
+{
+ struct bch_accounting_mem *dst = &c->accounting[0];
+ struct bch_accounting_mem *src = &c->accounting[1];
+ struct btree_trans *trans = bch2_trans_get(c);
+ struct printbuf buf = PRINTBUF;
+ int ret = 0;
+
+ percpu_down_write(&c->mark_lock);
+
+ ret = copy_counters(dst, src) ?:
+ copy_counters(src, dst);
+ if (ret)
+ goto err;
+
+ BUG_ON(dst->k.nr != src->k.nr);
+
+ for (unsigned i = 0; i < src->k.nr; i++) {
+ BUG_ON(src->k.data[i].nr_counters != dst->k.data[i].nr_counters);
+ BUG_ON(!bpos_eq(dst->k.data[i].pos, src->k.data[i].pos));
+
+ struct disk_accounting_key acc_k;
+ bpos_to_disk_accounting_key(&acc_k, src->k.data[i].pos);
+
+ unsigned nr = src->k.data[i].nr_counters;
+ u64 src_v[BCH_ACCOUNTING_MAX_COUNTERS];
+ u64 dst_v[BCH_ACCOUNTING_MAX_COUNTERS];
+
+ bch2_accounting_mem_read_counters(c, i, dst_v, nr, false);
+ bch2_accounting_mem_read_counters(c, i, src_v, nr, true);
+
+ if (memcmp(dst_v, src_v, nr * sizeof(u64))) {
+ printbuf_reset(&buf);
+ prt_str(&buf, "accounting mismatch for ");
+ bch2_accounting_key_to_text(&buf, &acc_k);
+
+ prt_str(&buf, ": got");
+ for (unsigned j = 0; j < nr; j++)
+ prt_printf(&buf, " %llu", dst_v[j]);
+
+ prt_str(&buf, " should be");
+ for (unsigned j = 0; j < nr; j++)
+ prt_printf(&buf, " %llu", src_v[j]);
+
+ if (fsck_err(c, accounting_mismatch, "%s", buf.buf)) {
+ for (unsigned j = 0; j < dst->k.data[i].nr_counters; j++)
+ percpu_u64_set(dst->v + dst->k.data[i].offset + j, src_v[j]);
+
+ ret = commit_do(trans, NULL, NULL, 0,
+ accounting_write_key(trans, src->k.data[i].pos, src_v, nr));
+ if (ret)
+ goto err;
+ }
+ }
+ }
+err:
+fsck_err:
+ percpu_up_write(&c->mark_lock);
+ printbuf_exit(&buf);
+ bch2_trans_put(trans);
+ bch_err_fn(c, ret);
+ return ret;
+}
+
static bool accounting_key_is_zero(struct bkey_s_c_accounting a)
{

@@ -251,7 +378,7 @@ static int accounting_read_key(struct bch_fs *c, struct bkey_s_c k)
return 0;

percpu_down_read(&c->mark_lock);
- int ret = __bch2_accounting_mem_add(c, bkey_s_c_to_accounting(k));
+ int ret = __bch2_accounting_mem_add(c, bkey_s_c_to_accounting(k), false);
percpu_up_read(&c->mark_lock);

if (accounting_key_is_zero(bkey_s_c_to_accounting(k)) &&
@@ -274,7 +401,7 @@ static int accounting_read_key(struct bch_fs *c, struct bkey_s_c k)

int bch2_accounting_read(struct bch_fs *c)
{
- struct bch_accounting_mem *acc = &c->accounting;
+ struct bch_accounting_mem *acc = &c->accounting[0];

int ret = bch2_trans_run(c,
for_each_btree_key(trans, iter,
@@ -321,7 +448,7 @@ int bch2_accounting_read(struct bch_fs *c)
bpos_to_disk_accounting_key(&k, acc->k.data[i].pos);

u64 v[BCH_ACCOUNTING_MAX_COUNTERS];
- bch2_accounting_mem_read_counters(c, i, v, ARRAY_SIZE(v));
+ bch2_accounting_mem_read_counters(c, i, v, ARRAY_SIZE(v), false);

switch (k.type) {
case BCH_DISK_ACCOUNTING_persistent_reserved:
@@ -370,8 +497,9 @@ int bch2_dev_usage_remove(struct bch_fs *c, unsigned dev)
bch2_btree_write_buffer_flush_sync(trans));
}

-int bch2_dev_usage_init(struct bch_dev *ca)
+int bch2_dev_usage_init(struct bch_dev *ca, bool gc)
{
+ struct bch_fs *c = ca->fs;
struct disk_accounting_key acc = {
.type = BCH_DISK_ACCOUNTING_dev_data_type,
.dev_data_type.dev = ca->dev_idx,
@@ -379,14 +507,21 @@ int bch2_dev_usage_init(struct bch_dev *ca)
};
u64 v[3] = { ca->mi.nbuckets - ca->mi.first_bucket, 0, 0 };

- return bch2_trans_do(ca->fs, NULL, NULL, 0,
- bch2_disk_accounting_mod(trans, &acc, v, ARRAY_SIZE(v)));
+ int ret = bch2_trans_do(c, NULL, NULL, 0,
+ bch2_disk_accounting_mod(trans, &acc, v, ARRAY_SIZE(v), gc));
+ bch_err_fn(c, ret);
+ return ret;
}

-void bch2_fs_accounting_exit(struct bch_fs *c)
+void bch2_accounting_free(struct bch_accounting_mem *acc)
{
- struct bch_accounting_mem *acc = &c->accounting;
-
darray_exit(&acc->k);
free_percpu(acc->v);
+ acc->v = NULL;
+ acc->nr_counters = 0;
+}
+
+void bch2_fs_accounting_exit(struct bch_fs *c)
+{
+ bch2_accounting_free(&c->accounting[0]);
}
diff --git a/fs/bcachefs/disk_accounting.h b/fs/bcachefs/disk_accounting.h
index a8526bf43207..70ac67f4a3cb 100644
--- a/fs/bcachefs/disk_accounting.h
+++ b/fs/bcachefs/disk_accounting.h
@@ -78,11 +78,9 @@ static inline struct bpos disk_accounting_key_to_bpos(struct disk_accounting_key
return ret;
}

-int bch2_disk_accounting_mod(struct btree_trans *,
- struct disk_accounting_key *,
- s64 *, unsigned);
-int bch2_mod_dev_cached_sectors(struct btree_trans *trans,
- unsigned dev, s64 sectors);
+int bch2_disk_accounting_mod(struct btree_trans *, struct disk_accounting_key *,
+ s64 *, unsigned, bool);
+int bch2_mod_dev_cached_sectors(struct btree_trans *, unsigned, s64, bool);

int bch2_accounting_invalid(struct bch_fs *, struct bkey_s_c,
enum bkey_invalid_flags, struct printbuf *);
@@ -106,15 +104,15 @@ static inline int accounting_pos_cmp(const void *_l, const void *_r)
return bpos_cmp(*l, *r);
}

-int bch2_accounting_mem_add_slowpath(struct bch_fs *, struct bkey_s_c_accounting);
+int bch2_accounting_mem_add_slowpath(struct bch_fs *, struct bkey_s_c_accounting, bool);

-static inline int __bch2_accounting_mem_add(struct bch_fs *c, struct bkey_s_c_accounting a)
+static inline int __bch2_accounting_mem_add(struct bch_fs *c, struct bkey_s_c_accounting a, bool gc)
{
- struct bch_accounting_mem *acc = &c->accounting;
+ struct bch_accounting_mem *acc = &c->accounting[gc];
unsigned idx = eytzinger0_find(acc->k.data, acc->k.nr, sizeof(acc->k.data[0]),
accounting_pos_cmp, &a.k->p);
if (unlikely(idx >= acc->k.nr))
- return bch2_accounting_mem_add_slowpath(c, a);
+ return bch2_accounting_mem_add_slowpath(c, a, gc);

unsigned offset = acc->k.data[idx].offset;

@@ -125,37 +123,48 @@ static inline int __bch2_accounting_mem_add(struct bch_fs *c, struct bkey_s_c_ac
return 0;
}

-static inline int bch2_accounting_mem_add(struct btree_trans *trans, struct bkey_s_c_accounting a)
+static inline int bch2_accounting_mem_add_locked(struct btree_trans *trans, struct bkey_s_c_accounting a, bool gc)
{
struct bch_fs *c = trans->c;
- struct disk_accounting_key acc_k;
- bpos_to_disk_accounting_key(&acc_k, a.k->p);

- switch (acc_k.type) {
- case BCH_DISK_ACCOUNTING_persistent_reserved:
- trans->fs_usage_delta.reserved += acc_k.persistent_reserved.nr_replicas * a.v->d[0];
- break;
- case BCH_DISK_ACCOUNTING_replicas:
- fs_usage_data_type_to_base(&trans->fs_usage_delta, acc_k.replicas.data_type, a.v->d[0]);
- break;
- case BCH_DISK_ACCOUNTING_dev_data_type: {
- struct bch_dev *ca = bch_dev_bkey_exists(c, acc_k.dev_data_type.dev);
-
- this_cpu_add(ca->usage->d[acc_k.dev_data_type.data_type].buckets, a.v->d[0]);
- this_cpu_add(ca->usage->d[acc_k.dev_data_type.data_type].sectors, a.v->d[1]);
- this_cpu_add(ca->usage->d[acc_k.dev_data_type.data_type].fragmented, a.v->d[2]);
+ if (!gc) {
+ struct disk_accounting_key acc_k;
+ bpos_to_disk_accounting_key(&acc_k, a.k->p);
+
+ switch (acc_k.type) {
+ case BCH_DISK_ACCOUNTING_persistent_reserved:
+ trans->fs_usage_delta.reserved += acc_k.persistent_reserved.nr_replicas * a.v->d[0];
+ break;
+ case BCH_DISK_ACCOUNTING_replicas:
+ fs_usage_data_type_to_base(&trans->fs_usage_delta, acc_k.replicas.data_type, a.v->d[0]);
+ break;
+ case BCH_DISK_ACCOUNTING_dev_data_type: {
+ struct bch_dev *ca = bch_dev_bkey_exists(c, acc_k.dev_data_type.dev);
+
+ this_cpu_add(ca->usage->d[acc_k.dev_data_type.data_type].buckets, a.v->d[0]);
+ this_cpu_add(ca->usage->d[acc_k.dev_data_type.data_type].sectors, a.v->d[1]);
+ this_cpu_add(ca->usage->d[acc_k.dev_data_type.data_type].fragmented, a.v->d[2]);
+ }
+ }
}
- }
- return __bch2_accounting_mem_add(c, a);
+
+ return __bch2_accounting_mem_add(c, a, gc);
}

-static inline void bch2_accounting_mem_read_counters(struct bch_fs *c,
- unsigned idx,
- u64 *v, unsigned nr)
+static inline int bch2_accounting_mem_add(struct btree_trans *trans, struct bkey_s_c_accounting a, bool gc)
+{
+ percpu_down_read(&trans->c->mark_lock);
+ int ret = bch2_accounting_mem_add_locked(trans, a, gc);
+ percpu_up_read(&trans->c->mark_lock);
+ return ret;
+}
+
+static inline void bch2_accounting_mem_read_counters(struct bch_fs *c, unsigned idx,
+ u64 *v, unsigned nr, bool gc)
{
memset(v, 0, sizeof(*v) * nr);

- struct bch_accounting_mem *acc = &c->accounting;
+ struct bch_accounting_mem *acc = &c->accounting[0];
if (unlikely(idx >= acc->k.nr))
return;

@@ -169,19 +178,23 @@ static inline void bch2_accounting_mem_read_counters(struct bch_fs *c,
static inline void bch2_accounting_mem_read(struct bch_fs *c, struct bpos p,
u64 *v, unsigned nr)
{
- struct bch_accounting_mem *acc = &c->accounting;
+ struct bch_accounting_mem *acc = &c->accounting[0];
unsigned idx = eytzinger0_find(acc->k.data, acc->k.nr, sizeof(acc->k.data[0]),
accounting_pos_cmp, &p);

- bch2_accounting_mem_read_counters(c, idx, v, nr);
+ bch2_accounting_mem_read_counters(c, idx, v, nr, false);
}

int bch2_fs_replicas_usage_read(struct bch_fs *, darray_char *);

+int bch2_accounting_gc_done(struct bch_fs *);
+
int bch2_accounting_read(struct bch_fs *);

int bch2_dev_usage_remove(struct bch_fs *, unsigned);
-int bch2_dev_usage_init(struct bch_dev *);
+int bch2_dev_usage_init(struct bch_dev *, bool);
+
+void bch2_accounting_free(struct bch_accounting_mem *);
void bch2_fs_accounting_exit(struct bch_fs *);

#endif /* _BCACHEFS_DISK_ACCOUNTING_H */
diff --git a/fs/bcachefs/ec.c b/fs/bcachefs/ec.c
index 38e5e882f4a4..bd435d385559 100644
--- a/fs/bcachefs/ec.c
+++ b/fs/bcachefs/ec.c
@@ -238,10 +238,8 @@ static int bch2_trans_mark_stripe_bucket(struct btree_trans *trans,
return ret;
}

-static int mark_stripe_bucket(struct btree_trans *trans,
- struct bkey_s_c k,
- unsigned ptr_idx,
- unsigned flags)
+static int mark_stripe_bucket(struct btree_trans *trans, struct bkey_s_c k,
+ unsigned ptr_idx, unsigned flags)
{
struct bch_fs *c = trans->c;
const struct bch_stripe *s = bkey_s_c_to_stripe(k).v;
@@ -287,13 +285,16 @@ static int mark_stripe_bucket(struct btree_trans *trans,
g->stripe = k.k->p.offset;
g->stripe_redundancy = s->nr_redundant;
new = *g;
-err:
bucket_unlock(g);
- if (!ret)
- bch2_dev_usage_update_m(c, ca, &old, &new);
percpu_up_read(&c->mark_lock);
+ ret = bch2_bucket_to_dev_counters(trans, ca, &old, &new, flags);
+out:
printbuf_exit(&buf);
return ret;
+err:
+ bucket_unlock(g);
+ percpu_up_read(&c->mark_lock);
+ goto out;
}

int bch2_trigger_stripe(struct btree_trans *trans,
@@ -309,7 +310,12 @@ int bch2_trigger_stripe(struct btree_trans *trans,
const struct bch_stripe *new_s = new.k->type == KEY_TYPE_stripe
? bkey_s_c_to_stripe(new).v : NULL;

- if (flags & BTREE_TRIGGER_TRANSACTIONAL) {
+ BUG_ON(new_s && old_s &&
+ (new_s->nr_blocks != old_s->nr_blocks ||
+ new_s->nr_redundant != old_s->nr_redundant));
+
+
+ if (flags & (BTREE_TRIGGER_TRANSACTIONAL|BTREE_TRIGGER_GC)) {
/*
* If the pointers aren't changing, we don't need to do anything:
*/
@@ -320,9 +326,34 @@ int bch2_trigger_stripe(struct btree_trans *trans,
new_s->nr_blocks * sizeof(struct bch_extent_ptr)))
return 0;

- BUG_ON(new_s && old_s &&
- (new_s->nr_blocks != old_s->nr_blocks ||
- new_s->nr_redundant != old_s->nr_redundant));
+ struct gc_stripe *gc = NULL;
+ if (flags & BTREE_TRIGGER_GC) {
+ gc = genradix_ptr_alloc(&c->gc_stripes, idx, GFP_KERNEL);
+ if (!gc) {
+ bch_err(c, "error allocating memory for gc_stripes, idx %llu", idx);
+ return -BCH_ERR_ENOMEM_mark_stripe;
+ }
+
+ /*
+ * This will be wrong when we bring back runtime gc: we should
+ * be unmarking the old key and then marking the new key
+ *
+ * Also: when we bring back runtime gc, locking
+ */
+ gc->alive = true;
+ gc->sectors = le16_to_cpu(new_s->sectors);
+ gc->nr_blocks = new_s->nr_blocks;
+ gc->nr_redundant = new_s->nr_redundant;
+
+ for (unsigned i = 0; i < new_s->nr_blocks; i++)
+ gc->ptrs[i] = new_s->ptrs[i];
+
+ /*
+ * gc recalculates this field from stripe ptr
+ * references:
+ */
+ memset(gc->block_sectors, 0, sizeof(gc->block_sectors));
+ }

if (new_s) {
s64 sectors = (u64) le16_to_cpu(new_s->sectors) * new_s->nr_redundant;
@@ -331,9 +362,12 @@ int bch2_trigger_stripe(struct btree_trans *trans,
.type = BCH_DISK_ACCOUNTING_replicas,
};
bch2_bkey_to_replicas(&acc.replicas, new);
- int ret = bch2_disk_accounting_mod(trans, &acc, &sectors, 1);
+ int ret = bch2_disk_accounting_mod(trans, &acc, &sectors, 1, gc);
if (ret)
return ret;
+
+ if (gc)
+ memcpy(&gc->r.e, &acc.replicas, replicas_entry_bytes(&acc.replicas));
}

if (old_s) {
@@ -343,29 +377,42 @@ int bch2_trigger_stripe(struct btree_trans *trans,
.type = BCH_DISK_ACCOUNTING_replicas,
};
bch2_bkey_to_replicas(&acc.replicas, old);
- int ret = bch2_disk_accounting_mod(trans, &acc, &sectors, 1);
+ int ret = bch2_disk_accounting_mod(trans, &acc, &sectors, 1, gc);
if (ret)
return ret;
}

unsigned nr_blocks = new_s ? new_s->nr_blocks : old_s->nr_blocks;
- for (unsigned i = 0; i < nr_blocks; i++) {
- if (new_s && old_s &&
- !memcmp(&new_s->ptrs[i],
- &old_s->ptrs[i],
- sizeof(new_s->ptrs[i])))
- continue;

- if (new_s) {
- int ret = bch2_trans_mark_stripe_bucket(trans,
- bkey_s_c_to_stripe(new), i, false);
- if (ret)
- return ret;
+ if (flags & BTREE_TRIGGER_TRANSACTIONAL) {
+ for (unsigned i = 0; i < nr_blocks; i++) {
+ if (new_s && old_s &&
+ !memcmp(&new_s->ptrs[i],
+ &old_s->ptrs[i],
+ sizeof(new_s->ptrs[i])))
+ continue;
+
+ if (new_s) {
+ int ret = bch2_trans_mark_stripe_bucket(trans,
+ bkey_s_c_to_stripe(new), i, false);
+ if (ret)
+ return ret;
+ }
+
+ if (old_s) {
+ int ret = bch2_trans_mark_stripe_bucket(trans,
+ bkey_s_c_to_stripe(old), i, true);
+ if (ret)
+ return ret;
+ }
}
+ }

- if (old_s) {
- int ret = bch2_trans_mark_stripe_bucket(trans,
- bkey_s_c_to_stripe(old), i, true);
+ if (flags & BTREE_TRIGGER_GC) {
+ BUG_ON(old_s);
+
+ for (unsigned i = 0; i < nr_blocks; i++) {
+ int ret = mark_stripe_bucket(trans, new, i, flags);
if (ret)
return ret;
}
@@ -411,53 +458,6 @@ int bch2_trigger_stripe(struct btree_trans *trans,
}
}

- if (flags & BTREE_TRIGGER_GC) {
- struct gc_stripe *m =
- genradix_ptr_alloc(&c->gc_stripes, idx, GFP_KERNEL);
-
- if (!m) {
- bch_err(c, "error allocating memory for gc_stripes, idx %llu",
- idx);
- return -BCH_ERR_ENOMEM_mark_stripe;
- }
- /*
- * This will be wrong when we bring back runtime gc: we should
- * be unmarking the old key and then marking the new key
- */
- m->alive = true;
- m->sectors = le16_to_cpu(new_s->sectors);
- m->nr_blocks = new_s->nr_blocks;
- m->nr_redundant = new_s->nr_redundant;
-
- for (unsigned i = 0; i < new_s->nr_blocks; i++)
- m->ptrs[i] = new_s->ptrs[i];
-
- bch2_bkey_to_replicas(&m->r.e, new);
-
- /*
- * gc recalculates this field from stripe ptr
- * references:
- */
- memset(m->block_sectors, 0, sizeof(m->block_sectors));
-
- for (unsigned i = 0; i < new_s->nr_blocks; i++) {
- int ret = mark_stripe_bucket(trans, new, i, flags);
- if (ret)
- return ret;
- }
-
- int ret = bch2_update_replicas(c, new, &m->r.e,
- ((s64) m->sectors * m->nr_redundant));
- if (ret) {
- struct printbuf buf = PRINTBUF;
-
- bch2_bkey_val_to_text(&buf, c, new);
- bch2_fs_fatal_error(c, "no replicas entry for %s", buf.buf);
- printbuf_exit(&buf);
- return ret;
- }
- }
-
return 0;
}

diff --git a/fs/bcachefs/inode.c b/fs/bcachefs/inode.c
index 3dfa9f77c739..e8f128d6b703 100644
--- a/fs/bcachefs/inode.c
+++ b/fs/bcachefs/inode.c
@@ -607,41 +607,26 @@ int bch2_trigger_inode(struct btree_trans *trans,
struct bkey_s new,
unsigned flags)
{
- s64 nr = bkey_is_inode(new.k) - bkey_is_inode(old.k);
-
- if (flags & BTREE_TRIGGER_TRANSACTIONAL) {
- if (nr) {
- struct disk_accounting_key acc = {
- .type = BCH_DISK_ACCOUNTING_nr_inodes
- };
-
- int ret = bch2_disk_accounting_mod(trans, &acc, &nr, 1);
- if (ret)
- return ret;
- }
-
- bool old_deleted = bkey_is_deleted_inode(old);
- bool new_deleted = bkey_is_deleted_inode(new.s_c);
- if (old_deleted != new_deleted) {
- int ret = bch2_btree_bit_mod_buffered(trans, BTREE_ID_deleted_inodes,
- new.k->p, new_deleted);
- if (ret)
- return ret;
- }
- }
-
if ((flags & BTREE_TRIGGER_ATOMIC) && (flags & BTREE_TRIGGER_INSERT)) {
BUG_ON(!trans->journal_res.seq);
-
bkey_s_to_inode_v3(new).v->bi_journal_seq = cpu_to_le64(trans->journal_res.seq);
}

- if (flags & BTREE_TRIGGER_GC) {
- struct bch_fs *c = trans->c;
+ s64 nr = bkey_is_inode(new.k) - bkey_is_inode(old.k);
+ if ((flags & (BTREE_TRIGGER_TRANSACTIONAL|BTREE_TRIGGER_GC)) && nr) {
+ struct disk_accounting_key acc = { .type = BCH_DISK_ACCOUNTING_nr_inodes };
+ int ret = bch2_disk_accounting_mod(trans, &acc, &nr, 1, flags & BTREE_TRIGGER_GC);
+ if (ret)
+ return ret;
+ }

- percpu_down_read(&c->mark_lock);
- this_cpu_add(c->usage_gc->b.nr_inodes, nr);
- percpu_up_read(&c->mark_lock);
+ int deleted_delta = (int) bkey_is_deleted_inode(new.s_c) -
+ (int) bkey_is_deleted_inode(old);
+ if ((flags & BTREE_TRIGGER_TRANSACTIONAL) && deleted_delta) {
+ int ret = bch2_btree_bit_mod_buffered(trans, BTREE_ID_deleted_inodes,
+ new.k->p, deleted_delta > 0);
+ if (ret)
+ return ret;
}

return 0;
diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c
index 18fd71960d2e..6a8b2c753688 100644
--- a/fs/bcachefs/recovery.c
+++ b/fs/bcachefs/recovery.c
@@ -1177,8 +1177,7 @@ int bch2_fs_initialize(struct bch_fs *c)
goto err;

for_each_member_device(c, ca) {
- ret = bch2_dev_usage_init(ca);
- bch_err_msg(c, ret, "initializing device usage");
+ ret = bch2_dev_usage_init(ca, false);
if (ret) {
percpu_ref_put(&ca->ref);
goto err;
diff --git a/fs/bcachefs/replicas.c b/fs/bcachefs/replicas.c
index 427dc6711427..cba5ba44cfd8 100644
--- a/fs/bcachefs/replicas.c
+++ b/fs/bcachefs/replicas.c
@@ -275,73 +275,6 @@ bool bch2_replicas_marked(struct bch_fs *c,
return ret;
}

-static void __replicas_table_update(struct bch_fs_usage *dst,
- struct bch_replicas_cpu *dst_r,
- struct bch_fs_usage *src,
- struct bch_replicas_cpu *src_r)
-{
- int src_idx, dst_idx;
-
- *dst = *src;
-
- for (src_idx = 0; src_idx < src_r->nr; src_idx++) {
- if (!src->replicas[src_idx])
- continue;
-
- dst_idx = __replicas_entry_idx(dst_r,
- cpu_replicas_entry(src_r, src_idx));
- BUG_ON(dst_idx < 0);
-
- dst->replicas[dst_idx] = src->replicas[src_idx];
- }
-}
-
-static void __replicas_table_update_pcpu(struct bch_fs_usage __percpu *dst_p,
- struct bch_replicas_cpu *dst_r,
- struct bch_fs_usage __percpu *src_p,
- struct bch_replicas_cpu *src_r)
-{
- unsigned src_nr = sizeof(struct bch_fs_usage) / sizeof(u64) + src_r->nr;
- struct bch_fs_usage *dst, *src = (void *)
- bch2_acc_percpu_u64s((u64 __percpu *) src_p, src_nr);
-
- preempt_disable();
- dst = this_cpu_ptr(dst_p);
- preempt_enable();
-
- __replicas_table_update(dst, dst_r, src, src_r);
-}
-
-/*
- * Resize filesystem accounting:
- */
-static int replicas_table_update(struct bch_fs *c,
- struct bch_replicas_cpu *new_r)
-{
- struct bch_fs_usage __percpu *new_gc = NULL;
- unsigned bytes = sizeof(struct bch_fs_usage) +
- sizeof(u64) * new_r->nr;
- int ret = 0;
-
- if ((c->usage_gc &&
- !(new_gc = __alloc_percpu_gfp(bytes, sizeof(u64), GFP_KERNEL))))
- goto err;
-
- if (c->usage_gc)
- __replicas_table_update_pcpu(new_gc, new_r,
- c->usage_gc, &c->replicas);
-
- swap(c->usage_gc, new_gc);
- swap(c->replicas, *new_r);
-out:
- free_percpu(new_gc);
- return ret;
-err:
- bch_err(c, "error updating replicas table: memory allocation failure");
- ret = -BCH_ERR_ENOMEM_replicas_table;
- goto out;
-}
-
noinline
static int bch2_mark_replicas_slowpath(struct bch_fs *c,
struct bch_replicas_entry_v1 *new_entry)
@@ -389,7 +322,7 @@ static int bch2_mark_replicas_slowpath(struct bch_fs *c,
/* don't update in memory replicas until changes are persistent */
percpu_down_write(&c->mark_lock);
if (new_r.entries)
- ret = replicas_table_update(c, &new_r);
+ swap(c->replicas, new_r);
if (new_gc.entries)
swap(new_gc, c->replicas_gc);
percpu_up_write(&c->mark_lock);
@@ -424,8 +357,9 @@ int bch2_replicas_gc_end(struct bch_fs *c, int ret)
percpu_down_write(&c->mark_lock);

ret = ret ?:
- bch2_cpu_replicas_to_sb_replicas(c, &c->replicas_gc) ?:
- replicas_table_update(c, &c->replicas_gc);
+ bch2_cpu_replicas_to_sb_replicas(c, &c->replicas_gc);
+ if (!ret)
+ swap(c->replicas, c->replicas_gc);

kfree(c->replicas_gc.entries);
c->replicas_gc.entries = NULL;
@@ -635,8 +569,7 @@ int bch2_sb_replicas_to_cpu_replicas(struct bch_fs *c)
bch2_cpu_replicas_sort(&new_r);

percpu_down_write(&c->mark_lock);
-
- ret = replicas_table_update(c, &new_r);
+ swap(c->replicas, new_r);
percpu_up_write(&c->mark_lock);

kfree(new_r.entries);
@@ -927,10 +860,8 @@ unsigned bch2_sb_dev_has_data(struct bch_sb *sb, unsigned dev)

unsigned bch2_dev_has_data(struct bch_fs *c, struct bch_dev *ca)
{
- unsigned ret;
-
mutex_lock(&c->sb_lock);
- ret = bch2_sb_dev_has_data(c->disk_sb.sb, ca->dev_idx);
+ unsigned ret = bch2_sb_dev_has_data(c->disk_sb.sb, ca->dev_idx);
mutex_unlock(&c->sb_lock);

return ret;
@@ -941,8 +872,3 @@ void bch2_fs_replicas_exit(struct bch_fs *c)
kfree(c->replicas.entries);
kfree(c->replicas_gc.entries);
}
-
-int bch2_fs_replicas_init(struct bch_fs *c)
-{
- return replicas_table_update(c, &c->replicas);
-}
diff --git a/fs/bcachefs/replicas.h b/fs/bcachefs/replicas.h
index eac2dff20423..ab2d00e4865c 100644
--- a/fs/bcachefs/replicas.h
+++ b/fs/bcachefs/replicas.h
@@ -80,6 +80,5 @@ extern const struct bch_sb_field_ops bch_sb_field_ops_replicas;
extern const struct bch_sb_field_ops bch_sb_field_ops_replicas_v0;

void bch2_fs_replicas_exit(struct bch_fs *);
-int bch2_fs_replicas_init(struct bch_fs *);

#endif /* _BCACHEFS_REPLICAS_H */
diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c
index 89c481831608..6617c8912e51 100644
--- a/fs/bcachefs/super.c
+++ b/fs/bcachefs/super.c
@@ -894,7 +894,6 @@ static struct bch_fs *bch2_fs_alloc(struct bch_sb *sb, struct bch_opts opts)
bch2_io_clock_init(&c->io_clock[READ]) ?:
bch2_io_clock_init(&c->io_clock[WRITE]) ?:
bch2_fs_journal_init(&c->journal) ?:
- bch2_fs_replicas_init(c) ?:
bch2_fs_btree_cache_init(c) ?:
bch2_fs_btree_key_cache_init(&c->btree_key_cache) ?:
bch2_fs_btree_iter_init(c) ?:
@@ -1772,7 +1771,7 @@ int bch2_dev_add(struct bch_fs *c, const char *path)
bch2_write_super(c);
mutex_unlock(&c->sb_lock);

- ret = bch2_dev_usage_init(ca);
+ ret = bch2_dev_usage_init(ca, false);
if (ret)
goto err_late;

@@ -1946,9 +1945,9 @@ int bch2_dev_resize(struct bch_fs *c, struct bch_dev *ca, u64 nbuckets)
};
u64 v[3] = { nbuckets - old_nbuckets, 0, 0 };

- ret = bch2_dev_freespace_init(c, ca, old_nbuckets, nbuckets) ?:
- bch2_trans_do(ca->fs, NULL, NULL, 0,
- bch2_disk_accounting_mod(trans, &acc, v, ARRAY_SIZE(v)));
+ ret = bch2_trans_do(ca->fs, NULL, NULL, 0,
+ bch2_disk_accounting_mod(trans, &acc, v, ARRAY_SIZE(v), false)) ?:
+ bch2_dev_freespace_init(c, ca, old_nbuckets, nbuckets);
if (ret)
goto err;
}
--
2.43.0


2024-02-25 02:43:02

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 19/21] bcachefs: Convert bch2_compression_stats_to_text() to new accounting

We no longer have to walk the whole btree to calculate compression
stats.

Signed-off-by: Kent Overstreet <[email protected]>
---
fs/bcachefs/sysfs.c | 85 ++++++++++-----------------------------------
1 file changed, 18 insertions(+), 67 deletions(-)

diff --git a/fs/bcachefs/sysfs.c b/fs/bcachefs/sysfs.c
index c86a93a8d8fc..287a0bf920db 100644
--- a/fs/bcachefs/sysfs.c
+++ b/fs/bcachefs/sysfs.c
@@ -22,6 +22,7 @@
#include "buckets.h"
#include "clock.h"
#include "compress.h"
+#include "disk_accounting.h"
#include "disk_groups.h"
#include "ec.h"
#include "inode.h"
@@ -256,63 +257,6 @@ static size_t bch2_btree_cache_size(struct bch_fs *c)

static int bch2_compression_stats_to_text(struct printbuf *out, struct bch_fs *c)
{
- struct btree_trans *trans;
- enum btree_id id;
- struct compression_type_stats {
- u64 nr_extents;
- u64 sectors_compressed;
- u64 sectors_uncompressed;
- } s[BCH_COMPRESSION_TYPE_NR];
- u64 compressed_incompressible = 0;
- int ret = 0;
-
- memset(s, 0, sizeof(s));
-
- if (!test_bit(BCH_FS_started, &c->flags))
- return -EPERM;
-
- trans = bch2_trans_get(c);
-
- for (id = 0; id < BTREE_ID_NR; id++) {
- if (!btree_type_has_ptrs(id))
- continue;
-
- ret = for_each_btree_key(trans, iter, id, POS_MIN,
- BTREE_ITER_ALL_SNAPSHOTS, k, ({
- struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k);
- struct bch_extent_crc_unpacked crc;
- const union bch_extent_entry *entry;
- bool compressed = false, incompressible = false;
-
- bkey_for_each_crc(k.k, ptrs, crc, entry) {
- incompressible |= crc.compression_type == BCH_COMPRESSION_TYPE_incompressible;
- compressed |= crc_is_compressed(crc);
-
- if (crc_is_compressed(crc)) {
- s[crc.compression_type].nr_extents++;
- s[crc.compression_type].sectors_compressed += crc.compressed_size;
- s[crc.compression_type].sectors_uncompressed += crc.uncompressed_size;
- }
- }
-
- compressed_incompressible += compressed && incompressible;
-
- if (!compressed) {
- unsigned t = incompressible ? BCH_COMPRESSION_TYPE_incompressible : 0;
-
- s[t].nr_extents++;
- s[t].sectors_compressed += k.k->size;
- s[t].sectors_uncompressed += k.k->size;
- }
- 0;
- }));
- }
-
- bch2_trans_put(trans);
-
- if (ret)
- return ret;
-
prt_str(out, "type");
printbuf_tabstop_push(out, 12);
prt_tab(out);
@@ -330,28 +274,35 @@ static int bch2_compression_stats_to_text(struct printbuf *out, struct bch_fs *c
prt_tab_rjust(out);
prt_newline(out);

- for (unsigned i = 0; i < ARRAY_SIZE(s); i++) {
+ for (unsigned i = 1; i < BCH_COMPRESSION_TYPE_NR; i++) {
+ struct disk_accounting_key a = {
+ .type = BCH_DISK_ACCOUNTING_compression,
+ .compression.type = i,
+ };
+ struct bpos p = disk_accounting_key_to_bpos(&a);
+ u64 v[3];
+ bch2_accounting_mem_read(c, p, v, ARRAY_SIZE(v));
+
+ u64 nr_extents = v[0];
+ u64 sectors_uncompressed = v[1];
+ u64 sectors_compressed = v[2];
+
bch2_prt_compression_type(out, i);
prt_tab(out);

- prt_human_readable_u64(out, s[i].sectors_compressed << 9);
+ prt_human_readable_u64(out, sectors_compressed << 9);
prt_tab_rjust(out);

- prt_human_readable_u64(out, s[i].sectors_uncompressed << 9);
+ prt_human_readable_u64(out, sectors_uncompressed << 9);
prt_tab_rjust(out);

- prt_human_readable_u64(out, s[i].nr_extents
- ? div_u64(s[i].sectors_uncompressed << 9, s[i].nr_extents)
+ prt_human_readable_u64(out, nr_extents
+ ? div_u64(sectors_uncompressed << 9, nr_extents)
: 0);
prt_tab_rjust(out);
prt_newline(out);
}

- if (compressed_incompressible) {
- prt_printf(out, "%llu compressed & incompressible extents", compressed_incompressible);
- prt_newline(out);
- }
-
return 0;
}

--
2.43.0


2024-02-27 15:48:00

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH 01/21] bcachefs: KEY_TYPE_accounting

On Sat, Feb 24, 2024 at 09:38:03PM -0500, Kent Overstreet wrote:
> New key type for the disk space accounting rewrite.
>
> - Holds a variable sized array of u64s (may be more than one for
> accounting e.g. compressed and uncompressed size, or buckets and
> sectors for a given data type)
>
> - Updates are deltas, not new versions of the key: this means updates
> to accounting can happen via the btree write buffer, which we'll be
> teaching to accumulate deltas.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> ---
> fs/bcachefs/Makefile | 3 +-
> fs/bcachefs/bcachefs.h | 1 +
> fs/bcachefs/bcachefs_format.h | 80 +++------------
> fs/bcachefs/bkey_methods.c | 1 +
> fs/bcachefs/disk_accounting.c | 70 ++++++++++++++
> fs/bcachefs/disk_accounting.h | 52 ++++++++++
> fs/bcachefs/disk_accounting_format.h | 139 +++++++++++++++++++++++++++
> fs/bcachefs/replicas_format.h | 21 ++++
> fs/bcachefs/sb-downgrade.c | 12 ++-
> fs/bcachefs/sb-errors_types.h | 3 +-
> 10 files changed, 311 insertions(+), 71 deletions(-)
> create mode 100644 fs/bcachefs/disk_accounting.c
> create mode 100644 fs/bcachefs/disk_accounting.h
> create mode 100644 fs/bcachefs/disk_accounting_format.h
> create mode 100644 fs/bcachefs/replicas_format.h
>
..
> diff --git a/fs/bcachefs/disk_accounting_format.h b/fs/bcachefs/disk_accounting_format.h
> new file mode 100644
> index 000000000000..e06a42f0d578
> --- /dev/null
> +++ b/fs/bcachefs/disk_accounting_format.h
> @@ -0,0 +1,139 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _BCACHEFS_DISK_ACCOUNTING_FORMAT_H
> +#define _BCACHEFS_DISK_ACCOUNTING_FORMAT_H
> +
> +#include "replicas_format.h"
> +
> +/*
> + * Disk accounting - KEY_TYPE_accounting - on disk format:
> + *
> + * Here, the key has considerably more structure than a typical key (bpos); an
> + * accounting key is 'struct disk_accounting_key', which is a union of bpos.
> + *

First impression.. I'm a little confused why the key type is a union of
bpos. I'm possibly missing something fundamental/obvious, but could you
elaborate more on why that is here?

Brian

> + * This is a type-tagged union of all our various subtypes; a disk accounting
> + * key can be device counters, replicas counters, et cetera - it's extensible.
> + *
> + * The value is a list of u64s or s64s; the number of counters is specific to a
> + * given accounting type.
> + *
> + * Unlike with other key types, updates are _deltas_, and the deltas are not
> + * resolved until the update to the underlying btree, done by btree write buffer
> + * flush or journal replay.
> + *
> + * Journal replay in particular requires special handling. The journal tracks a
> + * range of entries which may possibly have not yet been applied to the btree
> + * yet - it does not know definitively whether individual entries are dirty and
> + * still need to be applied.
> + *
> + * To handle this, we use the version field of struct bkey, and give every
> + * accounting update a unique version number - a total ordering in time; the
> + * version number is derived from the key's position in the journal. Then
> + * journal replay can compare the version number of the key from the journal
> + * with the version number of the key in the btree to determine if a key needs
> + * to be replayed.
> + *
> + * For this to work, we must maintain this strict time ordering of updates as
> + * they are flushed to the btree, both via write buffer flush and via journal
> + * replay. This has complications for the write buffer code while journal replay
> + * is still in progress; the write buffer cannot flush any accounting keys to
> + * the btree until journal replay has finished replaying its accounting keys, or
> + * the (newer) version number of the keys from the write buffer will cause
> + * updates from journal replay to be lost.
> + */
> +
> +struct bch_accounting {
> + struct bch_val v;
> + __u64 d[];
> +};
> +
> +#define BCH_ACCOUNTING_MAX_COUNTERS 3
> +
> +#define BCH_DATA_TYPES() \
> + x(free, 0) \
> + x(sb, 1) \
> + x(journal, 2) \
> + x(btree, 3) \
> + x(user, 4) \
> + x(cached, 5) \
> + x(parity, 6) \
> + x(stripe, 7) \
> + x(need_gc_gens, 8) \
> + x(need_discard, 9)
> +
> +enum bch_data_type {
> +#define x(t, n) BCH_DATA_##t,
> + BCH_DATA_TYPES()
> +#undef x
> + BCH_DATA_NR
> +};
> +
> +static inline bool data_type_is_empty(enum bch_data_type type)
> +{
> + switch (type) {
> + case BCH_DATA_free:
> + case BCH_DATA_need_gc_gens:
> + case BCH_DATA_need_discard:
> + return true;
> + default:
> + return false;
> + }
> +}
> +
> +static inline bool data_type_is_hidden(enum bch_data_type type)
> +{
> + switch (type) {
> + case BCH_DATA_sb:
> + case BCH_DATA_journal:
> + return true;
> + default:
> + return false;
> + }
> +}
> +
> +#define BCH_DISK_ACCOUNTING_TYPES() \
> + x(nr_inodes, 0) \
> + x(persistent_reserved, 1) \
> + x(replicas, 2) \
> + x(dev_data_type, 3) \
> + x(dev_stripe_buckets, 4)
> +
> +enum disk_accounting_type {
> +#define x(f, nr) BCH_DISK_ACCOUNTING_##f = nr,
> + BCH_DISK_ACCOUNTING_TYPES()
> +#undef x
> + BCH_DISK_ACCOUNTING_TYPE_NR,
> +};
> +
> +struct bch_nr_inodes {
> +};
> +
> +struct bch_persistent_reserved {
> + __u8 nr_replicas;
> +};
> +
> +struct bch_dev_data_type {
> + __u8 dev;
> + __u8 data_type;
> +};
> +
> +struct bch_dev_stripe_buckets {
> + __u8 dev;
> +};
> +
> +struct disk_accounting_key {
> + union {
> + struct {
> + __u8 type;
> + union {
> + struct bch_nr_inodes nr_inodes;
> + struct bch_persistent_reserved persistent_reserved;
> + struct bch_replicas_entry_v1 replicas;
> + struct bch_dev_data_type dev_data_type;
> + struct bch_dev_stripe_buckets dev_stripe_buckets;
> + };
> + };
> + struct bpos _pad;
> + };
> +};
> +
> +#endif /* _BCACHEFS_DISK_ACCOUNTING_FORMAT_H */
> diff --git a/fs/bcachefs/replicas_format.h b/fs/bcachefs/replicas_format.h
> new file mode 100644
> index 000000000000..ed94f8c636b3
> --- /dev/null
> +++ b/fs/bcachefs/replicas_format.h
> @@ -0,0 +1,21 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _BCACHEFS_REPLICAS_FORMAT_H
> +#define _BCACHEFS_REPLICAS_FORMAT_H
> +
> +struct bch_replicas_entry_v0 {
> + __u8 data_type;
> + __u8 nr_devs;
> + __u8 devs[];
> +} __packed;
> +
> +struct bch_replicas_entry_v1 {
> + __u8 data_type;
> + __u8 nr_devs;
> + __u8 nr_required;
> + __u8 devs[];
> +} __packed;
> +
> +#define replicas_entry_bytes(_i) \
> + (offsetof(typeof(*(_i)), devs) + (_i)->nr_devs)
> +
> +#endif /* _BCACHEFS_REPLICAS_FORMAT_H */
> diff --git a/fs/bcachefs/sb-downgrade.c b/fs/bcachefs/sb-downgrade.c
> index 3337419faeff..33db8d7ca8c4 100644
> --- a/fs/bcachefs/sb-downgrade.c
> +++ b/fs/bcachefs/sb-downgrade.c
> @@ -52,9 +52,15 @@
> BCH_FSCK_ERR_subvol_fs_path_parent_wrong) \
> x(btree_subvolume_children, \
> BIT_ULL(BCH_RECOVERY_PASS_check_subvols), \
> - BCH_FSCK_ERR_subvol_children_not_set)
> + BCH_FSCK_ERR_subvol_children_not_set) \
> + x(disk_accounting_v2, \
> + BIT_ULL(BCH_RECOVERY_PASS_check_allocations), \
> + BCH_FSCK_ERR_accounting_mismatch)
>
> -#define DOWNGRADE_TABLE()
> +#define DOWNGRADE_TABLE() \
> + x(disk_accounting_v2, \
> + BIT_ULL(BCH_RECOVERY_PASS_check_alloc_info), \
> + BCH_FSCK_ERR_dev_usage_buckets_wrong)
>
> struct upgrade_downgrade_entry {
> u64 recovery_passes;
> @@ -108,7 +114,7 @@ void bch2_sb_set_upgrade(struct bch_fs *c,
> }
> }
>
> -#define x(ver, passes, ...) static const u16 downgrade_ver_##errors[] = { __VA_ARGS__ };
> +#define x(ver, passes, ...) static const u16 downgrade_##ver##_errors[] = { __VA_ARGS__ };
> DOWNGRADE_TABLE()
> #undef x
>
> diff --git a/fs/bcachefs/sb-errors_types.h b/fs/bcachefs/sb-errors_types.h
> index 0df4b0e7071a..383e13711001 100644
> --- a/fs/bcachefs/sb-errors_types.h
> +++ b/fs/bcachefs/sb-errors_types.h
> @@ -264,7 +264,8 @@
> x(subvol_children_not_set, 256) \
> x(subvol_children_bad, 257) \
> x(subvol_loop, 258) \
> - x(subvol_unreachable, 259)
> + x(subvol_unreachable, 259) \
> + x(accounting_mismatch, 260)
>
> enum bch_sb_error_id {
> #define x(t, n) BCH_FSCK_ERR_##t = n,
> --
> 2.43.0
>


2024-02-27 15:49:12

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH 03/21] bcachefs: btree write buffer knows how to accumulate bch_accounting keys

On Sat, Feb 24, 2024 at 09:38:05PM -0500, Kent Overstreet wrote:
> Teach the btree write buffer how to accumulate accounting keys - instead
> of having the newer key overwrite the older key as we do with other
> updates, we need to add them together.
>
> Also, add a flag so that write buffer flush knows when journal replay is
> finished flushing accounting, and teach it to hold accounting keys until
> that flag is set.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> ---
> fs/bcachefs/bcachefs.h | 1 +
> fs/bcachefs/btree_write_buffer.c | 66 +++++++++++++++++++++++++++-----
> fs/bcachefs/recovery.c | 3 ++
> 3 files changed, 61 insertions(+), 9 deletions(-)
>
..
> diff --git a/fs/bcachefs/btree_write_buffer.c b/fs/bcachefs/btree_write_buffer.c
> index b77e7b382b66..002a0762fc85 100644
> --- a/fs/bcachefs/btree_write_buffer.c
> +++ b/fs/bcachefs/btree_write_buffer.c
> @@ -5,6 +5,7 @@
> #include "btree_update.h"
> #include "btree_update_interior.h"
> #include "btree_write_buffer.h"
> +#include "disk_accounting.h"
> #include "error.h"
> #include "journal.h"
> #include "journal_io.h"
> @@ -123,7 +124,9 @@ static noinline int wb_flush_one_slowpath(struct btree_trans *trans,
>
> static inline int wb_flush_one(struct btree_trans *trans, struct btree_iter *iter,
> struct btree_write_buffered_key *wb,
> - bool *write_locked, size_t *fast)
> + bool *write_locked,
> + bool *accounting_accumulated,
> + size_t *fast)
> {
> struct btree_path *path;
> int ret;
> @@ -136,6 +139,16 @@ static inline int wb_flush_one(struct btree_trans *trans, struct btree_iter *ite
> if (ret)
> return ret;
>
> + if (!*accounting_accumulated && wb->k.k.type == KEY_TYPE_accounting) {
> + struct bkey u;
> + struct bkey_s_c k = bch2_btree_path_peek_slot_exact(btree_iter_path(trans, iter), &u);
> +
> + if (k.k->type == KEY_TYPE_accounting)
> + bch2_accounting_accumulate(bkey_i_to_accounting(&wb->k),
> + bkey_s_c_to_accounting(k));

So it looks like we're accumulating from the btree key into the write
buffer key. Is this so the following code will basically insert a new
btree key based on the value of the write buffer key?

> + }
> + *accounting_accumulated = true;
> +
> /*
> * We can't clone a path that has write locks: unshare it now, before
> * set_pos and traverse():
> @@ -248,8 +261,9 @@ static int bch2_btree_write_buffer_flush_locked(struct btree_trans *trans)
> struct journal *j = &c->journal;
> struct btree_write_buffer *wb = &c->btree_write_buffer;
> struct btree_iter iter = { NULL };
> - size_t skipped = 0, fast = 0, slowpath = 0;
> + size_t overwritten = 0, fast = 0, slowpath = 0, could_not_insert = 0;
> bool write_locked = false;
> + bool accounting_replay_done = test_bit(BCH_FS_accounting_replay_done, &c->flags);
> int ret = 0;
>
> bch2_trans_unlock(trans);
> @@ -284,17 +298,29 @@ static int bch2_btree_write_buffer_flush_locked(struct btree_trans *trans)
>
> darray_for_each(wb->sorted, i) {
> struct btree_write_buffered_key *k = &wb->flushing.keys.data[i->idx];
> + bool accounting_accumulated = false;

Should this live within the interior flush loop?

>
> for (struct wb_key_ref *n = i + 1; n < min(i + 4, &darray_top(wb->sorted)); n++)
> prefetch(&wb->flushing.keys.data[n->idx]);
>
> BUG_ON(!k->journal_seq);
>
> + if (!accounting_replay_done &&
> + k->k.k.type == KEY_TYPE_accounting) {
> + slowpath++;
> + continue;
> + }
> +
> if (i + 1 < &darray_top(wb->sorted) &&
> wb_key_eq(i, i + 1)) {
> struct btree_write_buffered_key *n = &wb->flushing.keys.data[i[1].idx];
>
> - skipped++;
> + if (k->k.k.type == KEY_TYPE_accounting &&
> + n->k.k.type == KEY_TYPE_accounting)
> + bch2_accounting_accumulate(bkey_i_to_accounting(&n->k),
> + bkey_i_to_s_c_accounting(&k->k));
> +
> + overwritten++;
> n->journal_seq = min_t(u64, n->journal_seq, k->journal_seq);
> k->journal_seq = 0;
> continue;
> @@ -325,7 +351,8 @@ static int bch2_btree_write_buffer_flush_locked(struct btree_trans *trans)
> break;
> }
>
> - ret = wb_flush_one(trans, &iter, k, &write_locked, &fast);
> + ret = wb_flush_one(trans, &iter, k, &write_locked,
> + &accounting_accumulated, &fast);
> if (!write_locked)
> bch2_trans_begin(trans);
> } while (bch2_err_matches(ret, BCH_ERR_transaction_restart));
> @@ -361,8 +388,15 @@ static int bch2_btree_write_buffer_flush_locked(struct btree_trans *trans)
> if (!i->journal_seq)
> continue;
>
> - bch2_journal_pin_update(j, i->journal_seq, &wb->flushing.pin,
> - bch2_btree_write_buffer_journal_flush);
> + if (!accounting_replay_done &&
> + i->k.k.type == KEY_TYPE_accounting) {
> + could_not_insert++;
> + continue;
> + }
> +
> + if (!could_not_insert)
> + bch2_journal_pin_update(j, i->journal_seq, &wb->flushing.pin,
> + bch2_btree_write_buffer_journal_flush);

Hmm.. so this is sane because the slowpath runs in journal sorted order,
right?

>
> bch2_trans_begin(trans);
>
> @@ -375,13 +409,27 @@ static int bch2_btree_write_buffer_flush_locked(struct btree_trans *trans)
> btree_write_buffered_insert(trans, i));
> if (ret)
> goto err;
> +
> + i->journal_seq = 0;
> + }
> +

/*
* Condense the remaining keys <reasons reasons>...??
*/

> + if (could_not_insert) {
> + struct btree_write_buffered_key *dst = wb->flushing.keys.data;
> +
> + darray_for_each(wb->flushing.keys, i)
> + if (i->journal_seq)
> + *dst++ = *i;
> + wb->flushing.keys.nr = dst - wb->flushing.keys.data;
> }
> }
> err:
> + if (ret || !could_not_insert) {
> + bch2_journal_pin_drop(j, &wb->flushing.pin);
> + wb->flushing.keys.nr = 0;
> + }
> +
> bch2_fs_fatal_err_on(ret, c, "%s: insert error %s", __func__, bch2_err_str(ret));
> - trace_write_buffer_flush(trans, wb->flushing.keys.nr, skipped, fast, 0);
> - bch2_journal_pin_drop(j, &wb->flushing.pin);
> - wb->flushing.keys.nr = 0;
> + trace_write_buffer_flush(trans, wb->flushing.keys.nr, overwritten, fast, 0);

I feel like the last time I looked at the write buffer stuff the flush
wasn't reentrant in this way. I.e., the flush switched out the active
buffer and so had to process all entries in the current buffer (or
something like that). Has something changed or do I misunderstand?

> return ret;
> }
>
> diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c
> index 6829d80bd181..b8289af66c8e 100644
> --- a/fs/bcachefs/recovery.c
> +++ b/fs/bcachefs/recovery.c
> @@ -228,6 +228,8 @@ static int bch2_journal_replay(struct bch_fs *c)
> goto err;
> }
>
> + set_bit(BCH_FS_accounting_replay_done, &c->flags);
> +

I assume this ties into the question on the previous patch..

Related question.. if the write buffer can't flush during journal
replay, is there concern/risk of overflowing it?

Brian

> /*
> * First, attempt to replay keys in sorted order. This is more
> * efficient - better locality of btree access - but some might fail if
> @@ -1204,6 +1206,7 @@ int bch2_fs_initialize(struct bch_fs *c)
> * set up the journal.pin FIFO and journal.cur pointer:
> */
> bch2_fs_journal_start(&c->journal, 1);
> + set_bit(BCH_FS_accounting_replay_done, &c->flags);
> bch2_journal_set_replay_done(&c->journal);
>
> ret = bch2_fs_read_write_early(c);
> --
> 2.43.0
>


2024-02-27 15:53:29

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH 04/21] bcachefs: Disk space accounting rewrite

On Sat, Feb 24, 2024 at 09:38:06PM -0500, Kent Overstreet wrote:
> Main part of the disk accounting rewrite.
>
> This is a wholesale rewrite of the existing disk space accounting, which
> relies on percepu counters that are sharded by journal buffer, and
> rolled up and added to each journal write.
>
> With the new scheme, every set of counters is a distinct key in the
> accounting btree; this fixes scaling limitations of the old scheme,
> where counters took up space in each journal entry and required multiple
> percpu counters.
>
> Now, in memory accounting requires a single set of percpu counters - not
> multiple for each in flight journal buffer - and in the future we'll
> probably also have counters that don't use in memory percpu counters,
> they're not strictly required.
>
> An accounting update is now a normal btree update, using the btree write
> buffer path. At transaction commit time, we apply accounting updates to
> the in memory counters, which are percpu counters indexed in an
> eytzinger tree by the accounting key.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> ---
> fs/bcachefs/alloc_background.c | 68 +++++-
> fs/bcachefs/bcachefs.h | 6 +-
> fs/bcachefs/bcachefs_format.h | 1 -
> fs/bcachefs/bcachefs_ioctl.h | 7 +-
> fs/bcachefs/btree_gc.c | 3 +-
> fs/bcachefs/btree_iter.c | 9 -
> fs/bcachefs/btree_trans_commit.c | 62 ++++--
> fs/bcachefs/btree_types.h | 1 -
> fs/bcachefs/btree_update.h | 8 -
> fs/bcachefs/buckets.c | 289 +++++---------------------
> fs/bcachefs/buckets.h | 33 +--
> fs/bcachefs/disk_accounting.c | 308 ++++++++++++++++++++++++++++
> fs/bcachefs/disk_accounting.h | 126 ++++++++++++
> fs/bcachefs/disk_accounting_types.h | 20 ++
> fs/bcachefs/ec.c | 24 ++-
> fs/bcachefs/inode.c | 9 +-
> fs/bcachefs/recovery.c | 12 +-
> fs/bcachefs/recovery_types.h | 1 +
> fs/bcachefs/replicas.c | 42 ++--
> fs/bcachefs/replicas.h | 11 +-
> fs/bcachefs/replicas_types.h | 16 --
> fs/bcachefs/sb-errors_types.h | 3 +-
> fs/bcachefs/super.c | 49 +++--
> 23 files changed, 704 insertions(+), 404 deletions(-)
> create mode 100644 fs/bcachefs/disk_accounting_types.h
>
..
> diff --git a/fs/bcachefs/disk_accounting.c b/fs/bcachefs/disk_accounting.c
> index 209f59e87b34..327c586ac661 100644
> --- a/fs/bcachefs/disk_accounting.c
> +++ b/fs/bcachefs/disk_accounting.c
..
> @@ -13,6 +17,44 @@ static const char * const disk_accounting_type_strs[] = {
> NULL
> };
>

So I'm gonna need to stare at all this much more than I have so far, but
one initial thing that stands out to me is the lack of high level
function comments. IMO, something that helps tremendously in
reading/reviewing these sorts of systemic changes is having a a couple
sentence or so comment at the top of the main/external interfaces just
to briefly explain what they do in plain english.

So here, something like "modify an accounting key in the btree based on
<whatever> ..." helps explain what it does and why it's used where it
is. The same goes for some of the other interface level functions, like
reading in accounting from disk, updating in-memory accounting (from
journal entries in committing transactions?), updating the superblock,
etc. I think I've started to put some of those pieces together, but
having to jump all through the implementation to piece together high
level behaviors is significantly more time consuming than having the
author guide one through the high level interactions.

IOW, I think if you minimally document the functions that are sufficient
to help understand how accounting works as a black box (somewhere
beneath the [nice] higher level big comment descriptions of the whole
thing and above the low level implementation details), that helps the
reviewer establish an understanding of the mechanism before having to
dig through the implementation details and also serves as a reference
going forward for the next person who is in a similar position and wants
to read/debug/tweak/whatever this code.

> +int bch2_disk_accounting_mod(struct btree_trans *trans,
> + struct disk_accounting_key *k,
> + s64 *d, unsigned nr)
> +{
> + /* Normalize: */
> + switch (k->type) {
> + case BCH_DISK_ACCOUNTING_replicas:
> + bubble_sort(k->replicas.devs, k->replicas.nr_devs, u8_cmp);
> + break;
> + }
> +
> + BUG_ON(nr > BCH_ACCOUNTING_MAX_COUNTERS);
> +
> + struct {
> + __BKEY_PADDED(k, BCH_ACCOUNTING_MAX_COUNTERS);
> + } k_i;
> + struct bkey_i_accounting *acc = bkey_accounting_init(&k_i.k);
> +
> + acc->k.p = disk_accounting_key_to_bpos(k);
> + set_bkey_val_u64s(&acc->k, sizeof(struct bch_accounting) / sizeof(u64) + nr);
> +
> + memcpy_u64s_small(acc->v.d, d, nr);
> +
> + return bch2_trans_update_buffered(trans, BTREE_ID_accounting, &acc->k_i);
> +}
> +
..
> diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c
> index a7f9de220d90..685d54d0ddbb 100644
> --- a/fs/bcachefs/super.c
> +++ b/fs/bcachefs/super.c
..
> @@ -1618,6 +1621,16 @@ int bch2_dev_remove(struct bch_fs *c, struct bch_dev *ca, int flags)
> if (ret)
> goto err;
>
> + /*
> + * We need to flush the entire journal to get rid of keys that reference
> + * the device being removed before removing the superblock entry
> + */
> + bch2_journal_flush_all_pins(&c->journal);

I thought this needed to occur between the device removal and superblock
update (according to the comment below). Is that not the case? Either
way, is it moved for reasons related to accounting?

Brian

> +
> + /*
> + * this is really just needed for the bch2_replicas_gc_(start|end)
> + * calls, and could be cleaned up:
> + */
> ret = bch2_journal_flush_device_pins(&c->journal, ca->dev_idx);
> bch_err_msg(ca, ret, "bch2_journal_flush_device_pins()");
> if (ret)
> @@ -1655,17 +1668,6 @@ int bch2_dev_remove(struct bch_fs *c, struct bch_dev *ca, int flags)
>
> bch2_dev_free(ca);
>
> - /*
> - * At this point the device object has been removed in-core, but the
> - * on-disk journal might still refer to the device index via sb device
> - * usage entries. Recovery fails if it sees usage information for an
> - * invalid device. Flush journal pins to push the back of the journal
> - * past now invalid device index references before we update the
> - * superblock, but after the device object has been removed so any
> - * further journal writes elide usage info for the device.
> - */
> - bch2_journal_flush_all_pins(&c->journal);
> -
> /*
> * Free this device's slot in the bch_member array - all pointers to
> * this device must be gone:
> @@ -1727,8 +1729,6 @@ int bch2_dev_add(struct bch_fs *c, const char *path)
> goto err;
> }
>
> - bch2_dev_usage_init(ca);
> -
> ret = __bch2_dev_attach_bdev(ca, &sb);
> if (ret)
> goto err;
> @@ -1793,6 +1793,10 @@ int bch2_dev_add(struct bch_fs *c, const char *path)
>
> bch2_dev_usage_journal_reserve(c);
>
> + ret = bch2_dev_usage_init(ca);
> + if (ret)
> + goto err_late;
> +
> ret = bch2_trans_mark_dev_sb(c, ca);
> bch_err_msg(ca, ret, "marking new superblock");
> if (ret)
> @@ -1956,15 +1960,18 @@ int bch2_dev_resize(struct bch_fs *c, struct bch_dev *ca, u64 nbuckets)
> mutex_unlock(&c->sb_lock);
>
> if (ca->mi.freespace_initialized) {
> - ret = bch2_dev_freespace_init(c, ca, old_nbuckets, nbuckets);
> + struct disk_accounting_key acc = {
> + .type = BCH_DISK_ACCOUNTING_dev_data_type,
> + .dev_data_type.dev = ca->dev_idx,
> + .dev_data_type.data_type = BCH_DATA_free,
> + };
> + u64 v[3] = { nbuckets - old_nbuckets, 0, 0 };
> +
> + ret = bch2_dev_freespace_init(c, ca, old_nbuckets, nbuckets) ?:
> + bch2_trans_do(ca->fs, NULL, NULL, 0,
> + bch2_disk_accounting_mod(trans, &acc, v, ARRAY_SIZE(v)));
> if (ret)
> goto err;
> -
> - /*
> - * XXX: this is all wrong transactionally - we'll be able to do
> - * this correctly after the disk space accounting rewrite
> - */
> - ca->usage_base->d[BCH_DATA_free].buckets += nbuckets - old_nbuckets;
> }
>
> bch2_recalc_capacity(c);
> --
> 2.43.0
>


2024-02-27 16:34:14

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH 02/21] bcachefs: Accumulate accounting keys in journal replay

On Sat, Feb 24, 2024 at 09:38:04PM -0500, Kent Overstreet wrote:
> Until accounting keys hit the btree, they are deltas, not new versions
> of the existing key; this means we have to teach journal replay to
> accumulate them.
>
> Additionally, the journal doesn't track precisely which entries have
> been flushed to the btree; it only tracks a range of entries that may
> possibly still need to be flushed.
>
> That means we need to compare accounting keys against the version in the
> btree and only flush updates that are newer.
>
> There's another wrinkle with the write buffer: if the write buffer
> starts flushing accounting keys before journal replay has finished
> flushing accounting keys, journal replay will see the version number
> from the new updates and updates from the journal will be lost.
>
> To avoid this, journal replay has to flush accounting keys first, and
> we'll be adding a flag so that write buffer flush knows to hold
> accounting keys until then.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> ---
> fs/bcachefs/btree_journal_iter.c | 23 +++-------
> fs/bcachefs/btree_journal_iter.h | 15 +++++++
> fs/bcachefs/btree_trans_commit.c | 9 +++-
> fs/bcachefs/btree_update.h | 14 +++++-
> fs/bcachefs/recovery.c | 76 +++++++++++++++++++++++++++++++-
> 5 files changed, 117 insertions(+), 20 deletions(-)
>
..
> diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c
> index 96e7a1ec7091..6829d80bd181 100644
> --- a/fs/bcachefs/recovery.c
> +++ b/fs/bcachefs/recovery.c
> @@ -11,6 +11,7 @@
> #include "btree_io.h"
> #include "buckets.h"
> #include "dirent.h"
> +#include "disk_accounting.h"
> #include "ec.h"
> #include "errcode.h"
> #include "error.h"
> @@ -87,6 +88,56 @@ static void replay_now_at(struct journal *j, u64 seq)
> bch2_journal_pin_put(j, j->replay_journal_seq++);
> }
>
> +static int bch2_journal_replay_accounting_key(struct btree_trans *trans,
> + struct journal_key *k)
> +{
> + struct journal_keys *keys = &trans->c->journal_keys;
> +
> + struct btree_iter iter;
> + bch2_trans_node_iter_init(trans, &iter, k->btree_id, k->k->k.p,
> + BTREE_MAX_DEPTH, k->level,
> + BTREE_ITER_INTENT);
> + int ret = bch2_btree_iter_traverse(&iter);
> + if (ret)
> + goto out;
> +
> + struct bkey u;
> + struct bkey_s_c old = bch2_btree_path_peek_slot(btree_iter_path(trans, &iter), &u);
> +
> + if (bversion_cmp(old.k->version, k->k->k.version) >= 0) {
> + ret = 0;
> + goto out;
> + }

So I assume this is what correlates back to the need to not flush the
write buffer until replay completes, otherwise we could unintentionally
skip subsequent key updates. Is that the case?

If so, it would be nice to have some comments here that explain this.
I.e., I don't quite have a big enough picture to know where or how this
is prevented to ensure that the version updates down the key
accumulation helpers don't conflict with this particular check, so
something that helps connect the dots enough to somebody who doesn't
already know how this is all supposed to work would be useful.

Brian

> +
> + if (k + 1 < &darray_top(*keys) &&
> + !journal_key_cmp(k, k + 1)) {
> + BUG_ON(bversion_cmp(k[0].k->k.version, k[1].k->k.version) > 0);
> +
> + bch2_accounting_accumulate(bkey_i_to_accounting(k[1].k),
> + bkey_i_to_s_c_accounting(k[0].k));
> + ret = 0;
> + goto out;
> + }
> +
> + struct bkey_i *new = k->k;
> + if (old.k->type == KEY_TYPE_accounting) {
> + new = bch2_bkey_make_mut_noupdate(trans, bkey_i_to_s_c(k->k));
> + ret = PTR_ERR_OR_ZERO(new);
> + if (ret)
> + goto out;
> +
> + bch2_accounting_accumulate(bkey_i_to_accounting(new),
> + bkey_s_c_to_accounting(old));
> + }
> +
> + trans->journal_res.seq = k->journal_seq;
> +
> + ret = bch2_trans_update(trans, &iter, new, BTREE_TRIGGER_NORUN);
> +out:
> + bch2_trans_iter_exit(trans, &iter);
> + return ret;
> +}
> +
> static int bch2_journal_replay_key(struct btree_trans *trans,
> struct journal_key *k)
> {
> @@ -159,12 +210,33 @@ static int bch2_journal_replay(struct bch_fs *c)
>
> BUG_ON(!atomic_read(&keys->ref));
>
> + /*
> + * Replay accounting keys first: we can't allow the write buffer to
> + * flush accounting keys until we're done
> + */
> + darray_for_each(*keys, k) {
> + if (!(k->k->k.type == KEY_TYPE_accounting && !k->allocated))
> + continue;
> +
> + cond_resched();
> +
> + ret = commit_do(trans, NULL, NULL,
> + BCH_TRANS_COMMIT_no_enospc|
> + BCH_TRANS_COMMIT_no_journal_res,
> + bch2_journal_replay_accounting_key(trans, k));
> + if (bch2_fs_fatal_err_on(ret, c, "error replaying accounting; %s", bch2_err_str(ret)))
> + goto err;
> + }
> +
> /*
> * First, attempt to replay keys in sorted order. This is more
> * efficient - better locality of btree access - but some might fail if
> * that would cause a journal deadlock.
> */
> darray_for_each(*keys, k) {
> + if (k->k->k.type == KEY_TYPE_accounting && !k->allocated)
> + continue;
> +
> cond_resched();
>
> /* Skip fastpath if we're low on space in the journal */
> @@ -174,7 +246,7 @@ static int bch2_journal_replay(struct bch_fs *c)
> BCH_TRANS_COMMIT_journal_reclaim|
> (!k->allocated ? BCH_TRANS_COMMIT_no_journal_res : 0),
> bch2_journal_replay_key(trans, k));
> - BUG_ON(!ret && !k->overwritten);
> + BUG_ON(!ret && !k->overwritten && k->k->k.type != KEY_TYPE_accounting);
> if (ret) {
> ret = darray_push(&keys_sorted, k);
> if (ret)
> @@ -208,7 +280,7 @@ static int bch2_journal_replay(struct bch_fs *c)
> if (ret)
> goto err;
>
> - BUG_ON(!k->overwritten);
> + BUG_ON(k->btree_id != BTREE_ID_accounting && !k->overwritten);
> }
>
> /*
> --
> 2.43.0
>


2024-02-28 19:41:55

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 01/21] bcachefs: KEY_TYPE_accounting

On Tue, Feb 27, 2024 at 10:49:19AM -0500, Brian Foster wrote:
> On Sat, Feb 24, 2024 at 09:38:03PM -0500, Kent Overstreet wrote:
> > New key type for the disk space accounting rewrite.
> >
> > - Holds a variable sized array of u64s (may be more than one for
> > accounting e.g. compressed and uncompressed size, or buckets and
> > sectors for a given data type)
> >
> > - Updates are deltas, not new versions of the key: this means updates
> > to accounting can happen via the btree write buffer, which we'll be
> > teaching to accumulate deltas.
> >
> > Signed-off-by: Kent Overstreet <[email protected]>
> > ---
> > fs/bcachefs/Makefile | 3 +-
> > fs/bcachefs/bcachefs.h | 1 +
> > fs/bcachefs/bcachefs_format.h | 80 +++------------
> > fs/bcachefs/bkey_methods.c | 1 +
> > fs/bcachefs/disk_accounting.c | 70 ++++++++++++++
> > fs/bcachefs/disk_accounting.h | 52 ++++++++++
> > fs/bcachefs/disk_accounting_format.h | 139 +++++++++++++++++++++++++++
> > fs/bcachefs/replicas_format.h | 21 ++++
> > fs/bcachefs/sb-downgrade.c | 12 ++-
> > fs/bcachefs/sb-errors_types.h | 3 +-
> > 10 files changed, 311 insertions(+), 71 deletions(-)
> > create mode 100644 fs/bcachefs/disk_accounting.c
> > create mode 100644 fs/bcachefs/disk_accounting.h
> > create mode 100644 fs/bcachefs/disk_accounting_format.h
> > create mode 100644 fs/bcachefs/replicas_format.h
> >
> ...
> > diff --git a/fs/bcachefs/disk_accounting_format.h b/fs/bcachefs/disk_accounting_format.h
> > new file mode 100644
> > index 000000000000..e06a42f0d578
> > --- /dev/null
> > +++ b/fs/bcachefs/disk_accounting_format.h
> > @@ -0,0 +1,139 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _BCACHEFS_DISK_ACCOUNTING_FORMAT_H
> > +#define _BCACHEFS_DISK_ACCOUNTING_FORMAT_H
> > +
> > +#include "replicas_format.h"
> > +
> > +/*
> > + * Disk accounting - KEY_TYPE_accounting - on disk format:
> > + *
> > + * Here, the key has considerably more structure than a typical key (bpos); an
> > + * accounting key is 'struct disk_accounting_key', which is a union of bpos.
> > + *
>
> First impression.. I'm a little confused why the key type is a union of
> bpos. I'm possibly missing something fundamental/obvious, but could you
> elaborate more on why that is here?

How's this?

* More specifically: a key is just a muliword integer (where word endianness
* matches native byte order), so we're treating bpos as an opaque 20 byte
* integer and mapping bch_accounting_key to that.

2024-02-28 20:16:31

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 02/21] bcachefs: Accumulate accounting keys in journal replay

On Tue, Feb 27, 2024 at 10:49:46AM -0500, Brian Foster wrote:
> On Sat, Feb 24, 2024 at 09:38:04PM -0500, Kent Overstreet wrote:
> > Until accounting keys hit the btree, they are deltas, not new versions
> > of the existing key; this means we have to teach journal replay to
> > accumulate them.
> >
> > Additionally, the journal doesn't track precisely which entries have
> > been flushed to the btree; it only tracks a range of entries that may
> > possibly still need to be flushed.
> >
> > That means we need to compare accounting keys against the version in the
> > btree and only flush updates that are newer.
> >
> > There's another wrinkle with the write buffer: if the write buffer
> > starts flushing accounting keys before journal replay has finished
> > flushing accounting keys, journal replay will see the version number
> > from the new updates and updates from the journal will be lost.
> >
> > To avoid this, journal replay has to flush accounting keys first, and
> > we'll be adding a flag so that write buffer flush knows to hold
> > accounting keys until then.
> >
> > Signed-off-by: Kent Overstreet <[email protected]>
> > ---
> > fs/bcachefs/btree_journal_iter.c | 23 +++-------
> > fs/bcachefs/btree_journal_iter.h | 15 +++++++
> > fs/bcachefs/btree_trans_commit.c | 9 +++-
> > fs/bcachefs/btree_update.h | 14 +++++-
> > fs/bcachefs/recovery.c | 76 +++++++++++++++++++++++++++++++-
> > 5 files changed, 117 insertions(+), 20 deletions(-)
> >
> ...
> > diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c
> > index 96e7a1ec7091..6829d80bd181 100644
> > --- a/fs/bcachefs/recovery.c
> > +++ b/fs/bcachefs/recovery.c
> > @@ -11,6 +11,7 @@
> > #include "btree_io.h"
> > #include "buckets.h"
> > #include "dirent.h"
> > +#include "disk_accounting.h"
> > #include "ec.h"
> > #include "errcode.h"
> > #include "error.h"
> > @@ -87,6 +88,56 @@ static void replay_now_at(struct journal *j, u64 seq)
> > bch2_journal_pin_put(j, j->replay_journal_seq++);
> > }
> >
> > +static int bch2_journal_replay_accounting_key(struct btree_trans *trans,
> > + struct journal_key *k)
> > +{
> > + struct journal_keys *keys = &trans->c->journal_keys;
> > +
> > + struct btree_iter iter;
> > + bch2_trans_node_iter_init(trans, &iter, k->btree_id, k->k->k.p,
> > + BTREE_MAX_DEPTH, k->level,
> > + BTREE_ITER_INTENT);
> > + int ret = bch2_btree_iter_traverse(&iter);
> > + if (ret)
> > + goto out;
> > +
> > + struct bkey u;
> > + struct bkey_s_c old = bch2_btree_path_peek_slot(btree_iter_path(trans, &iter), &u);
> > +
> > + if (bversion_cmp(old.k->version, k->k->k.version) >= 0) {
> > + ret = 0;
> > + goto out;
> > + }
>
> So I assume this is what correlates back to the need to not flush the
> write buffer until replay completes, otherwise we could unintentionally
> skip subsequent key updates. Is that the case?

No, this is the "has this delta been applie to the btree key" check -
adding that as a comment.

Write buffer exclusion comes with a new filesytem bit that gets set once
accounting keys have all been replayed, that's in the next patch

2024-02-28 22:42:58

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 03/21] bcachefs: btree write buffer knows how to accumulate bch_accounting keys

On Tue, Feb 27, 2024 at 10:50:23AM -0500, Brian Foster wrote:
> On Sat, Feb 24, 2024 at 09:38:05PM -0500, Kent Overstreet wrote:
> > + if (!*accounting_accumulated && wb->k.k.type == KEY_TYPE_accounting) {
> > + struct bkey u;
> > + struct bkey_s_c k = bch2_btree_path_peek_slot_exact(btree_iter_path(trans, iter), &u);
> > +
> > + if (k.k->type == KEY_TYPE_accounting)
> > + bch2_accounting_accumulate(bkey_i_to_accounting(&wb->k),
> > + bkey_s_c_to_accounting(k));
>
> So it looks like we're accumulating from the btree key into the write
> buffer key. Is this so the following code will basically insert a new
> btree key based on the value of the write buffer key?

Correct, this is where we go from "accounting keys is a delta" to
"accounting key is new version of the key".

> > darray_for_each(wb->sorted, i) {
> > struct btree_write_buffered_key *k = &wb->flushing.keys.data[i->idx];
> > + bool accounting_accumulated = false;
>
> Should this live within the interior flush loop?

We can't define it within the loop because then we'd be setting it to
false on every loop iteration... but it does belong _with_ the loop, so
I'll move it to right before.

> > - bch2_journal_pin_update(j, i->journal_seq, &wb->flushing.pin,
> > - bch2_btree_write_buffer_journal_flush);
> > + if (!accounting_replay_done &&
> > + i->k.k.type == KEY_TYPE_accounting) {
> > + could_not_insert++;
> > + continue;
> > + }
> > +
> > + if (!could_not_insert)
> > + bch2_journal_pin_update(j, i->journal_seq, &wb->flushing.pin,
> > + bch2_btree_write_buffer_journal_flush);
>
> Hmm.. so this is sane because the slowpath runs in journal sorted order,
> right?

yup, which means as soon as we hit a key we can't insert we can't
release any more journal pins

>
> >
> > bch2_trans_begin(trans);
> >
> > @@ -375,13 +409,27 @@ static int bch2_btree_write_buffer_flush_locked(struct btree_trans *trans)
> > btree_write_buffered_insert(trans, i));
> > if (ret)
> > goto err;
> > +
> > + i->journal_seq = 0;
> > + }
> > +
>
> /*
> * Condense the remaining keys <reasons reasons>...??
> */

yup, that's a good comment

> > + if (could_not_insert) {
> > + struct btree_write_buffered_key *dst = wb->flushing.keys.data;
> > +
> > + darray_for_each(wb->flushing.keys, i)
> > + if (i->journal_seq)
> > + *dst++ = *i;
> > + wb->flushing.keys.nr = dst - wb->flushing.keys.data;
> > }
> > }
> > err:
> > + if (ret || !could_not_insert) {
> > + bch2_journal_pin_drop(j, &wb->flushing.pin);
> > + wb->flushing.keys.nr = 0;
> > + }
> > +
> > bch2_fs_fatal_err_on(ret, c, "%s: insert error %s", __func__, bch2_err_str(ret));
> > - trace_write_buffer_flush(trans, wb->flushing.keys.nr, skipped, fast, 0);
> > - bch2_journal_pin_drop(j, &wb->flushing.pin);
> > - wb->flushing.keys.nr = 0;
> > + trace_write_buffer_flush(trans, wb->flushing.keys.nr, overwritten, fast, 0);
>
> I feel like the last time I looked at the write buffer stuff the flush
> wasn't reentrant in this way. I.e., the flush switched out the active
> buffer and so had to process all entries in the current buffer (or
> something like that). Has something changed or do I misunderstand?

Yeah, originally we were adding keys to the write buffer directly from
the transaction commit path, so that necessitated the super fast
lockless stuff where we'd toggle between buffers so one was always
available.

Now keys are pulled from the journal, so we can use (somewhat) simpler
locking and buffering; now the complication is that we can't predict in
advance how many keys are going to come out of the journal for the write
buffer.

>
> > return ret;
> > }
> >
> > diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c
> > index 6829d80bd181..b8289af66c8e 100644
> > --- a/fs/bcachefs/recovery.c
> > +++ b/fs/bcachefs/recovery.c
> > @@ -228,6 +228,8 @@ static int bch2_journal_replay(struct bch_fs *c)
> > goto err;
> > }
> >
> > + set_bit(BCH_FS_accounting_replay_done, &c->flags);
> > +
>
> I assume this ties into the question on the previous patch..
>
> Related question.. if the write buffer can't flush during journal
> replay, is there concern/risk of overflowing it?

Shouldn't be any actual risk. It's just new accounting updates that the
write buffer can't flush, and those are only going to be generated by
interior btree node updates as journal replay has to split/rewrite nodes
to make room for its updates.

And for those new acounting updates, updates to the same counters get
accumulated as they're flushed from the journal to the write buffer -
see the patch for eytzingcer tree accumulated. So we could only overflow
if the number of distinct counters touched somehow was very large.

And the number of distinct counters will be growing significantly, but
the new counters will all be for user data, not metadata.

(Except: that reminds me, we do want to add per-btree counters, so users
can see "I have x amount of extents, x amount of dirents, etc.).

2024-02-29 04:14:15

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 04/21] bcachefs: Disk space accounting rewrite

On Tue, Feb 27, 2024 at 10:55:02AM -0500, Brian Foster wrote:
> On Sat, Feb 24, 2024 at 09:38:06PM -0500, Kent Overstreet wrote:
> > Main part of the disk accounting rewrite.
> >
> > This is a wholesale rewrite of the existing disk space accounting, which
> > relies on percepu counters that are sharded by journal buffer, and
> > rolled up and added to each journal write.
> >
> > With the new scheme, every set of counters is a distinct key in the
> > accounting btree; this fixes scaling limitations of the old scheme,
> > where counters took up space in each journal entry and required multiple
> > percpu counters.
> >
> > Now, in memory accounting requires a single set of percpu counters - not
> > multiple for each in flight journal buffer - and in the future we'll
> > probably also have counters that don't use in memory percpu counters,
> > they're not strictly required.
> >
> > An accounting update is now a normal btree update, using the btree write
> > buffer path. At transaction commit time, we apply accounting updates to
> > the in memory counters, which are percpu counters indexed in an
> > eytzinger tree by the accounting key.
> >
> > Signed-off-by: Kent Overstreet <[email protected]>
> > ---
> > fs/bcachefs/alloc_background.c | 68 +++++-
> > fs/bcachefs/bcachefs.h | 6 +-
> > fs/bcachefs/bcachefs_format.h | 1 -
> > fs/bcachefs/bcachefs_ioctl.h | 7 +-
> > fs/bcachefs/btree_gc.c | 3 +-
> > fs/bcachefs/btree_iter.c | 9 -
> > fs/bcachefs/btree_trans_commit.c | 62 ++++--
> > fs/bcachefs/btree_types.h | 1 -
> > fs/bcachefs/btree_update.h | 8 -
> > fs/bcachefs/buckets.c | 289 +++++---------------------
> > fs/bcachefs/buckets.h | 33 +--
> > fs/bcachefs/disk_accounting.c | 308 ++++++++++++++++++++++++++++
> > fs/bcachefs/disk_accounting.h | 126 ++++++++++++
> > fs/bcachefs/disk_accounting_types.h | 20 ++
> > fs/bcachefs/ec.c | 24 ++-
> > fs/bcachefs/inode.c | 9 +-
> > fs/bcachefs/recovery.c | 12 +-
> > fs/bcachefs/recovery_types.h | 1 +
> > fs/bcachefs/replicas.c | 42 ++--
> > fs/bcachefs/replicas.h | 11 +-
> > fs/bcachefs/replicas_types.h | 16 --
> > fs/bcachefs/sb-errors_types.h | 3 +-
> > fs/bcachefs/super.c | 49 +++--
> > 23 files changed, 704 insertions(+), 404 deletions(-)
> > create mode 100644 fs/bcachefs/disk_accounting_types.h
> >
> ...
> > diff --git a/fs/bcachefs/disk_accounting.c b/fs/bcachefs/disk_accounting.c
> > index 209f59e87b34..327c586ac661 100644
> > --- a/fs/bcachefs/disk_accounting.c
> > +++ b/fs/bcachefs/disk_accounting.c
> ...
> > @@ -13,6 +17,44 @@ static const char * const disk_accounting_type_strs[] = {
> > NULL
> > };
> >
>
> So I'm gonna need to stare at all this much more than I have so far, but
> one initial thing that stands out to me is the lack of high level
> function comments. IMO, something that helps tremendously in
> reading/reviewing these sorts of systemic changes is having a a couple
> sentence or so comment at the top of the main/external interfaces just
> to briefly explain what they do in plain english.
>
> So here, something like "modify an accounting key in the btree based on
> <whatever> ..." helps explain what it does and why it's used where it
> is. The same goes for some of the other interface level functions, like
> reading in accounting from disk, updating in-memory accounting (from
> journal entries in committing transactions?), updating the superblock,
> etc. I think I've started to put some of those pieces together, but
> having to jump all through the implementation to piece together high
> level behaviors is significantly more time consuming than having the
> author guide one through the high level interactions.
>
> IOW, I think if you minimally document the functions that are sufficient
> to help understand how accounting works as a black box (somewhere
> beneath the [nice] higher level big comment descriptions of the whole
> thing and above the low level implementation details), that helps the
> reviewer establish an understanding of the mechanism before having to
> dig through the implementation details and also serves as a reference
> going forward for the next person who is in a similar position and wants
> to read/debug/tweak/whatever this code.
>
> > +int bch2_disk_accounting_mod(struct btree_trans *trans,
> > + struct disk_accounting_key *k,
> > + s64 *d, unsigned nr)
> > +{
> > + /* Normalize: */
> > + switch (k->type) {
> > + case BCH_DISK_ACCOUNTING_replicas:
> > + bubble_sort(k->replicas.devs, k->replicas.nr_devs, u8_cmp);
> > + break;
> > + }
> > +
> > + BUG_ON(nr > BCH_ACCOUNTING_MAX_COUNTERS);
> > +
> > + struct {
> > + __BKEY_PADDED(k, BCH_ACCOUNTING_MAX_COUNTERS);
> > + } k_i;
> > + struct bkey_i_accounting *acc = bkey_accounting_init(&k_i.k);
> > +
> > + acc->k.p = disk_accounting_key_to_bpos(k);
> > + set_bkey_val_u64s(&acc->k, sizeof(struct bch_accounting) / sizeof(u64) + nr);
> > +
> > + memcpy_u64s_small(acc->v.d, d, nr);
> > +
> > + return bch2_trans_update_buffered(trans, BTREE_ID_accounting, &acc->k_i);
> > +}
> > +
> ...
> > diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c
> > index a7f9de220d90..685d54d0ddbb 100644
> > --- a/fs/bcachefs/super.c
> > +++ b/fs/bcachefs/super.c
> ...
> > @@ -1618,6 +1621,16 @@ int bch2_dev_remove(struct bch_fs *c, struct bch_dev *ca, int flags)
> > if (ret)
> > goto err;
> >
> > + /*
> > + * We need to flush the entire journal to get rid of keys that reference
> > + * the device being removed before removing the superblock entry
> > + */
> > + bch2_journal_flush_all_pins(&c->journal);
>
> I thought this needed to occur between the device removal and superblock
> update (according to the comment below). Is that not the case? Either
> way, is it moved for reasons related to accounting?

I think it ended up not needing to be moved, and I just forgot to drop
it - originally I disallowed accounting entries that referenced
nonexistent devices, but that wasn't workable so now it's only nonzero
accounting keys that aren't allowed to reference nonexistent devices.

I'll see if I can delete it.

Applying the following fixup patch, renaming for consistency but mostly
adding documentation. Helpful?

From 2f2c088f5a4c374d6e7357398c5307425dc52140 Mon Sep 17 00:00:00 2001
From: Kent Overstreet <[email protected]>
Date: Wed, 28 Feb 2024 23:09:28 -0500
Subject: [PATCH] fixup! bcachefs: Disk space accounting rewrite


diff --git a/fs/bcachefs/btree_trans_commit.c b/fs/bcachefs/btree_trans_commit.c
index b005e20039bb..3a5b815af8bc 100644
--- a/fs/bcachefs/btree_trans_commit.c
+++ b/fs/bcachefs/btree_trans_commit.c
@@ -697,7 +697,7 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,
a->k.version = journal_pos_to_bversion(&trans->journal_res,
(u64 *) entry - (u64 *) trans->journal_entries);
BUG_ON(bversion_zero(a->k.version));
- ret = bch2_accounting_mem_add(trans, accounting_i_to_s_c(a));
+ ret = bch2_accounting_mem_mod(trans, accounting_i_to_s_c(a));
if (ret)
goto revert_fs_usage;
}
@@ -784,7 +784,7 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,
struct bkey_s_accounting a = bkey_i_to_s_accounting(entry2->start);

bch2_accounting_neg(a);
- bch2_accounting_mem_add(trans, a.c);
+ bch2_accounting_mem_mod(trans, a.c);
bch2_accounting_neg(a);
}
percpu_up_read(&c->mark_lock);
diff --git a/fs/bcachefs/disk_accounting.c b/fs/bcachefs/disk_accounting.c
index df9791da1ab7..e8a6ff191acd 100644
--- a/fs/bcachefs/disk_accounting.c
+++ b/fs/bcachefs/disk_accounting.c
@@ -10,6 +10,45 @@
#include "journal_io.h"
#include "replicas.h"

+/*
+ * Notes on disk accounting:
+ *
+ * We have two parallel sets of counters to be concerned with, and both must be
+ * kept in sync.
+ *
+ * - Persistent/on disk accounting, stored in the accounting btree and updated
+ * via btree write buffer updates that treat new accounting keys as deltas to
+ * apply to existing values. But reading from a write buffer btree is
+ * expensive, so we also have
+ *
+ * - In memory accounting, where accounting is stored as an array of percpu
+ * counters, indexed by an eytzinger array of disk acounting keys/bpos (which
+ * are the same thing, excepting byte swabbing on big endian).
+ *
+ * Cheap to read, but non persistent.
+ *
+ * To do a disk accounting update:
+ * - initialize a disk_accounting_key, to specify which counter is being update
+ * - initialize counter deltas, as an array of 1-3 s64s
+ * - call bch2_disk_accounting_mod()
+ *
+ * This queues up the accounting update to be done at transaction commit time.
+ * Underneath, it's a normal btree write buffer update.
+ *
+ * The transaction commit path is responsible for propagating updates to the in
+ * memory counters, with bch2_accounting_mem_mod().
+ *
+ * The commit path also assigns every disk accounting update a unique version
+ * number, based on the journal sequence number and offset within that journal
+ * buffer; this is used by journal replay to determine which updates have been
+ * done.
+ *
+ * The transaction commit path also ensures that replicas entry accounting
+ * updates are properly marked in the superblock (so that we know whether we can
+ * mount without data being unavailable); it will update the superblock if
+ * bch2_accounting_mem_mod() tells it to.
+ */
+
static const char * const disk_accounting_type_strs[] = {
#define x(t, n, ...) [n] = #t,
BCH_DISK_ACCOUNTING_TYPES()
@@ -133,6 +172,10 @@ static int bch2_accounting_update_sb_one(struct bch_fs *c, struct bpos p)
: 0;
}

+/*
+ * Ensure accounting keys being updated are present in the superblock, when
+ * applicable (i.e. replicas updates)
+ */
int bch2_accounting_update_sb(struct btree_trans *trans)
{
for (struct jset_entry *i = trans->journal_entries;
@@ -147,7 +190,7 @@ int bch2_accounting_update_sb(struct btree_trans *trans)
return 0;
}

-static int __bch2_accounting_mem_add_slowpath(struct bch_fs *c, struct bkey_s_c_accounting a)
+static int __bch2_accounting_mem_mod_slowpath(struct bch_fs *c, struct bkey_s_c_accounting a)
{
struct bch_replicas_padded r;

@@ -191,16 +234,24 @@ static int __bch2_accounting_mem_add_slowpath(struct bch_fs *c, struct bkey_s_c_
return 0;
}

-int bch2_accounting_mem_add_slowpath(struct bch_fs *c, struct bkey_s_c_accounting a)
+int bch2_accounting_mem_mod_slowpath(struct bch_fs *c, struct bkey_s_c_accounting a)
{
percpu_up_read(&c->mark_lock);
percpu_down_write(&c->mark_lock);
- int ret = __bch2_accounting_mem_add_slowpath(c, a);
+ int ret = __bch2_accounting_mem_mod_slowpath(c, a);
percpu_up_write(&c->mark_lock);
percpu_down_read(&c->mark_lock);
return ret;
}

+/*
+ * Read out accounting keys for replicas entries, as an array of
+ * bch_replicas_usage entries.
+ *
+ * Note: this may be deprecated/removed at smoe point in the future and replaced
+ * with something more general, it exists to support the ioctl used by the
+ * 'bcachefs fs usage' command.
+ */
int bch2_fs_replicas_usage_read(struct bch_fs *c, darray_char *usage)
{
struct bch_accounting_mem *acc = &c->accounting;
@@ -234,15 +285,6 @@ int bch2_fs_replicas_usage_read(struct bch_fs *c, darray_char *usage)
return ret;
}

-static bool accounting_key_is_zero(struct bkey_s_c_accounting a)
-{
-
- for (unsigned i = 0; i < bch2_accounting_counters(a.k); i++)
- if (a.v->d[i])
- return false;
- return true;
-}
-
static int accounting_read_key(struct bch_fs *c, struct bkey_s_c k)
{
struct printbuf buf = PRINTBUF;
@@ -251,10 +293,10 @@ static int accounting_read_key(struct bch_fs *c, struct bkey_s_c k)
return 0;

percpu_down_read(&c->mark_lock);
- int ret = __bch2_accounting_mem_add(c, bkey_s_c_to_accounting(k));
+ int ret = __bch2_accounting_mem_mod(c, bkey_s_c_to_accounting(k));
percpu_up_read(&c->mark_lock);

- if (accounting_key_is_zero(bkey_s_c_to_accounting(k)) &&
+ if (bch2_accounting_key_is_zero(bkey_s_c_to_accounting(k)) &&
ret == -BCH_ERR_btree_insert_need_mark_replicas)
ret = 0;

@@ -272,6 +314,10 @@ static int accounting_read_key(struct bch_fs *c, struct bkey_s_c k)
return ret;
}

+/*
+ * At startup time, initialize the in memory accounting from the btree (and
+ * journal)
+ */
int bch2_accounting_read(struct bch_fs *c)
{
struct bch_accounting_mem *acc = &c->accounting;
diff --git a/fs/bcachefs/disk_accounting.h b/fs/bcachefs/disk_accounting.h
index 5fd053a819df..d9f2ce327761 100644
--- a/fs/bcachefs/disk_accounting.h
+++ b/fs/bcachefs/disk_accounting.h
@@ -105,15 +105,15 @@ static inline int accounting_pos_cmp(const void *_l, const void *_r)
return bpos_cmp(*l, *r);
}

-int bch2_accounting_mem_add_slowpath(struct bch_fs *, struct bkey_s_c_accounting);
+int bch2_accounting_mem_mod_slowpath(struct bch_fs *, struct bkey_s_c_accounting);

-static inline int __bch2_accounting_mem_add(struct bch_fs *c, struct bkey_s_c_accounting a)
+static inline int __bch2_accounting_mem_mod(struct bch_fs *c, struct bkey_s_c_accounting a)
{
struct bch_accounting_mem *acc = &c->accounting;
unsigned idx = eytzinger0_find(acc->k.data, acc->k.nr, sizeof(acc->k.data[0]),
accounting_pos_cmp, &a.k->p);
if (unlikely(idx >= acc->k.nr))
- return bch2_accounting_mem_add_slowpath(c, a);
+ return bch2_accounting_mem_mod_slowpath(c, a);

unsigned offset = acc->k.data[idx].offset;

@@ -124,7 +124,12 @@ static inline int __bch2_accounting_mem_add(struct bch_fs *c, struct bkey_s_c_ac
return 0;
}

-static inline int bch2_accounting_mem_add(struct btree_trans *trans, struct bkey_s_c_accounting a)
+/*
+ * Update in memory counters so they match the btree update we're doing; called
+ * from transaction commit path
+ */
+static inline int bch2_accounting_mem_mod(struct btree_trans *trans, struct
+ bkey_s_c_accounting a)
{
struct disk_accounting_key acc_k;
bpos_to_disk_accounting_key(&acc_k, a.k->p);
@@ -137,7 +142,7 @@ static inline int bch2_accounting_mem_add(struct btree_trans *trans, struct bkey
fs_usage_data_type_to_base(&trans->fs_usage_delta, acc_k.replicas.data_type, a.v->d[0]);
break;
}
- return __bch2_accounting_mem_add(trans->c, a);
+ return __bch2_accounting_mem_mod(trans->c, a);
}

static inline void bch2_accounting_mem_read_counters(struct bch_fs *c,

2024-02-29 18:41:44

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH 01/21] bcachefs: KEY_TYPE_accounting

On Wed, Feb 28, 2024 at 02:39:38PM -0500, Kent Overstreet wrote:
> On Tue, Feb 27, 2024 at 10:49:19AM -0500, Brian Foster wrote:
> > On Sat, Feb 24, 2024 at 09:38:03PM -0500, Kent Overstreet wrote:
> > > New key type for the disk space accounting rewrite.
> > >
> > > - Holds a variable sized array of u64s (may be more than one for
> > > accounting e.g. compressed and uncompressed size, or buckets and
> > > sectors for a given data type)
> > >
> > > - Updates are deltas, not new versions of the key: this means updates
> > > to accounting can happen via the btree write buffer, which we'll be
> > > teaching to accumulate deltas.
> > >
> > > Signed-off-by: Kent Overstreet <[email protected]>
> > > ---
> > > fs/bcachefs/Makefile | 3 +-
> > > fs/bcachefs/bcachefs.h | 1 +
> > > fs/bcachefs/bcachefs_format.h | 80 +++------------
> > > fs/bcachefs/bkey_methods.c | 1 +
> > > fs/bcachefs/disk_accounting.c | 70 ++++++++++++++
> > > fs/bcachefs/disk_accounting.h | 52 ++++++++++
> > > fs/bcachefs/disk_accounting_format.h | 139 +++++++++++++++++++++++++++
> > > fs/bcachefs/replicas_format.h | 21 ++++
> > > fs/bcachefs/sb-downgrade.c | 12 ++-
> > > fs/bcachefs/sb-errors_types.h | 3 +-
> > > 10 files changed, 311 insertions(+), 71 deletions(-)
> > > create mode 100644 fs/bcachefs/disk_accounting.c
> > > create mode 100644 fs/bcachefs/disk_accounting.h
> > > create mode 100644 fs/bcachefs/disk_accounting_format.h
> > > create mode 100644 fs/bcachefs/replicas_format.h
> > >
> > ...
> > > diff --git a/fs/bcachefs/disk_accounting_format.h b/fs/bcachefs/disk_accounting_format.h
> > > new file mode 100644
> > > index 000000000000..e06a42f0d578
> > > --- /dev/null
> > > +++ b/fs/bcachefs/disk_accounting_format.h
> > > @@ -0,0 +1,139 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +#ifndef _BCACHEFS_DISK_ACCOUNTING_FORMAT_H
> > > +#define _BCACHEFS_DISK_ACCOUNTING_FORMAT_H
> > > +
> > > +#include "replicas_format.h"
> > > +
> > > +/*
> > > + * Disk accounting - KEY_TYPE_accounting - on disk format:
> > > + *
> > > + * Here, the key has considerably more structure than a typical key (bpos); an
> > > + * accounting key is 'struct disk_accounting_key', which is a union of bpos.
> > > + *
> >
> > First impression.. I'm a little confused why the key type is a union of
> > bpos. I'm possibly missing something fundamental/obvious, but could you
> > elaborate more on why that is here?
>
> How's this?
>
> * More specifically: a key is just a muliword integer (where word endianness
> * matches native byte order), so we're treating bpos as an opaque 20 byte
> * integer and mapping bch_accounting_key to that.
>

Hmm.. I think the connection I missed on first look is basically
disk_accounting_key_to_bpos(). I think what is confusing is that calling
this a key makes me think of bkey, which I understand to contain a bpos,
so then overlaying it with a bpos didn't really make a lot of sense to
me conceptually.

So when I look at disk_accounting_key_to_bpos(), I see we are actually
using the bpos _pad field, and this structure basically _is_ the bpos
for a disk accounting btree bkey. So that kind of makes me wonder why
this isn't called something like disk_accounting_pos instead of _key,
but maybe that is wrong for other reasons.

Either way, what I'm trying to get at is that I think this documentation
would be better if it explained conceptually how disk_accounting_key
relates to bkey/bpos, and why it exists separately from bkey vs. other
key types, rather than (or at least before) getting into the lower level
side effects of a union with bpos.

Brian


2024-02-29 18:42:59

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH 04/21] bcachefs: Disk space accounting rewrite

On Wed, Feb 28, 2024 at 11:10:12PM -0500, Kent Overstreet wrote:
> On Tue, Feb 27, 2024 at 10:55:02AM -0500, Brian Foster wrote:
> > On Sat, Feb 24, 2024 at 09:38:06PM -0500, Kent Overstreet wrote:
> > > Main part of the disk accounting rewrite.
> > >
> > > This is a wholesale rewrite of the existing disk space accounting, which
> > > relies on percepu counters that are sharded by journal buffer, and
> > > rolled up and added to each journal write.
> > >
> > > With the new scheme, every set of counters is a distinct key in the
> > > accounting btree; this fixes scaling limitations of the old scheme,
> > > where counters took up space in each journal entry and required multiple
> > > percpu counters.
> > >
> > > Now, in memory accounting requires a single set of percpu counters - not
> > > multiple for each in flight journal buffer - and in the future we'll
> > > probably also have counters that don't use in memory percpu counters,
> > > they're not strictly required.
> > >
> > > An accounting update is now a normal btree update, using the btree write
> > > buffer path. At transaction commit time, we apply accounting updates to
> > > the in memory counters, which are percpu counters indexed in an
> > > eytzinger tree by the accounting key.
> > >
> > > Signed-off-by: Kent Overstreet <[email protected]>
> > > ---
> > > fs/bcachefs/alloc_background.c | 68 +++++-
> > > fs/bcachefs/bcachefs.h | 6 +-
> > > fs/bcachefs/bcachefs_format.h | 1 -
> > > fs/bcachefs/bcachefs_ioctl.h | 7 +-
> > > fs/bcachefs/btree_gc.c | 3 +-
> > > fs/bcachefs/btree_iter.c | 9 -
> > > fs/bcachefs/btree_trans_commit.c | 62 ++++--
> > > fs/bcachefs/btree_types.h | 1 -
> > > fs/bcachefs/btree_update.h | 8 -
> > > fs/bcachefs/buckets.c | 289 +++++---------------------
> > > fs/bcachefs/buckets.h | 33 +--
> > > fs/bcachefs/disk_accounting.c | 308 ++++++++++++++++++++++++++++
> > > fs/bcachefs/disk_accounting.h | 126 ++++++++++++
> > > fs/bcachefs/disk_accounting_types.h | 20 ++
> > > fs/bcachefs/ec.c | 24 ++-
> > > fs/bcachefs/inode.c | 9 +-
> > > fs/bcachefs/recovery.c | 12 +-
> > > fs/bcachefs/recovery_types.h | 1 +
> > > fs/bcachefs/replicas.c | 42 ++--
> > > fs/bcachefs/replicas.h | 11 +-
> > > fs/bcachefs/replicas_types.h | 16 --
> > > fs/bcachefs/sb-errors_types.h | 3 +-
> > > fs/bcachefs/super.c | 49 +++--
> > > 23 files changed, 704 insertions(+), 404 deletions(-)
> > > create mode 100644 fs/bcachefs/disk_accounting_types.h
> > >
> > ...
> > > diff --git a/fs/bcachefs/disk_accounting.c b/fs/bcachefs/disk_accounting.c
> > > index 209f59e87b34..327c586ac661 100644
> > > --- a/fs/bcachefs/disk_accounting.c
> > > +++ b/fs/bcachefs/disk_accounting.c
> > ...
> > > @@ -13,6 +17,44 @@ static const char * const disk_accounting_type_strs[] = {
> > > NULL
> > > };
> > >
> >
..
> > > +int bch2_disk_accounting_mod(struct btree_trans *trans,
> > > + struct disk_accounting_key *k,
> > > + s64 *d, unsigned nr)
> > > +{
> > > + /* Normalize: */
> > > + switch (k->type) {
> > > + case BCH_DISK_ACCOUNTING_replicas:
> > > + bubble_sort(k->replicas.devs, k->replicas.nr_devs, u8_cmp);
> > > + break;
> > > + }
> > > +
> > > + BUG_ON(nr > BCH_ACCOUNTING_MAX_COUNTERS);
> > > +
> > > + struct {
> > > + __BKEY_PADDED(k, BCH_ACCOUNTING_MAX_COUNTERS);
> > > + } k_i;
> > > + struct bkey_i_accounting *acc = bkey_accounting_init(&k_i.k);
> > > +
> > > + acc->k.p = disk_accounting_key_to_bpos(k);
> > > + set_bkey_val_u64s(&acc->k, sizeof(struct bch_accounting) / sizeof(u64) + nr);
> > > +
> > > + memcpy_u64s_small(acc->v.d, d, nr);
> > > +
> > > + return bch2_trans_update_buffered(trans, BTREE_ID_accounting, &acc->k_i);
> > > +}
> > > +
> > ...
> > > diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c
> > > index a7f9de220d90..685d54d0ddbb 100644
> > > --- a/fs/bcachefs/super.c
> > > +++ b/fs/bcachefs/super.c
> > ...
> > > @@ -1618,6 +1621,16 @@ int bch2_dev_remove(struct bch_fs *c, struct bch_dev *ca, int flags)
> > > if (ret)
> > > goto err;
> > >
> > > + /*
> > > + * We need to flush the entire journal to get rid of keys that reference
> > > + * the device being removed before removing the superblock entry
> > > + */
> > > + bch2_journal_flush_all_pins(&c->journal);
> >
> > I thought this needed to occur between the device removal and superblock
> > update (according to the comment below). Is that not the case? Either
> > way, is it moved for reasons related to accounting?
>
> I think it ended up not needing to be moved, and I just forgot to drop
> it - originally I disallowed accounting entries that referenced
> nonexistent devices, but that wasn't workable so now it's only nonzero
> accounting keys that aren't allowed to reference nonexistent devices.
>
> I'll see if I can delete it.
>

Do you mean to delete the change that moves the call, or the flush call
entirely?

> Applying the following fixup patch, renaming for consistency but mostly
> adding documentation. Helpful?
>

Yes, definitely. A few nitty suggestions below you can choose to take or
leave, but this is the sort of thing I was getting at above.

> From 2f2c088f5a4c374d6e7357398c5307425dc52140 Mon Sep 17 00:00:00 2001
> From: Kent Overstreet <[email protected]>
> Date: Wed, 28 Feb 2024 23:09:28 -0500
> Subject: [PATCH] fixup! bcachefs: Disk space accounting rewrite
>
>
> diff --git a/fs/bcachefs/btree_trans_commit.c b/fs/bcachefs/btree_trans_commit.c
> index b005e20039bb..3a5b815af8bc 100644
> --- a/fs/bcachefs/btree_trans_commit.c
> +++ b/fs/bcachefs/btree_trans_commit.c
> @@ -697,7 +697,7 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,
> a->k.version = journal_pos_to_bversion(&trans->journal_res,
> (u64 *) entry - (u64 *) trans->journal_entries);
> BUG_ON(bversion_zero(a->k.version));
> - ret = bch2_accounting_mem_add(trans, accounting_i_to_s_c(a));
> + ret = bch2_accounting_mem_mod(trans, accounting_i_to_s_c(a));
> if (ret)
> goto revert_fs_usage;
> }
> @@ -784,7 +784,7 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,
> struct bkey_s_accounting a = bkey_i_to_s_accounting(entry2->start);
>
> bch2_accounting_neg(a);
> - bch2_accounting_mem_add(trans, a.c);
> + bch2_accounting_mem_mod(trans, a.c);
> bch2_accounting_neg(a);
> }
> percpu_up_read(&c->mark_lock);
> diff --git a/fs/bcachefs/disk_accounting.c b/fs/bcachefs/disk_accounting.c
> index df9791da1ab7..e8a6ff191acd 100644
> --- a/fs/bcachefs/disk_accounting.c
> +++ b/fs/bcachefs/disk_accounting.c
> @@ -10,6 +10,45 @@
> #include "journal_io.h"
> #include "replicas.h"
>
> +/*
> + * Notes on disk accounting:
> + *
> + * We have two parallel sets of counters to be concerned with, and both must be
> + * kept in sync.
> + *
> + * - Persistent/on disk accounting, stored in the accounting btree and updated
> + * via btree write buffer updates that treat new accounting keys as deltas to
> + * apply to existing values. But reading from a write buffer btree is
> + * expensive, so we also have
> + *

I find the wording a little odd here, and I also think it would be
helpful to explain how/from where the deltas originate. For example,
something along the lines of:

"Persistent/on disk accounting, stored in the accounting btree and
updated via btree write buffer updates. Accounting updates are
represented as deltas that originate from <somewhere? trans triggers?>.
Accounting keys represent these deltas through commit into the write
buffer. The accounting/delta keys in the write buffer are then
accumulated into the appropriate accounting btree key at write buffer
flush time."

> + * - In memory accounting, where accounting is stored as an array of percpu
> + * counters, indexed by an eytzinger array of disk acounting keys/bpos (which
> + * are the same thing, excepting byte swabbing on big endian).
> + *

Not really sure about the keys vs. bpos thing, kind of related to my
comments on the earlier patch. It might be more clear to just elide the
implementation details here, i.e.:

"In memory accounting, where accounting is stored as an array of percpu
counters that are cheap to read, but not persistent. Updates to in
memory accounting are propagated from the transaction commit path."

.. but NBD, and feel free to reword, drop and/or correct any of that
text.

> + * Cheap to read, but non persistent.
> + *
> + * To do a disk accounting update:
> + * - initialize a disk_accounting_key, to specify which counter is being update
> + * - initialize counter deltas, as an array of 1-3 s64s
> + * - call bch2_disk_accounting_mod()
> + *
> + * This queues up the accounting update to be done at transaction commit time.
> + * Underneath, it's a normal btree write buffer update.
> + *
> + * The transaction commit path is responsible for propagating updates to the in
> + * memory counters, with bch2_accounting_mem_mod().
> + *
> + * The commit path also assigns every disk accounting update a unique version
> + * number, based on the journal sequence number and offset within that journal
> + * buffer; this is used by journal replay to determine which updates have been
> + * done.
> + *
> + * The transaction commit path also ensures that replicas entry accounting
> + * updates are properly marked in the superblock (so that we know whether we can
> + * mount without data being unavailable); it will update the superblock if
> + * bch2_accounting_mem_mod() tells it to.

I'm not really sure what this last paragraph is telling me, but granted
I've not got that far into the code yet either.

Brian

> + */
> +
> static const char * const disk_accounting_type_strs[] = {
> #define x(t, n, ...) [n] = #t,
> BCH_DISK_ACCOUNTING_TYPES()
> @@ -133,6 +172,10 @@ static int bch2_accounting_update_sb_one(struct bch_fs *c, struct bpos p)
> : 0;
> }
>
> +/*
> + * Ensure accounting keys being updated are present in the superblock, when
> + * applicable (i.e. replicas updates)
> + */
> int bch2_accounting_update_sb(struct btree_trans *trans)
> {
> for (struct jset_entry *i = trans->journal_entries;
> @@ -147,7 +190,7 @@ int bch2_accounting_update_sb(struct btree_trans *trans)
> return 0;
> }
>
> -static int __bch2_accounting_mem_add_slowpath(struct bch_fs *c, struct bkey_s_c_accounting a)
> +static int __bch2_accounting_mem_mod_slowpath(struct bch_fs *c, struct bkey_s_c_accounting a)
> {
> struct bch_replicas_padded r;
>
> @@ -191,16 +234,24 @@ static int __bch2_accounting_mem_add_slowpath(struct bch_fs *c, struct bkey_s_c_
> return 0;
> }
>
> -int bch2_accounting_mem_add_slowpath(struct bch_fs *c, struct bkey_s_c_accounting a)
> +int bch2_accounting_mem_mod_slowpath(struct bch_fs *c, struct bkey_s_c_accounting a)
> {
> percpu_up_read(&c->mark_lock);
> percpu_down_write(&c->mark_lock);
> - int ret = __bch2_accounting_mem_add_slowpath(c, a);
> + int ret = __bch2_accounting_mem_mod_slowpath(c, a);
> percpu_up_write(&c->mark_lock);
> percpu_down_read(&c->mark_lock);
> return ret;
> }
>
> +/*
> + * Read out accounting keys for replicas entries, as an array of
> + * bch_replicas_usage entries.
> + *
> + * Note: this may be deprecated/removed at smoe point in the future and replaced
> + * with something more general, it exists to support the ioctl used by the
> + * 'bcachefs fs usage' command.
> + */
> int bch2_fs_replicas_usage_read(struct bch_fs *c, darray_char *usage)
> {
> struct bch_accounting_mem *acc = &c->accounting;
> @@ -234,15 +285,6 @@ int bch2_fs_replicas_usage_read(struct bch_fs *c, darray_char *usage)
> return ret;
> }
>
> -static bool accounting_key_is_zero(struct bkey_s_c_accounting a)
> -{
> -
> - for (unsigned i = 0; i < bch2_accounting_counters(a.k); i++)
> - if (a.v->d[i])
> - return false;
> - return true;
> -}
> -
> static int accounting_read_key(struct bch_fs *c, struct bkey_s_c k)
> {
> struct printbuf buf = PRINTBUF;
> @@ -251,10 +293,10 @@ static int accounting_read_key(struct bch_fs *c, struct bkey_s_c k)
> return 0;
>
> percpu_down_read(&c->mark_lock);
> - int ret = __bch2_accounting_mem_add(c, bkey_s_c_to_accounting(k));
> + int ret = __bch2_accounting_mem_mod(c, bkey_s_c_to_accounting(k));
> percpu_up_read(&c->mark_lock);
>
> - if (accounting_key_is_zero(bkey_s_c_to_accounting(k)) &&
> + if (bch2_accounting_key_is_zero(bkey_s_c_to_accounting(k)) &&
> ret == -BCH_ERR_btree_insert_need_mark_replicas)
> ret = 0;
>
> @@ -272,6 +314,10 @@ static int accounting_read_key(struct bch_fs *c, struct bkey_s_c k)
> return ret;
> }
>
> +/*
> + * At startup time, initialize the in memory accounting from the btree (and
> + * journal)
> + */
> int bch2_accounting_read(struct bch_fs *c)
> {
> struct bch_accounting_mem *acc = &c->accounting;
> diff --git a/fs/bcachefs/disk_accounting.h b/fs/bcachefs/disk_accounting.h
> index 5fd053a819df..d9f2ce327761 100644
> --- a/fs/bcachefs/disk_accounting.h
> +++ b/fs/bcachefs/disk_accounting.h
> @@ -105,15 +105,15 @@ static inline int accounting_pos_cmp(const void *_l, const void *_r)
> return bpos_cmp(*l, *r);
> }
>
> -int bch2_accounting_mem_add_slowpath(struct bch_fs *, struct bkey_s_c_accounting);
> +int bch2_accounting_mem_mod_slowpath(struct bch_fs *, struct bkey_s_c_accounting);
>
> -static inline int __bch2_accounting_mem_add(struct bch_fs *c, struct bkey_s_c_accounting a)
> +static inline int __bch2_accounting_mem_mod(struct bch_fs *c, struct bkey_s_c_accounting a)
> {
> struct bch_accounting_mem *acc = &c->accounting;
> unsigned idx = eytzinger0_find(acc->k.data, acc->k.nr, sizeof(acc->k.data[0]),
> accounting_pos_cmp, &a.k->p);
> if (unlikely(idx >= acc->k.nr))
> - return bch2_accounting_mem_add_slowpath(c, a);
> + return bch2_accounting_mem_mod_slowpath(c, a);
>
> unsigned offset = acc->k.data[idx].offset;
>
> @@ -124,7 +124,12 @@ static inline int __bch2_accounting_mem_add(struct bch_fs *c, struct bkey_s_c_ac
> return 0;
> }
>
> -static inline int bch2_accounting_mem_add(struct btree_trans *trans, struct bkey_s_c_accounting a)
> +/*
> + * Update in memory counters so they match the btree update we're doing; called
> + * from transaction commit path
> + */
> +static inline int bch2_accounting_mem_mod(struct btree_trans *trans, struct
> + bkey_s_c_accounting a)
> {
> struct disk_accounting_key acc_k;
> bpos_to_disk_accounting_key(&acc_k, a.k->p);
> @@ -137,7 +142,7 @@ static inline int bch2_accounting_mem_add(struct btree_trans *trans, struct bkey
> fs_usage_data_type_to_base(&trans->fs_usage_delta, acc_k.replicas.data_type, a.v->d[0]);
> break;
> }
> - return __bch2_accounting_mem_add(trans->c, a);
> + return __bch2_accounting_mem_mod(trans->c, a);
> }
>
> static inline void bch2_accounting_mem_read_counters(struct bch_fs *c,
>


2024-02-29 18:55:00

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH 03/21] bcachefs: btree write buffer knows how to accumulate bch_accounting keys

On Wed, Feb 28, 2024 at 05:42:39PM -0500, Kent Overstreet wrote:
> On Tue, Feb 27, 2024 at 10:50:23AM -0500, Brian Foster wrote:
> > On Sat, Feb 24, 2024 at 09:38:05PM -0500, Kent Overstreet wrote:
> > > + if (!*accounting_accumulated && wb->k.k.type == KEY_TYPE_accounting) {
> > > + struct bkey u;
> > > + struct bkey_s_c k = bch2_btree_path_peek_slot_exact(btree_iter_path(trans, iter), &u);
> > > +
> > > + if (k.k->type == KEY_TYPE_accounting)
> > > + bch2_accounting_accumulate(bkey_i_to_accounting(&wb->k),
> > > + bkey_s_c_to_accounting(k));
> >
> > So it looks like we're accumulating from the btree key into the write
> > buffer key. Is this so the following code will basically insert a new
> > btree key based on the value of the write buffer key?
>
> Correct, this is where we go from "accounting keys is a delta" to
> "accounting key is new version of the key".
>
> > > darray_for_each(wb->sorted, i) {
> > > struct btree_write_buffered_key *k = &wb->flushing.keys.data[i->idx];
> > > + bool accounting_accumulated = false;
> >
> > Should this live within the interior flush loop?
>
> We can't define it within the loop because then we'd be setting it to
> false on every loop iteration... but it does belong _with_ the loop, so
> I'll move it to right before.
>

Ah, right.

> > > - bch2_journal_pin_update(j, i->journal_seq, &wb->flushing.pin,
> > > - bch2_btree_write_buffer_journal_flush);
> > > + if (!accounting_replay_done &&
> > > + i->k.k.type == KEY_TYPE_accounting) {
> > > + could_not_insert++;
> > > + continue;
> > > + }
> > > +
> > > + if (!could_not_insert)
> > > + bch2_journal_pin_update(j, i->journal_seq, &wb->flushing.pin,
> > > + bch2_btree_write_buffer_journal_flush);
> >
> > Hmm.. so this is sane because the slowpath runs in journal sorted order,
> > right?
>
> yup, which means as soon as we hit a key we can't insert we can't
> release any more journal pins
>
> >
> > >
> > > bch2_trans_begin(trans);
> > >
> > > @@ -375,13 +409,27 @@ static int bch2_btree_write_buffer_flush_locked(struct btree_trans *trans)
> > > btree_write_buffered_insert(trans, i));
> > > if (ret)
> > > goto err;
> > > +
> > > + i->journal_seq = 0;
> > > + }
> > > +
> >
> > /*
> > * Condense the remaining keys <reasons reasons>...??
> > */
>
> yup, that's a good comment
>
> > > + if (could_not_insert) {
> > > + struct btree_write_buffered_key *dst = wb->flushing.keys.data;
> > > +
> > > + darray_for_each(wb->flushing.keys, i)
> > > + if (i->journal_seq)
> > > + *dst++ = *i;
> > > + wb->flushing.keys.nr = dst - wb->flushing.keys.data;
> > > }
> > > }
> > > err:
> > > + if (ret || !could_not_insert) {
> > > + bch2_journal_pin_drop(j, &wb->flushing.pin);
> > > + wb->flushing.keys.nr = 0;
> > > + }
> > > +
> > > bch2_fs_fatal_err_on(ret, c, "%s: insert error %s", __func__, bch2_err_str(ret));
> > > - trace_write_buffer_flush(trans, wb->flushing.keys.nr, skipped, fast, 0);
> > > - bch2_journal_pin_drop(j, &wb->flushing.pin);
> > > - wb->flushing.keys.nr = 0;
> > > + trace_write_buffer_flush(trans, wb->flushing.keys.nr, overwritten, fast, 0);
> >
> > I feel like the last time I looked at the write buffer stuff the flush
> > wasn't reentrant in this way. I.e., the flush switched out the active
> > buffer and so had to process all entries in the current buffer (or
> > something like that). Has something changed or do I misunderstand?
>
> Yeah, originally we were adding keys to the write buffer directly from
> the transaction commit path, so that necessitated the super fast
> lockless stuff where we'd toggle between buffers so one was always
> available.
>
> Now keys are pulled from the journal, so we can use (somewhat) simpler
> locking and buffering; now the complication is that we can't predict in
> advance how many keys are going to come out of the journal for the write
> buffer.
>

Ok.

> >
> > > return ret;
> > > }
> > >
> > > diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c
> > > index 6829d80bd181..b8289af66c8e 100644
> > > --- a/fs/bcachefs/recovery.c
> > > +++ b/fs/bcachefs/recovery.c
> > > @@ -228,6 +228,8 @@ static int bch2_journal_replay(struct bch_fs *c)
> > > goto err;
> > > }
> > >
> > > + set_bit(BCH_FS_accounting_replay_done, &c->flags);
> > > +
> >
> > I assume this ties into the question on the previous patch..
> >
> > Related question.. if the write buffer can't flush during journal
> > replay, is there concern/risk of overflowing it?
>
> Shouldn't be any actual risk. It's just new accounting updates that the
> write buffer can't flush, and those are only going to be generated by
> interior btree node updates as journal replay has to split/rewrite nodes
> to make room for its updates.
>
> And for those new acounting updates, updates to the same counters get
> accumulated as they're flushed from the journal to the write buffer -
> see the patch for eytzingcer tree accumulated. So we could only overflow
> if the number of distinct counters touched somehow was very large.
>
> And the number of distinct counters will be growing significantly, but
> the new counters will all be for user data, not metadata.
>
> (Except: that reminds me, we do want to add per-btree counters, so users
> can see "I have x amount of extents, x amount of dirents, etc.).
>

Heh, Ok. This all does sound a little open ended to me. Maybe the better
question is: suppose this hypothetically does happen after adding a
bunch of new counters, what would the expected side effect be in the
recovery scenario where the write buffer can't be flushed?

If write buffer updates now basically just journal a special entry,
would that basically mean we'd deadlock during recovery due to no longer
being able to insert journal entries due to a pinned write buffer? If
so, that actually seems reasonable to me in the sense that in theory it
at least doesn't break the filesystem on-disk, but obviously it would
require some kind of enhancement in order to complete the recovery (even
if what that is is currently unknown). Hm?

Brian


2024-02-29 20:28:18

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 03/21] bcachefs: btree write buffer knows how to accumulate bch_accounting keys

On Thu, Feb 29, 2024 at 01:44:07PM -0500, Brian Foster wrote:
> On Wed, Feb 28, 2024 at 05:42:39PM -0500, Kent Overstreet wrote:
> > Shouldn't be any actual risk. It's just new accounting updates that the
> > write buffer can't flush, and those are only going to be generated by
> > interior btree node updates as journal replay has to split/rewrite nodes
> > to make room for its updates.
> >
> > And for those new acounting updates, updates to the same counters get
> > accumulated as they're flushed from the journal to the write buffer -
> > see the patch for eytzingcer tree accumulated. So we could only overflow
> > if the number of distinct counters touched somehow was very large.
> >
> > And the number of distinct counters will be growing significantly, but
> > the new counters will all be for user data, not metadata.
> >
> > (Except: that reminds me, we do want to add per-btree counters, so users
> > can see "I have x amount of extents, x amount of dirents, etc.).
> >
>
> Heh, Ok. This all does sound a little open ended to me. Maybe the better
> question is: suppose this hypothetically does happen after adding a
> bunch of new counters, what would the expected side effect be in the
> recovery scenario where the write buffer can't be flushed?

The btree write buffer buf is allowed to grow - we try to keep it
bounded in normal operation, but that's one of the ways we deal with the
unpredictability of the amount of write buffer keys in the journal.

So it'll grow until that kvrealloc fails. It won't show up as a
deadlock, it'll show up as an allocation failure; and for that to
mappen, that would mean the number of accounting keys being update - not
the number of accounting updates, just the number of distinct keys being
updated - is no longer fitting in the write buffer.

2024-02-29 21:16:22

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 04/21] bcachefs: Disk space accounting rewrite

On Thu, Feb 29, 2024 at 01:44:27PM -0500, Brian Foster wrote:
> On Wed, Feb 28, 2024 at 11:10:12PM -0500, Kent Overstreet wrote:
> > I think it ended up not needing to be moved, and I just forgot to drop
> > it - originally I disallowed accounting entries that referenced
> > nonexistent devices, but that wasn't workable so now it's only nonzero
> > accounting keys that aren't allowed to reference nonexistent devices.
> >
> > I'll see if I can delete it.
> >
>
> Do you mean to delete the change that moves the call, or the flush call
> entirely?

Delte the change, I think there's further cleanup (& probably bugs to
fix) possible with that flush call but I'm not going to get into it
right now.

> > +/*
> > + * Notes on disk accounting:
> > + *
> > + * We have two parallel sets of counters to be concerned with, and both must be
> > + * kept in sync.
> > + *
> > + * - Persistent/on disk accounting, stored in the accounting btree and updated
> > + * via btree write buffer updates that treat new accounting keys as deltas to
> > + * apply to existing values. But reading from a write buffer btree is
> > + * expensive, so we also have
> > + *
>
> I find the wording a little odd here, and I also think it would be
> helpful to explain how/from where the deltas originate. For example,
> something along the lines of:
>
> "Persistent/on disk accounting, stored in the accounting btree and
> updated via btree write buffer updates. Accounting updates are
> represented as deltas that originate from <somewhere? trans triggers?>.
> Accounting keys represent these deltas through commit into the write
> buffer. The accounting/delta keys in the write buffer are then
> accumulated into the appropriate accounting btree key at write buffer
> flush time."

yeah, that's worth including.

There's an interesting point that you're touching on; btree write buffer
are always dependent state changes from some other (non write buffer)
btree; we never look at a write buffer btree and generate an update
there - we can't, reading from a write buffer btree doesn't get you
anything consistent or up to date.

So in normal operation it really only makes sense to do write buffer
updates from a transactional trigger - that's the only way to use them
and have them be consistent with the resst of the filesystem.

And since triggers work by comparing old and new, they naturally
generate updates that are deltas.

> > + * - In memory accounting, where accounting is stored as an array of percpu
> > + * counters, indexed by an eytzinger array of disk acounting keys/bpos (which
> > + * are the same thing, excepting byte swabbing on big endian).
> > + *
>
> Not really sure about the keys vs. bpos thing, kind of related to my
> comments on the earlier patch. It might be more clear to just elide the
> implementation details here, i.e.:
>
> "In memory accounting, where accounting is stored as an array of percpu
> counters that are cheap to read, but not persistent. Updates to in
> memory accounting are propagated from the transaction commit path."
>
> ... but NBD, and feel free to reword, drop and/or correct any of that
> text.

It's there because bch2_accounting_mem_read() takes a bpos when it
should be a disk_accounting_key. I'll fix that if I can...

> > + * Cheap to read, but non persistent.
> > + *
> > + * To do a disk accounting update:
> > + * - initialize a disk_accounting_key, to specify which counter is being update
> > + * - initialize counter deltas, as an array of 1-3 s64s
> > + * - call bch2_disk_accounting_mod()
> > + *
> > + * This queues up the accounting update to be done at transaction commit time.
> > + * Underneath, it's a normal btree write buffer update.
> > + *
> > + * The transaction commit path is responsible for propagating updates to the in
> > + * memory counters, with bch2_accounting_mem_mod().
> > + *
> > + * The commit path also assigns every disk accounting update a unique version
> > + * number, based on the journal sequence number and offset within that journal
> > + * buffer; this is used by journal replay to determine which updates have been
> > + * done.
> > + *
> > + * The transaction commit path also ensures that replicas entry accounting
> > + * updates are properly marked in the superblock (so that we know whether we can
> > + * mount without data being unavailable); it will update the superblock if
> > + * bch2_accounting_mem_mod() tells it to.
>
> I'm not really sure what this last paragraph is telling me, but granted
> I've not got that far into the code yet either.

yeah that's for a whole different subsystem that happens to be slaved to
the accounting - replicas.c, which also used to help out quite a bit
with the accounting but now it's pretty much just for managing the
superblock replicas section.

The superblock replicas section is just a list of entries, where each
entry is a list of devices - "there is replicated data present on this
set of devices". We also have full counters of how much data is present
replicated across each set of devices, so the superblock section is just
a truncated version of the accounting - "data exists on these devices",
instead of saying how much.

2024-02-29 21:25:18

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 01/21] bcachefs: KEY_TYPE_accounting

On Thu, Feb 29, 2024 at 01:43:15PM -0500, Brian Foster wrote:
> Hmm.. I think the connection I missed on first look is basically
> disk_accounting_key_to_bpos(). I think what is confusing is that calling
> this a key makes me think of bkey, which I understand to contain a bpos,
> so then overlaying it with a bpos didn't really make a lot of sense to
> me conceptually.
>
> So when I look at disk_accounting_key_to_bpos(), I see we are actually
> using the bpos _pad field, and this structure basically _is_ the bpos
> for a disk accounting btree bkey. So that kind of makes me wonder why
> this isn't called something like disk_accounting_pos instead of _key,
> but maybe that is wrong for other reasons.

hmm, I didn't consider calling it disk_accounting_pos. I'll let that
roll around in my brain.

'key' is more standard terminology to me outside bcachefs, but 'pos'
does make more sense within bcachefs.

> Either way, what I'm trying to get at is that I think this documentation
> would be better if it explained conceptually how disk_accounting_key
> relates to bkey/bpos, and why it exists separately from bkey vs. other
> key types, rather than (or at least before) getting into the lower level
> side effects of a union with bpos.

Well, that gets into some fun territory - ideally bpos would not be a
fixed thing that every btree was forced to use, we'd be able to define
different types per btree.

And we're actually going to need to be able to do that in order to do
configurationless autotiering - i.e. tracking how hot/cold data is on an
inode:offset basis, because LRU btree backreferences need to go in the
key (bpos), not the value, in order to avoid collisions, and bpos isn't
big enough for that.

disk_accounting_(key|pos) is an even trickier situation, because of
endianness issues. The trick we do with bpos of defining the field order
differently based on endianness so that byte order matches word order -
that really wouldn't work here, so there is at present no practical way
that I know of to avoid the byte swabbing when going back and forth
between bpos and disk_accounting_pos on big endian.

But gcc does have an attribute now that lets you specify that an integer
struct member is big or little endian... I if we could get them to go
one step further and give us an attribute to control whether members are
laid out in ascending or descending order...

2024-03-01 15:02:24

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH 04/21] bcachefs: Disk space accounting rewrite

On Thu, Feb 29, 2024 at 04:16:00PM -0500, Kent Overstreet wrote:
> On Thu, Feb 29, 2024 at 01:44:27PM -0500, Brian Foster wrote:
> > On Wed, Feb 28, 2024 at 11:10:12PM -0500, Kent Overstreet wrote:
> > > I think it ended up not needing to be moved, and I just forgot to drop
> > > it - originally I disallowed accounting entries that referenced
> > > nonexistent devices, but that wasn't workable so now it's only nonzero
> > > accounting keys that aren't allowed to reference nonexistent devices.
> > >
> > > I'll see if I can delete it.
> > >
> >
> > Do you mean to delete the change that moves the call, or the flush call
> > entirely?
>
> Delte the change, I think there's further cleanup (& probably bugs to
> fix) possible with that flush call but I'm not going to get into it
> right now.
>

Ok, just trying to determine whether I need to look back and make sure
this doesn't regress the problem this originally fixed.

> > > +/*
> > > + * Notes on disk accounting:
> > > + *
> > > + * We have two parallel sets of counters to be concerned with, and both must be
> > > + * kept in sync.
> > > + *
> > > + * - Persistent/on disk accounting, stored in the accounting btree and updated
> > > + * via btree write buffer updates that treat new accounting keys as deltas to
> > > + * apply to existing values. But reading from a write buffer btree is
> > > + * expensive, so we also have
> > > + *
> >
> > I find the wording a little odd here, and I also think it would be
> > helpful to explain how/from where the deltas originate. For example,
> > something along the lines of:
> >
> > "Persistent/on disk accounting, stored in the accounting btree and
> > updated via btree write buffer updates. Accounting updates are
> > represented as deltas that originate from <somewhere? trans triggers?>.
> > Accounting keys represent these deltas through commit into the write
> > buffer. The accounting/delta keys in the write buffer are then
> > accumulated into the appropriate accounting btree key at write buffer
> > flush time."
>
> yeah, that's worth including.
>
> There's an interesting point that you're touching on; btree write buffer
> are always dependent state changes from some other (non write buffer)
> btree; we never look at a write buffer btree and generate an update
> there - we can't, reading from a write buffer btree doesn't get you
> anything consistent or up to date.
>
> So in normal operation it really only makes sense to do write buffer
> updates from a transactional trigger - that's the only way to use them
> and have them be consistent with the resst of the filesystem.
>
> And since triggers work by comparing old and new, they naturally
> generate updates that are deltas.
>

Hm that is interesting, I hadn't made that connection. Thanks.

Brian

> > > + * - In memory accounting, where accounting is stored as an array of percpu
> > > + * counters, indexed by an eytzinger array of disk acounting keys/bpos (which
> > > + * are the same thing, excepting byte swabbing on big endian).
> > > + *
> >
> > Not really sure about the keys vs. bpos thing, kind of related to my
> > comments on the earlier patch. It might be more clear to just elide the
> > implementation details here, i.e.:
> >
> > "In memory accounting, where accounting is stored as an array of percpu
> > counters that are cheap to read, but not persistent. Updates to in
> > memory accounting are propagated from the transaction commit path."
> >
> > ... but NBD, and feel free to reword, drop and/or correct any of that
> > text.
>
> It's there because bch2_accounting_mem_read() takes a bpos when it
> should be a disk_accounting_key. I'll fix that if I can...
>
> > > + * Cheap to read, but non persistent.
> > > + *
> > > + * To do a disk accounting update:
> > > + * - initialize a disk_accounting_key, to specify which counter is being update
> > > + * - initialize counter deltas, as an array of 1-3 s64s
> > > + * - call bch2_disk_accounting_mod()
> > > + *
> > > + * This queues up the accounting update to be done at transaction commit time.
> > > + * Underneath, it's a normal btree write buffer update.
> > > + *
> > > + * The transaction commit path is responsible for propagating updates to the in
> > > + * memory counters, with bch2_accounting_mem_mod().
> > > + *
> > > + * The commit path also assigns every disk accounting update a unique version
> > > + * number, based on the journal sequence number and offset within that journal
> > > + * buffer; this is used by journal replay to determine which updates have been
> > > + * done.
> > > + *
> > > + * The transaction commit path also ensures that replicas entry accounting
> > > + * updates are properly marked in the superblock (so that we know whether we can
> > > + * mount without data being unavailable); it will update the superblock if
> > > + * bch2_accounting_mem_mod() tells it to.
> >
> > I'm not really sure what this last paragraph is telling me, but granted
> > I've not got that far into the code yet either.
>
> yeah that's for a whole different subsystem that happens to be slaved to
> the accounting - replicas.c, which also used to help out quite a bit
> with the accounting but now it's pretty much just for managing the
> superblock replicas section.
>
> The superblock replicas section is just a list of entries, where each
> entry is a list of devices - "there is replicated data present on this
> set of devices". We also have full counters of how much data is present
> replicated across each set of devices, so the superblock section is just
> a truncated version of the accounting - "data exists on these devices",
> instead of saying how much.
>


2024-03-01 15:03:57

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH 01/21] bcachefs: KEY_TYPE_accounting

On Thu, Feb 29, 2024 at 04:24:37PM -0500, Kent Overstreet wrote:
> On Thu, Feb 29, 2024 at 01:43:15PM -0500, Brian Foster wrote:
> > Hmm.. I think the connection I missed on first look is basically
> > disk_accounting_key_to_bpos(). I think what is confusing is that calling
> > this a key makes me think of bkey, which I understand to contain a bpos,
> > so then overlaying it with a bpos didn't really make a lot of sense to
> > me conceptually.
> >
> > So when I look at disk_accounting_key_to_bpos(), I see we are actually
> > using the bpos _pad field, and this structure basically _is_ the bpos
> > for a disk accounting btree bkey. So that kind of makes me wonder why
> > this isn't called something like disk_accounting_pos instead of _key,
> > but maybe that is wrong for other reasons.
>
> hmm, I didn't consider calling it disk_accounting_pos. I'll let that
> roll around in my brain.
>
> 'key' is more standard terminology to me outside bcachefs, but 'pos'
> does make more sense within bcachefs.
>

Ok, so I'm not totally crazy at least. :)

Note again that wasn't an explicit suggestion, just that it seems more
logical to me based on my current understanding. I'm just trying to put
down my initial thoughts/confusions in hopes that at least some of this
triggers ideas for improvements...

> > Either way, what I'm trying to get at is that I think this documentation
> > would be better if it explained conceptually how disk_accounting_key
> > relates to bkey/bpos, and why it exists separately from bkey vs. other
> > key types, rather than (or at least before) getting into the lower level
> > side effects of a union with bpos.
>
> Well, that gets into some fun territory - ideally bpos would not be a
> fixed thing that every btree was forced to use, we'd be able to define
> different types per btree.
>

Ok, but this starts to sound orthogonal to the accounting bits. Since I
don't really grok why this is called a key, here's how I would add to
the existing documentation:

"Here, the key has considerably more structure than a typical key
(bpos); an accounting key is 'struct disk_accounting_key', which is a
union of bpos. We do this because disk_account_key actually is bpos for
the related bkey that ends up in the accounting btree.

This btree uses nontraditional bpos semantics because accounting btree
keys are indexed differently <reasons based on the counter
structures..?>. Yadda yadda..

Unlike with other key types, <continued existing comment> ...
"

Hm?

Brian

> And we're actually going to need to be able to do that in order to do
> configurationless autotiering - i.e. tracking how hot/cold data is on an
> inode:offset basis, because LRU btree backreferences need to go in the
> key (bpos), not the value, in order to avoid collisions, and bpos isn't
> big enough for that.
>
> disk_accounting_(key|pos) is an even trickier situation, because of
> endianness issues. The trick we do with bpos of defining the field order
> differently based on endianness so that byte order matches word order -
> that really wouldn't work here, so there is at present no practical way
> that I know of to avoid the byte swabbing when going back and forth
> between bpos and disk_accounting_pos on big endian.
>
> But gcc does have an attribute now that lets you specify that an integer
> struct member is big or little endian... I if we could get them to go
> one step further and give us an attribute to control whether members are
> laid out in ascending or descending order...
>


2024-03-01 20:07:53

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 01/21] bcachefs: KEY_TYPE_accounting

On Fri, Mar 01, 2024 at 10:03:06AM -0500, Brian Foster wrote:
> On Thu, Feb 29, 2024 at 04:24:37PM -0500, Kent Overstreet wrote:
> > On Thu, Feb 29, 2024 at 01:43:15PM -0500, Brian Foster wrote:
> > > Hmm.. I think the connection I missed on first look is basically
> > > disk_accounting_key_to_bpos(). I think what is confusing is that calling
> > > this a key makes me think of bkey, which I understand to contain a bpos,
> > > so then overlaying it with a bpos didn't really make a lot of sense to
> > > me conceptually.
> > >
> > > So when I look at disk_accounting_key_to_bpos(), I see we are actually
> > > using the bpos _pad field, and this structure basically _is_ the bpos
> > > for a disk accounting btree bkey. So that kind of makes me wonder why
> > > this isn't called something like disk_accounting_pos instead of _key,
> > > but maybe that is wrong for other reasons.
> >
> > hmm, I didn't consider calling it disk_accounting_pos. I'll let that
> > roll around in my brain.
> >
> > 'key' is more standard terminology to me outside bcachefs, but 'pos'
> > does make more sense within bcachefs.
> >
>
> Ok, so I'm not totally crazy at least. :)
>
> Note again that wasn't an explicit suggestion, just that it seems more
> logical to me based on my current understanding. I'm just trying to put
> down my initial thoughts/confusions in hopes that at least some of this
> triggers ideas for improvements...

I liked it because it makes the relationship between disk_accounting_pos
and bpos more explicit - they're both the same kind of thing.

> > > Either way, what I'm trying to get at is that I think this documentation
> > > would be better if it explained conceptually how disk_accounting_key
> > > relates to bkey/bpos, and why it exists separately from bkey vs. other
> > > key types, rather than (or at least before) getting into the lower level
> > > side effects of a union with bpos.
> >
> > Well, that gets into some fun territory - ideally bpos would not be a
> > fixed thing that every btree was forced to use, we'd be able to define
> > different types per btree.
> >
>
> Ok, but this starts to sound orthogonal to the accounting bits. Since I
> don't really grok why this is called a key, here's how I would add to
> the existing documentation:
>
> "Here, the key has considerably more structure than a typical key
> (bpos); an accounting key is 'struct disk_accounting_key', which is a
> union of bpos. We do this because disk_account_key actually is bpos for
> the related bkey that ends up in the accounting btree.
>
> This btree uses nontraditional bpos semantics because accounting btree
> keys are indexed differently <reasons based on the counter
> structures..?>. Yadda yadda..
>
> Unlike with other key types, <continued existing comment> ...

I'm just going to go with my latest revision for now, I think it's a
reasonable balance between terse and explanatory:

* More specifically: a key is just a muliword integer (where word endianness
* matches native byte order), so we're treating bpos as an opaque 20 byte
* integer and mapping bch_accounting_pos to that.