2023-07-17 11:48:45

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v3 0/5] Implement MTE tag compression for swapped pages

Currently, when MTE pages are swapped out, the tags are kept in the
memory, occupying 128 bytes per page. This is especially problematic
for devices that use zram-backed in-memory swap, because tags stored
uncompressed in the heap effectively reduce the available amount of
swap memory.

The RLE-based EA0 algorithm suggested by Evgenii Stepanov and
implemented in this patch series is able to efficiently compress
128-byte tag buffers, resulting in practical compression ratio between
2.5x and 20x. In most cases it is possible to store the compressed data
in 63-bit Xarray values, resulting in no extra memory allocations.

Our measurements show that EA0 provides better compression than existing
kernel compression algorithms (LZ4, LZO, LZ4HC, ZSTD) can offer, because
EA0 specifically targets 128-byte buffers.

To implement compression/decompression, we also extend <linux/bitmap.h>
with methods to set/get bit values at arbitrary places in the map.

We refactor arch/arm64/mm/mteswap.c to support both the compressed
(CONFIG_ARM64_MTE_COMP) and non-compressed case. For the former, in
addition to tag compression, we move tag allocation from kmalloc() to
separate kmem caches, providing greater locality and relaxing the
alignment requirements.

v3:
- as suggested by Andy Shevchenko, use
bitmap_get_value()/bitmap_set_value() written by Syed Nayyar Waris
- switched to unsigned long to reduce typecasts
- simplified the compression code

v2:
- as suggested by Yuri Norov, replace the poorly implemented struct
bitq with <linux/bitmap.h>

Alexander Potapenko (5):
lib/bitmap: add bitmap_{set,get}_value()
lib/test_bitmap: add tests for bitmap_{set,get}_value()
arm64: mte: implement CONFIG_ARM64_MTE_COMP
arm64: mte: add a test for MTE tags compression
arm64: mte: add compression support to mteswap.c

arch/arm64/Kconfig | 20 ++
arch/arm64/include/asm/mtecomp.h | 60 +++++
arch/arm64/mm/Makefile | 7 +
arch/arm64/mm/mtecomp.c | 406 +++++++++++++++++++++++++++++++
arch/arm64/mm/mteswap.c | 20 +-
arch/arm64/mm/mteswap.h | 12 +
arch/arm64/mm/mteswap_comp.c | 52 ++++
arch/arm64/mm/mteswap_nocomp.c | 38 +++
arch/arm64/mm/test_mtecomp.c | 177 ++++++++++++++
include/linux/bitmap.h | 57 +++++
lib/test_bitmap.c | 33 +++
11 files changed, 871 insertions(+), 11 deletions(-)
create mode 100644 arch/arm64/include/asm/mtecomp.h
create mode 100644 arch/arm64/mm/mtecomp.c
create mode 100644 arch/arm64/mm/mteswap.h
create mode 100644 arch/arm64/mm/mteswap_comp.c
create mode 100644 arch/arm64/mm/mteswap_nocomp.c
create mode 100644 arch/arm64/mm/test_mtecomp.c

--
2.41.0.255.g8b1d071c50-goog



2023-07-17 11:53:17

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()

The two new functions allow setting/getting values of length up to
BITS_PER_LONG bits at arbitrary position in the bitmap.

The code was taken from "bitops: Introduce the for_each_set_clump macro"
by Syed Nayyar Waris with a couple of minor changes:
- instead of using roundup(), which adds an unnecessary dependency
on <linux/math.h>, we calculate space as BITS_PER_LONG-offset;
- indentation is reduced by not using else-clauses (suggested by
checkpatch for bitmap_get_value())

Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Syed Nayyar Waris <[email protected]>
Signed-off-by: William Breathitt Gray <[email protected]>
Link: https://lore.kernel.org/lkml/fe12eedf3666f4af5138de0e70b67a07c7f40338.1592224129.git.syednwaris@gmail.com/
Suggested-by: Yury Norov <[email protected]>
Signed-off-by: Alexander Potapenko <[email protected]>
---
include/linux/bitmap.h | 57 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 57 insertions(+)

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 03644237e1efb..4559366084988 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -76,7 +76,11 @@ struct device;
* bitmap_to_arr32(buf, src, nbits) Copy nbits from buf to u32[] dst
* bitmap_to_arr64(buf, src, nbits) Copy nbits from buf to u64[] dst
* bitmap_get_value8(map, start) Get 8bit value from map at start
+ * bitmap_get_value(map, start, nbits) Get bit value of size 'nbits'
+ * from map at start
* bitmap_set_value8(map, value, start) Set 8bit value to map at start
+ * bitmap_set_value(map, value, start, nbits) Set bit value of size 'nbits'
+ * of map at start
*
* Note, bitmap_zero() and bitmap_fill() operate over the region of
* unsigned longs, that is, bits behind bitmap till the unsigned long
@@ -583,6 +587,31 @@ static inline unsigned long bitmap_get_value8(const unsigned long *map,
return (map[index] >> offset) & 0xFF;
}

+/**
+ * bitmap_get_value - get a value of n-bits from the memory region
+ * @map: address to the bitmap memory region
+ * @start: bit offset of the n-bit value
+ * @nbits: size of value in bits
+ *
+ * Returns value of nbits located at the @start bit offset within the @map
+ * memory region.
+ */
+static inline unsigned long bitmap_get_value(const unsigned long *map,
+ unsigned long start,
+ unsigned long nbits)
+{
+ const size_t index = BIT_WORD(start);
+ const unsigned long offset = start % BITS_PER_LONG;
+ const unsigned long space = BITS_PER_LONG - offset;
+ unsigned long value_low, value_high;
+
+ if (space >= nbits)
+ return (map[index] >> offset) & GENMASK(nbits - 1, 0);
+ value_low = map[index] & BITMAP_FIRST_WORD_MASK(start);
+ value_high = map[index + 1] & BITMAP_LAST_WORD_MASK(start + nbits);
+ return (value_low >> offset) | (value_high << space);
+}
+
/**
* bitmap_set_value8 - set an 8-bit value within a memory region
* @map: address to the bitmap memory region
@@ -599,6 +628,34 @@ static inline void bitmap_set_value8(unsigned long *map, unsigned long value,
map[index] |= value << offset;
}

+/**
+ * bitmap_set_value - set n-bit value within a memory region
+ * @map: address to the bitmap memory region
+ * @value: value of nbits
+ * @start: bit offset of the n-bit value
+ * @nbits: size of value in bits
+ */
+static inline void bitmap_set_value(unsigned long *map,
+ unsigned long value,
+ unsigned long start, unsigned long nbits)
+{
+ const size_t index = BIT_WORD(start);
+ const unsigned long offset = start % BITS_PER_LONG;
+ const unsigned long space = BITS_PER_LONG - offset;
+
+ value &= GENMASK(nbits - 1, 0);
+
+ if (space >= nbits) {
+ map[index] &= ~(GENMASK(nbits + offset - 1, offset));
+ map[index] |= value << offset;
+ return;
+ }
+ map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
+ map[index] |= value << offset;
+ map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
+ map[index + 1] |= (value >> space);
+}
+
#endif /* __ASSEMBLY__ */

#endif /* __LINUX_BITMAP_H */
--
2.41.0.255.g8b1d071c50-goog


2023-07-17 12:04:50

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v3 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP

The config implements the EA0 algorithm suggested by Evgenii Stepanov
to compress the memory tags for ARM MTE during swapping.

The algorithm is based on RLE and specifically targets 128-byte buffers
of tags corresponding to a single page. In the common case a buffer
can be compressed into 63 bits, making it possible to store it without
additional memory allocation.

Suggested-by: Evgenii Stepanov <[email protected]>
Signed-off-by: Alexander Potapenko <[email protected]>

---
v3:
- Addressed comments by Andy Shevchenko:
- use bitmap_{set,get}_value() writte by Syed Nayyar Waris
- switched to unsigned long everywhere (fewer casts)
- simplified the code, removed redundant checks
- dropped ea0_compress_inline()
- added bit size constants and helpers to access the bitmap
- explicitly initialize all compressed sizes in ea0_compress_to_buf()
- initialize all handle bits

v2:
- as suggested by Yuri Norov, switched from struct bitq (which is
not needed anymore) to <linux/bitmap.h>
- add missing symbol exports
---
arch/arm64/Kconfig | 10 +
arch/arm64/include/asm/mtecomp.h | 60 +++++
arch/arm64/mm/Makefile | 1 +
arch/arm64/mm/mtecomp.c | 406 +++++++++++++++++++++++++++++++
4 files changed, 477 insertions(+)
create mode 100644 arch/arm64/include/asm/mtecomp.h
create mode 100644 arch/arm64/mm/mtecomp.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a2511b30d0f67..52cdc7603cf7c 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2093,6 +2093,16 @@ config ARM64_EPAN
if the cpu does not implement the feature.
endmenu # "ARMv8.7 architectural features"

+config ARM64_MTE_COMP
+ bool "Tag compression for ARM64 MTE"
+ default y
+ depends on ARM64_MTE
+ help
+ Enable tag compression support for ARM64 MTE.
+
+ 128-byte tag buffers corresponding to 4K pages can be compressed using
+ the EA0 algorithm to save heap memory.
+
config ARM64_SVE
bool "ARM Scalable Vector Extension support"
default y
diff --git a/arch/arm64/include/asm/mtecomp.h b/arch/arm64/include/asm/mtecomp.h
new file mode 100644
index 0000000000000..0c444c0d4ac04
--- /dev/null
+++ b/arch/arm64/include/asm/mtecomp.h
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __ASM_MTECOMP_H
+#define __ASM_MTECOMP_H
+
+#include <linux/types.h>
+
+/*
+ * ea0_compress() - compress the given tag array.
+ * @tags: 128-byte array to read the tags from.
+ *
+ * Compresses the tags and returns a 64-bit opaque handle pointing to the
+ * tag storage. May allocate memory, which is freed by @ea0_release_handle().
+ */
+unsigned long ea0_compress(u8 *tags);
+
+/*
+ * ea0_decompress() - decompress the tag array addressed by the handle.
+ * @handle: handle returned by @ea0_decompress()
+ * @tags: 128-byte array to write the tags to.
+ *
+ * Reads the compressed data and writes it into the user-supplied tag array.
+ * Returns true on success, false on error.
+ */
+bool ea0_decompress(unsigned long handle, u8 *tags);
+
+/*
+ * ea0_release_handle() - release the handle returned by ea0_compress().
+ * @handle: handle returned by ea0_compress().
+ */
+void ea0_release_handle(unsigned long handle);
+
+/* Functions below are exported for testing purposes. */
+
+/*
+ * ea0_storage_size() - calculate the memory occupied by compressed tags.
+ * @handle: storage handle returned by ea0_compress.
+ */
+int ea0_storage_size(unsigned long handle);
+
+/*
+ * ea0_tags_to_ranges() - break @tags into arrays of tag ranges.
+ * @tags: 128-byte array containing 256 MTE tags.
+ * @out_tags: u8 array to store the tag of every range.
+ * @out_sizes: u16 array to store the size of every range.
+ * @out_len: length of @out_tags and @out_sizes (output parameter, initially
+ * equal to lengths of out_tags[] and out_sizes[]).
+ */
+void ea0_tags_to_ranges(u8 *tags, u8 *out_tags, short *out_sizes, int *out_len);
+
+/*
+ * ea0_ranges_to_tags() - fill @tags using given tag ranges.
+ * @r_tags: u8[256] containing the tag of every range.
+ * @r_sizes: u16[256] containing the size of every range.
+ * @r_len: length of @r_tags and @r_sizes.
+ * @tags: 128-byte array to write the tags to.
+ */
+void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags);
+
+#endif // __ASM_MTECOMP_H
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index dbd1bc95967d0..46778f6dd83c2 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_TRANS_TABLE) += trans_pgd.o
obj-$(CONFIG_TRANS_TABLE) += trans_pgd-asm.o
obj-$(CONFIG_DEBUG_VIRTUAL) += physaddr.o
obj-$(CONFIG_ARM64_MTE) += mteswap.o
+obj-$(CONFIG_ARM64_MTE_COMP) += mtecomp.o
KASAN_SANITIZE_physaddr.o += n

obj-$(CONFIG_KASAN) += kasan_init.o
diff --git a/arch/arm64/mm/mtecomp.c b/arch/arm64/mm/mtecomp.c
new file mode 100644
index 0000000000000..50a379c035aee
--- /dev/null
+++ b/arch/arm64/mm/mtecomp.c
@@ -0,0 +1,406 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * MTE tag compression algorithm.
+ * Proposed by Evgenii Stepanov <[email protected]>
+ */
+
+/*
+ * EA0 stands for "Evgenii's Algorithm 0", as the initial proposal contained two
+ * compression algorithms.
+ *
+ * The algorithm attempts to compress a 128-byte (MTE_GRANULES_PER_PAGE / 2)
+ * array of tags into a smaller byte sequence that can be stored in a
+ * 16-, 32-, or 64-byte buffer. A special case is storing the tags inline in
+ * an 8-byte pointer.
+ *
+ * We encapsulate tag storage memory management in this module, because it is
+ * tightly coupled with the pointer representation.
+ * ea0_compress(*tags) takes a 128-byte buffer and returns an opaque value
+ * that can be stored in Xarray
+ * ea0_decompress(*ptr, *tags) takes the opaque value and loads the tags into
+ * the provided 128-byte buffer.
+ *
+ * The compression algorithm works as follows.
+ *
+ * 1. The input array of 128 bytes is transformed into tag ranges (two arrays:
+ * @r_tags containing tag values and @r_sizes containing range lengths) by
+ * ea0_tags_to_ranges(). Note that @r_sizes sums up to 256.
+ *
+ * 2. Depending on the number N of ranges, the following storage class is picked:
+ * N <= 6: 8 bytes (inline case, no allocation required);
+ * 6 < N <= 11: 16 bytes
+ * 11 < N <= 23: 32 bytes
+ * 23 < N <= 46: 64 bytes
+ * 46 < N: 128 bytes (no compression will be performed)
+ *
+ * 3. The number of the largest element of @r_sizes is stored in @largest_idx.
+ * The element itself is thrown away from @r_sizes, because it can be
+ * reconstructed from the sum of the remaining elements. Note that now none
+ * of the remaining @r_sizes elements is greater than 127.
+ *
+ * 4. For the inline case, the following values are stored in the 8-byte handle:
+ * largest_idx : i4
+ * r_tags[0..5] : i4 x 6
+ * r_sizes[0..4] : i7 x 5
+ * (if N is less than 6, @r_tags and @r_sizes are padded up with zero values)
+ *
+ * Because @largest_idx is <= 5, bit 63 of the handle is always 0 (so it can
+ * be stored in the Xarray), and bits 62..60 cannot all be 1, so it can be
+ * distinguished from a kernel pointer.
+ *
+ * 5. For the out-of-line case, the storage is allocated from one of the
+ * "mte-tags-{16,32,64,128}" kmem caches. The resulting pointer is aligned
+ * on 8 bytes, so its bits 2..0 can be used to store the size class:
+ * - 0 for 128 bytes
+ * - 1 for 16
+ * - 2 for 32
+ * - 4 for 64.
+ * Bit 63 of the pointer is zeroed out, so that it can be stored in Xarray.
+ *
+ * 6. The data layout in the allocated storage is as follows:
+ * largest_idx : i6
+ * r_tags[0..N] : i4 x N
+ * r_sizes[0..N-1] : i7 x (N-1)
+ *
+ * The decompression algorithm performs the steps below.
+ *
+ * 1. Decide if data is stored inline (bits 62..60 of the handle != 0b111) or
+ * out-of line.
+ *
+ * 2. For the inline case, treat the handle itself as the input buffer.
+ *
+ * 3. For the out-of-line case, look at bits 2..0 of the handle to understand
+ * the input buffer length. To obtain the pointer to the input buffer, unset
+ * bits 2..0 of the handle and set bit 63.
+ *
+ * 4. If the input buffer is 128 byte long, copy its contents to the output
+ * buffer.
+ *
+ * 5. Otherwise, read @largest_idx, @r_tags and @r_sizes from the input buffer.
+ * Calculate the removed largest element of @r_sizes:
+ * largest = 256 - sum(r_sizes)
+ * and insert it into @r_sizes at position @largest_idx.
+ *
+ * 6. While @r_sizes[i] > 0, add a 4-bit value @r_tags[i] to the output buffer
+ * @r_sizes[i] times.
+ */
+
+#include <linux/bits.h>
+#include <linux/bitmap.h>
+#include <linux/gfp.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/swab.h>
+#include <linux/string.h>
+#include <linux/types.h>
+
+#include <asm/mtecomp.h>
+
+/* The handle must fit into an Xarray value. */
+#define HANDLE_MASK GENMASK_ULL(62, 0)
+
+/* Out-of-line handles have 0b111 in bits 62..60. */
+#define NOINLINE_MASK GENMASK_ULL(62, 60)
+
+/* Cache index is stored in the lowest pointer bits. */
+#define CACHE_ID_MASK GENMASK_ULL(2, 0)
+
+/*
+ * Four separate caches to store out-of-line data:
+ * 0: mte-tags-128
+ * 1: mte-tags-16
+ * 2: mte-tags-32
+ * 3: mte-tags-64
+ */
+#define NUM_CACHES 4
+static struct kmem_cache *mtecomp_caches[NUM_CACHES];
+
+/*
+ * Sizes of compressed values.
+ */
+#define BITS_PER_TAG 4
+#define BITS_PER_SIZE 7
+#define BITS_PER_LARGEST_IDX_INLINE 4
+#define BITS_PER_LARGEST_IDX 6
+
+/* Translate allocation size into mtecomp_caches[] index. */
+static int ea0_size_to_cache_id(int len)
+{
+ if (len < 128)
+ return fls(len >> 4);
+ return 0;
+}
+
+/* Translate mtecomp_caches[] index into allocation size. */
+static int ea0_cache_id_to_size(int id)
+{
+ if (id == 0)
+ return 128;
+ return 8 << id;
+}
+
+/* Transform tags into tag ranges. */
+void ea0_tags_to_ranges(u8 *tags, u8 *out_tags, short *out_sizes, int *out_len)
+{
+ u8 prev_tag = U8_MAX;
+ int cur_idx = -1;
+ u8 cur_tag;
+ int i, j;
+
+ memset(out_tags, 0, array_size(*out_len, sizeof(*out_tags)));
+ memset(out_sizes, 0, array_size(*out_len, sizeof(*out_sizes)));
+
+ for (i = 0; i < MTE_PAGE_TAG_STORAGE; i++) {
+ for (j = 0; j < 2; j++) {
+ cur_tag = j ? (tags[i] % 16) : (tags[i] / 16);
+ if (cur_tag == prev_tag) {
+ out_sizes[cur_idx]++;
+ } else {
+ cur_idx++;
+ prev_tag = cur_tag;
+ out_tags[cur_idx] = prev_tag;
+ out_sizes[cur_idx] = 1;
+ }
+ }
+ }
+ *out_len = cur_idx + 1;
+}
+EXPORT_SYMBOL_NS(ea0_tags_to_ranges, MTECOMP);
+
+/* Transform tag ranges back into tags. */
+void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags)
+{
+ int i, j, pos = 0;
+ u8 prev;
+
+ for (i = 0; i < r_len; i++) {
+ for (j = 0; j < r_sizes[i]; j++) {
+ if (pos % 2)
+ tags[pos / 2] = (prev << 4) | r_tags[i];
+ else
+ prev = r_tags[i];
+ pos++;
+ }
+ }
+}
+EXPORT_SYMBOL_NS(ea0_ranges_to_tags, MTECOMP);
+
+/* Translate @num_ranges into the allocation size needed to hold them. */
+static int ea0_alloc_size(int num_ranges)
+{
+ if (num_ranges <= 6)
+ return 8;
+ if (num_ranges <= 11)
+ return 16;
+ if (num_ranges <= 23)
+ return 32;
+ if (num_ranges <= 46)
+ return 64;
+ return 128;
+}
+
+/* Translate allocation size into maximum number of ranges that it can hold. */
+static int ea0_size_to_ranges(int size)
+{
+ switch (size) {
+ case 8:
+ return 6;
+ case 16:
+ return 11;
+ case 32:
+ return 23;
+ case 64:
+ return 46;
+ default:
+ return 0;
+ }
+}
+#define RANGES_INLINE ea0_size_to_ranges(8)
+
+/* Is the data stored inline in the handle itself? */
+static bool ea0_is_inline(unsigned long handle)
+{
+ return (handle & NOINLINE_MASK) != NOINLINE_MASK;
+}
+
+/* Get the size of the buffer backing @handle. */
+int ea0_storage_size(unsigned long handle)
+{
+ if (ea0_is_inline(handle))
+ return 8;
+ return ea0_cache_id_to_size(handle & CACHE_ID_MASK);
+}
+EXPORT_SYMBOL_NS(ea0_storage_size, MTECOMP);
+
+static void bitmap_write(unsigned long *bitmap, unsigned long value,
+ unsigned long *pos, unsigned long bits)
+{
+ bitmap_set_value(bitmap, value, *pos, bits);
+ *pos += bits;
+}
+
+/* Compress ranges into the buffer that can accommodate up to max_ranges. */
+static void ea0_compress_to_buf(int len, u8 *tags, short *sizes,
+ unsigned long *bitmap, int max_ranges)
+{
+ unsigned long bit_pos = 0, l_bits;
+ int largest_idx = -1, i;
+ short largest = 0;
+
+ for (i = 0; i < len; i++) {
+ if (sizes[i] > largest) {
+ largest = sizes[i];
+ largest_idx = i;
+ }
+ }
+ l_bits = (max_ranges == RANGES_INLINE) ? BITS_PER_LARGEST_IDX_INLINE :
+ BITS_PER_LARGEST_IDX;
+ bitmap_write(bitmap, largest_idx, &bit_pos, l_bits);
+ for (i = 0; i < len; i++)
+ bitmap_write(bitmap, tags[i], &bit_pos, BITS_PER_TAG);
+ for (i = len; i < max_ranges; i++)
+ bitmap_write(bitmap, 0, &bit_pos, BITS_PER_TAG);
+ for (i = 0; i < len; i++) {
+ if (i == largest_idx)
+ continue;
+ bitmap_write(bitmap, sizes[i], &bit_pos, BITS_PER_SIZE);
+ }
+ for (i = len; i < max_ranges; i++)
+ bitmap_write(bitmap, 0, &bit_pos, BITS_PER_SIZE);
+}
+
+/* Compress @tags and return a handle. */
+unsigned long ea0_compress(u8 *tags)
+{
+ int alloc_size, cache_id;
+ struct kmem_cache *cache;
+ short r_sizes[256];
+ u8 r_tags[256];
+ int r_len = ARRAY_SIZE(r_tags);
+ unsigned long *storage;
+ /*
+ * ea0_compress_to_buf() only initializes the bits that ea0_decompress()
+ * will read. But when the tags are stored in the handle itself, it must
+ * have all its bits initialized.
+ */
+ unsigned long result = 0;
+
+ ea0_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
+ alloc_size = ea0_alloc_size(r_len);
+ if (alloc_size == 8) {
+ ea0_compress_to_buf(r_len, r_tags, r_sizes, &result,
+ RANGES_INLINE);
+ return result;
+ }
+ cache_id = ea0_size_to_cache_id(alloc_size);
+ cache = mtecomp_caches[cache_id];
+ storage = kmem_cache_alloc(cache, GFP_KERNEL);
+ if (alloc_size < 128) {
+ /* alloc_size is always a multiple of sizeof(unsigned long). */
+ ea0_compress_to_buf(r_len, r_tags, r_sizes, storage,
+ ea0_size_to_ranges(alloc_size));
+ return ((unsigned long)storage | cache_id) & HANDLE_MASK;
+ }
+ memcpy(storage, tags, alloc_size);
+ return (unsigned long)storage & HANDLE_MASK;
+}
+EXPORT_SYMBOL_NS(ea0_compress, MTECOMP);
+
+static unsigned long bitmap_read(const unsigned long *bitmap,
+ unsigned long *pos, unsigned long bits)
+{
+ unsigned long result;
+
+ result = bitmap_get_value(bitmap, *pos, bits);
+ *pos += bits;
+ return result;
+}
+
+/* Decompress the contents of the given buffer into @tags. */
+static bool ea0_decompress_from_buf(const unsigned long *bitmap, int max_ranges,
+ u8 *tags)
+{
+ int largest_idx, i;
+ short r_sizes[46], sum = 0;
+ u8 r_tags[46];
+ unsigned long bit_pos = 0, l_bits;
+
+ l_bits = (max_ranges == RANGES_INLINE) ? BITS_PER_LARGEST_IDX_INLINE :
+ BITS_PER_LARGEST_IDX;
+ largest_idx = bitmap_read(bitmap, &bit_pos, l_bits);
+ for (i = 0; i < max_ranges; i++)
+ r_tags[i] = bitmap_read(bitmap, &bit_pos, BITS_PER_TAG);
+ for (i = 0; i < max_ranges; i++) {
+ if (i == largest_idx)
+ continue;
+ r_sizes[i] = bitmap_read(bitmap, &bit_pos, BITS_PER_SIZE);
+ if (!r_sizes[i]) {
+ max_ranges = i;
+ break;
+ }
+ sum += r_sizes[i];
+ }
+ if (sum >= 256)
+ return false;
+ r_sizes[largest_idx] = 256 - sum;
+ ea0_ranges_to_tags(r_tags, r_sizes, max_ranges, tags);
+ return true;
+}
+
+/* Get pointer to the out-of-line storage from a handle. */
+static void *ea0_storage(unsigned long handle)
+{
+ if (ea0_is_inline(handle))
+ return NULL;
+ return (void *)((handle & (~CACHE_ID_MASK)) | BIT_ULL(63));
+}
+
+/* Decompress tags from the buffer referenced by @handle. */
+bool ea0_decompress(unsigned long handle, u8 *tags)
+{
+ unsigned long *storage = ea0_storage(handle);
+ int size = ea0_storage_size(handle);
+
+ if (size == 128) {
+ memcpy(tags, storage, size);
+ return true;
+ }
+ if (size == 8)
+ return ea0_decompress_from_buf(&handle, RANGES_INLINE, tags);
+ return ea0_decompress_from_buf(storage, ea0_size_to_ranges(size), tags);
+}
+EXPORT_SYMBOL_NS(ea0_decompress, MTECOMP);
+
+/* Release the memory referenced by @handle. */
+void ea0_release_handle(unsigned long handle)
+{
+ void *storage = ea0_storage(handle);
+ int size = ea0_storage_size(handle);
+ struct kmem_cache *c;
+
+ if (!storage)
+ return;
+
+ c = mtecomp_caches[ea0_size_to_cache_id(size)];
+ kmem_cache_free(c, storage);
+}
+EXPORT_SYMBOL_NS(ea0_release_handle, MTECOMP);
+
+/* Set up mtecomp_caches[]. */
+static int mtecomp_init(void)
+{
+ char name[16];
+ int size;
+ int i;
+
+ BUILD_BUG_ON(MTE_PAGE_TAG_STORAGE != 128);
+ for (i = 0; i < NUM_CACHES; i++) {
+ size = ea0_cache_id_to_size(i);
+ snprintf(name, ARRAY_SIZE(name), "mte-tags-%d", size);
+ mtecomp_caches[i] =
+ kmem_cache_create(name, size, size, 0, NULL);
+ }
+ return 0;
+}
+module_init(mtecomp_init);
--
2.41.0.255.g8b1d071c50-goog


2023-07-17 13:17:03

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()

On Mon, Jul 17, 2023 at 01:37:04PM +0200, Alexander Potapenko wrote:
> The two new functions allow setting/getting values of length up to
> BITS_PER_LONG bits at arbitrary position in the bitmap.
>
> The code was taken from "bitops: Introduce the for_each_set_clump macro"
> by Syed Nayyar Waris with a couple of minor changes:

Since changes are minor, please make sure that the authorship is kept
untouched.

> - instead of using roundup(), which adds an unnecessary dependency
> on <linux/math.h>, we calculate space as BITS_PER_LONG-offset;
> - indentation is reduced by not using else-clauses (suggested by
> checkpatch for bitmap_get_value())

> Cc: Arnd Bergmann <[email protected]>

You can use --cc to `git send-email` instead of polluting the commit message.

> Signed-off-by: Syed Nayyar Waris <[email protected]>
> Signed-off-by: William Breathitt Gray <[email protected]>
> Link: https://lore.kernel.org/lkml/fe12eedf3666f4af5138de0e70b67a07c7f40338.1592224129.git.syednwaris@gmail.com/
> Suggested-by: Yury Norov <[email protected]>
> Signed-off-by: Alexander Potapenko <[email protected]>

With above, I think you can also add Co-developed-by (as the changes were
made).

...

> +static inline void bitmap_set_value(unsigned long *map,
> + unsigned long value,
> + unsigned long start, unsigned long nbits)
> +{
> + const size_t index = BIT_WORD(start);
> + const unsigned long offset = start % BITS_PER_LONG;
> + const unsigned long space = BITS_PER_LONG - offset;
> +
> + value &= GENMASK(nbits - 1, 0);
> +
> + if (space >= nbits) {

> + map[index] &= ~(GENMASK(nbits + offset - 1, offset));

I remember that this construction may bring horrible code on some architectures
with some version(s) of the compiler (*).

To fix that I found an easy refactoring:

map[index] &= ~(GENMASK(nbits, 0) << offset));


(*) don't remember the actual versions, though, but anyway...

> + map[index] |= value << offset;
> + return;
> + }
> + map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
> + map[index] |= value << offset;
> + map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> + map[index + 1] |= (value >> space);
> +}

--
With Best Regards,
Andy Shevchenko



2023-07-17 14:07:27

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP

On Mon, Jul 17, 2023 at 01:37:06PM +0200, Alexander Potapenko wrote:
> The config implements the EA0 algorithm suggested by Evgenii Stepanov
> to compress the memory tags for ARM MTE during swapping.
>
> The algorithm is based on RLE and specifically targets 128-byte buffers
> of tags corresponding to a single page. In the common case a buffer
> can be compressed into 63 bits, making it possible to store it without
> additional memory allocation.

...

> +config ARM64_MTE_COMP
> + bool "Tag compression for ARM64 MTE"

At least here, make sure everybody understands what you are talking about. WTF
MTE is?

> + default y
> + depends on ARM64_MTE
> + help
> + Enable tag compression support for ARM64 MTE.
> +
> + 128-byte tag buffers corresponding to 4K pages can be compressed using
> + the EA0 algorithm to save heap memory.

> config ARM64_SVE
> bool "ARM Scalable Vector Extension support"

You see the difference?

...

> +/*

Are you deliberately made it NON-kernel-doc? If so, why? And why does it
have too many similarities with above mentioned format?

> + * ea0_compress() - compress the given tag array.
> + * @tags: 128-byte array to read the tags from.
> + *
> + * Compresses the tags and returns a 64-bit opaque handle pointing to the
> + * tag storage. May allocate memory, which is freed by @ea0_release_handle().
> + */
> +unsigned long ea0_compress(u8 *tags);
> +
> +/*
> + * ea0_decompress() - decompress the tag array addressed by the handle.
> + * @handle: handle returned by @ea0_decompress()
> + * @tags: 128-byte array to write the tags to.
> + *
> + * Reads the compressed data and writes it into the user-supplied tag array.
> + * Returns true on success, false on error.

In case you are going to make them real kernel-doc:s, make sure kernel-doc
validator doesn't warn. Here, for example, return section is missing. The easy
fix is to add : after Returns. Same to the rest of function descriptions. Also
why you put the descriptions in to the header file? It's a bit unusual for the
exported ones.

> + */

...

> +/*
> + * ea0_tags_to_ranges() - break @tags into arrays of tag ranges.
> + * @tags: 128-byte array containing 256 MTE tags.
> + * @out_tags: u8 array to store the tag of every range.
> + * @out_sizes: u16 array to store the size of every range.

u16? I don't see it.

> + * @out_len: length of @out_tags and @out_sizes (output parameter, initially
> + * equal to lengths of out_tags[] and out_sizes[]).
> + */

> +/*
> + * ea0_ranges_to_tags() - fill @tags using given tag ranges.
> + * @r_tags: u8[256] containing the tag of every range.
> + * @r_sizes: u16[256] containing the size of every range.

Ditto.

> + * @r_len: length of @r_tags and @r_sizes.
> + * @tags: 128-byte array to write the tags to.
> + */
> +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags);

In both cases signed integer may be promoted with a sign. Is it a problem here?

...

> +/*
> + * EA0 stands for "Evgenii's Algorithm 0", as the initial proposal contained two
> + * compression algorithms.
> + *
> + * The algorithm attempts to compress a 128-byte (MTE_GRANULES_PER_PAGE / 2)
> + * array of tags into a smaller byte sequence that can be stored in a
> + * 16-, 32-, or 64-byte buffer. A special case is storing the tags inline in
> + * an 8-byte pointer.
> + *
> + * We encapsulate tag storage memory management in this module, because it is
> + * tightly coupled with the pointer representation.

> + * ea0_compress(*tags) takes a 128-byte buffer and returns an opaque value

ea0_compress() is usual way how we refer to the functions. Let tools to make
the necessary references.

> + * that can be stored in Xarray
> + * ea0_decompress(*ptr, *tags) takes the opaque value and loads the tags into

Ditto. And so on.

> + * the provided 128-byte buffer.
> + *
> + * The compression algorithm works as follows.
> + *
> + * 1. The input array of 128 bytes is transformed into tag ranges (two arrays:
> + * @r_tags containing tag values and @r_sizes containing range lengths) by
> + * ea0_tags_to_ranges(). Note that @r_sizes sums up to 256.
> + *
> + * 2. Depending on the number N of ranges, the following storage class is picked:
> + * N <= 6: 8 bytes (inline case, no allocation required);
> + * 6 < N <= 11: 16 bytes
> + * 11 < N <= 23: 32 bytes
> + * 23 < N <= 46: 64 bytes
> + * 46 < N: 128 bytes (no compression will be performed)
> + *
> + * 3. The number of the largest element of @r_sizes is stored in @largest_idx.
> + * The element itself is thrown away from @r_sizes, because it can be
> + * reconstructed from the sum of the remaining elements. Note that now none
> + * of the remaining @r_sizes elements is greater than 127.
> + *
> + * 4. For the inline case, the following values are stored in the 8-byte handle:
> + * largest_idx : i4
> + * r_tags[0..5] : i4 x 6
> + * r_sizes[0..4] : i7 x 5
> + * (if N is less than 6, @r_tags and @r_sizes are padded up with zero values)
> + *
> + * Because @largest_idx is <= 5, bit 63 of the handle is always 0 (so it can
> + * be stored in the Xarray), and bits 62..60 cannot all be 1, so it can be
> + * distinguished from a kernel pointer.
> + *
> + * 5. For the out-of-line case, the storage is allocated from one of the
> + * "mte-tags-{16,32,64,128}" kmem caches. The resulting pointer is aligned
> + * on 8 bytes, so its bits 2..0 can be used to store the size class:

> + * - 0 for 128 bytes
> + * - 1 for 16
> + * - 2 for 32
> + * - 4 for 64.

Is this chosen deliberately (for performance?)? Otherwise why not put them in
natural exponential growing?

> + * Bit 63 of the pointer is zeroed out, so that it can be stored in Xarray.
> + *
> + * 6. The data layout in the allocated storage is as follows:
> + * largest_idx : i6
> + * r_tags[0..N] : i4 x N
> + * r_sizes[0..N-1] : i7 x (N-1)
> + *
> + * The decompression algorithm performs the steps below.
> + *
> + * 1. Decide if data is stored inline (bits 62..60 of the handle != 0b111) or
> + * out-of line.
> + *
> + * 2. For the inline case, treat the handle itself as the input buffer.
> + *
> + * 3. For the out-of-line case, look at bits 2..0 of the handle to understand
> + * the input buffer length. To obtain the pointer to the input buffer, unset
> + * bits 2..0 of the handle and set bit 63.
> + *
> + * 4. If the input buffer is 128 byte long, copy its contents to the output
> + * buffer.
> + *
> + * 5. Otherwise, read @largest_idx, @r_tags and @r_sizes from the input buffer.
> + * Calculate the removed largest element of @r_sizes:
> + * largest = 256 - sum(r_sizes)
> + * and insert it into @r_sizes at position @largest_idx.
> + *
> + * 6. While @r_sizes[i] > 0, add a 4-bit value @r_tags[i] to the output buffer
> + * @r_sizes[i] times.
> + */

...

> +#include <linux/bits.h>
> +#include <linux/bitmap.h>

bitmap guarantees that bits.h will be included.

> +#include <linux/gfp.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/swab.h>
> +#include <linux/string.h>
> +#include <linux/types.h>

...

> +void ea0_tags_to_ranges(u8 *tags, u8 *out_tags, short *out_sizes, int *out_len)
> +{
> + u8 prev_tag = U8_MAX;

> + int cur_idx = -1;

At which circumstances does this assignment make sense?

> + u8 cur_tag;
> + int i, j;
> +
> + memset(out_tags, 0, array_size(*out_len, sizeof(*out_tags)));
> + memset(out_sizes, 0, array_size(*out_len, sizeof(*out_sizes)));
> +
> + for (i = 0; i < MTE_PAGE_TAG_STORAGE; i++) {
> + for (j = 0; j < 2; j++) {
> + cur_tag = j ? (tags[i] % 16) : (tags[i] / 16);
> + if (cur_tag == prev_tag) {
> + out_sizes[cur_idx]++;

Who guarantees this one is not [-1]?

> + } else {

> + cur_idx++;

Aha, above seems a bit prone to out of boundaries errors. Can you make it
unsigned and start from 0?

> + prev_tag = cur_tag;
> + out_tags[cur_idx] = prev_tag;
> + out_sizes[cur_idx] = 1;
> + }
> + }
> + }
> + *out_len = cur_idx + 1;
> +}

...

> +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags)
> +{
> + int i, j, pos = 0;

Wouldn't be more correct to have this assignment inside the first for-loop?

> + u8 prev;
> +
> + for (i = 0; i < r_len; i++) {
> + for (j = 0; j < r_sizes[i]; j++) {
> + if (pos % 2)
> + tags[pos / 2] = (prev << 4) | r_tags[i];
> + else
> + prev = r_tags[i];
> + pos++;
> + }
> + }
> +}

...

> +#define RANGES_INLINE ea0_size_to_ranges(8)

Don't forget to undef it when not needed.

...

> +static void bitmap_write(unsigned long *bitmap, unsigned long value,
> + unsigned long *pos, unsigned long bits)

Please, don't use reserved namespace. Yours is ea0, use it:
ea0_bitmap_write()! Same to other similarly named functions.

...

> + unsigned long bit_pos = 0, l_bits;
> + int largest_idx = -1, i;
> + short largest = 0;

Here and elsewhere, please, double check the correctness and/or necessity of
signdness and assignments of local variables.

...

> + for (i = 0; i < len; i++) {
> + if (sizes[i] > largest) {

Here

if (largest >= sizes[i])
continue;
makes sense, but...

> + largest = sizes[i];
> + largest_idx = i;
> + }
> + }

...

> + for (i = 0; i < len; i++) {
> + if (i == largest_idx)
> + continue;
> + bitmap_write(bitmap, sizes[i], &bit_pos, BITS_PER_SIZE);

...here I would do the opposite since it's one liner.

> + }

...

> + u8 r_tags[256];
> + int r_len = ARRAY_SIZE(r_tags);

sizeof() ?

...

> + l_bits = (max_ranges == RANGES_INLINE) ? BITS_PER_LARGEST_IDX_INLINE :
> + BITS_PER_LARGEST_IDX;

Is it a dup? Perhaps a helper for this?

Seems BITS_PER_TAG, BITS_PER_SIZE and the rest should also be namespaced,
EA0_BITS_...

...

> +bool ea0_decompress(unsigned long handle, u8 *tags)
> +{
> + unsigned long *storage = ea0_storage(handle);
> + int size = ea0_storage_size(handle);
> +
> + if (size == 128) {
> + memcpy(tags, storage, size);
> + return true;
> + }
> + if (size == 8)
> + return ea0_decompress_from_buf(&handle, RANGES_INLINE, tags);

Maybe

switch (ea0_storage_size(handle)) {
...
default:
}

?

> + return ea0_decompress_from_buf(storage, ea0_size_to_ranges(size), tags);
> +}

...

> +void ea0_release_handle(unsigned long handle)
> +{
> + void *storage = ea0_storage(handle);
> + int size = ea0_storage_size(handle);
> + struct kmem_cache *c;

> + if (!storage)
> + return;

I find slightly better for maintaining in the form as

struct kmem_cache *c;
void *storage;
int size;

storage = ea0_storage(handle);
if (!storage)
return;

size = ea0_storage_size(handle);

> + c = mtecomp_caches[ea0_size_to_cache_id(size)];
> + kmem_cache_free(c, storage);
> +}

...

> +static int mtecomp_init(void)
> +{
> + char name[16];
> + int size;
> + int i;
> +
> + BUILD_BUG_ON(MTE_PAGE_TAG_STORAGE != 128);

Why not static_assert()?

> + for (i = 0; i < NUM_CACHES; i++) {
> + size = ea0_cache_id_to_size(i);
> + snprintf(name, ARRAY_SIZE(name), "mte-tags-%d", size);

sizeof() will work the same way without need of having kernel.h be included.

> + mtecomp_caches[i] =
> + kmem_cache_create(name, size, size, 0, NULL);
> + }
> + return 0;
> +}

--
With Best Regards,
Andy Shevchenko



2023-07-17 14:32:39

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()

>
> > Cc: Arnd Bergmann <[email protected]>
>
> You can use --cc to `git send-email` instead of polluting the commit message.

Right. But as far as I can tell, certain kernel devs prefer to be CCed
on the whole series, whereas others do not want to see anything but
the actual patch they were interested in.
I am not sure about Arnd's preferences, so I just decided to keep the
tag from the original patch by Syed Nayyar Waris (which I also
consider to be an indication of the fact "that potentially interested
parties have been included in the discussion" per
https://www.kernel.org/doc/html/v4.17/process/submitting-patches.html#when-to-use-acked-by-cc-and-co-developed-by)

> > Signed-off-by: Syed Nayyar Waris <[email protected]>
> > Signed-off-by: William Breathitt Gray <[email protected]>
> > Link: https://lore.kernel.org/lkml/fe12eedf3666f4af5138de0e70b67a07c7f40338.1592224129.git.syednwaris@gmail.com/
> > Suggested-by: Yury Norov <[email protected]>
> > Signed-off-by: Alexander Potapenko <[email protected]>
>
> With above, I think you can also add Co-developed-by (as the changes were
> made).

Ok, will do.

> ...
>
> > +static inline void bitmap_set_value(unsigned long *map,
> > + unsigned long value,
> > + unsigned long start, unsigned long nbits)
> > +{
> > + const size_t index = BIT_WORD(start);
> > + const unsigned long offset = start % BITS_PER_LONG;
> > + const unsigned long space = BITS_PER_LONG - offset;
> > +
> > + value &= GENMASK(nbits - 1, 0);
> > +
> > + if (space >= nbits) {
>
> > + map[index] &= ~(GENMASK(nbits + offset - 1, offset));
>
> I remember that this construction may bring horrible code on some architectures
> with some version(s) of the compiler (*).

Wow, even the trunk Clang and GCC seem to generate better code for
your version of this line: https://godbolt.org/z/36Kqxhe6j

> To fix that I found an easy refactoring:
>
> map[index] &= ~(GENMASK(nbits, 0) << offset));
>

I'll take this one.

> (*) don't remember the actual versions, though, but anyway...
>
> > + map[index] |= value << offset;
> > + return;
> > + }
> > + map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
> > + map[index] |= value << offset;
> > + map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> > + map[index + 1] |= (value >> space);
> > +}
>
> --
> With Best Regards,
> Andy Shevchenko
>
>


--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

2023-07-17 14:41:33

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()

On Mon, Jul 17, 2023 at 04:14:57PM +0200, Alexander Potapenko wrote:

+Cc: Nathan (on code generation question below)

...

> > > Cc: Arnd Bergmann <[email protected]>
> >
> > You can use --cc to `git send-email` instead of polluting the commit message.
>
> Right. But as far as I can tell, certain kernel devs prefer to be CCed
> on the whole series, whereas others do not want to see anything but
> the actual patch they were interested in.
> I am not sure about Arnd's preferences, so I just decided to keep the
> tag from the original patch by Syed Nayyar Waris (which I also
> consider to be an indication of the fact "that potentially interested
> parties have been included in the discussion" per
> https://www.kernel.org/doc/html/v4.17/process/submitting-patches.html#when-to-use-acked-by-cc-and-co-developed-by)

My personal statistics from the field that more than 90% of maintainers would
like to receive 100% of the series. Of course it depends on the series (if it's
treewide, I will agree with you). Here another point to my suggestion is that
Arnd is SoC tree maintainer, where ARM is one of the biggest player, so I think
he would like to see arm*: prefixed patches anyway.

...

> > > + map[index] &= ~(GENMASK(nbits + offset - 1, offset));
> >
> > I remember that this construction may bring horrible code on some architectures
> > with some version(s) of the compiler (*).
>
> Wow, even the trunk Clang and GCC seem to generate better code for
> your version of this line: https://godbolt.org/z/36Kqxhe6j

Wow, indeed! Perhaps time to report to clang and GCC people. I believe the root
cause is that in the original version compiler can't prove that l is constant
for GENMASK().

> > To fix that I found an easy refactoring:
> >
> > map[index] &= ~(GENMASK(nbits, 0) << offset));
>
> I'll take this one.
>
> > (*) don't remember the actual versions, though, but anyway...

--
With Best Regards,
Andy Shevchenko



2023-07-17 14:57:22

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()

On Mon, Jul 17, 2023 at 05:29:12PM +0300, Andy Shevchenko wrote:
> On Mon, Jul 17, 2023 at 04:14:57PM +0200, Alexander Potapenko wrote:

...

> > > > + map[index] &= ~(GENMASK(nbits + offset - 1, offset));
> > >
> > > I remember that this construction may bring horrible code on some architectures
> > > with some version(s) of the compiler (*).
> >
> > Wow, even the trunk Clang and GCC seem to generate better code for
> > your version of this line: https://godbolt.org/z/36Kqxhe6j
>
> Wow, indeed! Perhaps time to report to clang and GCC people. I believe the root
> cause is that in the original version compiler can't prove that l is constant
> for GENMASK().
>
> > > To fix that I found an easy refactoring:
> > >
> > > map[index] &= ~(GENMASK(nbits, 0) << offset));

nbits - 1 it should be, btw. In any case it seems the code is still better.

> > I'll take this one.
> >
> > > (*) don't remember the actual versions, though, but anyway...

--
With Best Regards,
Andy Shevchenko



2023-07-17 15:23:25

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()

> >
> > I remember that this construction may bring horrible code on some architectures
> > with some version(s) of the compiler (*).
>
> Wow, even the trunk Clang and GCC seem to generate better code for
> your version of this line: https://godbolt.org/z/36Kqxhe6j

Oh wait.
First, my Godbolt reproducer is incorrect, it is using sizeof(unsigned
long) instead of 8 * sizeof(unsigned long) for BITS_PER_LONG.

>
> > To fix that I found an easy refactoring:
> >
> > map[index] &= ~(GENMASK(nbits, 0) << offset));
> >

Second, the line above should probably be:
map[index] &= ~(GENMASK(nbits - 1, 0) << offset));

, right?

2023-07-17 15:31:12

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()

On Mon, Jul 17, 2023 at 04:53:51PM +0200, Alexander Potapenko wrote:

...

> > > I remember that this construction may bring horrible code on some architectures
> > > with some version(s) of the compiler (*).
> >
> > Wow, even the trunk Clang and GCC seem to generate better code for
> > your version of this line: https://godbolt.org/z/36Kqxhe6j
>
> Oh wait.
> First, my Godbolt reproducer is incorrect, it is using sizeof(unsigned
> long) instead of 8 * sizeof(unsigned long) for BITS_PER_LONG.

Still slightly better. And note, that the same GENMASK() is used at the
beginning of the function. Compiler actually might cache it.

> > > To fix that I found an easy refactoring:
> > >
> > > map[index] &= ~(GENMASK(nbits, 0) << offset));
> > >
>
> Second, the line above should probably be:
> map[index] &= ~(GENMASK(nbits - 1, 0) << offset));
>
> , right?

Yes.

--
With Best Regards,
Andy Shevchenko



2023-07-17 16:16:48

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()

On Mon, Jul 17, 2023 at 01:37:04PM +0200, Alexander Potapenko wrote:
> The two new functions allow setting/getting values of length up to
> BITS_PER_LONG bits at arbitrary position in the bitmap.
>
> The code was taken from "bitops: Introduce the for_each_set_clump macro"
> by Syed Nayyar Waris with a couple of minor changes:
> - instead of using roundup(), which adds an unnecessary dependency
> on <linux/math.h>, we calculate space as BITS_PER_LONG-offset;
> - indentation is reduced by not using else-clauses (suggested by
> checkpatch for bitmap_get_value())

Please preserve Syed's authorship ('From' field in git log).

> Cc: Arnd Bergmann <[email protected]>
> Signed-off-by: Syed Nayyar Waris <[email protected]>
> Signed-off-by: William Breathitt Gray <[email protected]>
> Link: https://lore.kernel.org/lkml/fe12eedf3666f4af5138de0e70b67a07c7f40338.1592224129.git.syednwaris@gmail.com/
> Suggested-by: Yury Norov <[email protected]>
> Signed-off-by: Alexander Potapenko <[email protected]>
> ---
> include/linux/bitmap.h | 57 ++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 57 insertions(+)
>
> diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
> index 03644237e1efb..4559366084988 100644
> --- a/include/linux/bitmap.h
> +++ b/include/linux/bitmap.h
> @@ -76,7 +76,11 @@ struct device;
> * bitmap_to_arr32(buf, src, nbits) Copy nbits from buf to u32[] dst
> * bitmap_to_arr64(buf, src, nbits) Copy nbits from buf to u64[] dst
> * bitmap_get_value8(map, start) Get 8bit value from map at start
> + * bitmap_get_value(map, start, nbits) Get bit value of size 'nbits'
> + * from map at start
> * bitmap_set_value8(map, value, start) Set 8bit value to map at start
> + * bitmap_set_value(map, value, start, nbits) Set bit value of size 'nbits'
> + * of map at start

The 'bit value of size' sounds more confusing than it should. The size
of bit is actually a bit... Can you rephrase? Moreover, 'set bits' has
a meaning of actually setting them, i.e. switching to '1'. Maybe:
"Copy 'nbits' to bitmap starting at 'start'"?

> *
> * Note, bitmap_zero() and bitmap_fill() operate over the region of
> * unsigned longs, that is, bits behind bitmap till the unsigned long
> @@ -583,6 +587,31 @@ static inline unsigned long bitmap_get_value8(const unsigned long *map,
> return (map[index] >> offset) & 0xFF;
> }
>
> +/**
> + * bitmap_get_value - get a value of n-bits from the memory region
> + * @map: address to the bitmap memory region
> + * @start: bit offset of the n-bit value
> + * @nbits: size of value in bits


* @nbits: size of value in bits, up to BITS_PER_LONG

> + *
> + * Returns value of nbits located at the @start bit offset within the @map
> + * memory region.
> + */
> +static inline unsigned long bitmap_get_value(const unsigned long *map,
> + unsigned long start,
> + unsigned long nbits)
> +{
> + const size_t index = BIT_WORD(start);
> + const unsigned long offset = start % BITS_PER_LONG;
> + const unsigned long space = BITS_PER_LONG - offset;
> + unsigned long value_low, value_high;
> +
> + if (space >= nbits)
> + return (map[index] >> offset) & GENMASK(nbits - 1, 0);
> + value_low = map[index] & BITMAP_FIRST_WORD_MASK(start);
> + value_high = map[index + 1] & BITMAP_LAST_WORD_MASK(start + nbits);
> + return (value_low >> offset) | (value_high << space);
> +}

When nbits == 0, copy-like functions shouldn't touch any memory. See how
other bitmap and find_bit functions hold it.

Thanks,
Yury

2023-07-17 16:48:09

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()

On Mon, Jul 17, 2023 at 5:03 PM Andy Shevchenko
<[email protected]> wrote:
>
> On Mon, Jul 17, 2023 at 04:53:51PM +0200, Alexander Potapenko wrote:
>
> ...
>
> > > > I remember that this construction may bring horrible code on some architectures
> > > > with some version(s) of the compiler (*).
> > >
> > > Wow, even the trunk Clang and GCC seem to generate better code for
> > > your version of this line: https://godbolt.org/z/36Kqxhe6j
> >
> > Oh wait.
> > First, my Godbolt reproducer is incorrect, it is using sizeof(unsigned
> > long) instead of 8 * sizeof(unsigned long) for BITS_PER_LONG.
>
> Still slightly better. And note, that the same GENMASK() is used at the
> beginning of the function. Compiler actually might cache it.

I think that the compiler might consider this transformation invalid
because "GENMASK(nbits - 1, 0) << offset" and "GENMASK(nbits + offset
- 1, offset)" have different values in the case "nbits + offset"
exceed 64.
Restricting the domain of bitmap_set_value() might result in better
code, but it is easier to stick to your version, which prevents the
overflow.

> > > > To fix that I found an easy refactoring:
> > > >
> > > > map[index] &= ~(GENMASK(nbits, 0) << offset));
> > > >
> >
> > Second, the line above should probably be:
> > map[index] &= ~(GENMASK(nbits - 1, 0) << offset));
> >
> > , right?
>
> Yes.

This "nbits vs. nbits - 1" was not detected by test_set_get_value(),
so I'll add an extra test case for it.

2023-07-17 16:52:19

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()

On Mon, Jul 17, 2023 at 05:31:44PM +0300, Andy Shevchenko wrote:
> On Mon, Jul 17, 2023 at 05:29:12PM +0300, Andy Shevchenko wrote:
> > On Mon, Jul 17, 2023 at 04:14:57PM +0200, Alexander Potapenko wrote:
>
> ...
>
> > > > > + map[index] &= ~(GENMASK(nbits + offset - 1, offset));
> > > >
> > > > I remember that this construction may bring horrible code on some architectures
> > > > with some version(s) of the compiler (*).
> > >
> > > Wow, even the trunk Clang and GCC seem to generate better code for
> > > your version of this line: https://godbolt.org/z/36Kqxhe6j
> >
> > Wow, indeed! Perhaps time to report to clang and GCC people. I believe the root
> > cause is that in the original version compiler can't prove that l is constant
> > for GENMASK().
> >
> > > > To fix that I found an easy refactoring:
> > > >
> > > > map[index] &= ~(GENMASK(nbits, 0) << offset));
>
> nbits - 1 it should be, btw. In any case it seems the code is still better.

Yep.

Alexander, for the next round can you please show disassembly for the
functions in case of compile-time and run-time defined start and nbits?


Thanks,
Yury

2023-07-18 09:39:38

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()

On Mon, Jul 17, 2023 at 5:51 PM Yury Norov <[email protected]> wrote:
>
> On Mon, Jul 17, 2023 at 01:37:04PM +0200, Alexander Potapenko wrote:
> > The two new functions allow setting/getting values of length up to
> > BITS_PER_LONG bits at arbitrary position in the bitmap.
> >
> > The code was taken from "bitops: Introduce the for_each_set_clump macro"
> > by Syed Nayyar Waris with a couple of minor changes:
> > - instead of using roundup(), which adds an unnecessary dependency
> > on <linux/math.h>, we calculate space as BITS_PER_LONG-offset;
> > - indentation is reduced by not using else-clauses (suggested by
> > checkpatch for bitmap_get_value())
>
> Please preserve Syed's authorship ('From' field in git log).
Done

> > * bitmap_set_value8(map, value, start) Set 8bit value to map at start
> > + * bitmap_set_value(map, value, start, nbits) Set bit value of size 'nbits'
> > + * of map at start
>
> The 'bit value of size' sounds more confusing than it should. The size
> of bit is actually a bit... Can you rephrase?
How about "Get an nbits-sized value from map at start" and "Set an
nbits-sized value to map at start"?

> Moreover, 'set bits' has
> a meaning of actually setting them, i.e. switching to '1'. Maybe:
> "Copy 'nbits' to bitmap starting at 'start'"?

Right now it is in line with the comment for bitmap_set_value8 (and
the names of the functions also have _set_ in them).
Shall I also change that comment?
WDYT about "Put an nbits-sized value to map at start"?


> > +/**
> > + * bitmap_get_value - get a value of n-bits from the memory region
> > + * @map: address to the bitmap memory region
> > + * @start: bit offset of the n-bit value
> > + * @nbits: size of value in bits
>
>
> * @nbits: size of value in bits, up to BITS_PER_LONG
Ok

> > + *
> > + * Returns value of nbits located at the @start bit offset within the @map
> > + * memory region.
> > + */
> > +static inline unsigned long bitmap_get_value(const unsigned long *map,
> > + unsigned long start,
> > + unsigned long nbits)
> > +{
> > + const size_t index = BIT_WORD(start);
> > + const unsigned long offset = start % BITS_PER_LONG;
> > + const unsigned long space = BITS_PER_LONG - offset;
> > + unsigned long value_low, value_high;
> > +
> > + if (space >= nbits)
> > + return (map[index] >> offset) & GENMASK(nbits - 1, 0);
> > + value_low = map[index] & BITMAP_FIRST_WORD_MASK(start);
> > + value_high = map[index + 1] & BITMAP_LAST_WORD_MASK(start + nbits);
> > + return (value_low >> offset) | (value_high << space);
> > +}
>
> When nbits == 0, copy-like functions shouldn't touch any memory. See how
> other bitmap and find_bit functions hold it.

I think this is different from what other bitmap functions do, but it
should be enough to bail out on !nbits, i.e.:

if (!nbits)
return 0;

You probably meant adding a __builtin_constant_p() (which is used all
over the place in bitmap.h), but:
- the compiler won't have problem optimizing away the code for a
constant nbits=0;
- we anyway need a dynamic check for the case nbits is not constant
(for both bitmap_get_value() and bitmap_set_value(), I assume).

What do you think?

2023-07-18 14:34:35

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()

On Tue, Jul 18, 2023 at 11:30:00AM +0200, Alexander Potapenko wrote:
> On Mon, Jul 17, 2023 at 5:51 PM Yury Norov <[email protected]> wrote:
> > On Mon, Jul 17, 2023 at 01:37:04PM +0200, Alexander Potapenko wrote:

...

> > When nbits == 0, copy-like functions shouldn't touch any memory. See how
> > other bitmap and find_bit functions hold it.
>
> I think this is different from what other bitmap functions do, but it
> should be enough to bail out on !nbits, i.e.:
>
> if (!nbits)
> return 0;
>
> You probably meant adding a __builtin_constant_p() (which is used all
> over the place in bitmap.h), but:
> - the compiler won't have problem optimizing away the code for a
> constant nbits=0;
> - we anyway need a dynamic check for the case nbits is not constant
> (for both bitmap_get_value() and bitmap_set_value(), I assume).
>
> What do you think?

The idea behind is to eliminate the code completely for the cases nbits != 0.
In your case the dynamic check will be there. That's what we want to avoid.

--
With Best Regards,
Andy Shevchenko



2023-07-18 16:10:41

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP

On Mon, Jul 17, 2023 at 3:49 PM Andy Shevchenko
<[email protected]> wrote:
>
> On Mon, Jul 17, 2023 at 01:37:06PM +0200, Alexander Potapenko wrote:
> > The config implements the EA0 algorithm suggested by Evgenii Stepanov
> > to compress the memory tags for ARM MTE during swapping.
> >
> > The algorithm is based on RLE and specifically targets 128-byte buffers
> > of tags corresponding to a single page. In the common case a buffer
> > can be compressed into 63 bits, making it possible to store it without
> > additional memory allocation.
>
> ...
>
> > +config ARM64_MTE_COMP
> > + bool "Tag compression for ARM64 MTE"
>
> At least here, make sure everybody understands what you are talking about. WTF
> MTE is?
Replaced with "Memory Tagging Extension"


> ...
>
> > +/*
>
> Are you deliberately made it NON-kernel-doc? If so, why? And why does it
> have too many similarities with above mentioned format?

No, just by lack of habit.

> > + * ea0_compress() - compress the given tag array.
> > + * @tags: 128-byte array to read the tags from.
> > + *
> > + * Compresses the tags and returns a 64-bit opaque handle pointing to the
> > + * tag storage. May allocate memory, which is freed by @ea0_release_handle().
> > + */
> > +unsigned long ea0_compress(u8 *tags);
> > +
> > +/*
> > + * ea0_decompress() - decompress the tag array addressed by the handle.
> > + * @handle: handle returned by @ea0_decompress()
> > + * @tags: 128-byte array to write the tags to.
> > + *
> > + * Reads the compressed data and writes it into the user-supplied tag array.
> > + * Returns true on success, false on error.
>
> In case you are going to make them real kernel-doc:s, make sure kernel-doc
> validator doesn't warn.

I'll try to.
However it doesn't seem to be very picky.

$ scripts/kernel-doc -v -none arch/arm64/include/asm/mtecomp.h

warns about e.g. parameter name mismatch, but does not care about the
missing return section.

> Here, for example, return section is missing. The easy
> fix is to add : after Returns. Same to the rest of function descriptions.
Done

> Also
> why you put the descriptions in to the header file? It's a bit unusual for the
> exported ones.

https://docs.kernel.org/doc-guide/kernel-doc.html was not specific on
this, and I thought anyone wanting to understand how an interface
works would prefer reading the header rather than the implementation.
I can move the comments to arch/arm64/mm/mtecomp.c if you think it's a
better place for them.

> > +/*
> > + * ea0_tags_to_ranges() - break @tags into arrays of tag ranges.
> > + * @tags: 128-byte array containing 256 MTE tags.
> > + * @out_tags: u8 array to store the tag of every range.
> > + * @out_sizes: u16 array to store the size of every range.
>
> u16? I don't see it.
Fixed.


> > +/*
> > + * ea0_ranges_to_tags() - fill @tags using given tag ranges.
> > + * @r_tags: u8[256] containing the tag of every range.
> > + * @r_sizes: u16[256] containing the size of every range.
>
> Ditto.
Fixed.

>
> > + * @r_len: length of @r_tags and @r_sizes.
> > + * @tags: 128-byte array to write the tags to.
> > + */
> > +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags);
>
> In both cases signed integer may be promoted with a sign. Is it a problem here?
Not sure if you mean r_len or r_sizes, but all those values are >= 0
and <= 256, so there should be no problems.
(shorts could have been ints as well, we are just saving some stack
space in ea0_compress()/ea0_decompress()).

> > + *
> > + * We encapsulate tag storage memory management in this module, because it is
> > + * tightly coupled with the pointer representation.
>
> > + * ea0_compress(*tags) takes a 128-byte buffer and returns an opaque value
>
> ea0_compress() is usual way how we refer to the functions. Let tools to make
> the necessary references.

Done.

>
> > + * that can be stored in Xarray
> > + * ea0_decompress(*ptr, *tags) takes the opaque value and loads the tags into
>
> Ditto. And so on.
Done.


> > + * 5. For the out-of-line case, the storage is allocated from one of the
> > + * "mte-tags-{16,32,64,128}" kmem caches. The resulting pointer is aligned
> > + * on 8 bytes, so its bits 2..0 can be used to store the size class:
>
> > + * - 0 for 128 bytes
> > + * - 1 for 16
> > + * - 2 for 32
> > + * - 4 for 64.
>
> Is this chosen deliberately (for performance?)? Otherwise why not put them in
> natural exponential growing?

This way pointers to uncompressed buffers do not have extra data stored in them.
This property does not have any immediate use though.


> ...
>
> > +#include <linux/bits.h>
> > +#include <linux/bitmap.h>
>
> bitmap guarantees that bits.h will be included.

I am following the IWYU principle here, and I believe it's a good thing to do.
I've seen cases where these transitive dependencies rotted after some
refactoring, but the fact was only noticed in certain configurations.
Also, there's an ongoing work by Ingo Molnar to speed up kernel builds
by optimizing headers
(https://lwn.net/ml/linux-kernel/[email protected]/), and it
relies on explicit dependencies which are easier to untangle.

>
> ...
>
> > +void ea0_tags_to_ranges(u8 *tags, u8 *out_tags, short *out_sizes, int *out_len)
> > +{
> > + u8 prev_tag = U8_MAX;
>
> > + int cur_idx = -1;
>
> At which circumstances does this assignment make sense?
This value is never used to index the array, it is incremented on the
first loop iteration.

> > + for (i = 0; i < MTE_PAGE_TAG_STORAGE; i++) {
> > + for (j = 0; j < 2; j++) {
> > + cur_tag = j ? (tags[i] % 16) : (tags[i] / 16);
> > + if (cur_tag == prev_tag) {
> > + out_sizes[cur_idx]++;
>
> Who guarantees this one is not [-1]?
The fact that prev_tag is initialized to U8_MAX, whereas cur_tag <= 15.


> > + } else {
>
> > + cur_idx++;
>
> Aha, above seems a bit prone to out of boundaries errors.
Not really, but since you stumbled upon it, I'd better make it more readable.

> Can you make it
> unsigned and start from 0?

Changed to start with 0, but I am a bit hesitant about making it
unsigned: it is now no more special than a loop variable.

> > +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags)
> > +{
> > + int i, j, pos = 0;
>
> Wouldn't be more correct to have this assignment inside the first for-loop?

Do you mean setting it back to 0 on every iteration of the outer loop?
We sure don't want that, since pos is the location in the tags[] array
where the next tag is written.
If what you meant is doing "for (i = 0, pos = 0; ...)" this is a
question of preference, but I can do that.


>
> > +#define RANGES_INLINE ea0_size_to_ranges(8)
>
> Don't forget to undef it when not needed.

Ok, will do.
Shall I undef the constants above as well (e.g. BITS_PER_TAG)?
The intuitive answer is "no", but then there should be some difference
between those and RANGES_INLINE?

>
> ...
>
> > +static void bitmap_write(unsigned long *bitmap, unsigned long value,
> > + unsigned long *pos, unsigned long bits)
>
> Please, don't use reserved namespace. Yours is ea0, use it:
> ea0_bitmap_write()! Same to other similarly named functions.

Done.
However there are two parallel namespaces now: "ea0" for the
compression algorithm, and "memcomp" for the module initialization and
data structures.
Dunno if it makes sense to merge them (and rename the .c file accordingly).

> ...
>
> > + unsigned long bit_pos = 0, l_bits;
> > + int largest_idx = -1, i;
> > + short largest = 0;
>
> Here and elsewhere, please, double check the correctness and/or necessity of
> signdness and assignments of local variables.

I replaced a bunch of ints with size_t where it made sense (basically
all array indices remain ints, but all sizes are size_t now).
Also removed the assignment to largest_idx above.


> Here
>
> if (largest >= sizes[i])
> continue;
> makes sense, but...
>
> > + largest = sizes[i];
> > + largest_idx = i;
> > + }
> > + }
>
> ...
>
> > + for (i = 0; i < len; i++) {
> > + if (i == largest_idx)
> > + continue;
> > + bitmap_write(bitmap, sizes[i], &bit_pos, BITS_PER_SIZE);
>
> ...here I would do the opposite since it's one liner.

Agreed.

>
> > + }
>
> ...
>
> > + u8 r_tags[256];
> > + int r_len = ARRAY_SIZE(r_tags);
>
No, it is the length of the arrays (both r_tags and r_sizes).
Below you make a good point it will spare us a kernel.h dependency,
but for the sake of keeping the code error-prone we probably shouldn't
assume r_tags is a byte array.


> > + l_bits = (max_ranges == RANGES_INLINE) ? BITS_PER_LARGEST_IDX_INLINE :
> > + BITS_PER_LARGEST_IDX;
>
> Is it a dup? Perhaps a helper for this?
Do you mean it is the same in ea0_compress_to_buf() and
ea0_decompress_from_buf()?
Added a helper.

> Seems BITS_PER_TAG, BITS_PER_SIZE and the rest should also be namespaced,
> EA0_BITS_...

Done.

> ...
>
> > +bool ea0_decompress(unsigned long handle, u8 *tags)
> > +{
> > + unsigned long *storage = ea0_storage(handle);
> > + int size = ea0_storage_size(handle);
> > +
> > + if (size == 128) {
> > + memcpy(tags, storage, size);
> > + return true;
> > + }
> > + if (size == 8)
> > + return ea0_decompress_from_buf(&handle, RANGES_INLINE, tags);
>
> Maybe
>
> switch (ea0_storage_size(handle)) {
> ...
> default:
> }
>
> ?

Sounds reasonable, done.

> > +void ea0_release_handle(unsigned long handle)
> > +{
> > + void *storage = ea0_storage(handle);
> > + int size = ea0_storage_size(handle);
> > + struct kmem_cache *c;
>
> > + if (!storage)
> > + return;
>
> I find slightly better for maintaining in the form as
>
> struct kmem_cache *c;
> void *storage;
> int size;
>
> storage = ea0_storage(handle);
> if (!storage)
> return;
>
> size = ea0_storage_size(handle);

Done.

> > +static int mtecomp_init(void)
> > +{
> > + char name[16];
> > + int size;
> > + int i;
> > +
> > + BUILD_BUG_ON(MTE_PAGE_TAG_STORAGE != 128);
>
> Why not static_assert()?
>
> > + for (i = 0; i < NUM_CACHES; i++) {
> > + size = ea0_cache_id_to_size(i);
> > + snprintf(name, ARRAY_SIZE(name), "mte-tags-%d", size);
>
> sizeof() will work the same way without need of having kernel.h be included.

Done.

2023-07-18 17:44:16

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()

On Tue, Jul 18, 2023 at 10:03:25AM -0700, Yury Norov wrote:
> On Tue, Jul 18, 2023 at 05:01:28PM +0300, Andy Shevchenko wrote:
> > On Tue, Jul 18, 2023 at 11:30:00AM +0200, Alexander Potapenko wrote:

...

> > The idea behind is to eliminate the code completely for the cases nbits != 0.
> > In your case the dynamic check will be there. That's what we want to avoid.
>
> Alexander is right - we can't avoid testing against 0 if we need to
> test for 0... In case of other functions we have inline and outline
> implementations, controlled by small_const_nbits().
>
> As you can see, the small_const_nbits() tests against 0 explicitly,
> although it's free at compile time. But if nbits == 0, we pick
> outline version of a function regardless.
>
> On their turn, outline versions again do their test against nbits == 0,
> but most of the time implicitly.
>
> In case of bitmap_set_val, we are touching at max 2 words, and there's
> no reason for outline version, so we have to test nbits against 0
> inside inline code.
>
> Having all that in mind, and because nbits == 0 is most likely an
> error we'd follow the following rules:
> - no memory must be touched as we're potentially in error condition,
> and pointer may be corrupted;
> - the cost of the check must be as minimal as possible.
>
> So I suggest:
>
> if (unlikely(nbits == 0))
> return;
>
> For readers that would literally mean: we don't expect that, and we find
> it suspicious, but we'll handle that as correct as we can.

Okay, thank you for elaborated answer.

--
With Best Regards,
Andy Shevchenko



2023-07-18 17:47:08

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()

On Tue, Jul 18, 2023 at 05:01:28PM +0300, Andy Shevchenko wrote:
> On Tue, Jul 18, 2023 at 11:30:00AM +0200, Alexander Potapenko wrote:
> > On Mon, Jul 17, 2023 at 5:51 PM Yury Norov <[email protected]> wrote:
> > > On Mon, Jul 17, 2023 at 01:37:04PM +0200, Alexander Potapenko wrote:
>
> ...
>
> > > When nbits == 0, copy-like functions shouldn't touch any memory. See how
> > > other bitmap and find_bit functions hold it.
> >
> > I think this is different from what other bitmap functions do, but it
> > should be enough to bail out on !nbits, i.e.:
> >
> > if (!nbits)
> > return 0;
> >
> > You probably meant adding a __builtin_constant_p() (which is used all
> > over the place in bitmap.h), but:

No, I didn't mean that.

> > - the compiler won't have problem optimizing away the code for a
> > constant nbits=0;

Look at your code, having nbits == 0 in mind:

const size_t index = BIT_WORD(start);
const unsigned long offset = start % BITS_PER_LONG;
const unsigned long space = BITS_PER_LONG - offset;
unsigned long value_low, value_high;

if (space >= nbits) // This is always the case
return (map[index] >> offset) & GENMASK(nbits - 1, 0);
... ^^ ^^
Unconditional fetch Wshift-count-overflow

Thanks to GENMASK() implementation, you'll be warned by GENMASK_INPUT_CHECK()
if nbits is a compile-time variable. In case of runtime, it's a pure undef,
not mentioning useless, expensive and dangerous fetch.

> > - we anyway need a dynamic check for the case nbits is not constant
> > (for both bitmap_get_value() and bitmap_set_value(), I assume).
> >
> > What do you think?

I think that instead of speculations, it's better to cover nbits == 0
with the explicit tests for run- and compile-time. That way you're
always on a safe side.

bitmap_get_val(NULL, 0, 0) shouldn't crash the kernel.

> The idea behind is to eliminate the code completely for the cases nbits != 0.
> In your case the dynamic check will be there. That's what we want to avoid.

Alexander is right - we can't avoid testing against 0 if we need to
test for 0... In case of other functions we have inline and outline
implementations, controlled by small_const_nbits().

As you can see, the small_const_nbits() tests against 0 explicitly,
although it's free at compile time. But if nbits == 0, we pick
outline version of a function regardless.

On their turn, outline versions again do their test against nbits == 0,
but most of the time implicitly.

In case of bitmap_set_val, we are touching at max 2 words, and there's
no reason for outline version, so we have to test nbits against 0
inside inline code.

Having all that in mind, and because nbits == 0 is most likely an
error we'd follow the following rules:
- no memory must be touched as we're potentially in error condition,
and pointer may be corrupted;
- the cost of the check must be as minimal as possible.

So I suggest:

if (unlikely(nbits == 0))
return;

For readers that would literally mean: we don't expect that, and we find
it suspicious, but we'll handle that as correct as we can.

By the way, Alexander, please drop that 'const' things. Those are for
pointers or some global variables, not for inline functions with 4
lines of code. (If you think it helps the code to be safe than no - it's
unsafe even with consts.)

Thanks,
Yury

2023-07-18 18:09:14

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP

On Tue, Jul 18, 2023 at 05:33:37PM +0200, Alexander Potapenko wrote:
> On Mon, Jul 17, 2023 at 3:49 PM Andy Shevchenko
> <[email protected]> wrote:
> > On Mon, Jul 17, 2023 at 01:37:06PM +0200, Alexander Potapenko wrote:

...

> However it doesn't seem to be very picky.
>
> $ scripts/kernel-doc -v -none arch/arm64/include/asm/mtecomp.h
>
> warns about e.g. parameter name mismatch, but does not care about the
> missing return section.

-Wreturn is missing

...

> > Also
> > why you put the descriptions in to the header file? It's a bit unusual for the
> > exported ones.
>
> https://docs.kernel.org/doc-guide/kernel-doc.html was not specific on
> this, and I thought anyone wanting to understand how an interface
> works would prefer reading the header rather than the implementation.
> I can move the comments to arch/arm64/mm/mtecomp.c if you think it's a
> better place for them.

With the kernel doc in the C file you may also comment the internal ones and
generate documentation only for exported ones. This is not possible with h.

...

> > > +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags);
> > In both cases signed integer may be promoted with a sign. Is it a problem here?
> Not sure if you mean r_len or r_sizes,

Mostly about the latter.

> but all those values are >= 0
> and <= 256, so there should be no problems.
> (shorts could have been ints as well, we are just saving some stack
> space in ea0_compress()/ea0_decompress()).

Then why they are signed? Please, justify that.
Signdness prone to subtle and hard to hunt errors due to integer promotions.

...

> > > +#include <linux/bits.h>
> > > +#include <linux/bitmap.h>
> >
> > bitmap guarantees that bits.h will be included.
>
> I am following the IWYU principle here, and I believe it's a good thing to do.
> I've seen cases where these transitive dependencies rotted after some
> refactoring, but the fact was only noticed in certain configurations.
> Also, there's an ongoing work by Ingo Molnar to speed up kernel builds
> by optimizing headers
> (https://lwn.net/ml/linux-kernel/[email protected]/), and it
> relies on explicit dependencies which are easier to untangle.

Yes, but we have some guarantees. E.g., we don't include compiler*.h
when types.h is included, because of the guarantees.

Otherwise your code misses _a lot_ of headers.

...

> > Can you make it unsigned and start from 0?
>
> Changed to start with 0, but I am a bit hesitant about making it
> unsigned: it is now no more special than a loop variable.

Signdness is a beast in C, needs always an additional justification.

...

> > > + int i, j, pos = 0;
> >
> > Wouldn't be more correct to have this assignment inside the first for-loop?
>
> Do you mean setting it back to 0 on every iteration of the outer loop?

Yes.

> We sure don't want that, since pos is the location in the tags[] array
> where the next tag is written.

OK!

...

> > > +#define RANGES_INLINE ea0_size_to_ranges(8)
> >
> > Don't forget to undef it when not needed.
>
> Ok, will do.

> Shall I undef the constants above as well (e.g. BITS_PER_TAG)?
> The intuitive answer is "no",

Correct.

> but then there should be some difference between those and RANGES_INLINE?

Yes, one is widely used constant and one is a _localized_ helper.

...

> > > +static void bitmap_write(unsigned long *bitmap, unsigned long value,
> > > + unsigned long *pos, unsigned long bits)
> >
> > Please, don't use reserved namespace. Yours is ea0, use it:
> > ea0_bitmap_write()! Same to other similarly named functions.
>
> Done.
> However there are two parallel namespaces now: "ea0" for the
> compression algorithm, and "memcomp" for the module initialization and
> data structures.
> Dunno if it makes sense to merge them (and rename the .c file accordingly).

Your choice. Mu point, just do prefix it with something meaningful.

...

> > > + u8 r_tags[256];
> > > + int r_len = ARRAY_SIZE(r_tags);
> >
> No, it is the length of the arrays (both r_tags and r_sizes).
> Below you make a good point it will spare us a kernel.h dependency,
> but for the sake of keeping the code error-prone we probably shouldn't
> assume r_tags is a byte array.

It's a common practice even outside of Linux kernel to use sizeof() against
char arrays. I don't see the point to have the ARRAY_SIZE() complication here.

...

> > > + snprintf(name, ARRAY_SIZE(name), "mte-tags-%d", size);

You see, if you grep for similar calls I'm pretty sure the order of 2 of power
of 10 will be the difference between sizeof()/ARRAY_SIZE() if the latter even
occurs at least once.


--
With Best Regards,
Andy Shevchenko



2023-07-19 06:24:48

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP

On Mon, Jul 17, 2023 at 01:37:06PM +0200, Alexander Potapenko wrote:
> The config implements the EA0 algorithm suggested by Evgenii Stepanov
> to compress the memory tags for ARM MTE during swapping.
>
> The algorithm is based on RLE and specifically targets 128-byte buffers
> of tags corresponding to a single page. In the common case a buffer
> can be compressed into 63 bits, making it possible to store it without
> additional memory allocation.
>
> Suggested-by: Evgenii Stepanov <[email protected]>
> Signed-off-by: Alexander Potapenko <[email protected]>
>
> ---
> v3:
> - Addressed comments by Andy Shevchenko:
> - use bitmap_{set,get}_value() writte by Syed Nayyar Waris
> - switched to unsigned long everywhere (fewer casts)
> - simplified the code, removed redundant checks
> - dropped ea0_compress_inline()
> - added bit size constants and helpers to access the bitmap
> - explicitly initialize all compressed sizes in ea0_compress_to_buf()
> - initialize all handle bits
>
> v2:
> - as suggested by Yuri Norov, switched from struct bitq (which is
> not needed anymore) to <linux/bitmap.h>
> - add missing symbol exports
> ---
> arch/arm64/Kconfig | 10 +
> arch/arm64/include/asm/mtecomp.h | 60 +++++
> arch/arm64/mm/Makefile | 1 +
> arch/arm64/mm/mtecomp.c | 406 +++++++++++++++++++++++++++++++
> 4 files changed, 477 insertions(+)
> create mode 100644 arch/arm64/include/asm/mtecomp.h
> create mode 100644 arch/arm64/mm/mtecomp.c
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index a2511b30d0f67..52cdc7603cf7c 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -2093,6 +2093,16 @@ config ARM64_EPAN
> if the cpu does not implement the feature.
> endmenu # "ARMv8.7 architectural features"
>
> +config ARM64_MTE_COMP
> + bool "Tag compression for ARM64 MTE"
> + default y
> + depends on ARM64_MTE
> + help
> + Enable tag compression support for ARM64 MTE.
> +
> + 128-byte tag buffers corresponding to 4K pages can be compressed using
> + the EA0 algorithm to save heap memory.
> +
> config ARM64_SVE
> bool "ARM Scalable Vector Extension support"
> default y
> diff --git a/arch/arm64/include/asm/mtecomp.h b/arch/arm64/include/asm/mtecomp.h
> new file mode 100644
> index 0000000000000..0c444c0d4ac04
> --- /dev/null
> +++ b/arch/arm64/include/asm/mtecomp.h
> @@ -0,0 +1,60 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef __ASM_MTECOMP_H
> +#define __ASM_MTECOMP_H
> +
> +#include <linux/types.h>
> +
> +/*
> + * ea0_compress() - compress the given tag array.
> + * @tags: 128-byte array to read the tags from.
> + *
> + * Compresses the tags and returns a 64-bit opaque handle pointing to the
> + * tag storage. May allocate memory, which is freed by @ea0_release_handle().
> + */
> +unsigned long ea0_compress(u8 *tags);
> +
> +/*
> + * ea0_decompress() - decompress the tag array addressed by the handle.
> + * @handle: handle returned by @ea0_decompress()
> + * @tags: 128-byte array to write the tags to.
> + *
> + * Reads the compressed data and writes it into the user-supplied tag array.
> + * Returns true on success, false on error.
> + */
> +bool ea0_decompress(unsigned long handle, u8 *tags);
> +
> +/*
> + * ea0_release_handle() - release the handle returned by ea0_compress().
> + * @handle: handle returned by ea0_compress().
> + */
> +void ea0_release_handle(unsigned long handle);
> +
> +/* Functions below are exported for testing purposes. */

Then declare them in a separate local header or simply in the test, but
please not in a public header.

> +
> +/*
> + * ea0_storage_size() - calculate the memory occupied by compressed tags.
> + * @handle: storage handle returned by ea0_compress.
> + */
> +int ea0_storage_size(unsigned long handle);
> +
> +/*
> + * ea0_tags_to_ranges() - break @tags into arrays of tag ranges.
> + * @tags: 128-byte array containing 256 MTE tags.
> + * @out_tags: u8 array to store the tag of every range.
> + * @out_sizes: u16 array to store the size of every range.
> + * @out_len: length of @out_tags and @out_sizes (output parameter, initially
> + * equal to lengths of out_tags[] and out_sizes[]).
> + */
> +void ea0_tags_to_ranges(u8 *tags, u8 *out_tags, short *out_sizes, int *out_len);
> +
> +/*
> + * ea0_ranges_to_tags() - fill @tags using given tag ranges.
> + * @r_tags: u8[256] containing the tag of every range.
> + * @r_sizes: u16[256] containing the size of every range.
> + * @r_len: length of @r_tags and @r_sizes.
> + * @tags: 128-byte array to write the tags to.
> + */
> +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags);
> +
> +#endif // __ASM_MTECOMP_H
> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
> index dbd1bc95967d0..46778f6dd83c2 100644
> --- a/arch/arm64/mm/Makefile
> +++ b/arch/arm64/mm/Makefile
> @@ -10,6 +10,7 @@ obj-$(CONFIG_TRANS_TABLE) += trans_pgd.o
> obj-$(CONFIG_TRANS_TABLE) += trans_pgd-asm.o
> obj-$(CONFIG_DEBUG_VIRTUAL) += physaddr.o
> obj-$(CONFIG_ARM64_MTE) += mteswap.o
> +obj-$(CONFIG_ARM64_MTE_COMP) += mtecomp.o
> KASAN_SANITIZE_physaddr.o += n
>
> obj-$(CONFIG_KASAN) += kasan_init.o
> diff --git a/arch/arm64/mm/mtecomp.c b/arch/arm64/mm/mtecomp.c
> new file mode 100644
> index 0000000000000..50a379c035aee
> --- /dev/null
> +++ b/arch/arm64/mm/mtecomp.c
> @@ -0,0 +1,406 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +/*
> + * MTE tag compression algorithm.
> + * Proposed by Evgenii Stepanov <[email protected]>
> + */
> +
> +/*
> + * EA0 stands for "Evgenii's Algorithm 0", as the initial proposal contained two
> + * compression algorithms.

This is the 4th time I see mr. Stepanov's credentials in the patch.
I've no doubts he's a worthy gentleman but please avoid mentioning
people in source code. Suggested-by is enough. IIRC, the rule for
that exists for about decade.

For the purpose of namespacing, the mte_compress/mte_decompress would
sound better.

> + *
> + * The algorithm attempts to compress a 128-byte (MTE_GRANULES_PER_PAGE / 2)
> + * array of tags into a smaller byte sequence that can be stored in a
> + * 16-, 32-, or 64-byte buffer. A special case is storing the tags inline in
> + * an 8-byte pointer.
> + *
> + * We encapsulate tag storage memory management in this module, because it is
> + * tightly coupled with the pointer representation.
> + * ea0_compress(*tags) takes a 128-byte buffer and returns an opaque value
> + * that can be stored in Xarray
> + * ea0_decompress(*ptr, *tags) takes the opaque value and loads the tags into
> + * the provided 128-byte buffer.
> + *
> + * The compression algorithm works as follows.
> + *
> + * 1. The input array of 128 bytes is transformed into tag ranges (two arrays:
> + * @r_tags containing tag values and @r_sizes containing range lengths) by
> + * ea0_tags_to_ranges(). Note that @r_sizes sums up to 256.
> + *
> + * 2. Depending on the number N of ranges, the following storage class is picked:
> + * N <= 6: 8 bytes (inline case, no allocation required);
> + * 6 < N <= 11: 16 bytes
> + * 11 < N <= 23: 32 bytes
> + * 23 < N <= 46: 64 bytes
> + * 46 < N: 128 bytes (no compression will be performed)
> + *
> + * 3. The number of the largest element of @r_sizes is stored in @largest_idx.
> + * The element itself is thrown away from @r_sizes, because it can be
> + * reconstructed from the sum of the remaining elements. Note that now none
> + * of the remaining @r_sizes elements is greater than 127.
> + *
> + * 4. For the inline case, the following values are stored in the 8-byte handle:
> + * largest_idx : i4
> + * r_tags[0..5] : i4 x 6
> + * r_sizes[0..4] : i7 x 5
> + * (if N is less than 6, @r_tags and @r_sizes are padded up with zero values)
> + *
> + * Because @largest_idx is <= 5, bit 63 of the handle is always 0 (so it can
> + * be stored in the Xarray), and bits 62..60 cannot all be 1, so it can be
> + * distinguished from a kernel pointer.

I honestly tried to understand... For example, what the
'r_sizes[0..4] : i7 x 5'
means? An array of 5 elements, 17 bits each? But it's alone greater
than size of pointer... Oh gosh...

Moreover, MTE tags are all 4-bits.

It seems like text without pictures is above my mental abilities. Can you
please illustrate it? For example, from the #4, I (hopefully correctly)
realized that:

Inline frame format:
0 60 62 63
+---------------------------------------------------------------------+
| No idea what happens here | Lidx | X |
+---------------------------------------------------------------------+
63 : X : RAZ : Reserved for Xarray.
60-62 : Lidx : 0..5 : Largest element index.
6 : Reserved
7 : Invalid handler

> + *
> + * 5. For the out-of-line case, the storage is allocated from one of the
> + * "mte-tags-{16,32,64,128}" kmem caches. The resulting pointer is aligned
> + * on 8 bytes, so its bits 2..0 can be used to store the size class:
> + * - 0 for 128 bytes
> + * - 1 for 16
> + * - 2 for 32
> + * - 4 for 64.
> + * Bit 63 of the pointer is zeroed out, so that it can be stored in Xarray.
> + *
> + * 6. The data layout in the allocated storage is as follows:
> + * largest_idx : i6
> + * r_tags[0..N] : i4 x N
> + * r_sizes[0..N-1] : i7 x (N-1)
> + *
> + * The decompression algorithm performs the steps below.
> + *
> + * 1. Decide if data is stored inline (bits 62..60 of the handle != 0b111) or
> + * out-of line.
> + *
> + * 2. For the inline case, treat the handle itself as the input buffer.
> + *
> + * 3. For the out-of-line case, look at bits 2..0 of the handle to understand
> + * the input buffer length. To obtain the pointer to the input buffer, unset
> + * bits 2..0 of the handle and set bit 63.
> + *
> + * 4. If the input buffer is 128 byte long, copy its contents to the output
> + * buffer.
> + *
> + * 5. Otherwise, read @largest_idx, @r_tags and @r_sizes from the input buffer.
> + * Calculate the removed largest element of @r_sizes:
> + * largest = 256 - sum(r_sizes)
> + * and insert it into @r_sizes at position @largest_idx.
> + *
> + * 6. While @r_sizes[i] > 0, add a 4-bit value @r_tags[i] to the output buffer
> + * @r_sizes[i] times.
> + */
> +

Because of the size, I believe this comment is worth to put in Docs,
moreover we already have "Documentation/arch/arm64/memory-tagging-extension.rst"
Why don't you add an 'MTE Compression' section in there?

> +#include <linux/bits.h>
> +#include <linux/bitmap.h>
> +#include <linux/gfp.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/swab.h>
> +#include <linux/string.h>
> +#include <linux/types.h>

Nit: Alphabetic order?

Andy is right, bitmap.h includes bit.h, no need to include both. And if
your code will get broken one day, it's a bitmap maintainers' work to fix.

> +
> +#include <asm/mtecomp.h>
> +
> +/* The handle must fit into an Xarray value. */
> +#define HANDLE_MASK GENMASK_ULL(62, 0)
> +
> +/* Out-of-line handles have 0b111 in bits 62..60. */
> +#define NOINLINE_MASK GENMASK_ULL(62, 60)
> +
> +/* Cache index is stored in the lowest pointer bits. */
> +#define CACHE_ID_MASK GENMASK_ULL(2, 0)
> +
> +/*
> + * Four separate caches to store out-of-line data:
> + * 0: mte-tags-128
> + * 1: mte-tags-16
> + * 2: mte-tags-32
> + * 3: mte-tags-64
> + */
> +#define NUM_CACHES 4
> +static struct kmem_cache *mtecomp_caches[NUM_CACHES];
> +
> +/*
> + * Sizes of compressed values.
> + */
> +#define BITS_PER_TAG 4
> +#define BITS_PER_SIZE 7
> +#define BITS_PER_LARGEST_IDX_INLINE 4
> +#define BITS_PER_LARGEST_IDX 6

But in the comment you say that largest index in inline frame is 3-bits long.

> +
> +/* Translate allocation size into mtecomp_caches[] index. */
> +static int ea0_size_to_cache_id(int len)

Here and everywhere, do you need signed values? If not, unsigned int.

> +{
> + if (len < 128)
> + return fls(len >> 4);
> + return 0;
> +}
> +
> +/* Translate mtecomp_caches[] index into allocation size. */
> +static int ea0_cache_id_to_size(int id)
> +{
> + if (id == 0)
> + return 128;
> + return 8 << id;
> +}
> +
> +/* Transform tags into tag ranges. */
> +void ea0_tags_to_ranges(u8 *tags, u8 *out_tags, short *out_sizes, int *out_len)
> +{
> + u8 prev_tag = U8_MAX;
> + int cur_idx = -1;
> + u8 cur_tag;
> + int i, j;
> +
> + memset(out_tags, 0, array_size(*out_len, sizeof(*out_tags)));
> + memset(out_sizes, 0, array_size(*out_len, sizeof(*out_sizes)));
> +
> + for (i = 0; i < MTE_PAGE_TAG_STORAGE; i++) {
> + for (j = 0; j < 2; j++) {
> + cur_tag = j ? (tags[i] % 16) : (tags[i] / 16);
> + if (cur_tag == prev_tag) {
> + out_sizes[cur_idx]++;
> + } else {
> + cur_idx++;
> + prev_tag = cur_tag;
> + out_tags[cur_idx] = prev_tag;
> + out_sizes[cur_idx] = 1;
> + }
> + }
> + }
> + *out_len = cur_idx + 1;
> +}
> +EXPORT_SYMBOL_NS(ea0_tags_to_ranges, MTECOMP);
> +
> +/* Transform tag ranges back into tags. */
> +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags)
> +{
> + int i, j, pos = 0;
> + u8 prev;
> +
> + for (i = 0; i < r_len; i++) {
> + for (j = 0; j < r_sizes[i]; j++) {
> + if (pos % 2)
> + tags[pos / 2] = (prev << 4) | r_tags[i];
> + else
> + prev = r_tags[i];
> + pos++;
> + }
> + }
> +}
> +EXPORT_SYMBOL_NS(ea0_ranges_to_tags, MTECOMP);

Because I didn't understand the compressed frame format, not sure I
can understand this logic...

> +
> +/* Translate @num_ranges into the allocation size needed to hold them. */
> +static int ea0_alloc_size(int num_ranges)
> +{
> + if (num_ranges <= 6)
> + return 8;
> + if (num_ranges <= 11)
> + return 16;
> + if (num_ranges <= 23)
> + return 32;
> + if (num_ranges <= 46)
> + return 64;
> + return 128;
> +}
> +
> +/* Translate allocation size into maximum number of ranges that it can hold. */
> +static int ea0_size_to_ranges(int size)
> +{
> + switch (size) {
> + case 8:
> + return 6;
> + case 16:
> + return 11;
> + case 32:
> + return 23;
> + case 64:
> + return 46;
> + default:
> + return 0;
> + }
> +}

I wonder if there's a math formula here? Can you explain where from
those numbers come?

> +#define RANGES_INLINE ea0_size_to_ranges(8)

Maybe

#define RANGES_INLINE (6)

> +
> +/* Is the data stored inline in the handle itself? */
> +static bool ea0_is_inline(unsigned long handle)
> +{
> + return (handle & NOINLINE_MASK) != NOINLINE_MASK;
> +}
> +
> +/* Get the size of the buffer backing @handle. */
> +int ea0_storage_size(unsigned long handle)
> +{
> + if (ea0_is_inline(handle))
> + return 8;
> + return ea0_cache_id_to_size(handle & CACHE_ID_MASK);
> +}
> +EXPORT_SYMBOL_NS(ea0_storage_size, MTECOMP);
> +
> +static void bitmap_write(unsigned long *bitmap, unsigned long value,
> + unsigned long *pos, unsigned long bits)

Please don't steal prefixes. But the idea is good. For the next
iteration, let's rename bitmap_set_value() to bitmap_write()?
So that your function will be an mte_bitmap_write().

Thanks,
Yury

> +{
> + bitmap_set_value(bitmap, value, *pos, bits);
> + *pos += bits;
> +}
> +
> +/* Compress ranges into the buffer that can accommodate up to max_ranges. */
> +static void ea0_compress_to_buf(int len, u8 *tags, short *sizes,
> + unsigned long *bitmap, int max_ranges)
> +{
> + unsigned long bit_pos = 0, l_bits;
> + int largest_idx = -1, i;
> + short largest = 0;
> +
> + for (i = 0; i < len; i++) {
> + if (sizes[i] > largest) {
> + largest = sizes[i];
> + largest_idx = i;
> + }
> + }
> + l_bits = (max_ranges == RANGES_INLINE) ? BITS_PER_LARGEST_IDX_INLINE :
> + BITS_PER_LARGEST_IDX;
> + bitmap_write(bitmap, largest_idx, &bit_pos, l_bits);
> + for (i = 0; i < len; i++)
> + bitmap_write(bitmap, tags[i], &bit_pos, BITS_PER_TAG);
> + for (i = len; i < max_ranges; i++)
> + bitmap_write(bitmap, 0, &bit_pos, BITS_PER_TAG);
> + for (i = 0; i < len; i++) {
> + if (i == largest_idx)
> + continue;
> + bitmap_write(bitmap, sizes[i], &bit_pos, BITS_PER_SIZE);
> + }
> + for (i = len; i < max_ranges; i++)
> + bitmap_write(bitmap, 0, &bit_pos, BITS_PER_SIZE);
> +}
> +
> +/* Compress @tags and return a handle. */
> +unsigned long ea0_compress(u8 *tags)
> +{
> + int alloc_size, cache_id;
> + struct kmem_cache *cache;
> + short r_sizes[256];
> + u8 r_tags[256];
> + int r_len = ARRAY_SIZE(r_tags);
> + unsigned long *storage;
> + /*
> + * ea0_compress_to_buf() only initializes the bits that ea0_decompress()
> + * will read. But when the tags are stored in the handle itself, it must
> + * have all its bits initialized.
> + */
> + unsigned long result = 0;
> +
> + ea0_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
> + alloc_size = ea0_alloc_size(r_len);
> + if (alloc_size == 8) {
> + ea0_compress_to_buf(r_len, r_tags, r_sizes, &result,
> + RANGES_INLINE);
> + return result;
> + }
> + cache_id = ea0_size_to_cache_id(alloc_size);
> + cache = mtecomp_caches[cache_id];
> + storage = kmem_cache_alloc(cache, GFP_KERNEL);
> + if (alloc_size < 128) {
> + /* alloc_size is always a multiple of sizeof(unsigned long). */
> + ea0_compress_to_buf(r_len, r_tags, r_sizes, storage,
> + ea0_size_to_ranges(alloc_size));
> + return ((unsigned long)storage | cache_id) & HANDLE_MASK;
> + }
> + memcpy(storage, tags, alloc_size);
> + return (unsigned long)storage & HANDLE_MASK;
> +}
> +EXPORT_SYMBOL_NS(ea0_compress, MTECOMP);
> +
> +static unsigned long bitmap_read(const unsigned long *bitmap,
> + unsigned long *pos, unsigned long bits)
> +{
> + unsigned long result;
> +
> + result = bitmap_get_value(bitmap, *pos, bits);
> + *pos += bits;
> + return result;
> +}
> +
> +/* Decompress the contents of the given buffer into @tags. */
> +static bool ea0_decompress_from_buf(const unsigned long *bitmap, int max_ranges,
> + u8 *tags)
> +{
> + int largest_idx, i;
> + short r_sizes[46], sum = 0;
> + u8 r_tags[46];
> + unsigned long bit_pos = 0, l_bits;
> +
> + l_bits = (max_ranges == RANGES_INLINE) ? BITS_PER_LARGEST_IDX_INLINE :
> + BITS_PER_LARGEST_IDX;
> + largest_idx = bitmap_read(bitmap, &bit_pos, l_bits);
> + for (i = 0; i < max_ranges; i++)
> + r_tags[i] = bitmap_read(bitmap, &bit_pos, BITS_PER_TAG);
> + for (i = 0; i < max_ranges; i++) {
> + if (i == largest_idx)
> + continue;
> + r_sizes[i] = bitmap_read(bitmap, &bit_pos, BITS_PER_SIZE);
> + if (!r_sizes[i]) {
> + max_ranges = i;
> + break;
> + }
> + sum += r_sizes[i];
> + }
> + if (sum >= 256)
> + return false;
> + r_sizes[largest_idx] = 256 - sum;
> + ea0_ranges_to_tags(r_tags, r_sizes, max_ranges, tags);
> + return true;
> +}
> +
> +/* Get pointer to the out-of-line storage from a handle. */
> +static void *ea0_storage(unsigned long handle)
> +{
> + if (ea0_is_inline(handle))
> + return NULL;
> + return (void *)((handle & (~CACHE_ID_MASK)) | BIT_ULL(63));
> +}
> +
> +/* Decompress tags from the buffer referenced by @handle. */
> +bool ea0_decompress(unsigned long handle, u8 *tags)
> +{
> + unsigned long *storage = ea0_storage(handle);
> + int size = ea0_storage_size(handle);
> +
> + if (size == 128) {
> + memcpy(tags, storage, size);
> + return true;
> + }
> + if (size == 8)
> + return ea0_decompress_from_buf(&handle, RANGES_INLINE, tags);
> + return ea0_decompress_from_buf(storage, ea0_size_to_ranges(size), tags);
> +}
> +EXPORT_SYMBOL_NS(ea0_decompress, MTECOMP);
> +
> +/* Release the memory referenced by @handle. */
> +void ea0_release_handle(unsigned long handle)
> +{
> + void *storage = ea0_storage(handle);
> + int size = ea0_storage_size(handle);
> + struct kmem_cache *c;
> +
> + if (!storage)
> + return;
> +
> + c = mtecomp_caches[ea0_size_to_cache_id(size)];
> + kmem_cache_free(c, storage);
> +}
> +EXPORT_SYMBOL_NS(ea0_release_handle, MTECOMP);
> +
> +/* Set up mtecomp_caches[]. */
> +static int mtecomp_init(void)
> +{
> + char name[16];
> + int size;
> + int i;
> +
> + BUILD_BUG_ON(MTE_PAGE_TAG_STORAGE != 128);
> + for (i = 0; i < NUM_CACHES; i++) {
> + size = ea0_cache_id_to_size(i);
> + snprintf(name, ARRAY_SIZE(name), "mte-tags-%d", size);
> + mtecomp_caches[i] =
> + kmem_cache_create(name, size, size, 0, NULL);
> + }
> + return 0;
> +}
> +module_init(mtecomp_init);
> --
> 2.41.0.255.g8b1d071c50-goog

2023-07-19 09:05:25

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] lib/bitmap: add bitmap_{set,get}_value()

>
> Thanks to GENMASK() implementation, you'll be warned by GENMASK_INPUT_CHECK()
> if nbits is a compile-time variable. In case of runtime, it's a pure undef,
> not mentioning useless, expensive and dangerous fetch.
>
> > > - we anyway need a dynamic check for the case nbits is not constant
> > > (for both bitmap_get_value() and bitmap_set_value(), I assume).
> > >
> > > What do you think?
>
> I think that instead of speculations, it's better to cover nbits == 0
> with the explicit tests for run- and compile-time. That way you're
> always on a safe side.

You are right. I added tests for these cases.

> bitmap_get_val(NULL, 0, 0) shouldn't crash the kernel.

Haha, the compiler is smart enough to not crash the kernel in this case.
But passing zero via a volatile variable did the trick.

>
> > The idea behind is to eliminate the code completely for the cases nbits != 0.
> > In your case the dynamic check will be there. That's what we want to avoid.
>
> Alexander is right - we can't avoid testing against 0 if we need to
> test for 0... In case of other functions we have inline and outline
> implementations, controlled by small_const_nbits().
>
> As you can see, the small_const_nbits() tests against 0 explicitly,
> although it's free at compile time. But if nbits == 0, we pick
> outline version of a function regardless.
>
> On their turn, outline versions again do their test against nbits == 0,
> but most of the time implicitly.
>
> In case of bitmap_set_val, we are touching at max 2 words, and there's
> no reason for outline version, so we have to test nbits against 0
> inside inline code.
>
> Having all that in mind, and because nbits == 0 is most likely an
> error we'd follow the following rules:
> - no memory must be touched as we're potentially in error condition,
> and pointer may be corrupted;
> - the cost of the check must be as minimal as possible.
>
> So I suggest:
>
> if (unlikely(nbits == 0))
> return;

Sounds good, I'll add unlikely() around the check.
Thanks for the explanation!

>
> For readers that would literally mean: we don't expect that, and we find
> it suspicious, but we'll handle that as correct as we can.
>
> By the way, Alexander, please drop that 'const' things. Those are for
> pointers or some global variables, not for inline functions with 4
> lines of code. (If you think it helps the code to be safe than no - it's
> unsafe even with consts.)

These consts are from the original Syed's patch and were probably
added for consistency with bitmap_{set,get}_value8().
But, okay, I'll remove them.




--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

2023-07-19 12:40:48

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP

On Tue, Jul 18, 2023 at 7:18 PM Andy Shevchenko
<[email protected]> wrote:

> > However it doesn't seem to be very picky.
> >
> > $ scripts/kernel-doc -v -none arch/arm64/include/asm/mtecomp.h
> >
> > warns about e.g. parameter name mismatch, but does not care about the
> > missing return section.
>
> -Wreturn is missing

Nice, adding it (or -Wall) uncovered more problems.

> ...
>
> > > Also
> > > why you put the descriptions in to the header file? It's a bit unusual for the
> > > exported ones.
> >
> > https://docs.kernel.org/doc-guide/kernel-doc.html was not specific on
> > this, and I thought anyone wanting to understand how an interface
> > works would prefer reading the header rather than the implementation.
> > I can move the comments to arch/arm64/mm/mtecomp.c if you think it's a
> > better place for them.
>
> With the kernel doc in the C file you may also comment the internal ones and
> generate documentation only for exported ones. This is not possible with h.

I see.
After some hesitation and looking at the existing practices in the
kernel, I moved the kernel doc comments to mtecomp.c



> > but all those values are >= 0
> > and <= 256, so there should be no problems.
> > (shorts could have been ints as well, we are just saving some stack
> > space in ea0_compress()/ea0_decompress()).
>
> Then why they are signed? Please, justify that.
> Signdness prone to subtle and hard to hunt errors due to integer promotions.

Switched to unsigned shorts.

> ...
>
> > > > +#include <linux/bits.h>
> > > > +#include <linux/bitmap.h>
> > >
> > > bitmap guarantees that bits.h will be included.
> >
> > I am following the IWYU principle here, and I believe it's a good thing to do.
> > I've seen cases where these transitive dependencies rotted after some
> > refactoring, but the fact was only noticed in certain configurations.
> > Also, there's an ongoing work by Ingo Molnar to speed up kernel builds
> > by optimizing headers
> > (https://lwn.net/ml/linux-kernel/[email protected]/), and it
> > relies on explicit dependencies which are easier to untangle.
>
> Yes, but we have some guarantees. E.g., we don't include compiler*.h
> when types.h is included, because of the guarantees.
Ok, I removed bits.h

> Otherwise your code misses _a lot_ of headers.

... and tried to expand the list of headers where applicable.

>
> ...
>
> > > Can you make it unsigned and start from 0?
> >
> > Changed to start with 0, but I am a bit hesitant about making it
> > unsigned: it is now no more special than a loop variable.
>
> Signdness is a beast in C, needs always an additional justification.

Changed all ints to unsigned, because, well, why not :)


> > Shall I undef the constants above as well (e.g. BITS_PER_TAG)?
> > The intuitive answer is "no",
>
> Correct.
>
> > but then there should be some difference between those and RANGES_INLINE?
>
> Yes, one is widely used constant and one is a _localized_ helper.

Ack

>
> ...
>
> > > > +static void bitmap_write(unsigned long *bitmap, unsigned long value,
> > > > + unsigned long *pos, unsigned long bits)
> > >
> > > Please, don't use reserved namespace. Yours is ea0, use it:
> > > ea0_bitmap_write()! Same to other similarly named functions.
> >
> > Done.
> > However there are two parallel namespaces now: "ea0" for the
> > compression algorithm, and "memcomp" for the module initialization and
> > data structures.
> > Dunno if it makes sense to merge them (and rename the .c file accordingly).
>
> Your choice. Mu point, just do prefix it with something meaningful.
Fine, I'll go with "ea0" for the algorithm and "memcomp" for the
module-specific stuff.
Let's also see what ARM maintainers say.

> ...
>
> > > > + u8 r_tags[256];
> > > > + int r_len = ARRAY_SIZE(r_tags);
> > >
> > No, it is the length of the arrays (both r_tags and r_sizes).
> > Below you make a good point it will spare us a kernel.h dependency,
> > but for the sake of keeping the code error-prone we probably shouldn't
> > assume r_tags is a byte array.
>
> It's a common practice even outside of Linux kernel to use sizeof() against
> char arrays. I don't see the point to have the ARRAY_SIZE() complication here.

Fixed.

> ...
>
> > > > + snprintf(name, ARRAY_SIZE(name), "mte-tags-%d", size);
>
> You see, if you grep for similar calls I'm pretty sure the order of 2 of power
> of 10 will be the difference between sizeof()/ARRAY_SIZE() if the latter even
> occurs at least once.

You are right, it is 3700+ vs. 66 :)
Changed ARRAY_SIZE() to sizeof() here as well.

2023-07-19 14:39:20

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP

On Wed, Jul 19, 2023 at 8:09 AM Yury Norov <[email protected]> wrote:
...
> > +
> > +/* Functions below are exported for testing purposes. */
>
> Then declare them in a separate local header or simply in the test, but
> please not in a public header.

Fair enough, moved them to the test.

> > +
> > +/*
> > + * EA0 stands for "Evgenii's Algorithm 0", as the initial proposal contained two
> > + * compression algorithms.
>
> This is the 4th time I see mr. Stepanov's credentials in the patch.
> I've no doubts he's a worthy gentleman but please avoid mentioning
> people in source code. Suggested-by is enough. IIRC, the rule for
> that exists for about decade.
>
> For the purpose of namespacing, the mte_compress/mte_decompress would
> sound better.

This indeed makes sense; "EA0" is not widely recognizable, and we are
unlikely to have more than one MTE compression algorithm anyway.
I changed "ea0" to "mte" in the patch series.



> > + *
> > + * 4. For the inline case, the following values are stored in the 8-byte handle:
> > + * largest_idx : i4
> > + * r_tags[0..5] : i4 x 6
> > + * r_sizes[0..4] : i7 x 5
> > + * (if N is less than 6, @r_tags and @r_sizes are padded up with zero values)
> > + *
> > + * Because @largest_idx is <= 5, bit 63 of the handle is always 0 (so it can
> > + * be stored in the Xarray), and bits 62..60 cannot all be 1, so it can be
> > + * distinguished from a kernel pointer.
>
> I honestly tried to understand... For example, what the
> 'r_sizes[0..4] : i7 x 5'
> means? An array of 5 elements, 17 bits each? But it's alone greater
> than size of pointer... Oh gosh...

iN (note that it is a small i, not a 1) is LLVM notation for an N-bit
integer type.
There's no such thing in C syntax, and describing the data layout
using bitfields would be painful.
Would it help if I add a comment explaining that iN is an N-bit
integer? Or do you prefer something like

r_sizes[0..4] : 5 x 7 bits

?

>
> Moreover, MTE tags are all 4-bits.
>
> It seems like text without pictures is above my mental abilities. Can you
> please illustrate it? For example, from the #4, I (hopefully correctly)
> realized that:
>
> Inline frame format:
> 0 60 62 63
> +---------------------------------------------------------------------+
I think it's more natural to number bits from 63 to 0

> | No idea what happens here | Lidx | X |
> +---------------------------------------------------------------------+
> 63 : X : RAZ : Reserved for Xarray.
> 60-62 : Lidx : 0..5 : Largest element index.

There's some mismatch now between this picture, where Lidx is i3, and
the implementation that treats it as an i4, merging bit 63 into it.
Maybe we can avoid this by not splitting off bit 63?

> 6 : Reserved
> 7 : Invalid handler

No, 7 means "treat this handle as an out-of-line one". It is still
valid, but instead of tag data it contains a pointer.

>
> Because of the size, I believe this comment is worth to put in Docs,
> moreover we already have "Documentation/arch/arm64/memory-tagging-extension.rst"
> Why don't you add an 'MTE Compression' section in there?

Good idea, will do!

>
> > +#include <linux/bits.h>
> > +#include <linux/bitmap.h>
> > +#include <linux/gfp.h>
> > +#include <linux/module.h>
> > +#include <linux/slab.h>
> > +#include <linux/swab.h>
> > +#include <linux/string.h>
> > +#include <linux/types.h>
>
> Nit: Alphabetic order?
Ack.

>
> Andy is right, bitmap.h includes bit.h, no need to include both. And if
> your code will get broken one day, it's a bitmap maintainers' work to fix.
Ack.


> > +
> > +/*
> > + * Sizes of compressed values.
> > + */
> > +#define BITS_PER_TAG 4
> > +#define BITS_PER_SIZE 7
> > +#define BITS_PER_LARGEST_IDX_INLINE 4
> > +#define BITS_PER_LARGEST_IDX 6
>
> But in the comment you say that largest index in inline frame is 3-bits long.

The comment says "i4" (see my comments above)

> > +
> > +/* Translate allocation size into mtecomp_caches[] index. */
> > +static int ea0_size_to_cache_id(int len)
>
> Here and everywhere, do you need signed values? If not, unsigned int.

Done as part of fixing Andy's comments

> > +
> > +/* Transform tag ranges back into tags. */
> > +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags)
> > +{
> > + int i, j, pos = 0;
> > + u8 prev;
> > +
> > + for (i = 0; i < r_len; i++) {
> > + for (j = 0; j < r_sizes[i]; j++) {
> > + if (pos % 2)
> > + tags[pos / 2] = (prev << 4) | r_tags[i];
> > + else
> > + prev = r_tags[i];
> > + pos++;
> > + }
> > + }
> > +}
> > +EXPORT_SYMBOL_NS(ea0_ranges_to_tags, MTECOMP);
>
> Because I didn't understand the compressed frame format, not sure I
> can understand this logic...
Hope the above comments will help, if not - please let me know.

> > +
> > +/* Translate @num_ranges into the allocation size needed to hold them. */
> > +static int ea0_alloc_size(int num_ranges)
> > +{
> > + if (num_ranges <= 6)
> > + return 8;
> > + if (num_ranges <= 11)
> > + return 16;
> > + if (num_ranges <= 23)
> > + return 32;
> > + if (num_ranges <= 46)
> > + return 64;
> > + return 128;
> > +}
> > +
> > +/* Translate allocation size into maximum number of ranges that it can hold. */
> > +static int ea0_size_to_ranges(int size)
> > +{
> > + switch (size) {
> > + case 8:
> > + return 6;
> > + case 16:
> > + return 11;
> > + case 32:
> > + return 23;
> > + case 64:
> > + return 46;
> > + default:
> > + return 0;
> > + }
> > +}
>
> I wonder if there's a math formula here? Can you explain where from
> those numbers come?

I'll add a comment there.
Basically, for the inline case it is the biggest N such as 4 + 4*N +
7*(N-1) <= 63
(not 64, because Xarray steals one bit from us)

For the out-of-line case it is the biggest N such as 6+4*N + 7*(N-1)
<= array size in bits (128, 256, or 512).

>
> > +#define RANGES_INLINE ea0_size_to_ranges(8)
>
> Maybe
>
> #define RANGES_INLINE (6)

Fair enough.


> > +
> > +static void bitmap_write(unsigned long *bitmap, unsigned long value,
> > + unsigned long *pos, unsigned long bits)
>
> Please don't steal prefixes. But the idea is good. For the next
> iteration, let's rename bitmap_set_value() to bitmap_write()?
> So that your function will be an mte_bitmap_write().

I don't object, but it diverges from bitmap_set_value8() now.
Will do.

2023-07-19 20:50:29

by Evgenii Stepanov

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP

On Tue, Jul 18, 2023 at 11:09 PM Yury Norov <[email protected]> wrote:
> This is the 4th time I see mr. Stepanov's credentials in the patch.
> I've no doubts he's a worthy gentleman but please avoid mentioning
> people in source code. Suggested-by is enough. IIRC, the rule for
> that exists for about decade.

Thank you for your kind words :)
Indeed, Suggested-by is more than enough.

2023-07-19 21:57:59

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP

On Wed, Jul 19, 2023 at 04:00:14PM +0200, Alexander Potapenko wrote:
> > > + * 4. For the inline case, the following values are stored in the 8-byte handle:
> > > + * largest_idx : i4
> > > + * r_tags[0..5] : i4 x 6
> > > + * r_sizes[0..4] : i7 x 5
> > > + * (if N is less than 6, @r_tags and @r_sizes are padded up with zero values)
> > > + *
> > > + * Because @largest_idx is <= 5, bit 63 of the handle is always 0 (so it can
> > > + * be stored in the Xarray), and bits 62..60 cannot all be 1, so it can be
> > > + * distinguished from a kernel pointer.
> >
> > I honestly tried to understand... For example, what the
> > 'r_sizes[0..4] : i7 x 5'
> > means? An array of 5 elements, 17 bits each? But it's alone greater
> > than size of pointer... Oh gosh...
>
> iN (note that it is a small i, not a 1) is LLVM notation for an N-bit
> integer type.
> There's no such thing in C syntax, and describing the data layout
> using bitfields would be painful.
> Would it help if I add a comment explaining that iN is an N-bit
> integer? Or do you prefer something like
>
> r_sizes[0..4] : 5 x 7 bits
>
> ?

Yes, that would help.

> >
> > Moreover, MTE tags are all 4-bits.
> >
> > It seems like text without pictures is above my mental abilities. Can you
> > please illustrate it? For example, from the #4, I (hopefully correctly)
> > realized that:
> >
> > Inline frame format:
> > 0 60 62 63
> > +---------------------------------------------------------------------+
> I think it's more natural to number bits from 63 to 0
>
> > | No idea what happens here | Lidx | X |
> > +---------------------------------------------------------------------+
> > 63 : X : RAZ : Reserved for Xarray.
> > 60-62 : Lidx : 0..5 : Largest element index.
>
> There's some mismatch now between this picture, where Lidx is i3, and
> the implementation that treats it as an i4, merging bit 63 into it.
> Maybe we can avoid this by not splitting off bit 63?
>
> > 6 : Reserved
> > 7 : Invalid handler
>
> No, 7 means "treat this handle as an out-of-line one". It is still
> valid, but instead of tag data it contains a pointer.

So, it's invalid combination for _inline_ handler, right? Anyways, I'm
waiting for an updated docs, so it will hopefully bring some light.

> > > +
> > > +/* Transform tag ranges back into tags. */
> > > +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags)
> > > +{
> > > + int i, j, pos = 0;
> > > + u8 prev;
> > > +
> > > + for (i = 0; i < r_len; i++) {
> > > + for (j = 0; j < r_sizes[i]; j++) {
> > > + if (pos % 2)
> > > + tags[pos / 2] = (prev << 4) | r_tags[i];
> > > + else
> > > + prev = r_tags[i];
> > > + pos++;

This code flushes tags at every 2nd iteration. Is that true that
there's always an even number of iterations, i.e. rsizes is always
even, assuming r_len can be 1?

If not, it's possible to loose a tag, consider rlen == 1 and
rsizes[0] == 1.

If yes, you can simplify:

for (i = 0; i < r_len; i++)
for (j = 0; j < r_sizes[i]; j++)
tags[pos++] = (r_tags[i] << 4) | r_tags[i];

Anyways, in the test can you run all possible combinations?

> > > + }
> > > + }
> > > +}
> > > +EXPORT_SYMBOL_NS(ea0_ranges_to_tags, MTECOMP);
> >
> > Because I didn't understand the compressed frame format, not sure I
> > can understand this logic...
> Hope the above comments will help, if not - please let me know.
>
> > > +
> > > +/* Translate @num_ranges into the allocation size needed to hold them. */
> > > +static int ea0_alloc_size(int num_ranges)
> > > +{
> > > + if (num_ranges <= 6)
> > > + return 8;
> > > + if (num_ranges <= 11)
> > > + return 16;
> > > + if (num_ranges <= 23)
> > > + return 32;
> > > + if (num_ranges <= 46)
> > > + return 64;
> > > + return 128;
> > > +}
> > > +
> > > +/* Translate allocation size into maximum number of ranges that it can hold. */
> > > +static int ea0_size_to_ranges(int size)
> > > +{
> > > + switch (size) {
> > > + case 8:
> > > + return 6;
> > > + case 16:
> > > + return 11;
> > > + case 32:
> > > + return 23;
> > > + case 64:
> > > + return 46;
> > > + default:
> > > + return 0;
> > > + }
> > > +}
> >
> > I wonder if there's a math formula here? Can you explain where from
> > those numbers come?
>
> I'll add a comment there.
> Basically, for the inline case it is the biggest N such as 4 + 4*N +
> 7*(N-1) <= 63
>
> (not 64, because Xarray steals one bit from us)
>
> For the out-of-line case it is the biggest N such as 6+4*N + 7*(N-1)
> <= array size in bits (128, 256, or 512).

So it's: N <= (BIT_CAPACITY + NUM4 - NUM7) / (TAGSZ + SZ)

Doesn't look like a rocket science...

Why don't you code the formula instead of results? Are there any
difficulties? Things may change. For example, in next spec they
may bring 3- or 5-bit tags, and your compressor will crack loudly
with hardcoded numbers.

2023-07-20 12:29:34

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP

On Wed, Jul 19, 2023 at 11:06 PM Yury Norov <[email protected]> wrote:
>
> > Would it help if I add a comment explaining that iN is an N-bit
> > integer? Or do you prefer something like
> >
> > r_sizes[0..4] : 5 x 7 bits
> >
> > ?
>
> Yes, that would help.
Ok.


> > > 7 : Invalid handler
> >
> > No, 7 means "treat this handle as an out-of-line one". It is still
> > valid, but instead of tag data it contains a pointer.
>
> So, it's invalid combination for _inline_ handler, right?
Yes, that's right.

> Anyways, I'm
> waiting for an updated docs, so it will hopefully bring some light.
It's on the way.

> > > > +
> > > > +/* Transform tag ranges back into tags. */
> > > > +void ea0_ranges_to_tags(u8 *r_tags, short *r_sizes, int r_len, u8 *tags)
> > > > +{
> > > > + int i, j, pos = 0;
> > > > + u8 prev;
> > > > +
> > > > + for (i = 0; i < r_len; i++) {
> > > > + for (j = 0; j < r_sizes[i]; j++) {
> > > > + if (pos % 2)
> > > > + tags[pos / 2] = (prev << 4) | r_tags[i];
> > > > + else
> > > > + prev = r_tags[i];
> > > > + pos++;
>
> This code flushes tags at every 2nd iteration. Is that true that
> there's always an even number of iterations, i.e. rsizes is always
> even, assuming r_len can be 1?
>
> If not, it's possible to loose a tag, consider rlen == 1 and
> rsizes[0] == 1.

By design, ea0_ranges_to_tags() only accepts ranges that sum up to 256
(as produced by ea0_tags_to_ranges())
We could add a check for that, but given that it is an internal
helper, a comment should suffice.

But r_sizes[i] does not have to be even (otherwise we could've used
this fact for a more efficient compression).
For example, if the array of tags is {0xab, 0x0, 0x0, 0x0...}, then
r_len = 3, r_sizes[] = {1, 1, 254}, r_tags = {0x0a, 0x0b, 0x0}.

>
> If yes, you can simplify:
>
> for (i = 0; i < r_len; i++)
> for (j = 0; j < r_sizes[i]; j++)
> tags[pos++] = (r_tags[i] << 4) | r_tags[i];
>
> Anyways, in the test can you run all possible combinations?

I will extract code from compress_range_helper() to generate different
numbers of ranges and test against those.


> > > > +/* Translate allocation size into maximum number of ranges that it can hold. */
> > > > +static int ea0_size_to_ranges(int size)
> > > > +{
> > > > + switch (size) {
> > > > + case 8:
> > > > + return 6;
> > > > + case 16:
> > > > + return 11;
> > > > + case 32:
> > > > + return 23;
> > > > + case 64:
> > > > + return 46;
> > > > + default:
> > > > + return 0;
> > > > + }
> > > > +}
> > >
> > > I wonder if there's a math formula here? Can you explain where from
> > > those numbers come?
> >
> > I'll add a comment there.
> > Basically, for the inline case it is the biggest N such as 4 + 4*N +
> > 7*(N-1) <= 63
> >
> > (not 64, because Xarray steals one bit from us)
> >
> > For the out-of-line case it is the biggest N such as 6+4*N + 7*(N-1)
> > <= array size in bits (128, 256, or 512).
>
> So it's: N <= (BIT_CAPACITY + NUM4 - NUM7) / (TAGSZ + SZ)
>
> Doesn't look like a rocket science...

Note that the size of largest_idx is picked based on the largest N for
64-byte buffers.
And the size of r_sizes[] depends on the number of tags that fit into
128 bytes: if we throw away the largest range, each of the remaining
ones will fit into 7 bits.
But this will not be the case for e.g. 3-bit tags (we'll need 8-bit sizes then).

Changing mte_size_to_ranges() to "(size * 8 + MTE_BITS_PER_SIZE -
largest_bits) / (MTE_BITS_PER_TAG + MTE_BITS_PER_SIZE)" indeed looks
better than hardcoded numbers, but we should keep in mind that the
parameters depend on each other and cannot be easily changed.

>
> Why don't you code the formula instead of results? Are there any
> difficulties? Things may change. For example, in next spec they
> may bring 3- or 5-bit tags, and your compressor will crack loudly
> with hardcoded numbers.

The main reason for the compressor to crack loudly would be the
hardcoded assumption of the tags consisting of 4 bits.
I don't think it is practical to account on that number to change (and
e.g. use bitmap_read/bitmap_write to convert tags into ranges) - we'd
rather have a static assert for that.


--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg