2023-07-20 17:56:31

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v4 0/5] Implement MTE tag compression for swapped pages

Currently, when MTE pages are swapped out, the tags are kept in the
memory, occupying 128 bytes per page. This is especially problematic
for devices that use zram-backed in-memory swap, because tags stored
uncompressed in the heap effectively reduce the available amount of
swap memory.

The RLE-based EA0 algorithm suggested by Evgenii Stepanov and
implemented in this patch series is able to efficiently compress
128-byte tag buffers, resulting in practical compression ratio between
2.5x and 20x. In most cases it is possible to store the compressed data
in 63-bit Xarray values, resulting in no extra memory allocations.

Our measurements show that EA0 provides better compression than existing
kernel compression algorithms (LZ4, LZO, LZ4HC, ZSTD) can offer, because
EA0 specifically targets 128-byte buffers.

To implement compression/decompression, we also extend <linux/bitmap.h>
with methods to set/get bit values at arbitrary places in the map.

We refactor arch/arm64/mm/mteswap.c to support both the compressed
(CONFIG_ARM64_MTE_COMP) and non-compressed case. For the former, in
addition to tag compression, we move tag allocation from kmalloc() to
separate kmem caches, providing greater locality and relaxing the
alignment requirements.

v4:
- fixed a bunch of comments by Andy Shevchenko and Yury Norov
- added Documentation/arch/arm64/mte-tag-compression.rst

v3:
- as suggested by Andy Shevchenko, use
bitmap_get_value()/bitmap_set_value() written by Syed Nayyar Waris
- switched to unsigned long to reduce typecasts
- simplified the compression code

v2:
- as suggested by Yuri Norov, replace the poorly implemented struct
bitq with <linux/bitmap.h>



Alexander Potapenko (4):
lib/test_bitmap: add tests for bitmap_{set,get}_value()
arm64: mte: implement CONFIG_ARM64_MTE_COMP
arm64: mte: add a test for MTE tags compression
arm64: mte: add compression support to mteswap.c

Syed Nayyar Waris (1):
lib/bitmap: add bitmap_{set,get}_value()

Documentation/arch/arm64/index.rst | 1 +
.../arch/arm64/mte-tag-compression.rst | 212 ++++++++++
arch/arm64/Kconfig | 19 +
arch/arm64/include/asm/mtecomp.h | 13 +
arch/arm64/mm/Makefile | 7 +
arch/arm64/mm/mtecomp.c | 382 ++++++++++++++++++
arch/arm64/mm/mteswap.c | 20 +-
arch/arm64/mm/mteswap.h | 12 +
arch/arm64/mm/mteswap_comp.c | 54 +++
arch/arm64/mm/mteswap_nocomp.c | 38 ++
arch/arm64/mm/test_mtecomp.c | 217 ++++++++++
include/linux/bitmap.h | 60 +++
lib/test_bitmap.c | 81 ++++
13 files changed, 1105 insertions(+), 11 deletions(-)
create mode 100644 Documentation/arch/arm64/mte-tag-compression.rst
create mode 100644 arch/arm64/include/asm/mtecomp.h
create mode 100644 arch/arm64/mm/mtecomp.c
create mode 100644 arch/arm64/mm/mteswap.h
create mode 100644 arch/arm64/mm/mteswap_comp.c
create mode 100644 arch/arm64/mm/mteswap_nocomp.c
create mode 100644 arch/arm64/mm/test_mtecomp.c

--
2.41.0.487.g6d72f3e995-goog



2023-07-20 18:05:27

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

From: Syed Nayyar Waris <[email protected]>

The two new functions allow setting/getting values of length up to
BITS_PER_LONG bits at arbitrary position in the bitmap.

The code was taken from "bitops: Introduce the for_each_set_clump macro"
by Syed Nayyar Waris with a couple of minor changes:
- instead of using roundup(), which adds an unnecessary dependency
on <linux/math.h>, we calculate space as BITS_PER_LONG-offset;
- indentation is reduced by not using else-clauses (suggested by
checkpatch for bitmap_get_value())

Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Syed Nayyar Waris <[email protected]>
Signed-off-by: William Breathitt Gray <[email protected]>
Link: https://lore.kernel.org/lkml/fe12eedf3666f4af5138de0e70b67a07c7f40338.1592224129.git.syednwaris@gmail.com/
Suggested-by: Yury Norov <[email protected]>
Co-developed-by: Alexander Potapenko <[email protected]>
Signed-off-by: Alexander Potapenko <[email protected]>

---
v4:
- Address comments by Andy Shevchenko and Yury Norov:
- prevent passing values >= 64 to GENMASK()
- fix commit authorship
- change comments
- check for unlikely(nbits==0)
- drop unnecessary const declarations
- fix kernel-doc comments
- rename bitmap_{get,set}_value() to bitmap_{read,write}()
---
include/linux/bitmap.h | 60 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 60 insertions(+)

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 03644237e1efb..bc21c09a2e038 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -76,7 +76,11 @@ struct device;
* bitmap_to_arr32(buf, src, nbits) Copy nbits from buf to u32[] dst
* bitmap_to_arr64(buf, src, nbits) Copy nbits from buf to u64[] dst
* bitmap_get_value8(map, start) Get 8bit value from map at start
+ * bitmap_read(map, start, nbits) Read an nbits-sized value from
+ * map at start
* bitmap_set_value8(map, value, start) Set 8bit value to map at start
+ * bitmap_write(map, value, start, nbits) Write an nbits-sized value to
+ * map at start
*
* Note, bitmap_zero() and bitmap_fill() operate over the region of
* unsigned longs, that is, bits behind bitmap till the unsigned long
@@ -583,6 +587,33 @@ static inline unsigned long bitmap_get_value8(const unsigned long *map,
return (map[index] >> offset) & 0xFF;
}

+/**
+ * bitmap_read - read a value of n-bits from the memory region
+ * @map: address to the bitmap memory region
+ * @start: bit offset of the n-bit value
+ * @nbits: size of value in bits, up to BITS_PER_LONG
+ *
+ * Returns: value of nbits located at the @start bit offset within the @map
+ * memory region.
+ */
+static inline unsigned long bitmap_read(const unsigned long *map,
+ unsigned long start,
+ unsigned long nbits)
+{
+ size_t index = BIT_WORD(start);
+ unsigned long offset = start % BITS_PER_LONG;
+ unsigned long space = BITS_PER_LONG - offset;
+ unsigned long value_low, value_high;
+
+ if (unlikely(!nbits))
+ return 0;
+ if (space >= nbits)
+ return (map[index] >> offset) & GENMASK(nbits - 1, 0);
+ value_low = map[index] & BITMAP_FIRST_WORD_MASK(start);
+ value_high = map[index + 1] & BITMAP_LAST_WORD_MASK(start + nbits);
+ return (value_low >> offset) | (value_high << space);
+}
+
/**
* bitmap_set_value8 - set an 8-bit value within a memory region
* @map: address to the bitmap memory region
@@ -599,6 +630,35 @@ static inline void bitmap_set_value8(unsigned long *map, unsigned long value,
map[index] |= value << offset;
}

+/**
+ * bitmap_write - write n-bit value within a memory region
+ * @map: address to the bitmap memory region
+ * @value: value of nbits
+ * @start: bit offset of the n-bit value
+ * @nbits: size of value in bits, up to BITS_PER_LONG
+ */
+static inline void bitmap_write(unsigned long *map,
+ unsigned long value,
+ unsigned long start, unsigned long nbits)
+{
+ size_t index = BIT_WORD(start);
+ unsigned long offset = start % BITS_PER_LONG;
+ unsigned long space = BITS_PER_LONG - offset;
+
+ if (unlikely(!nbits))
+ return;
+ value &= GENMASK(nbits - 1, 0);
+ if (space >= nbits) {
+ map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
+ map[index] |= value << offset;
+ return;
+ }
+ map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
+ map[index] |= value << offset;
+ map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
+ map[index + 1] |= (value >> space);
+}
+
#endif /* __ASSEMBLY__ */

#endif /* __LINUX_BITMAP_H */
--
2.41.0.487.g6d72f3e995-goog


2023-07-20 18:09:59

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v4 4/5] arm64: mte: add a test for MTE tags compression

Ensure that tag sequences containing alternating values are compressed
to buffers of expected size and correctly decompressed afterwards.

Signed-off-by: Alexander Potapenko <[email protected]>

---
v4:
- addressed comments by Andy Shevchenko:
- expanded MTE to "Memory Tagging Extension" in Kconfig
- changed signed variables to unsigned where applicable
- added missing header dependencies

- addressed comments by Yury Norov:
- moved test-only declarations from mtecomp.h into this test
- switched to the new "mte"-prefixed function names, dropped the
mentions of "EA0"
- added test_tag_to_ranges_n()

v3:
- addressed comments by Andy Shevchenko in another patch:
- switched from u64 to unsigned long
- added MODULE_IMPORT_NS(MTECOMP)
- fixed includes order
---
arch/arm64/Kconfig | 10 ++
arch/arm64/mm/Makefile | 1 +
arch/arm64/mm/test_mtecomp.c | 217 +++++++++++++++++++++++++++++++++++
3 files changed, 228 insertions(+)
create mode 100644 arch/arm64/mm/test_mtecomp.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 3ac6e302b1509..2432475003054 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2102,6 +2102,16 @@ config ARM64_MTE_COMP

128-byte tag buffers corresponding to 4K pages can be compressed to save heap memory.

+config ARM64_MTE_COMP_KUNIT_TEST
+ tristate "Test tag compression for ARM64 Memory Tagging Extension" if !KUNIT_ALL_TESTS
+ default KUNIT_ALL_TESTS
+ depends on KUNIT && ARM64_MTE_COMP
+ help
+ Test MTE compression algorithm enabled by CONFIG_ARM64_MTE_COMP.
+
+ Ensure that tag sequences containing alternating values are compressed
+ to buffers of expected size and correctly decompressed afterwards.
+
config ARM64_SVE
bool "ARM Scalable Vector Extension support"
default y
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 46778f6dd83c2..170dc62b010b9 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_TRANS_TABLE) += trans_pgd-asm.o
obj-$(CONFIG_DEBUG_VIRTUAL) += physaddr.o
obj-$(CONFIG_ARM64_MTE) += mteswap.o
obj-$(CONFIG_ARM64_MTE_COMP) += mtecomp.o
+obj-$(CONFIG_ARM64_MTE_COMP_KUNIT_TEST) += test_mtecomp.o
KASAN_SANITIZE_physaddr.o += n

obj-$(CONFIG_KASAN) += kasan_init.o
diff --git a/arch/arm64/mm/test_mtecomp.c b/arch/arm64/mm/test_mtecomp.c
new file mode 100644
index 0000000000000..185eb6cb73650
--- /dev/null
+++ b/arch/arm64/mm/test_mtecomp.c
@@ -0,0 +1,217 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test cases for MTE tags compression algorithm.
+ */
+
+#include <linux/bits.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <kunit/test.h>
+#include <linux/types.h>
+
+#include <asm/mtecomp.h>
+
+/* Functions exported from mtecomp.c for this test. */
+void mte_tags_to_ranges(u8 *tags, u8 *out_tags, unsigned short *out_sizes,
+ size_t *out_len);
+void mte_ranges_to_tags(u8 *r_tags, unsigned short *r_sizes, size_t r_len,
+ u8 *tags);
+
+/*
+ * Test that mte_tags_to_ranges() produces a single range for a zero-filled tag
+ * buffer.
+ */
+static void test_tags_to_ranges_zero(struct kunit *test)
+{
+ u8 tags[128], dtags[128];
+ unsigned short r_sizes[256];
+ size_t r_len = 256;
+ u8 r_tags[256];
+
+ memset(tags, 0, 128);
+ mte_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
+ KUNIT_EXPECT_EQ(test, r_len, 1);
+ KUNIT_EXPECT_EQ(test, r_tags[0], 0);
+ KUNIT_EXPECT_EQ(test, r_sizes[0], 256);
+ mte_ranges_to_tags(r_tags, r_sizes, r_len, dtags);
+ KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+}
+
+/*
+ * Test that a small number of different tags is correctly transformed into
+ * ranges.
+ */
+static void test_tags_to_ranges_simple(struct kunit *test)
+{
+ u8 tags[128], dtags[128];
+ const u8 ex_tags[] = { 0xa, 0x0, 0xa, 0xb, 0x0 };
+ const unsigned short ex_sizes[] = { 1, 2, 2, 1, 250 };
+ unsigned short r_sizes[256];
+ size_t r_len = 256;
+ u8 r_tags[256];
+
+ memset(tags, 0, 128);
+ tags[0] = 0xa0;
+ tags[1] = 0x0a;
+ tags[2] = 0xab;
+ mte_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
+ KUNIT_EXPECT_EQ(test, r_len, 5);
+ KUNIT_EXPECT_EQ(test, memcmp(r_tags, ex_tags, sizeof(ex_tags)), 0);
+ KUNIT_EXPECT_EQ(test, memcmp(r_sizes, ex_sizes, sizeof(ex_sizes)), 0);
+ mte_ranges_to_tags(r_tags, r_sizes, r_len, dtags);
+ KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+}
+
+/* Test that repeated 0xa0 byte produces 256 ranges of length 1. */
+static void test_tags_to_ranges_repeated(struct kunit *test)
+{
+ u8 tags[128], dtags[128];
+ unsigned short r_sizes[256];
+ size_t r_len = 256;
+ u8 r_tags[256];
+
+ memset(tags, 0xa0, 128);
+ mte_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
+ KUNIT_EXPECT_EQ(test, r_len, 256);
+ mte_ranges_to_tags(r_tags, r_sizes, r_len, dtags);
+ KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+}
+
+/* Generate a buffer that will contain @nranges of tag ranges. */
+static void gen_tag_range_helper(u8 *tags, int nranges)
+{
+ unsigned int i;
+
+ memset(tags, 0, 128);
+ if (nranges > 1) {
+ nranges--;
+ for (i = 0; i < nranges / 2; i++)
+ tags[i] = 0xab;
+ if (nranges % 2)
+ tags[nranges / 2] = 0xa0;
+ }
+}
+
+/*
+ * Test that mte_tags_to_ranges()/mte_ranges_to_tags() work for various
+ * r_len values.
+ */
+static void test_tag_to_ranges_n(struct kunit *test)
+{
+ unsigned short r_sizes[256];
+ u8 tags[128], dtags[128];
+ unsigned int i, j, sum;
+ size_t r_len = 256;
+ u8 r_tags[256];
+
+ for (i = 1; i <= 256; i++) {
+ gen_tag_range_helper(tags, i);
+ mte_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
+ sum = 0;
+ for (j = 0; j < r_len; j++)
+ sum += r_sizes[j];
+ KUNIT_EXPECT_EQ(test, sum, 256);
+ mte_ranges_to_tags(r_tags, r_sizes, r_len, dtags);
+ KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+ }
+}
+
+/* Test that a zero-filled array is compressed into inline storage. */
+static void test_compress_zero(struct kunit *test)
+{
+ u8 tags[128], dtags[128];
+ unsigned long handle;
+
+ memset(tags, 0, 128);
+ handle = mte_compress(tags);
+ KUNIT_EXPECT_EQ(test, handle & BIT_ULL(63), 0);
+ /* Tags are stored inline. */
+ KUNIT_EXPECT_EQ(test, mte_storage_size(handle), 8);
+ KUNIT_EXPECT_TRUE(test, mte_decompress(handle, dtags));
+ KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+}
+
+/*
+ * Test that a very small number of tag ranges ends up compressed into 8 bytes.
+ */
+static void test_compress_simple(struct kunit *test)
+{
+ u8 tags[128], dtags[128];
+ unsigned long handle;
+
+ memset(tags, 0, 128);
+ tags[0] = 0xa0;
+ tags[1] = 0x0a;
+ tags[2] = 0xab;
+
+ handle = mte_compress(tags);
+ KUNIT_EXPECT_EQ(test, handle & BIT_ULL(63), 0);
+ /* Tags are stored inline. */
+ KUNIT_EXPECT_EQ(test, mte_storage_size(handle), 8);
+ KUNIT_EXPECT_TRUE(test, mte_decompress(handle, dtags));
+ KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+}
+
+/*
+ * Test that a buffer containing @nranges ranges compresses into @exp_size
+ * bytes and decompresses into the original tag sequence.
+ */
+static void compress_range_helper(struct kunit *test, int nranges,
+ size_t exp_size)
+{
+ u8 tags[128], dtags[128];
+ unsigned long handle;
+
+ gen_tag_range_helper(tags, nranges);
+ handle = mte_compress(tags);
+ KUNIT_EXPECT_EQ(test, handle & BIT_ULL(63), 0);
+ KUNIT_EXPECT_EQ(test, mte_storage_size(handle), exp_size);
+ KUNIT_EXPECT_TRUE(test, mte_decompress(handle, dtags));
+ KUNIT_EXPECT_EQ(test, memcmp(tags, dtags, 128), 0);
+}
+
+/*
+ * Test that every number of tag ranges is correctly compressed and
+ * decompressed.
+ */
+static void test_compress_ranges(struct kunit *test)
+{
+ size_t exp_size;
+ unsigned int i;
+
+ for (i = 1; i <= 256; i++) {
+ if (i < 7)
+ exp_size = 8;
+ else if (i < 12)
+ exp_size = 16;
+ else if (i < 24)
+ exp_size = 32;
+ else if (i < 47)
+ exp_size = 64;
+ else
+ exp_size = 128;
+ compress_range_helper(test, i, exp_size);
+ }
+}
+
+static struct kunit_case mtecomp_test_cases[] = {
+ KUNIT_CASE(test_tags_to_ranges_zero),
+ KUNIT_CASE(test_tags_to_ranges_simple),
+ KUNIT_CASE(test_tags_to_ranges_repeated),
+ KUNIT_CASE(test_tag_to_ranges_n),
+ KUNIT_CASE(test_compress_zero),
+ KUNIT_CASE(test_compress_simple),
+ KUNIT_CASE(test_compress_ranges),
+ {}
+};
+
+static struct kunit_suite mtecomp_test_suite = {
+ .name = "mtecomp",
+ .test_cases = mtecomp_test_cases,
+};
+kunit_test_suites(&mtecomp_test_suite);
+
+MODULE_IMPORT_NS(MTECOMP);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Alexander Potapenko <[email protected]>");
--
2.41.0.487.g6d72f3e995-goog


2023-07-20 18:14:49

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v4 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP

The config implements the algorithm compressing memory tags for ARM MTE
during swapping.

The algorithm is based on RLE and specifically targets 128-byte buffers
of tags corresponding to a single page. In the common case a buffer
can be compressed into 63 bits, making it possible to store it without
additional memory allocation.

Suggested-by: Evgenii Stepanov <[email protected]>
Signed-off-by: Alexander Potapenko <[email protected]>

---
v4:
- Addressed comments by Andy Shevchenko:
- expanded "MTE" to "Memory Tagging Extension" in Kconfig
- fixed kernel-doc comments, moved them to C source
- changed variables to unsigned where applicable
- some code simplifications, fewer unnecessary assignments
- added the mte_largest_idx_bits() helper
- added namespace prefixes to all functions
- added missing headers (but removed bits.h)
- Addressed comments by Yury Norov:
- removed test-only functions from mtecomp.h
- dropped the algoritm name (all functions are now prefixed with
"mte")
- added more comments
- got rid of MTE_RANGES_INLINE
- renamed bitmap_{get,set}_value() to bitmap_{read,write}()
- moved the big comment explaining the algorithm to
Documentation/arch/arm64/mte-tag-compression.rst, expanded it,
add a link to it from Documentation/arch/arm64/index.rst
- removed hardcoded ranges from mte_alloc_size()/mte_size_to_ranges()

v3:
- Addressed comments by Andy Shevchenko:
- use bitmap_{set,get}_value() writte by Syed Nayyar Waris
- switched to unsigned long everywhere (fewer casts)
- simplified the code, removed redundant checks
- dropped ea0_compress_inline()
- added bit size constants and helpers to access the bitmap
- explicitly initialize all compressed sizes in ea0_compress_to_buf()
- initialize all handle bits

v2:
- as suggested by Yury Norov, switched from struct bitq (which is
not needed anymore) to <linux/bitmap.h>
- add missing symbol exports
---
Documentation/arch/arm64/index.rst | 1 +
.../arch/arm64/mte-tag-compression.rst | 212 ++++++++++
arch/arm64/Kconfig | 9 +
arch/arm64/include/asm/mtecomp.h | 13 +
arch/arm64/mm/Makefile | 1 +
arch/arm64/mm/mtecomp.c | 382 ++++++++++++++++++
6 files changed, 618 insertions(+)
create mode 100644 Documentation/arch/arm64/mte-tag-compression.rst
create mode 100644 arch/arm64/include/asm/mtecomp.h
create mode 100644 arch/arm64/mm/mtecomp.c

diff --git a/Documentation/arch/arm64/index.rst b/Documentation/arch/arm64/index.rst
index d08e924204bf1..bf6c1583233a9 100644
--- a/Documentation/arch/arm64/index.rst
+++ b/Documentation/arch/arm64/index.rst
@@ -19,6 +19,7 @@ ARM64 Architecture
legacy_instructions
memory
memory-tagging-extension
+ mte-tag-compression
perf
pointer-authentication
ptdump
diff --git a/Documentation/arch/arm64/mte-tag-compression.rst b/Documentation/arch/arm64/mte-tag-compression.rst
new file mode 100644
index 0000000000000..af6716d53c1a8
--- /dev/null
+++ b/Documentation/arch/arm64/mte-tag-compression.rst
@@ -0,0 +1,212 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================================
+Tag Compression for Memory Tagging Extension (MTE)
+==================================================
+
+This document describes the algorithm used to compress memory tags used by the
+ARM Memory Tagging Extension (MTE)
+
+Introduction
+============
+
+MTE assigns tags to memory pages: for 4K pages those tags occupy 128 bytes
+(256 4-bit tags each corresponding to a 16-byte MTE granule). By default, MTE
+carves out 3.125% (1/16) of the available physical memory to store the tags.
+
+When MTE pages are saved to swap, their tags need to be stored in the kernel
+memory. If the system swap is used heavily, these tags may take a substantial
+portion of the physical memory, which in the case of a zram-backed swap may
+even exceed the memory used to store the swapped pages themselves. To reduce
+memory waste, ``CONFIG_ARM64_MTE_COMP`` allows the kernel to store the tags in
+compressed form.
+
+Implementation details
+======================
+
+The algorithm attempts to compress a 128-byte (``MTE_GRANULES_PER_PAGE / 2``)
+array of tags into a smaller byte sequence that can be stored in a
+16-, 32-, or 64-byte buffer. A special case is storing the tags inline in
+an 8-byte pointer.
+
+Tag manipulation and storage
+----------------------------
+
+Tags for swapped pages are stored in an XArray that maps swap entries to 63-bit
+values (see ``arch/arm64/mm/mteswap.c``). In the case when
+``CONFIG_ARM64_MTE_COMP=n``, these values contain pointers to 128-byte buffers
+allocated with kmalloc(). Otherwise, they are 63-bit handles used by the
+functions declared in ``arch/arm64/include/asm/mtecomp.h``:
+
+- mte_compress() compresses the given 128-byte ``tags`` buffer, allocates
+ storage for it, and returns an opaque handle addressing that storage;
+- mte_decompress() decompresses the tags addressed by ``handle``
+ and fills the 128-byte ``tags`` buffer;
+- mte_release_handle() releases the storage handle returned by
+ mte_compress() (so that this handle cannot be used anymore);
+- mte_storage_size() calculates the size occupied by the tags addressed
+ by ``handle``.
+
+Depending on the size of compressed data, ``mte_compress()`` stores it in one of
+the size classes backed by kmem caches: ``mte-tags-16``, ``mte-tags-32``,
+``mte-tags-64``, or ``mte-tags-128`` (for the data that cannot be compressed
+into 64 bytes and is stored uncompressed).
+A practical common case allows the tags to be compressed into 8 bytes - then
+they are stored in the handle itself.
+
+Handle format
+-------------
+
+The handle returned by ``mte_compress()`` is an ``unsigned long`` that has its
+bit 63 set to 0 (XArray entries must not exceed ``LONG_MAX``)::
+
+ 63 62 60 ... 2 0
+ +---+--------+-----+------------+
+ | 0 | INLINE | ... | SIZE_CLASS |
+ +---+--------+-----+------------+
+
+Bits ``62..60`` is the inline/out-of-line marker: if they all are set to 1, the
+data is stored out-of-line in the buffer pointed to by
+``(handle | BIT(63)) & ~7UL``. Otherwise, the data is stored inline in the
+handle itself.
+
+Bits ``2..0`` denote the size class for out-of-line allocations:
+
+- ``0b001`` for ``mte-tags-16``;
+- ``0b010`` for ``mte-tags-32``;
+- ``0b100`` for ``mte-tags-64``;
+- ``0b000`` for ``mte-tags-128``.
+
+
+Tag compression
+---------------
+
+The compression algorithm is a variation of RLE (run-length encoding) and works
+as follows.
+
+1. The input array of 128 bytes is transformed into tag ranges (two arrays:
+ ``r_tags[]`` containing tag values and ``r_sizes[]`` containing range
+ lengths) by ``mte_tags_to_ranges()``. Note that ``r_sizes[]`` sums up to 256.
+
+2. The number of the largest element of ``r_sizes[]`` is stored in
+ ``largest_idx``. The element itself is thrown away from ``r_sizes[]``,
+ because it can be reconstructed from the sum of the remaining elements. Note
+ that now none of the remaining ``r_sizes[]`` elements exceeds 127.
+
+3. Depending on the number ``N`` of ranges, a storage class is picked::
+
+ N <= 6: 8 bytes (inline case, no allocation required);
+ 6 < N <= 11: 16 bytes
+ 11 < N <= 23: 32 bytes
+ 23 < N <= 46: 64 bytes
+ 46 < N: 128 bytes (no compression will be performed)
+
+(See `Why these numbers?`_ below).
+
+4. For the inline case, the following values are stored packed in the 8-byte
+ handle (``i<size>`` means a ``<size>``-bit unsigned integer)::
+
+ largest_idx : i4
+ r_tags[0..5] : i4 x 6
+ r_sizes[0..4] : i7 x 5
+
+ (if N is less than 6, ``r_tags`` and ``r_sizes`` are padded up with zero
+ values)
+
+ Because ``largest_idx`` is <= 5, bit 63 of the handle is always 0 (so it
+ can be stored in the XArray), and bits 62..60 cannot all be 1, so it can be
+ distinguished from a kernel pointer.
+
+5. For the out-of-line case, the storage is allocated from one of the
+ ``mte-tags-{16,32,64,128}`` kmem caches. The resulting pointer is aligned
+ on 8 bytes, so its bits 2..0 can be used to store the size class (see above).
+
+ Bit 63 of the pointer is zeroed out, so that it can be stored in XArray.
+
+6. The data layout in the allocated storage is as follows::
+
+ largest_idx : i6
+ r_tags[0..N] : i4 x N
+ r_sizes[0..N-1] : i7 x (N-1)
+
+Tag decompression
+-----------------
+
+The decompression algorithm performs the steps below.
+
+1. Decide if data is stored inline (bits ``62..60`` of the handle ``!= 0b111``)
+ or out-of line.
+
+2. For the inline case, treat the handle itself as the input buffer.
+
+3. For the out-of-line case, look at bits ``2..0`` of the handle to understand
+ the input buffer length. To obtain the pointer to the input buffer, unset
+ bits ``2..0`` of the handle and set bit ``63``.
+
+4. If the input buffer is 128 byte long, copy its contents to the output
+ buffer.
+
+5. Otherwise, read ``largest_idx``, ``r_tags[]`` and ``r_sizes[]`` from the
+ input buffer. Calculate the removed largest element of ``r_sizes[]`` as
+ ``largest = 256 - sum(r_sizes)`` and insert it into ``r_sizes`` at
+ position ``largest_idx``.
+
+6. For each ``r_sizes[i] > 0``, add a 4-bit value ``r_tags[i]`` to the output
+ buffer ``r_sizes[i]`` times.
+
+
+Why these numbers?
+------------------
+
+To be able to reconstruct N tag ranges from the compressed data, we need to
+store ``largest_idx``, ``r_tags[N]``, and ``r_sizes[N-1]``. Knowing that the
+sizes do not exceed 127, those can be packed into 7 bits, whereas a single tag
+occupies 4 bits, and ``largest_idx`` cannot take more than 8 bits.
+
+Now, for each ``S``-byte size class it is possible to find the maximal number
+``M`` such as ``8 + 4 * M + 7 * (M - 1) <= 8 * S``,
+i.e. ``M = (8 * S - 1) / 11``::
+
+ +-------------+----+--------------+
+ | Buffer size | M | Storage bits |
+ +-------------+----+--------------+
+ | 8 | 5 | 56 |
+ | 16 | 11 | 122 |
+ | 32 | 23 | 254 |
+ | 64 | 46 | 507 |
+ +-------------+----+--------------+
+
+We can notice that ``M`` (and therefore ``largest_idx``) actually always fits
+into 6 bits. For the inline case it is even guaranteed to fit into 3 bits, which
+lets us squeeze an extra range into a 8-byte buffer. Because the inline case
+requires bit 63 of the handle to be zero, we add that bit to ``largest_idx``,
+knowing it will not be used.
+
+For the revised ``largest_idx`` sizes, we now pick the maximal number ``N``
+such as ``(L + 4 * N + 7 * (N - 1) <= 8 * S``, where ``L = 4`` in the inline
+case and ``L = 6`` otherwise.
+In other words, ``N = (8 * S + 7 - L) / 11``, therefore::
+
+ +-------------+----+--------------+
+ | Buffer size | N | Storage bits |
+ +-------------+----+--------------+
+ | 8 | 6 | 63 |
+ | 16 | 11 | 120 |
+ | 32 | 23 | 252 |
+ | 64 | 46 | 505 |
+ +-------------+----+--------------+
+
+
+Note
+----
+
+Tag compression and decompression implicitly rely on the fixed MTE tag size
+(4 bits) and number of tags per page. Should these values change, the algorithm
+may need to be revised.
+
+
+Programming Interface
+=====================
+
+ .. kernel-doc:: arch/arm64/mm/mtecomp.c
+
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a2511b30d0f67..3ac6e302b1509 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2093,6 +2093,15 @@ config ARM64_EPAN
if the cpu does not implement the feature.
endmenu # "ARMv8.7 architectural features"

+config ARM64_MTE_COMP
+ bool "Tag compression for ARM64 Memory Tagging Extension"
+ default y
+ depends on ARM64_MTE
+ help
+ Enable tag compression support for ARM64 Memory Tagging Extension.
+
+ 128-byte tag buffers corresponding to 4K pages can be compressed to save heap memory.
+
config ARM64_SVE
bool "ARM Scalable Vector Extension support"
default y
diff --git a/arch/arm64/include/asm/mtecomp.h b/arch/arm64/include/asm/mtecomp.h
new file mode 100644
index 0000000000000..71552bc429882
--- /dev/null
+++ b/arch/arm64/include/asm/mtecomp.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __ASM_MTECOMP_H
+#define __ASM_MTECOMP_H
+
+#include <linux/types.h>
+
+unsigned long mte_compress(u8 *tags);
+bool mte_decompress(unsigned long handle, u8 *tags);
+void mte_release_handle(unsigned long handle);
+size_t mte_storage_size(unsigned long handle);
+
+#endif // __ASM_MTECOMP_H
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index dbd1bc95967d0..46778f6dd83c2 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_TRANS_TABLE) += trans_pgd.o
obj-$(CONFIG_TRANS_TABLE) += trans_pgd-asm.o
obj-$(CONFIG_DEBUG_VIRTUAL) += physaddr.o
obj-$(CONFIG_ARM64_MTE) += mteswap.o
+obj-$(CONFIG_ARM64_MTE_COMP) += mtecomp.o
KASAN_SANITIZE_physaddr.o += n

obj-$(CONFIG_KASAN) += kasan_init.o
diff --git a/arch/arm64/mm/mtecomp.c b/arch/arm64/mm/mtecomp.c
new file mode 100644
index 0000000000000..6d27f69f9521f
--- /dev/null
+++ b/arch/arm64/mm/mtecomp.c
@@ -0,0 +1,382 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * MTE tag compression algorithm.
+ * Proposed by Evgenii Stepanov <[email protected]>
+ * See Documentation/arch/arm64/mte-tag-compression.rst for more details.
+ */
+
+#include <linux/bitmap.h>
+#include <linux/bitops.h>
+#include <linux/export.h>
+#include <linux/gfp.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/types.h>
+
+#include <asm/mtecomp.h>
+
+/* The handle must fit into an Xarray value. */
+#define MTE_HANDLE_MASK GENMASK_ULL(62, 0)
+
+/* Out-of-line handles have 0b111 in bits 62..60. */
+#define MTE_NOINLINE_MASK GENMASK_ULL(62, 60)
+
+/* Cache index is stored in the lowest pointer bits. */
+#define MTE_CACHE_ID_MASK GENMASK_ULL(2, 0)
+
+/*
+ * Four separate caches to store out-of-line data:
+ * 0: mte-tags-128
+ * 1: mte-tags-16
+ * 2: mte-tags-32
+ * 3: mte-tags-64
+ */
+#define MTECOMP_NUM_CACHES 4
+static struct kmem_cache *mtecomp_caches[MTECOMP_NUM_CACHES];
+
+/*
+ * Sizes of compressed values. These depend on MTE_TAG_SIZE and
+ * MTE_GRANULES_PER_PAGE.
+ */
+#define MTE_BITS_PER_TAG 4
+#define MTE_BITS_PER_SIZE 7
+#define MTE_BITS_PER_LARGEST_IDX_INLINE 4
+#define MTE_BITS_PER_LARGEST_IDX 6
+
+/* Translate allocation size into mtecomp_caches[] index. */
+static unsigned int mte_size_to_cache_id(size_t len)
+{
+ if (len < 128)
+ return fls(len >> 4);
+ return 0;
+}
+
+/* Translate mtecomp_caches[] index into allocation size. */
+static size_t mte_cache_id_to_size(unsigned int id)
+{
+ if (id == 0)
+ return 128;
+ return 8 << id;
+}
+
+/**
+ * mte_tags_to_ranges() - break @tags into arrays of tag ranges.
+ * @tags: 128-byte array containing 256 MTE tags.
+ * @out_tags: u8 array to store the tag of every range.
+ * @out_sizes: unsigned short array to store the size of every range.
+ * @out_len: length of @out_tags and @out_sizes (output parameter, initially
+ * equal to lengths of out_tags[] and out_sizes[]).
+ */
+void mte_tags_to_ranges(u8 *tags, u8 *out_tags, unsigned short *out_sizes,
+ size_t *out_len)
+{
+ u8 prev_tag = tags[0] / 16; /* First tag in the array. */
+ unsigned int cur_idx = 0, i, j;
+ u8 cur_tag;
+
+ memset(out_tags, 0, array_size(*out_len, sizeof(*out_tags)));
+ memset(out_sizes, 0, array_size(*out_len, sizeof(*out_sizes)));
+
+ out_tags[0] = prev_tag;
+ for (i = 0; i < MTE_PAGE_TAG_STORAGE; i++) {
+ for (j = 0; j < 2; j++) {
+ cur_tag = j ? (tags[i] % 16) : (tags[i] / 16);
+ if (cur_tag == prev_tag) {
+ out_sizes[cur_idx]++;
+ } else {
+ cur_idx++;
+ prev_tag = cur_tag;
+ out_tags[cur_idx] = prev_tag;
+ out_sizes[cur_idx] = 1;
+ }
+ }
+ }
+ *out_len = cur_idx + 1;
+}
+EXPORT_SYMBOL_NS(mte_tags_to_ranges, MTECOMP);
+
+/**
+ * mte_ranges_to_tags() - fill @tags using given tag ranges.
+ * @r_tags: u8[] containing the tag of every range.
+ * @r_sizes: unsigned short[] containing the size of every range.
+ * @r_len: length of @r_tags and @r_sizes.
+ * @tags: 128-byte array to write the tags to.
+ */
+void mte_ranges_to_tags(u8 *r_tags, unsigned short *r_sizes, size_t r_len,
+ u8 *tags)
+{
+ unsigned int i, j, pos = 0;
+ u8 prev;
+
+ for (i = 0; i < r_len; i++) {
+ for (j = 0; j < r_sizes[i]; j++) {
+ if (pos % 2)
+ tags[pos / 2] = (prev << 4) | r_tags[i];
+ else
+ prev = r_tags[i];
+ pos++;
+ }
+ }
+}
+EXPORT_SYMBOL_NS(mte_ranges_to_tags, MTECOMP);
+
+/*
+ * Translate allocation size into maximum number of ranges that it can hold.
+ *
+ * It is the biggest number N such as:
+ * 4 + 4*N + 7*(N-1) <= 63 bits, for the inline case,
+ * or
+ * 6 + 4*N + 7*(N-1) <= array size in bits (128, 256, or 512),
+ * for the out-of line case.
+ */
+static size_t mte_size_to_ranges(size_t size)
+{
+ size_t largest_bits;
+ size_t ret = 0;
+
+ largest_bits = (size == 8) ? MTE_BITS_PER_LARGEST_IDX_INLINE :
+ MTE_BITS_PER_LARGEST_IDX;
+ ret = (size * 8 + MTE_BITS_PER_SIZE - largest_bits) /
+ (MTE_BITS_PER_TAG + MTE_BITS_PER_SIZE);
+ return ret;
+}
+
+/* Translate @num_ranges into the allocation size needed to hold them. */
+static size_t mte_alloc_size(unsigned int num_ranges)
+{
+ size_t sizes[4] = { 8, 16, 32, 64 };
+ unsigned int i;
+
+ for (i = 0; i < ARRAY_SIZE(sizes); i++) {
+ if (num_ranges <= mte_size_to_ranges(sizes[i]))
+ return sizes[i];
+ }
+ return 128;
+}
+
+/* Is the data stored inline in the handle itself? */
+static bool mte_is_inline(unsigned long handle)
+{
+ return (handle & MTE_NOINLINE_MASK) != MTE_NOINLINE_MASK;
+}
+
+/**
+ * mte_storage_size() - calculate the memory occupied by compressed tags.
+ * @handle: storage handle returned by mte_compress.
+ * Returns: size of the storage used for @handle.
+ */
+size_t mte_storage_size(unsigned long handle)
+{
+ if (mte_is_inline(handle))
+ return 8;
+ return mte_cache_id_to_size(handle & MTE_CACHE_ID_MASK);
+}
+EXPORT_SYMBOL_NS(mte_storage_size, MTECOMP);
+
+static void mte_bitmap_write(unsigned long *bitmap, unsigned long value,
+ unsigned long *pos, unsigned long bits)
+{
+ bitmap_write(bitmap, value, *pos, bits);
+ *pos += bits;
+}
+
+static inline unsigned long mte_largest_idx_bits(size_t size)
+{
+ if (size == 8)
+ return MTE_BITS_PER_LARGEST_IDX_INLINE;
+ return MTE_BITS_PER_LARGEST_IDX;
+}
+
+/* Compress ranges into the buffer that can accommodate up to max_ranges. */
+static void mte_compress_to_buf(size_t len, u8 *tags, unsigned short *sizes,
+ unsigned long *bitmap, size_t size)
+{
+ unsigned long bit_pos = 0, l_bits;
+ unsigned int largest_idx, i;
+ unsigned short largest = 0;
+ size_t max_ranges;
+
+ for (i = 0; i < len; i++) {
+ if (sizes[i] > largest) {
+ largest = sizes[i];
+ largest_idx = i;
+ }
+ }
+ l_bits = mte_largest_idx_bits(size);
+ max_ranges = mte_size_to_ranges(size);
+ mte_bitmap_write(bitmap, largest_idx, &bit_pos, l_bits);
+ for (i = 0; i < len; i++)
+ mte_bitmap_write(bitmap, tags[i], &bit_pos, MTE_BITS_PER_TAG);
+ for (i = len; i < max_ranges; i++)
+ mte_bitmap_write(bitmap, 0, &bit_pos, MTE_BITS_PER_TAG);
+ for (i = 0; i < len; i++) {
+ if (i != largest_idx)
+ mte_bitmap_write(bitmap, sizes[i], &bit_pos,
+ MTE_BITS_PER_SIZE);
+ }
+ for (i = len; i < max_ranges; i++)
+ mte_bitmap_write(bitmap, 0, &bit_pos, MTE_BITS_PER_SIZE);
+}
+
+/**
+ * mte_compress() - compress the given tag array.
+ * @tags: 128-byte array to read the tags from.
+ *
+ * Compresses the tags and returns a 64-bit opaque handle pointing to the
+ * tag storage. May allocate memory, which is freed by @mte_release_handle().
+ * Returns: 64-bit tag storage handle.
+ */
+unsigned long mte_compress(u8 *tags)
+{
+ unsigned short r_sizes[256];
+ struct kmem_cache *cache;
+ unsigned long *storage;
+ unsigned int cache_id;
+ size_t alloc_size;
+ u8 r_tags[256];
+ size_t r_len;
+ /*
+ * mte_compress_to_buf() only initializes the bits that mte_decompress()
+ * will read. But when the tags are stored in the handle itself, it must
+ * have all its bits initialized.
+ */
+ unsigned long result = 0;
+
+ r_len = sizeof(r_tags);
+ mte_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
+ alloc_size = mte_alloc_size(r_len);
+ if (alloc_size == 8) {
+ mte_compress_to_buf(r_len, r_tags, r_sizes, &result,
+ alloc_size);
+ return result;
+ }
+ cache_id = mte_size_to_cache_id(alloc_size);
+ cache = mtecomp_caches[cache_id];
+ storage = kmem_cache_alloc(cache, GFP_KERNEL);
+ if (alloc_size < 128) {
+ /* alloc_size is always a multiple of sizeof(unsigned long). */
+ mte_compress_to_buf(r_len, r_tags, r_sizes, storage,
+ alloc_size);
+ return ((unsigned long)storage | cache_id) & MTE_HANDLE_MASK;
+ }
+ memcpy(storage, tags, alloc_size);
+ return (unsigned long)storage & MTE_HANDLE_MASK;
+}
+EXPORT_SYMBOL_NS(mte_compress, MTECOMP);
+
+static unsigned long mte_bitmap_read(const unsigned long *bitmap,
+ unsigned long *pos, unsigned long bits)
+{
+ unsigned long result;
+
+ result = bitmap_read(bitmap, *pos, bits);
+ *pos += bits;
+ return result;
+}
+
+/* Decompress the contents of the given buffer into @tags. */
+static bool mte_decompress_from_buf(const unsigned long *bitmap, size_t size,
+ u8 *tags)
+{
+ unsigned short r_sizes[46], sum = 0;
+ unsigned long bit_pos = 0, l_bits;
+ unsigned int largest_idx, i;
+ size_t max_ranges;
+ u8 r_tags[46];
+
+ max_ranges = mte_size_to_ranges(size);
+ l_bits = mte_largest_idx_bits(size);
+ largest_idx = mte_bitmap_read(bitmap, &bit_pos, l_bits);
+ for (i = 0; i < max_ranges; i++)
+ r_tags[i] = mte_bitmap_read(bitmap, &bit_pos, MTE_BITS_PER_TAG);
+ for (i = 0; i < max_ranges; i++) {
+ if (i == largest_idx)
+ continue;
+ r_sizes[i] =
+ mte_bitmap_read(bitmap, &bit_pos, MTE_BITS_PER_SIZE);
+ if (!r_sizes[i]) {
+ max_ranges = i;
+ break;
+ }
+ sum += r_sizes[i];
+ }
+ if (sum >= 256)
+ return false;
+ r_sizes[largest_idx] = 256 - sum;
+ mte_ranges_to_tags(r_tags, r_sizes, max_ranges, tags);
+ return true;
+}
+
+/* Get pointer to the out-of-line storage from a handle. */
+static void *mte_storage(unsigned long handle)
+{
+ if (mte_is_inline(handle))
+ return NULL;
+ return (void *)((handle & (~MTE_CACHE_ID_MASK)) | BIT_ULL(63));
+}
+
+/**
+ * mte_decompress() - decompress the tag array addressed by the handle.
+ * @handle: handle returned by @mte_decompress()
+ * @tags: 128-byte array to write the tags to.
+ *
+ * Reads the compressed data and writes it into the user-supplied tag array.
+ * Returns: true on success, false on error.
+ */
+bool mte_decompress(unsigned long handle, u8 *tags)
+{
+ unsigned long *storage = mte_storage(handle);
+ size_t size = mte_storage_size(handle);
+
+ switch (size) {
+ case 8:
+ return mte_decompress_from_buf(&handle, size, tags);
+ case 128:
+ memcpy(tags, storage, size);
+ return true;
+ default:
+ return mte_decompress_from_buf(storage, size, tags);
+ }
+}
+EXPORT_SYMBOL_NS(mte_decompress, MTECOMP);
+
+/**
+ * mte_release_handle() - release the handle returned by mte_compress().
+ * @handle: handle returned by mte_compress().
+ */
+void mte_release_handle(unsigned long handle)
+{
+ struct kmem_cache *c;
+ void *storage;
+ size_t size;
+
+ storage = mte_storage(handle);
+ if (!storage)
+ return;
+
+ size = mte_storage_size(handle);
+ c = mtecomp_caches[mte_size_to_cache_id(size)];
+ kmem_cache_free(c, storage);
+}
+EXPORT_SYMBOL_NS(mte_release_handle, MTECOMP);
+
+/* Set up mtecomp_caches[]. */
+static int mtecomp_init(void)
+{
+ unsigned int i;
+ char name[16];
+ size_t size;
+
+ static_assert(MTE_PAGE_TAG_STORAGE == 128);
+ static_assert(MTE_TAG_SIZE == MTE_BITS_PER_TAG);
+ for (i = 0; i < MTECOMP_NUM_CACHES; i++) {
+ size = mte_cache_id_to_size(i);
+ snprintf(name, sizeof(name), "mte-tags-%ld", size);
+ mtecomp_caches[i] =
+ kmem_cache_create(name, size, size, 0, NULL);
+ }
+ return 0;
+}
+module_init(mtecomp_init);
--
2.41.0.487.g6d72f3e995-goog


2023-07-21 11:56:00

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH v4 4/5] arm64: mte: add a test for MTE tags compression

On Thu, Jul 20, 2023 at 07:39:55PM +0200, Alexander Potapenko wrote:
> Ensure that tag sequences containing alternating values are compressed
> to buffers of expected size and correctly decompressed afterwards.

...

> +#include <linux/bits.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/string.h>

> +#include <kunit/test.h>

Keep this in a separate group outside of linux/*.

> +#include <linux/types.h>
> +
> +#include <asm/mtecomp.h>

...

> +/* Functions exported from mtecomp.c for this test. */
> +void mte_tags_to_ranges(u8 *tags, u8 *out_tags, unsigned short *out_sizes,
> + size_t *out_len);
> +void mte_ranges_to_tags(u8 *r_tags, unsigned short *r_sizes, size_t r_len,
> + u8 *tags);

This is interesting. Have you run `make W=1` on this code?

--
With Best Regards,
Andy Shevchenko



2023-07-21 11:59:30

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH v4 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP

On Thu, Jul 20, 2023 at 07:39:54PM +0200, Alexander Potapenko wrote:
> The config implements the algorithm compressing memory tags for ARM MTE
> during swapping.
>
> The algorithm is based on RLE and specifically targets 128-byte buffers
> of tags corresponding to a single page. In the common case a buffer
> can be compressed into 63 bits, making it possible to store it without
> additional memory allocation.

...

> +Programming Interface
> +=====================
> +
> + .. kernel-doc:: arch/arm64/mm/mtecomp.c

:export:

> +

Is it dangling trailing blank line? Drop it.

...

> +#include <linux/bitmap.h>

> +#include <linux/bitops.h>

This is guaranteed to be included by bitmap.h.

> +#include <linux/export.h>
> +#include <linux/gfp.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/string.h>
> +#include <linux/types.h>

...

> +/*
> + * Sizes of compressed values. These depend on MTE_TAG_SIZE and

of the

> + * MTE_GRANULES_PER_PAGE.
> + */

...

> + u8 prev_tag = tags[0] / 16; /* First tag in the array. */
> + unsigned int cur_idx = 0, i, j;
> + u8 cur_tag;


> + out_tags[0] = prev_tag;

out_tags[cur_idx] ?

> + for (i = 0; i < MTE_PAGE_TAG_STORAGE; i++) {
> + for (j = 0; j < 2; j++) {
> + cur_tag = j ? (tags[i] % 16) : (tags[i] / 16);
> + if (cur_tag == prev_tag) {
> + out_sizes[cur_idx]++;

> + } else {
> + cur_idx++;
> + prev_tag = cur_tag;
> + out_tags[cur_idx] = prev_tag;
> + out_sizes[cur_idx] = 1;

Looking more at this I think there is still a room for improvement. I can't
come up right now with a proposal (lunch time :-), but I would look into

do {
...
} while (i < MTE_...);

approach.

> + }
> + }
> + }

...

> +static size_t mte_size_to_ranges(size_t size)
> +{
> + size_t largest_bits;

> + size_t ret = 0;

Redundant assignment. Please, check again all of them.

> +
> + largest_bits = (size == 8) ? MTE_BITS_PER_LARGEST_IDX_INLINE :
> + MTE_BITS_PER_LARGEST_IDX;
> + ret = (size * 8 + MTE_BITS_PER_SIZE - largest_bits) /

Hmm... I thought that we moved BYTES_TO_BITS() to the generic header...
Okay, never mind.

> + (MTE_BITS_PER_TAG + MTE_BITS_PER_SIZE);
> + return ret;

return (...) / ...;

> +}

...

> +static size_t mte_alloc_size(unsigned int num_ranges)
> +{
> + size_t sizes[4] = { 8, 16, 32, 64 };

Hooray! And now it's not needed anymore...

> + unsigned int i;
> +
> + for (i = 0; i < ARRAY_SIZE(sizes); i++) {

...as sizes[i] is equivalent of (8 << i).

> + if (num_ranges <= mte_size_to_ranges(sizes[i]))
> + return sizes[i];
> + }
> + return 128;
> +}

> +}

...

> +/**
> + * mte_compress() - compress the given tag array.
> + * @tags: 128-byte array to read the tags from.
> + *
> + * Compresses the tags and returns a 64-bit opaque handle pointing to the
> + * tag storage. May allocate memory, which is freed by @mte_release_handle().

+ blank line here.

> + * Returns: 64-bit tag storage handle.
> + */

...

> + /*
> + * mte_compress_to_buf() only initializes the bits that mte_decompress()
> + * will read. But when the tags are stored in the handle itself, it must
> + * have all its bits initialized.
> + */
> + unsigned long result = 0;

// Actually it's interesting how it's supposed to work on 32-bit
// builds...
DECLARE_BITMAP(result, BITS_PER_LONG);

and then

bitmap_zero();

?

...

> +static unsigned long mte_bitmap_read(const unsigned long *bitmap,
> + unsigned long *pos, unsigned long bits)
> +{
> + unsigned long result;
> +
> + result = bitmap_read(bitmap, *pos, bits);
> + *pos += bits;
> + return result;

unsigned long start = *pos;

*pos += bits;
return bitmap_read(bitmap, start, bits);

> +}

...

> + unsigned short r_sizes[46], sum = 0;

See below.

...

It's cleaner and more robust to have

sum = 0;

here.

> + for (i = 0; i < max_ranges; i++) {
> + if (i == largest_idx)
> + continue;
> + r_sizes[i] =
> + mte_bitmap_read(bitmap, &bit_pos, MTE_BITS_PER_SIZE);
> + if (!r_sizes[i]) {
> + max_ranges = i;
> + break;
> + }
> + sum += r_sizes[i];
> + }
> + if (sum >= 256)
> + return false;

--
With Best Regards,
Andy Shevchenko



2023-07-23 02:08:55

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
> +/**
> + * bitmap_write - write n-bit value within a memory region
> + * @map: address to the bitmap memory region
> + * @value: value of nbits
> + * @start: bit offset of the n-bit value
> + * @nbits: size of value in bits, up to BITS_PER_LONG
> + */
> +static inline void bitmap_write(unsigned long *map,
> + unsigned long value,
> + unsigned long start, unsigned long nbits)
> +{
> + size_t index = BIT_WORD(start);
> + unsigned long offset = start % BITS_PER_LONG;
> + unsigned long space = BITS_PER_LONG - offset;
> +
> + if (unlikely(!nbits))
> + return;
> + value &= GENMASK(nbits - 1, 0);

Strictly speaking, a 'value' shouldn't contain set bits beyond nbits
because otherwise it's an out-of-bonds type of error.

This is kind of gray zone to me, because it's a caller's responsibility
to provide correct input. But on the other hand, it would be a great
headache debugging corrupted bitmaps.

Now that we've got a single user of the bitmap_write, and even more,
it's wrapped with a helper, I think it would be reasonable to trim a
'value' in the helper, if needed.

Anyways, the comment must warn about that loudly...

> + if (space >= nbits) {
> + map[index] &= ~(GENMASK(nbits - 1, 0) << offset);

'GENMASK(nbits - 1, 0) << offset' looks really silly.

> + map[index] |= value << offset;

Here you commit 2 reads and 2 writes for a word instead of one.

> + return;
> + }
> + map[index] &= ~BITMAP_FIRST_WORD_MASK(start);

~FIRST_WORD is the same as LAST WORD. I tried to replace, and it saves
~25 bytes of .text on x86_64.

> + map[index] |= value << offset;
> + map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> + map[index + 1] |= (value >> space);
> +}

With all that I think the implementation should look something like
this:

unsigned long w, mask;

if (unlikely(nbits == 0))
return 0;

map += BIT_WORD(start);
start %= BITS_PER_LONG;
end = start + nbits - 1;

w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start));
*map = w | (value << start);

if (end < BITS_PER_LONG)
return;

w = *++map & BITMAP_FIRST_WORD_MASK(end);
*map = w | (value >> BITS_PER_LONG * 2 - end);

It's not tested, except the /s/~FIRST_WORD/LAST_WORD/ part, but I expect
it should be more efficient.

Alexander, can you please try the above and compare?

In previous iteration, I asked you to share disassembly listings for the
functions. Can you please do that now?

Regarding the rest of the series:
- I still see Evgenii's name in mtecomp.c, and EA0 references;
- git-am throws warning about trailing line;
- checkpatch warns 7 times;

Can you fix all the above before sending the new version?

Have you tested generic part against BE32, BE64 and LE32 architectures;
and arch part against BE64? If not, please do.

You're mentioning that the compression ratio is 2 to 20x. Can you
share the absolute numbers? If it's 1k vs 2k, I think most people
just don't care...

Can you share the code that you used to measure the compression ratio?
Would it make sense to export the numbers via sysfs?

Thanks,
Yury

2023-07-23 16:09:35

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

On Sat, Jul 22, 2023 at 06:57:26PM -0700, Yury Norov wrote:
> On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
> > +/**
> > + * bitmap_write - write n-bit value within a memory region
> > + * @map: address to the bitmap memory region
> > + * @value: value of nbits
> > + * @start: bit offset of the n-bit value
> > + * @nbits: size of value in bits, up to BITS_PER_LONG
> > + */
> > +static inline void bitmap_write(unsigned long *map,
> > + unsigned long value,
> > + unsigned long start, unsigned long nbits)
> > +{
> > + size_t index = BIT_WORD(start);
> > + unsigned long offset = start % BITS_PER_LONG;
> > + unsigned long space = BITS_PER_LONG - offset;
> > +
> > + if (unlikely(!nbits))
> > + return;
> > + value &= GENMASK(nbits - 1, 0);
>
> Strictly speaking, a 'value' shouldn't contain set bits beyond nbits
> because otherwise it's an out-of-bonds type of error.
>
> This is kind of gray zone to me, because it's a caller's responsibility
> to provide correct input. But on the other hand, it would be a great
> headache debugging corrupted bitmaps.
>
> Now that we've got a single user of the bitmap_write, and even more,
> it's wrapped with a helper, I think it would be reasonable to trim a
> 'value' in the helper, if needed.
>
> Anyways, the comment must warn about that loudly...

OK. I spent a night with that, and I'm still not sure. Pseudo code
that implements it looks like this:

for (bit = 0; bit < nbits; bit++)
assign_bit(start + bit, bitmap, val & BIT(bit));

And it ignores trailing bits. So if we're optimizing this pattern,
we'd ignore these bits just as well...

Either way, whatever we decide, let's stay clear with our intentions
and mention explicitly that tail bits are either must be zero, or
ignored.

Alexander, can you add the snippet above to the comments for the
bitmap_write() and bitmap_read(), as well as in the test? Also, if we
decide to clear tail of the input value, would BITMAP_LAST_WORD_MASK()
generate better code than GENMASK(nbits - 1, 0) does?

Commets are very appreciated.

Thanks,
Yury

2023-07-24 08:46:25

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

On Sat, Jul 22, 2023 at 06:57:23PM -0700, Yury Norov wrote:
> On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:

> > + map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
>
> 'GENMASK(nbits - 1, 0) << offset' looks really silly.

But you followed the thread to get a clue why it's written in this form, right?

...

> With all that I think the implementation should look something like
> this:

I would go this way if and only if the code generation on main architectures
with both GCC and clang is better.

And maybe even some performance tests need to be provided.

...

> Alexander, can you please try the above and compare?

> In previous iteration, I asked you to share disassembly listings for the
> functions. Can you please do that now?

Exactly what we need before going with your suggestion(s).

--
With Best Regards,
Andy Shevchenko



2023-07-25 05:31:14

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

On Mon, Jul 24, 2023 at 11:36:36AM +0300, Andy Shevchenko wrote:
> On Sat, Jul 22, 2023 at 06:57:23PM -0700, Yury Norov wrote:
> > On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
>
> > > + map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
> >
> > 'GENMASK(nbits - 1, 0) << offset' looks really silly.
>
> But you followed the thread to get a clue why it's written in this form, right?

Yes, I did. But I don't expect everyone looking at kernel code would spend
time recovering discussions that explain why that happened. So, at least it
would be fine to drop a comment.

> ...
>
> > With all that I think the implementation should look something like
> > this:
>
> I would go this way if and only if the code generation on main architectures
> with both GCC and clang is better.
>
> And maybe even some performance tests need to be provided.

For the following implementation:

void my_bitmap_write(unsigned long *map, unsigned long value,
unsigned long start, unsigned long nbits)
{
unsigned long w, end;

if (unlikely(nbits == 0))
return;

value &= GENMASK(nbits - 1, 0);

map += BIT_WORD(start);
start %= BITS_PER_LONG;
end = start + nbits - 1;

w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start));
*map = w | (value << start);

if (end < BITS_PER_LONG)
return;

w = *++map & BITMAP_LAST_WORD_MASK(end + 1 - BITS_PER_LONG);
*map = w | (value >> (BITS_PER_LONG - start));
}

This is the bloat-o-meter output:

$ scripts/bloat-o-meter lib/test_bitmap.o.orig lib/test_bitmap.o
add/remove: 8/0 grow/shrink: 1/0 up/down: 2851/0 (2851)
Function old new delta
test_bitmap_init 3846 5484 +1638
test_bitmap_write_perf - 401 +401
bitmap_write - 271 +271
my_bitmap_write - 248 +248
bitmap_read - 229 +229
__pfx_test_bitmap_write_perf - 16 +16
__pfx_my_bitmap_write - 16 +16
__pfx_bitmap_write - 16 +16
__pfx_bitmap_read - 16 +16
Total: Before=36964, After=39815, chg +7.71%

And this is the performance test:

for (cnt = 0; cnt < 5; cnt++) {
time = ktime_get();
for (nbits = 1; nbits <= BITS_PER_LONG; nbits++) {
for (i = 0; i < 1000; i++) {
if (i + nbits > 1000)
break;
bitmap_write(bmap, val, i, nbits);
}
}
time = ktime_get() - time;
pr_err("bitmap_write:\t%llu\t", time);

time = ktime_get();
for (nbits = 1; nbits <= BITS_PER_LONG; nbits++) {
for (i = 0; i < 1000; i++) {
if (i + nbits > 1000)
break;
my_bitmap_write(bmap, val, i, nbits);
}
}
time = ktime_get() - time;
pr_cont("%llu\n", time);
}

Which on x86_64/kvm with GCC gives:
Orig My
[ 1.630731] test_bitmap: bitmap_write: 299092 252764
[ 1.631584] test_bitmap: bitmap_write: 299522 252554
[ 1.632429] test_bitmap: bitmap_write: 299171 258665
[ 1.633280] test_bitmap: bitmap_write: 299241 252794
[ 1.634133] test_bitmap: bitmap_write: 306716 252934

So, it's ~15% difference in performance and 8% in size.

I don't insist on my implementation, but I think, we'd experiment for more
with code generation.

Thanks,
Yury

2023-07-25 09:23:44

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

On Mon, Jul 24, 2023 at 10:04:34PM -0700, Yury Norov wrote:
> On Mon, Jul 24, 2023 at 11:36:36AM +0300, Andy Shevchenko wrote:
> > On Sat, Jul 22, 2023 at 06:57:23PM -0700, Yury Norov wrote:
> > > On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:

...

> > > 'GENMASK(nbits - 1, 0) << offset' looks really silly.
> >
> > But you followed the thread to get a clue why it's written in this form, right?
>
> Yes, I did. But I don't expect everyone looking at kernel code would spend
> time recovering discussions that explain why that happened. So, at least it
> would be fine to drop a comment.

See also below.

...

> w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start));

This GENMASK() may generate worse code as compiler uses more instructions
instead of simple approach with the above..

...

> bitmap_write - 271 +271
> my_bitmap_write - 248 +248
> bitmap_read - 229 +229

my_ -- means your proposal? Do you mean you have it better in size than original one?

--
With Best Regards,
Andy Shevchenko



2023-07-26 09:01:13

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

On Sun, Jul 23, 2023 at 3:57 AM Yury Norov <[email protected]> wrote:
>
> On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
> > +/**
> > + * bitmap_write - write n-bit value within a memory region
> > + * @map: address to the bitmap memory region
> > + * @value: value of nbits
> > + * @start: bit offset of the n-bit value
> > + * @nbits: size of value in bits, up to BITS_PER_LONG
> > + */
> > +static inline void bitmap_write(unsigned long *map,
> > + unsigned long value,
> > + unsigned long start, unsigned long nbits)
> > +{
> > + size_t index = BIT_WORD(start);
> > + unsigned long offset = start % BITS_PER_LONG;
> > + unsigned long space = BITS_PER_LONG - offset;
> > +
> > + if (unlikely(!nbits))
> > + return;
> > + value &= GENMASK(nbits - 1, 0);
>
> Strictly speaking, a 'value' shouldn't contain set bits beyond nbits
> because otherwise it's an out-of-bonds type of error.

I can easily imagine someone passing -1 (or ~0) as a value, but
wanting to only write n bits of n.

>
> This is kind of gray zone to me, because it's a caller's responsibility
> to provide correct input. But on the other hand, it would be a great
> headache debugging corrupted bitmaps.
>
> Now that we've got a single user of the bitmap_write, and even more,
> it's wrapped with a helper, I think it would be reasonable to trim a
> 'value' in the helper, if needed.
>
> Anyways, the comment must warn about that loudly...
>
> > + if (space >= nbits) {
> > + map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
>
> 'GENMASK(nbits - 1, 0) << offset' looks really silly.

As noted in the other patch discussion, pulling offset into GENMASK is
actually not an identity transformation, because nbits + offset may
exceed 64, producing an invalid mask.

>
> > + map[index] |= value << offset;
>
> Here you commit 2 reads and 2 writes for a word instead of one.

There won't be two reads and two writes.
The compiler will read map[index] to a register, apply the mask, then
write value << offset to it, then perform the write.
Unless map[] is a volatile, repeated reads/writes will be optimized
out by any decent compiler.

>
> > + return;
> > + }
> > + map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
>
> ~FIRST_WORD is the same as LAST WORD. I tried to replace, and it saves
> ~25 bytes of .text on x86_64.
>
> > + map[index] |= value << offset;
> > + map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> > + map[index + 1] |= (value >> space);
> > +}
>
> With all that I think the implementation should look something like
> this:
>
> unsigned long w, mask;
>
> if (unlikely(nbits == 0))
> return 0;
>
> map += BIT_WORD(start);
> start %= BITS_PER_LONG;
> end = start + nbits - 1;
>
> w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start));
> *map = w | (value << start);
>
> if (end < BITS_PER_LONG)
> return;
>
> w = *++map & BITMAP_FIRST_WORD_MASK(end);
> *map = w | (value >> BITS_PER_LONG * 2 - end);
>
> It's not tested, except the /s/~FIRST_WORD/LAST_WORD/ part, but I expect
> it should be more efficient.
>
> Alexander, can you please try the above and compare?

I like the idea of sharing the first write between the branches, and
it can be made even shorter:

===========================================================
void bitmap_write_new(unsigned long *map, unsigned long value,
unsigned long start, unsigned long nbits)
{
unsigned long offset;
unsigned long space;
size_t index;
bool fit;

if (unlikely(!nbits))
return;

value &= GENMASK(nbits - 1, 0);
offset = start % BITS_PER_LONG;
space = BITS_PER_LONG - offset;
index = BIT_WORD(start);
fit = space >= nbits;

map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
~BITMAP_FIRST_WORD_MASK(start));
map[index] |= value << offset;
if (fit)
return;

map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
map[index + 1] |= (value >> space);
}
===========================================================

According to Godbolt (https://godbolt.org/z/n5Te779bf), this function
is 32 bytes shorter than yours under x86 Clang, and 8 bytes - under
GCC (which on the other hand does a poor job optimizing both).

Overall, given that there's currently a single user of these
functions, isn't it premature to optimize them without knowing
anything about their performance?

> In previous iteration, I asked you to share disassembly listings for the
> functions. Can you please do that now?

Will godbolt work for you (see above)?


>
> Regarding the rest of the series:
> - I still see Evgenii's name in mtecomp.c, and EA0 references;
Will fix, thanks!

> - git-am throws warning about trailing line;
Interesting, I generate the patches using `git format-patch`, I
thought `git am` should be the inversion of it. I'll check.

> - checkpatch warns 7 times;

Indeed, there were warnings that I ignored (e.g. one of them was
caused by me adding extern symbols to the test module, as requested
during the review process). I think I can fix all of them.

>
> Can you fix all the above before sending the new version?
>
> Have you tested generic part against BE32, BE64 and LE32 architectures;
> and arch part against BE64? If not, please do.

BE64 works, I'll also need to check the 32-bit versions as well.
Note that MTE is an ARM64 feature (yet we still need to ensure
bitops.h works on 32 bits).

>
> You're mentioning that the compression ratio is 2 to 20x. Can you
> share the absolute numbers? If it's 1k vs 2k, I think most people
> just don't care...

I'll provide the exact numbers with the next patch series. Last time I
checked, the order of magnitude was tens of megabytes.

> Can you share the code that you used to measure the compression ratio?
> Would it make sense to export the numbers via sysfs?

For out-of-line allocations the data can be derived from
/proc/slabinfo, but we don't calculate inline allocations.
Agreed, a debugfs interface won't hurt.

2023-07-27 01:18:00

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

On Wed, Jul 26, 2023 at 10:08:28AM +0200, Alexander Potapenko wrote:
> On Sun, Jul 23, 2023 at 3:57 AM Yury Norov <[email protected]> wrote:
> >
> > On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
> > > +/**
> > > + * bitmap_write - write n-bit value within a memory region
> > > + * @map: address to the bitmap memory region
> > > + * @value: value of nbits
> > > + * @start: bit offset of the n-bit value
> > > + * @nbits: size of value in bits, up to BITS_PER_LONG
> > > + */
> > > +static inline void bitmap_write(unsigned long *map,
> > > + unsigned long value,
> > > + unsigned long start, unsigned long nbits)
> > > +{
> > > + size_t index = BIT_WORD(start);
> > > + unsigned long offset = start % BITS_PER_LONG;
> > > + unsigned long space = BITS_PER_LONG - offset;
> > > +
> > > + if (unlikely(!nbits))
> > > + return;
> > > + value &= GENMASK(nbits - 1, 0);
> >
> > Strictly speaking, a 'value' shouldn't contain set bits beyond nbits
> > because otherwise it's an out-of-bonds type of error.
>
> I can easily imagine someone passing -1 (or ~0) as a value, but
> wanting to only write n bits of n.

This is an abuse of new API because we've got a bitmap_set(). But
whatever, let's keep that masking.

...

> I like the idea of sharing the first write between the branches, and
> it can be made even shorter:
>
> ===========================================================
> void bitmap_write_new(unsigned long *map, unsigned long value,
> unsigned long start, unsigned long nbits)
> {
> unsigned long offset;
> unsigned long space;
> size_t index;
> bool fit;
>
> if (unlikely(!nbits))
> return;
>
> value &= GENMASK(nbits - 1, 0);
> offset = start % BITS_PER_LONG;
> space = BITS_PER_LONG - offset;
> index = BIT_WORD(start);
> fit = space >= nbits;

space >= nbits <=>
BITS_PER_LONG - offset >= nbits <=>
offset + nbits <= BITS_PER_LONG

> map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :

So here GENMASK(nbits + offset - 1, offset) is at max:
GENMASK(BITS_PER_LONG - 1, offset). And it never overflows, which is my
point. Does it make sense?

> ~BITMAP_FIRST_WORD_MASK(start));

As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK()
and vise-versa.

> map[index] |= value << offset;
> if (fit)
> return;
>
> map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> map[index + 1] |= (value >> space);
> }
> ===========================================================
>
> According to Godbolt (https://godbolt.org/z/n5Te779bf), this function
> is 32 bytes shorter than yours under x86 Clang, and 8 bytes - under
> GCC (which on the other hand does a poor job optimizing both).
>
> Overall, given that there's currently a single user of these
> functions, isn't it premature to optimize them without knowing
> anything about their performance?
>
> > In previous iteration, I asked you to share disassembly listings for the
> > functions. Can you please do that now?
>
> Will godbolt work for you (see above)?

I don't know for how long an external resource will keep the reference
alive. My SSD keeps emails long enough.

...

> > You're mentioning that the compression ratio is 2 to 20x. Can you
> > share the absolute numbers? If it's 1k vs 2k, I think most people
> > just don't care...
>
> I'll provide the exact numbers with the next patch series. Last time I
> checked, the order of magnitude was tens of megabytes.

That's impressive. Fruitful idea. It would be important for embedded guys
who may disable MTE because of memory overhead. I think it's worth to
mention that in Kconfig together with associate performance overhead,
if it ever measurable.

> > Can you share the code that you used to measure the compression ratio?
> > Would it make sense to export the numbers via sysfs?
>
> For out-of-line allocations the data can be derived from
> /proc/slabinfo, but we don't calculate inline allocations.
> Agreed, a debugfs interface won't hurt.

2023-08-04 16:34:16

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

> space >= nbits <=>
> BITS_PER_LONG - offset >= nbits <=>
> offset + nbits <= BITS_PER_LONG
>
> > map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
>
> So here GENMASK(nbits + offset - 1, offset) is at max:
> GENMASK(BITS_PER_LONG - 1, offset). And it never overflows, which is my
> point. Does it make sense?

It indeed does. Perhaps pulling offset inside GENMASK is not a bug
after all (a simple test does not show any difference between their
behavior.
But `GENMASK(nbits - 1 + offset, offset)` blows up the code (see below).
My guess is that this happens because the compiler fails to reuse the
value of `GENMASK(nbits - 1, 0)` used to clamp the value to write, and
calculates `GENMASK(nbits - 1 + offset, offset)` from scratch.

>
> > ~BITMAP_FIRST_WORD_MASK(start));
>
> As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK()
> and vise-versa.

Surprisingly, ~BITMAP_FIRST_WORD_MASK() generates better code than
BITMAP_LAST_WORD_MASK().

>
> > map[index] |= value << offset;
> > if (fit)
> > return;
> >
> > map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);

OTOH I managed to shave three more bytes off by replacing
~BITMAP_LAST_WORD_MASK with a BITMAP_FIRST_WORD_MASK here.

> > map[index + 1] |= (value >> space);
> > }

I'll post the implementations together with the disassembly below.
I used some Clang 17.0.0 version that is a couple months behind
upstream, but that still produces sustainably shorter code (~48 bytes
less) than the trunk GCC on Godbolt.

1. Original implementation of bitmap_write() from this patch - 164
bytes (interestingly, it's 157 bytes with Clang 14.0.6)

==================================================================
void bitmap_write(unsigned long *map, unsigned long value,
unsigned long start, unsigned long nbits)
{
size_t index = BIT_WORD(start);
unsigned long offset = start % BITS_PER_LONG;
unsigned long space = BITS_PER_LONG - offset;

if (unlikely(!nbits))
return;
value &= GENMASK(nbits - 1, 0);
if (space >= nbits) {
map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
map[index] |= value << offset;
return;
}
map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
map[index] |= value << offset;
map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
map[index + 1] |= (value >> space);
}
==================================================================
0000000000001200 <bitmap_write>:
1200: 41 57 push %r15
1202: 41 56 push %r14
1204: 53 push %rbx
1205: 48 85 c9 test %rcx,%rcx
1208: 74 7b je 1285 <bitmap_write+0x85>
120a: 48 89 c8 mov %rcx,%rax
120d: 49 89 d2 mov %rdx,%r10
1210: 49 c1 ea 06 shr $0x6,%r10
1214: 41 89 d1 mov %edx,%r9d
1217: f6 d9 neg %cl
1219: 49 c7 c7 ff ff ff ff mov $0xffffffffffffffff,%r15
1220: 49 d3 ef shr %cl,%r15
1223: 41 83 e1 3f and $0x3f,%r9d
1227: 41 b8 40 00 00 00 mov $0x40,%r8d
122d: 4c 21 fe and %r15,%rsi
1230: 48 89 f3 mov %rsi,%rbx
1233: 44 89 c9 mov %r9d,%ecx
1236: 48 d3 e3 shl %cl,%rbx
1239: 4d 29 c8 sub %r9,%r8
123c: 49 c7 c3 ff ff ff ff mov $0xffffffffffffffff,%r11
1243: 4e 8b 34 d7 mov (%rdi,%r10,8),%r14
1247: 49 39 c0 cmp %rax,%r8
124a: 73 3f jae 128b <bitmap_write+0x8b>
124c: 49 c7 c7 ff ff ff ff mov $0xffffffffffffffff,%r15
1253: 44 89 c9 mov %r9d,%ecx
1256: 49 d3 e7 shl %cl,%r15
1259: 49 f7 d7 not %r15
125c: 4d 21 fe and %r15,%r14
125f: 49 09 de or %rbx,%r14
1262: 4e 89 34 d7 mov %r14,(%rdi,%r10,8)
1266: 01 c2 add %eax,%edx
1268: f6 da neg %dl
126a: 89 d1 mov %edx,%ecx
126c: 49 d3 eb shr %cl,%r11
126f: 49 f7 d3 not %r11
1272: 44 89 c1 mov %r8d,%ecx
1275: 48 d3 ee shr %cl,%rsi
1278: 4e 23 5c d7 08 and 0x8(%rdi,%r10,8),%r11
127d: 4c 09 de or %r11,%rsi
1280: 4a 89 74 d7 08 mov %rsi,0x8(%rdi,%r10,8)
1285: 5b pop %rbx
1286: 41 5e pop %r14
1288: 41 5f pop %r15
128a: c3 ret
128b: 44 89 c9 mov %r9d,%ecx
128e: 49 d3 e7 shl %cl,%r15
1291: 49 f7 d7 not %r15
1294: 4d 21 fe and %r15,%r14
1297: 49 09 de or %rbx,%r14
129a: 4e 89 34 d7 mov %r14,(%rdi,%r10,8)
129e: 5b pop %rbx
129f: 41 5e pop %r14
12a1: 41 5f pop %r15
12a3: c3 ret
==================================================================

2. Your implementation, my_bitmap_write() - 152 bytes (used to be
slightly worse with Clang 14.0.6 - 159 bytes):

==================================================================
void my_bitmap_write(unsigned long *map, unsigned long value,
unsigned long start, unsigned long nbits)
{
unsigned long w, end;

if (unlikely(nbits == 0))
return;

value &= GENMASK(nbits - 1, 0);

map += BIT_WORD(start);
start %= BITS_PER_LONG;
end = start + nbits - 1;

w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) :
BITMAP_LAST_WORD_MASK(start));
*map = w | (value << start);

if (end < BITS_PER_LONG)
return;

w = *++map & BITMAP_LAST_WORD_MASK(end + 1 - BITS_PER_LONG);
*map = w | (value >> (BITS_PER_LONG - start));
}
==================================================================
0000000000001160 <my_bitmap_write>:
1160: 48 85 c9 test %rcx,%rcx
1163: 0f 84 8e 00 00 00 je 11f7 <my_bitmap_write+0x97>
1169: 49 89 c9 mov %rcx,%r9
116c: f6 d9 neg %cl
116e: 48 d3 e6 shl %cl,%rsi
1171: 48 d3 ee shr %cl,%rsi
1174: 49 89 d2 mov %rdx,%r10
1177: 49 c1 ea 06 shr $0x6,%r10
117b: 89 d0 mov %edx,%eax
117d: 83 e0 3f and $0x3f,%eax
1180: 4e 8d 04 08 lea (%rax,%r9,1),%r8
1184: 4a 8d 0c 08 lea (%rax,%r9,1),%rcx
1188: 48 ff c9 dec %rcx
118b: 4e 8b 0c d7 mov (%rdi,%r10,8),%r9
118f: 48 83 f9 3f cmp $0x3f,%rcx
1193: 77 29 ja 11be <my_bitmap_write+0x5e>
1195: 41 f6 d8 neg %r8b
1198: 48 c7 c2 ff ff ff ff mov $0xffffffffffffffff,%rdx
119f: 44 89 c1 mov %r8d,%ecx
11a2: 48 d3 ea shr %cl,%rdx
11a5: 89 c1 mov %eax,%ecx
11a7: 48 d3 ea shr %cl,%rdx
11aa: 48 d3 e2 shl %cl,%rdx
11ad: 48 d3 e6 shl %cl,%rsi
11b0: 48 f7 d2 not %rdx
11b3: 49 21 d1 and %rdx,%r9
11b6: 4c 09 ce or %r9,%rsi
11b9: 4a 89 34 d7 mov %rsi,(%rdi,%r10,8)
11bd: c3 ret
11be: f6 da neg %dl
11c0: 89 d1 mov %edx,%ecx
11c2: 49 d3 e1 shl %cl,%r9
11c5: 49 d3 e9 shr %cl,%r9
11c8: 48 89 f2 mov %rsi,%rdx
11cb: 89 c1 mov %eax,%ecx
11cd: 48 d3 e2 shl %cl,%rdx
11d0: 4c 09 ca or %r9,%rdx
11d3: 4a 89 14 d7 mov %rdx,(%rdi,%r10,8)
11d7: 4a 8b 54 d7 08 mov 0x8(%rdi,%r10,8),%rdx
11dc: 41 f6 d8 neg %r8b
11df: 44 89 c1 mov %r8d,%ecx
11e2: 48 d3 e2 shl %cl,%rdx
11e5: 48 d3 ea shr %cl,%rdx
11e8: f6 d8 neg %al
11ea: 89 c1 mov %eax,%ecx
11ec: 48 d3 ee shr %cl,%rsi
11ef: 48 09 d6 or %rdx,%rsi
11f2: 4a 89 74 d7 08 mov %rsi,0x8(%rdi,%r10,8)
11f7: c3 ret
==================================================================

3. My improved version built on top of yours and mentioned above under
the name bitmap_write_new() - 116 bytes:

==================================================================
void bitmap_write_new(unsigned long *map, unsigned long value,
unsigned long start, unsigned long nbits)
{
unsigned long offset;
unsigned long space;
size_t index;
bool fit;

if (unlikely(!nbits))
return;

value &= GENMASK(nbits - 1, 0);
offset = start % BITS_PER_LONG;
space = BITS_PER_LONG - offset;
index = BIT_WORD(start);
fit = space >= nbits;

map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
~BITMAP_FIRST_WORD_MASK(start));
map[index] |= value << offset;
if (fit)
return;

map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
map[index + 1] |= (value >> space);
}
==================================================================
00000000000012b0 <bitmap_write_new>:
12b0: 48 85 c9 test %rcx,%rcx
12b3: 74 6e je 1323 <bitmap_write_new+0x73>
12b5: 48 89 c8 mov %rcx,%rax
12b8: f6 d9 neg %cl
12ba: 49 c7 c2 ff ff ff ff mov $0xffffffffffffffff,%r10
12c1: 49 d3 ea shr %cl,%r10
12c4: 49 c7 c3 ff ff ff ff mov $0xffffffffffffffff,%r11
12cb: 4c 21 d6 and %r10,%rsi
12ce: 89 d1 mov %edx,%ecx
12d0: 83 e1 3f and $0x3f,%ecx
12d3: 41 b8 40 00 00 00 mov $0x40,%r8d
12d9: 49 29 c8 sub %rcx,%r8
12dc: 49 89 d1 mov %rdx,%r9
12df: 49 c1 e9 06 shr $0x6,%r9
12e3: 49 39 c0 cmp %rax,%r8
12e6: 4d 0f 42 d3 cmovb %r11,%r10
12ea: 49 d3 e2 shl %cl,%r10
12ed: 49 f7 d2 not %r10
12f0: 4e 23 14 cf and (%rdi,%r9,8),%r10
12f4: 49 89 f3 mov %rsi,%r11
12f7: 49 d3 e3 shl %cl,%r11
12fa: 4d 09 d3 or %r10,%r11
12fd: 4e 89 1c cf mov %r11,(%rdi,%r9,8)
1301: 49 39 c0 cmp %rax,%r8
1304: 73 1d jae 1323 <bitmap_write_new+0x73>
1306: 01 d0 add %edx,%eax
1308: 4a 8b 54 cf 08 mov 0x8(%rdi,%r9,8),%rdx
130d: 89 c1 mov %eax,%ecx
130f: 48 d3 ea shr %cl,%rdx
1312: 48 d3 e2 shl %cl,%rdx
1315: 44 89 c1 mov %r8d,%ecx
1318: 48 d3 ee shr %cl,%rsi
131b: 48 09 d6 or %rdx,%rsi
131e: 4a 89 74 cf 08 mov %rsi,0x8(%rdi,%r9,8)
1323: c3 ret
==================================================================

4. The version of bitmap_write_new() that uses `(GENMASK(nbits - 1 +
offset, offset)` - 146 bytes:

==================================================================
void bitmap_write_new_shift(unsigned long *map, unsigned long value,
unsigned long start, unsigned long nbits)
{
unsigned long offset;
unsigned long space;
size_t index;
bool fit;

if (unlikely(!nbits))
return;

value &= GENMASK(nbits - 1, 0);
offset = start % BITS_PER_LONG;
space = BITS_PER_LONG - offset;
index = BIT_WORD(start);
fit = space >= nbits;

map[index] &= (fit ? ~(GENMASK(nbits - 1 + offset, offset)) :
~BITMAP_FIRST_WORD_MASK(start));
map[index] |= value << offset;
if (fit)
return;

map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
map[index + 1] |= (value >> space);
}
==================================================================
0000000000001330 <bitmap_write_new_shift>:
1330: 48 85 c9 test %rcx,%rcx
1333: 74 6a je 139f <bitmap_write_new_shift+0x6f>
1335: 48 89 c8 mov %rcx,%rax
1338: f6 d9 neg %cl
133a: 48 d3 e6 shl %cl,%rsi
133d: 48 d3 ee shr %cl,%rsi
1340: 41 89 d0 mov %edx,%r8d
1343: 41 83 e0 3f and $0x3f,%r8d
1347: 41 b9 40 00 00 00 mov $0x40,%r9d
134d: 4d 29 c1 sub %r8,%r9
1350: 49 89 d2 mov %rdx,%r10
1353: 49 c1 ea 06 shr $0x6,%r10
1357: 49 c7 c3 ff ff ff ff mov $0xffffffffffffffff,%r11
135e: 44 89 c1 mov %r8d,%ecx
1361: 49 d3 e3 shl %cl,%r11
1364: 49 39 c1 cmp %rax,%r9
1367: 73 37 jae 13a0 <bitmap_write_new_shift+0x70>
1369: 53 push %rbx
136a: 49 f7 d3 not %r11
136d: 4e 23 1c d7 and (%rdi,%r10,8),%r11
1371: 48 89 f3 mov %rsi,%rbx
1374: 44 89 c1 mov %r8d,%ecx
1377: 48 d3 e3 shl %cl,%rbx
137a: 4c 09 db or %r11,%rbx
137d: 4a 89 1c d7 mov %rbx,(%rdi,%r10,8)
1381: 01 d0 add %edx,%eax
1383: 4a 8b 54 d7 08 mov 0x8(%rdi,%r10,8),%rdx
1388: 89 c1 mov %eax,%ecx
138a: 48 d3 ea shr %cl,%rdx
138d: 48 d3 e2 shl %cl,%rdx
1390: 44 89 c9 mov %r9d,%ecx
1393: 48 d3 ee shr %cl,%rsi
1396: 48 09 d6 or %rdx,%rsi
1399: 4a 89 74 d7 08 mov %rsi,0x8(%rdi,%r10,8)
139e: 5b pop %rbx
139f: c3 ret
13a0: 44 01 c0 add %r8d,%eax
13a3: f6 d8 neg %al
13a5: 89 c1 mov %eax,%ecx
13a7: 49 d3 e3 shl %cl,%r11
13aa: 49 d3 eb shr %cl,%r11
13ad: 49 f7 d3 not %r11
13b0: 44 89 c1 mov %r8d,%ecx
13b3: 48 d3 e6 shl %cl,%rsi
13b6: 4e 23 1c d7 and (%rdi,%r10,8),%r11
13ba: 4c 09 de or %r11,%rsi
13bd: 4a 89 34 d7 mov %rsi,(%rdi,%r10,8)
13c1: c3 ret
==================================================================

For posterity, I am also attaching the C file containing the four
implementations together with some code testing them.

PS. I won't be actively working on the patch series till the end of
August, so feel free to ignore this letter until then.


Attachments:
bitmap_write.c (3.94 kB)

2023-08-04 20:25:31

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

On Fri, Aug 04, 2023 at 06:07:00PM +0200, Alexander Potapenko wrote:
> > space >= nbits <=>
> > BITS_PER_LONG - offset >= nbits <=>
> > offset + nbits <= BITS_PER_LONG
> >
> > > map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
> >
> > So here GENMASK(nbits + offset - 1, offset) is at max:
> > GENMASK(BITS_PER_LONG - 1, offset). And it never overflows, which is my
> > point. Does it make sense?
>
> It indeed does. Perhaps pulling offset inside GENMASK is not a bug
> after all (a simple test does not show any difference between their
> behavior.
> But `GENMASK(nbits - 1 + offset, offset)` blows up the code (see below).
> My guess is that this happens because the compiler fails to reuse the
> value of `GENMASK(nbits - 1, 0)` used to clamp the value to write, and
> calculates `GENMASK(nbits - 1 + offset, offset)` from scratch.

OK. Can you put a comment explaining this? Or maybe would be even
better to use BITMAP_LAST_WORD_MASK() here:

mask = BITMAP_LAST_WORD_MASK(nbits);
value &= mask;
...
map[index] &= (fit ? (~mask << offset)) :

> > > ~BITMAP_FIRST_WORD_MASK(start));
> >
> > As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK()
> > and vise-versa.
>
> Surprisingly, ~BITMAP_FIRST_WORD_MASK() generates better code than
> BITMAP_LAST_WORD_MASK().

Wow... If that's consistent across different compilers/arches, we'd
just drop the latter. Thanks for pointing that. I'll check.

> > > map[index] |= value << offset;
> > > if (fit)
> > > return;
> > >
> > > map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
>
> OTOH I managed to shave three more bytes off by replacing
> ~BITMAP_LAST_WORD_MASK with a BITMAP_FIRST_WORD_MASK here.
>
> > > map[index + 1] |= (value >> space);
> > > }
>
> I'll post the implementations together with the disassembly below.
> I used some Clang 17.0.0 version that is a couple months behind
> upstream, but that still produces sustainably shorter code (~48 bytes
> less) than the trunk GCC on Godbolt.
>
> 1. Original implementation of bitmap_write() from this patch - 164
> bytes (interestingly, it's 157 bytes with Clang 14.0.6)

I spotted that too in some other case. Newer compilers tend to
generate bigger code, but the result usually works faster. One
particular reason for my case was a loop unrolling.

[...]

> 3. My improved version built on top of yours and mentioned above under
> the name bitmap_write_new() - 116 bytes:

30% better in size - that's impressive!

> ==================================================================
> void bitmap_write_new(unsigned long *map, unsigned long value,
> unsigned long start, unsigned long nbits)
> {
> unsigned long offset;
> unsigned long space;
> size_t index;
> bool fit;
>
> if (unlikely(!nbits))
> return;
>
> value &= GENMASK(nbits - 1, 0);
> offset = start % BITS_PER_LONG;
> space = BITS_PER_LONG - offset;
> index = BIT_WORD(start);
> fit = space >= nbits;
>
> map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
> ~BITMAP_FIRST_WORD_MASK(start));
> map[index] |= value << offset;
> if (fit)
> return;
>
> map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
> map[index + 1] |= (value >> space);
> }

Thanks,
Yury

2023-08-04 20:34:50

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

On Fri, Aug 04, 2023 at 12:55:01PM -0700, Yury Norov wrote:
> On Fri, Aug 04, 2023 at 06:07:00PM +0200, Alexander Potapenko wrote:

> > > > ~BITMAP_FIRST_WORD_MASK(start));
> > >
> > > As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK()
> > > and vise-versa.
> >
> > Surprisingly, ~BITMAP_FIRST_WORD_MASK() generates better code than
> > BITMAP_LAST_WORD_MASK().
>
> Wow... If that's consistent across different compilers/arches, we'd
> just drop the latter. Thanks for pointing that. I'll check.

It might be...

...

> > map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
> > ~BITMAP_FIRST_WORD_MASK(start));

...that both parts of ternary are using negation. So compiler may use negation
only once. (Haven't looked at the assembly.)

--
With Best Regards,
Andy Shevchenko



2023-09-22 09:30:30

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

> Either way, whatever we decide, let's stay clear with our intentions
> and mention explicitly that tail bits are either must be zero, or
> ignored.
>
> Alexander, can you add the snippet above to the comments for the
> bitmap_write() and bitmap_read(), as well as in the test? Also, if we
> decide to clear tail of the input value, would BITMAP_LAST_WORD_MASK()
> generate better code than GENMASK(nbits - 1, 0) does?

Added the snippet above to bitmap_write(). I think however it is
obvious that bitmap_read() reads only nbits?

> Commets are very appreciated.
>
> Thanks,
> Yury



--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

2023-09-22 10:08:56

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

>
> Regarding the rest of the series:
> - I still see Evgenii's name in mtecomp.c, and EA0 references;

Double-checked there are none in v5 (only the Suggested-by: tag)

> - git-am throws warning about trailing line;

I checked locally that `git am` does not warn about v5 patches. But
given that the patches are generated with `git format-patch` I suspect
they get garbled when you download them, could it be the case?

> - checkpatch warns 7 times;

It now warns 4 times, three warnings are about updating MAINTAINERS (I
don't think there's need for this), the last one is about
CONFIG_ARM64_MTE_COMP_KUNIT_TEST not having three lines of description
text in Kconfig.

> Can you fix all the above before sending the new version?

> Have you tested generic part against BE32, BE64 and LE32 architectures;
> and arch part against BE64? If not, please do.

I did now.


> You're mentioning that the compression ratio is 2 to 20x. Can you
> share the absolute numbers? If it's 1k vs 2k, I think most people
> just don't care...

In the other thread I mentioned that although 20x compression is
reachable, it may not lead to practical savings. I reworded the
description, having added the absolute numbers.

> Can you share the code that you used to measure the compression ratio?
> Would it make sense to export the numbers via sysfs?

Done in v5

> Thanks,
> Yury



--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

2023-09-22 10:10:49

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v4 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP

> > +MTE assigns tags to memory pages: for 4K pages those tags occupy 128 bytes
> > +(256 4-bit tags each corresponding to a 16-byte MTE granule). By default, MTE
> > +carves out 3.125% (1/16) of the available physical memory to store the tags.
> > +
> > +When MTE pages are saved to swap, their tags need to be stored in the kernel
> > +memory. If the system swap is used heavily, these tags may take a substantial
> > +portion of the physical memory, which in the case of a zram-backed swap may
> > +even exceed the memory used to store the swapped pages themselves.
>
> Hmm, I'm not sure about this claim ;). Is the zram so good that it
> manages a 32x compression (4096/128)?

For trivial data, like zero pages, this is possible in theory, but I
am not sure whether zram takes advantage of this. I removed the claim
:)

>
> How much would we save if we only do the compression when it can fit in
> 63 bits?

Good question. What I've observed is that soon after boot roughly 50%
of tag buffers can be compressed into 63 bits.


> > +void mte_tags_to_ranges(u8 *tags, u8 *out_tags, unsigned short *out_sizes,
> > + size_t *out_len)
> > +{
...
> > +}
> > +EXPORT_SYMBOL_NS(mte_tags_to_ranges, MTECOMP);
>
> What's with the exports here? Are we expecting these functions to be
> called from loadable modules?

Yes, this is for the KUnit test module

>
> --
> Catalin



--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

2023-09-22 11:25:07

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v4 4/5] arm64: mte: add a test for MTE tags compression

On Fri, Jul 21, 2023 at 1:25 PM Andy Shevchenko
<[email protected]> wrote:
>
> On Thu, Jul 20, 2023 at 07:39:55PM +0200, Alexander Potapenko wrote:
> > Ensure that tag sequences containing alternating values are compressed
> > to buffers of expected size and correctly decompressed afterwards.
>
> ...
>
> > +#include <linux/bits.h>
> > +#include <linux/module.h>
> > +#include <linux/slab.h>
> > +#include <linux/string.h>
>
> > +#include <kunit/test.h>
>
> Keep this in a separate group outside of linux/*.

Done.

>
> > +#include <linux/types.h>
> > +
> > +#include <asm/mtecomp.h>
>
> ...
>
> > +/* Functions exported from mtecomp.c for this test. */
> > +void mte_tags_to_ranges(u8 *tags, u8 *out_tags, unsigned short *out_sizes,
> > + size_t *out_len);
> > +void mte_ranges_to_tags(u8 *r_tags, unsigned short *r_sizes, size_t r_len,
> > + u8 *tags);
>
> This is interesting. Have you run `make W=1` on this code?

You are right, this is fishy. I added mtecomp.h and moved the
declarations there instead.

2023-09-22 15:08:48

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v4 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP

On Fri, Jul 21, 2023 at 1:22 PM Andy Shevchenko
<[email protected]> wrote:
>
> On Thu, Jul 20, 2023 at 07:39:54PM +0200, Alexander Potapenko wrote:
> > The config implements the algorithm compressing memory tags for ARM MTE
> > during swapping.
> >
> > The algorithm is based on RLE and specifically targets 128-byte buffers
> > of tags corresponding to a single page. In the common case a buffer
> > can be compressed into 63 bits, making it possible to store it without
> > additional memory allocation.
>
> ...
>
> > +Programming Interface
> > +=====================
> > +
> > + .. kernel-doc:: arch/arm64/mm/mtecomp.c
>
> :export:

Done

> > +
>
> Is it dangling trailing blank line? Drop it.

Sorry, it's hard to attribute this comment. I am assuming it is
related to Documentation/arch/arm64/mte-tag-compression.rst - done.

> ...
>
> > +#include <linux/bitmap.h>
>
> > +#include <linux/bitops.h>
>
> This is guaranteed to be included by bitmap.h.

I think we'd better stick to IWYU here.
Ingo's patch: https://git.kernel.org/pub/scm/linux/kernel/git/mingo/tip.git/commit/?id=32b1e9e4f5774951a3a80604a39fa1f0674c1833
specifically adds bitmap.h where bits.h is already present, without
removing the latter.
Although there might not be general consensus on this in the kernel
right now, I think Ingo's "Fast Kernel Headers" set out a good
direction.

>
> > +/*
> > + * Sizes of compressed values. These depend on MTE_TAG_SIZE and
>
> of the

This comment is gone now

>
> > + out_tags[0] = prev_tag;
>
> out_tags[cur_idx] ?

Yeah, looks more readable. Done.

>
> > + for (i = 0; i < MTE_PAGE_TAG_STORAGE; i++) {
> > + for (j = 0; j < 2; j++) {
> > + cur_tag = j ? (tags[i] % 16) : (tags[i] / 16);
> > + if (cur_tag == prev_tag) {
> > + out_sizes[cur_idx]++;
>
> > + } else {
> > + cur_idx++;
> > + prev_tag = cur_tag;
> > + out_tags[cur_idx] = prev_tag;
> > + out_sizes[cur_idx] = 1;
>
> Looking more at this I think there is still a room for improvement. I can't
> come up right now with a proposal (lunch time :-), but I would look into
>
> do {
> ...
> } while (i < MTE_...);
>
> approach.

We can e.g. get rid of the nested loop and iterate over tags instead
of bytes (see v5)


> > +static size_t mte_size_to_ranges(size_t size)
> > +{
> > + size_t largest_bits;
>
> > + size_t ret = 0;
>
> Redundant assignment. Please, check again all of them.

Done.

> > +
> > + largest_bits = (size == 8) ? MTE_BITS_PER_LARGEST_IDX_INLINE :
> > + MTE_BITS_PER_LARGEST_IDX;
> > + ret = (size * 8 + MTE_BITS_PER_SIZE - largest_bits) /
>
> Hmm... I thought that we moved BYTES_TO_BITS() to the generic header...
> Okay, never mind.

Ack

> > + (MTE_BITS_PER_TAG + MTE_BITS_PER_SIZE);
> > + return ret;
>
> return (...) / ...;

Done

> > +}
>
> ...
>
> > +static size_t mte_alloc_size(unsigned int num_ranges)
> > +{
> > + size_t sizes[4] = { 8, 16, 32, 64 };
>
> Hooray! And now it's not needed anymore...
>
> > + unsigned int i;
> > +
> > + for (i = 0; i < ARRAY_SIZE(sizes); i++) {
>
> ...as sizes[i] is equivalent of (8 << i).

It's gone now.

> ...
>
> > +/**
> > + * mte_compress() - compress the given tag array.
> > + * @tags: 128-byte array to read the tags from.
> > + *
> > + * Compresses the tags and returns a 64-bit opaque handle pointing to the
> > + * tag storage. May allocate memory, which is freed by @mte_release_handle().
>
> + blank line here.

Done (here and in other places in the file), but I'm wondering why
https://docs.kernel.org/doc-guide/kernel-doc.html does not mandate it.

>
> > + * Returns: 64-bit tag storage handle.
> > + */
>
> ...
>
> > + /*
> > + * mte_compress_to_buf() only initializes the bits that mte_decompress()
> > + * will read. But when the tags are stored in the handle itself, it must
> > + * have all its bits initialized.
> > + */
> > + unsigned long result = 0;
>
> // Actually it's interesting how it's supposed to work on 32-bit
> // builds...

It is not supposed to work on 32 bit.
First, the code is in arch/arm64 :)
Second, 32-bit CPUs do not support MTE (which reserves the four upper
bits of the address)


>
> > +static unsigned long mte_bitmap_read(const unsigned long *bitmap,
> > + unsigned long *pos, unsigned long bits)
> > +{
> > + unsigned long result;
> > +
> > + result = bitmap_read(bitmap, *pos, bits);
> > + *pos += bits;
> > + return result;
>
> unsigned long start = *pos;
>
> *pos += bits;
> return bitmap_read(bitmap, start, bits);

Done, thanks!

> > +}
>
> ...
>
> > + unsigned short r_sizes[46], sum = 0;
>
> See below.
>
> ...
>
> It's cleaner and more robust to have
>
> sum = 0;
>
> here.

Moved it inside the loop init statement


> --
> With Best Regards,
> Andy Shevchenko
>

Thank you!

2023-09-23 02:30:38

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

> OK. Can you put a comment explaining this? Or maybe would be even
> better to use BITMAP_LAST_WORD_MASK() here:
>
> mask = BITMAP_LAST_WORD_MASK(nbits);
> value &= mask;
> ...
> map[index] &= (fit ? (~mask << offset)) :

I changed GENMASK to BITMAP_LAST_WORD_MASK here, and I think it's
self-explanatory now, WDYT?