2023-09-22 13:37:57

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v5 0/5] Implement MTE tag compression for swapped pages

Currently, when MTE pages are swapped out, the tags are kept in the
memory, occupying PAGE_SIZE/32 bytes per page. This is especially
problematic for devices that use zram-backed in-memory swap, because
tags stored uncompressed in the heap effectively reduce the available
amount of swap memory.

The RLE-based algorithm suggested by Evgenii Stepanov and implemented in
this patch series is able to efficiently compress fixed-size tag buffers,
resulting in practical compression ratio between 2.5x and 4x. In most
cases it is possible to store the compressed data in 63-bit Xarray values,
resulting in no extra memory allocations.

Our measurements show that the proposed algorithm provides better
compression than existing kernel compression algorithms (LZ4, LZO,
LZ4HC, ZSTD) can offer.

To implement compression/decompression, we also extend <linux/bitmap.h>
with methods to read/write bit values at arbitrary places in the map.

We refactor arch/arm64/mm/mteswap.c to support both the compressed
(CONFIG_ARM64_MTE_COMP) and non-compressed case. For the former, in
addition to tag compression, we move tag allocation from kmalloc() to
separate kmem caches, providing greater locality and relaxing the
alignment requirements.

v5:
- fixed comments by Andy Shevchenko, Catalin Marinas, and Yury Norov
- added support for 16K- and 64K pages
- more efficient bitmap_write() implementation

v4:
- fixed a bunch of comments by Andy Shevchenko and Yury Norov
- added Documentation/arch/arm64/mte-tag-compression.rst

v3:
- as suggested by Andy Shevchenko, use
bitmap_get_value()/bitmap_set_value() written by Syed Nayyar Waris
- switched to unsigned long to reduce typecasts
- simplified the compression code

v2:
- as suggested by Yuri Norov, replace the poorly implemented struct
bitq with <linux/bitmap.h>



Alexander Potapenko (4):
lib/test_bitmap: add tests for bitmap_{read,write}()
arm64: mte: implement CONFIG_ARM64_MTE_COMP
arm64: mte: add a test for MTE tags compression
arm64: mte: add compression support to mteswap.c

Syed Nayyar Waris (1):
lib/bitmap: add bitmap_{read,write}()

Documentation/arch/arm64/index.rst | 1 +
.../arch/arm64/mte-tag-compression.rst | 245 +++++++++
arch/arm64/Kconfig | 21 +
arch/arm64/include/asm/mtecomp.h | 13 +
arch/arm64/mm/Makefile | 7 +
arch/arm64/mm/mtecomp.c | 467 ++++++++++++++++++
arch/arm64/mm/mtecomp.h | 12 +
arch/arm64/mm/mteswap.c | 20 +-
arch/arm64/mm/mteswap.h | 12 +
arch/arm64/mm/mteswap_comp.c | 60 +++
arch/arm64/mm/mteswap_nocomp.c | 38 ++
arch/arm64/mm/test_mtecomp.c | 287 +++++++++++
include/linux/bitmap.h | 68 +++
lib/test_bitmap.c | 115 +++++
14 files changed, 1355 insertions(+), 11 deletions(-)
create mode 100644 Documentation/arch/arm64/mte-tag-compression.rst
create mode 100644 arch/arm64/include/asm/mtecomp.h
create mode 100644 arch/arm64/mm/mtecomp.c
create mode 100644 arch/arm64/mm/mtecomp.h
create mode 100644 arch/arm64/mm/mteswap.h
create mode 100644 arch/arm64/mm/mteswap_comp.c
create mode 100644 arch/arm64/mm/mteswap_nocomp.c
create mode 100644 arch/arm64/mm/test_mtecomp.c

--
2.42.0.515.g380fc7ccd1-goog


2023-09-22 16:04:08

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v5 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP

The config implements the algorithm compressing memory tags for ARM MTE
during swapping.

The algorithm is based on RLE and specifically targets buffers of tags
corresponding to a single page. In the common case a buffer can be
compressed into 63 bits, making it possible to store it without
additional memory allocation.

Suggested-by: Evgenii Stepanov <[email protected]>
Signed-off-by: Alexander Potapenko <[email protected]>

---
v5:
- make code PAGE_SIZE-agnostic, remove hardcoded constants, updated
the docs
- implement debugfs interface
- Address comments by Andy Shevchenko:
- update description of mtecomp.c
- remove redundant assignments, simplify mte_tags_to_ranges()
- various code simplifications
- introduce mtecomp.h
- add :export: to Documentation/arch/arm64/mte-tag-compression.rst

v4:
- Addressed comments by Andy Shevchenko:
- expanded "MTE" to "Memory Tagging Extension" in Kconfig
- fixed kernel-doc comments, moved them to C source
- changed variables to unsigned where applicable
- some code simplifications, fewer unnecessary assignments
- added the mte_largest_idx_bits() helper
- added namespace prefixes to all functions
- added missing headers (but removed bits.h)
- Addressed comments by Yury Norov:
- removed test-only functions from mtecomp.h
- dropped the algoritm name (all functions are now prefixed with
"mte")
- added more comments
- got rid of MTE_RANGES_INLINE
- renamed bitmap_{get,set}_value() to bitmap_{read,write}()
- moved the big comment explaining the algorithm to
Documentation/arch/arm64/mte-tag-compression.rst, expanded it,
add a link to it from Documentation/arch/arm64/index.rst
- removed hardcoded ranges from mte_alloc_size()/mte_size_to_ranges()

v3:
- Addressed comments by Andy Shevchenko:
- use bitmap_{set,get}_value() writte by Syed Nayyar Waris
- switched to unsigned long everywhere (fewer casts)
- simplified the code, removed redundant checks
- dropped ea0_compress_inline()
- added bit size constants and helpers to access the bitmap
- explicitly initialize all compressed sizes in ea0_compress_to_buf()
- initialize all handle bits

v2:
- as suggested by Yury Norov, switched from struct bitq (which is
not needed anymore) to <linux/bitmap.h>
- add missing symbol exports
---
Documentation/arch/arm64/index.rst | 1 +
.../arch/arm64/mte-tag-compression.rst | 245 +++++++++
arch/arm64/Kconfig | 11 +
arch/arm64/include/asm/mtecomp.h | 13 +
arch/arm64/mm/Makefile | 1 +
arch/arm64/mm/mtecomp.c | 467 ++++++++++++++++++
arch/arm64/mm/mtecomp.h | 12 +
7 files changed, 750 insertions(+)
create mode 100644 Documentation/arch/arm64/mte-tag-compression.rst
create mode 100644 arch/arm64/include/asm/mtecomp.h
create mode 100644 arch/arm64/mm/mtecomp.c
create mode 100644 arch/arm64/mm/mtecomp.h

diff --git a/Documentation/arch/arm64/index.rst b/Documentation/arch/arm64/index.rst
index d08e924204bf1..bf6c1583233a9 100644
--- a/Documentation/arch/arm64/index.rst
+++ b/Documentation/arch/arm64/index.rst
@@ -19,6 +19,7 @@ ARM64 Architecture
legacy_instructions
memory
memory-tagging-extension
+ mte-tag-compression
perf
pointer-authentication
ptdump
diff --git a/Documentation/arch/arm64/mte-tag-compression.rst b/Documentation/arch/arm64/mte-tag-compression.rst
new file mode 100644
index 0000000000000..c5170155473aa
--- /dev/null
+++ b/Documentation/arch/arm64/mte-tag-compression.rst
@@ -0,0 +1,245 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================================
+Tag Compression for Memory Tagging Extension (MTE)
+==================================================
+
+This document describes the algorithm used to compress memory tags used by the
+ARM Memory Tagging Extension (MTE)
+
+Introduction
+============
+
+MTE assigns tags to memory pages: for 4K pages those tags occupy 128 bytes
+(256 4-bit tags each corresponding to a 16-byte MTE granule), for 16K pages -
+512 bytes, for 64K pages - 2048 bytes. By default, MTE carves out 3.125% (1/16)
+of the available physical memory to store the tags.
+
+When MTE pages are saved to swap, their tags need to be stored in the kernel
+memory. If the system swap is used heavily, these tags may take a substantial
+portion of the physical memory. To reduce memory waste,
+``CONFIG_ARM64_MTE_COMP`` allows the kernel to store the tags in compressed
+form.
+
+Implementation details
+======================
+
+The algorithm attempts to compress an array of ``MTE_PAGE_TAG_STORAGE``
+tag bytes into a byte sequence that can be stored in one of the smaller size
+class allocations (for 4K pages those are 16-, 32-, or 64-byte allocations).
+A special case is storing the tags inline in an 8-byte pointer.
+
+Tag manipulation and storage
+----------------------------
+
+Tags for swapped pages are stored in an XArray that maps swap entries to 63-bit
+values (see ``arch/arm64/mm/mteswap.c``). In the case when
+``CONFIG_ARM64_MTE_COMP=n``, these values contain pointers to uncompressed
+buffers allocated with kmalloc(). Otherwise, they are 63-bit handles used by the
+functions declared in ``arch/arm64/include/asm/mtecomp.h``:
+
+- mte_compress() compresses the given ``MTE_PAGE_TAG_STORAGE``-byte ``tags``
+ buffer, allocates storage for it, and returns an opaque handle addressing
+ that storage;
+- mte_decompress() decompresses the tags addressed by ``handle``
+ and fills the ``MTE_PAGE_TAG_STORAGE``-byte ``tags`` buffer;
+- mte_release_handle() releases the storage handle returned by
+ mte_compress() (so that this handle cannot be used anymore);
+- mte_storage_size() calculates the size occupied by the tags addressed
+ by ``handle``.
+
+Depending on the size of compressed data, ``mte_compress()`` stores it in one of
+the size classes backed by kmem caches: ``mte-tags-{16,32,64,128}`` for the
+4K-page case (``mte-tags-128`` being used for the data that cannot be compressed
+into 64 bytes and is stored uncompressed).
+A practical common case allows the tags to be compressed into 8 bytes - then
+they are stored in the handle itself.
+
+Handle format
+-------------
+
+The handle returned by ``mte_compress()`` is an ``unsigned long`` that has its
+bit 63 set to 0 (XArray entries must not exceed ``LONG_MAX``)::
+
+ 63 62 60 ... 2 0
+ +---+--------+-----+------------+
+ | 0 | INLINE | ... | SIZE_LOG |
+ +---+--------+-----+------------+
+
+Bits ``62..60`` is the inline/out-of-line marker: if they all are set to 1, the
+data is stored out-of-line in the buffer pointed to by
+``(handle | BIT(63)) & ~7UL``. Otherwise, the data is stored inline in the
+handle itself.
+
+Bits ``2..0`` denote the size for out-of-line allocations::
+
+ size = 16 << (handle & 0b111)
+
+
+Tag compression
+---------------
+
+The compression algorithm is a variation of RLE (run-length encoding) and works
+as follows (we'll be considering 4K pages and 128-byte tag buffers, but the same
+approach scales to 16- and 64K pages):
+
+1. The input array of 128 (``MTE_PAGE_TAG_STORAGE``) bytes is transformed into
+ tag ranges (two arrays: ``r_tags[]`` containing tag values and ``r_sizes[]``
+ containing range lengths) by ``mte_tags_to_ranges()``. Note that
+ ``r_sizes[]`` sums up to 256 (``MTE_GRANULES_PER_PAGE``).
+
+2. The number of the largest element of ``r_sizes[]`` is stored in
+ ``largest_idx``. The element itself is thrown away from ``r_sizes[]``,
+ because it can be reconstructed from the sum of the remaining elements. Note
+ that now none of the remaining ``r_sizes[]`` elements exceeds
+ ``MTE_PAGE_TAG_STORAGE - 1``.
+
+3. Depending on the number ``N`` of ranges, a storage class is picked::
+
+ N <= 6: 8 bytes (inline case, no allocation required);
+ 6 < N <= 11: 16 bytes
+ 11 < N <= 23: 32 bytes
+ 23 < N <= 46: 64 bytes
+ 46 < N: 128 bytes (no compression will be performed)
+
+(See `Why these numbers?`_ below).
+
+4. For the inline case, the following values are stored packed in the 8-byte
+ handle (``i<size>`` means a ``<size>``-bit unsigned integer)::
+
+ largest_idx : i4
+ r_tags[0..5] : i4 x 6
+ r_sizes[0..4] : i7 x 5
+
+ (if N is less than 6, ``r_tags`` and ``r_sizes`` are padded up with zero
+ values)
+
+ Because ``largest_idx`` is <= 5, bit 63 of the handle is always 0 (so the
+ handle can be stored in an Xarray), and bits 62..60 cannot all be 1 (so the
+ handle can be distinguished from a kernel pointer).
+
+5. For the out-of-line case, the storage is allocated from one of the
+ ``mte-tags-{16,32,64,128}`` kmem caches. The resulting pointer is aligned
+ on 8 bytes, so its bits 2..0 can be used to store the size class (see above).
+
+ Bit 63 of the pointer is zeroed out, so that it can be stored in XArray.
+
+6. The data layout in the allocated storage is as follows::
+
+ largest_idx : i6
+ r_tags[0..N] : i4 x N
+ r_sizes[0..N-1] : i7 x (N-1)
+
+Tag decompression
+-----------------
+
+The decompression algorithm performs the steps below.
+
+1. Decide if data is stored inline (bits ``62..60`` of the handle ``!= 0b111``)
+ or out-of line.
+
+2. For the inline case, treat the handle itself as the input buffer.
+
+3. For the out-of-line case, look at bits ``2..0`` of the handle to understand
+ the input buffer length. To obtain the pointer to the input buffer, unset
+ bits ``2..0`` of the handle and set bit ``63``.
+
+4. If the input buffer is 128 byte long, copy its contents to the output
+ buffer.
+
+5. Otherwise, read ``largest_idx``, ``r_tags[]`` and ``r_sizes[]`` from the
+ input buffer. Calculate the removed largest element of ``r_sizes[]`` as
+ ``largest = 256 - sum(r_sizes)`` and insert it into ``r_sizes`` at
+ position ``largest_idx``.
+
+6. For each ``r_sizes[i] > 0``, add a 4-bit value ``r_tags[i]`` to the output
+ buffer ``r_sizes[i]`` times.
+
+
+Why these numbers?
+------------------
+
+To be able to reconstruct N tag ranges from the compressed data, we need to
+store ``largest_idx``, ``r_tags[N]``, and ``r_sizes[N-1]``. Knowing that the
+sizes do not exceed ``MTE_PAGE_TAG_STORAGE``, those can be packed into
+``S = ilog2(MTE_PAGE_TAG_STORAGE)`` bits, whereas a single tag occupies
+4 bits, and ``largest_idx`` cannot take more than
+``Lmax = ilog2(MTE_GRANULES_PER_PAGE)`` bits.
+
+Now, for each ``B``-byte size class it is possible to find the maximal number
+``M`` such as ``Lmax + 4 * M + S * (M - 1) <= 8 * B``,
+i.e. ``M = (8 * B - 1) / 11``::
+
+ 4K pages: S = 7
+ +-------------+----+--------------+
+ | Buffer size | M | Storage bits |
+ +-------------+----+--------------+
+ | 8 | 5 | 56 |
+ | 16 | 11 | 122 |
+ | 32 | 23 | 254 |
+ | 64 | 46 | 507 |
+ +-------------+----+--------------+
+
+We can notice that ``M`` (and therefore ``largest_idx``) actually always fits
+into 6 bits. For the inline case it is even guaranteed to fit into 3 bits, which
+lets us squeeze an extra range into a 8-byte buffer. Because the inline case
+requires bit 63 of the handle to be zero, we add that bit to ``largest_idx``,
+knowing it will not be used.
+
+For the revised ``largest_idx`` sizes, we now pick the maximal number ``N``
+such as ``(L + 4 * N + 7 * (N - 1) <= 8 * S``, where ``L = 4`` in the inline
+case and ``L = 6`` otherwise.
+In other words, ``N = (8 * S + 7 - L) / 11``, therefore::
+
+ 4K pages: S = 7, L_i = 4, L_o = 6
+ +-------------+----+--------------+
+ | Buffer size | N | Storage bits |
+ +-------------+----+--------------+
+ | 8 | 6 | 63 |
+ | 16 | 11 | 120 |
+ | 32 | 23 | 252 |
+ | 64 | 46 | 505 |
+ +-------------+----+--------------+
+
+Similarly, for other page sizes::
+
+ 16K pages: S = 9, L_i = 4, L_o = 8
+ +-------------+-----+--------------+
+ | Buffer size | N | Storage bits |
+ +-------------+-----+--------------+
+ | 8 | 5 | 60 |
+ | 16 | 9 | 116 |
+ | 32 | 19 | 246 |
+ | 64 | 39 | 506 |
+ | 128 | 78 | 1013 |
+ | 256 | 157 | 2040 |
+ +-------------+-----+--------------+
+
+ 64K pages: S = 11, L_i = 4, L_o = 10
+ +-------------+-----+--------------+
+ | Buffer size | N | Storage bits |
+ +-------------+-----+--------------+
+ | 8 | 4 | 53 |
+ | 16 | 8 | 119 |
+ | 32 | 17 | 254 |
+ | 64 | 34 | 509 |
+ | 128 | 68 | 1019 |
+ | 256 | 136 | 2039 |
+ | 512 | 273 | 4094 |
+ | 1024 | 546 | 8189 |
+ +-------------+-----+--------------+
+
+
+Note
+----
+
+Tag compression and decompression implicitly rely on the fixed MTE tag size
+(4 bits) and number of tags per page. Should these values change, the algorithm
+may need to be revised.
+
+
+Programming Interface
+=====================
+
+ .. kernel-doc:: arch/arm64/mm/mtecomp.c
+ :export:
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a2511b30d0f67..d4fb3b8d11d77 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2093,6 +2093,17 @@ config ARM64_EPAN
if the cpu does not implement the feature.
endmenu # "ARMv8.7 architectural features"

+config ARM64_MTE_COMP
+ bool "Tag compression for ARM64 Memory Tagging Extension"
+ default y
+ depends on ARM64_MTE
+ help
+ Enable tag compression support for ARM64 Memory Tagging Extension.
+
+ Tag buffers corresponding to swapped RAM pages are compressed using
+ RLE to conserve heap memory. In the common case compressed tags
+ occupy 2.5x less memory.
+
config ARM64_SVE
bool "ARM Scalable Vector Extension support"
default y
diff --git a/arch/arm64/include/asm/mtecomp.h b/arch/arm64/include/asm/mtecomp.h
new file mode 100644
index 0000000000000..71552bc429882
--- /dev/null
+++ b/arch/arm64/include/asm/mtecomp.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __ASM_MTECOMP_H
+#define __ASM_MTECOMP_H
+
+#include <linux/types.h>
+
+unsigned long mte_compress(u8 *tags);
+bool mte_decompress(unsigned long handle, u8 *tags);
+void mte_release_handle(unsigned long handle);
+size_t mte_storage_size(unsigned long handle);
+
+#endif // __ASM_MTECOMP_H
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index dbd1bc95967d0..46778f6dd83c2 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_TRANS_TABLE) += trans_pgd.o
obj-$(CONFIG_TRANS_TABLE) += trans_pgd-asm.o
obj-$(CONFIG_DEBUG_VIRTUAL) += physaddr.o
obj-$(CONFIG_ARM64_MTE) += mteswap.o
+obj-$(CONFIG_ARM64_MTE_COMP) += mtecomp.o
KASAN_SANITIZE_physaddr.o += n

obj-$(CONFIG_KASAN) += kasan_init.o
diff --git a/arch/arm64/mm/mtecomp.c b/arch/arm64/mm/mtecomp.c
new file mode 100644
index 0000000000000..e2c8f9ef30ca0
--- /dev/null
+++ b/arch/arm64/mm/mtecomp.c
@@ -0,0 +1,467 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * MTE tag compression algorithm.
+ * See Documentation/arch/arm64/mte-tag-compression.rst for more details.
+ */
+
+#include <linux/bits.h>
+#include <linux/bitmap.h>
+#include <linux/bitops.h>
+#include <linux/debugfs.h>
+#include <linux/export.h>
+#include <linux/gfp.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/types.h>
+
+#include <asm/mtecomp.h>
+
+#include "mtecomp.h"
+
+/* The handle must fit into an Xarray value. */
+#define MTE_HANDLE_MASK GENMASK_ULL(62, 0)
+
+/* Out-of-line handles have 0b111 in bits 62..60. */
+#define MTE_NOINLINE_MASK GENMASK_ULL(62, 60)
+
+/* Cache index is stored in the lowest pointer bits. */
+#define MTE_CACHE_ID_MASK GENMASK_ULL(2, 0)
+
+/* Caches start at mte-tags-16 and go up to mte-tags-MTE_PAGE_TAG_STORAGE. */
+#define MTECOMP_NUM_CACHES ilog2(MTE_PAGE_TAG_STORAGE / 8)
+static struct kmem_cache *mtecomp_caches[MTECOMP_NUM_CACHES];
+/*
+ * [0] - store the numbers of created/released inline handles;
+ * [1..MTECOMP_NUM_CACHES] - store the number of allocations/deallocations from
+ * mtecomp_caches.
+ */
+static atomic_long_t alloc_counters[MTECOMP_NUM_CACHES + 1];
+static atomic_long_t dealloc_counters[MTECOMP_NUM_CACHES + 1];
+
+/*
+ * Largest number of ranges, for which compressed data fits into 63 bits, can
+ * be encoded with 4 bits.
+ */
+#define MTE_BITS_PER_LARGEST_IDX_INLINE 4
+/*
+ * In the worst case every tag is different, then largest index can be up to
+ * MTE_GRANULES_PER_PAGE.
+ */
+#define MTE_BITS_PER_LARGEST_IDX ilog2(MTE_GRANULES_PER_PAGE)
+/* Range size cannot exceed MTE_GRANULES_PER_PAGE / 2. */
+#define MTE_BITS_PER_SIZE (MTE_BITS_PER_LARGEST_IDX - 1)
+
+/* Translate allocation size into mtecomp_caches[] index. */
+static unsigned int mte_size_to_cache_id(size_t len)
+{
+ return fls(len) - 5;
+}
+
+/* Translate mtecomp_caches[] index into allocation size. */
+static size_t mte_cache_id_to_size(unsigned int id)
+{
+ return 16 << id;
+}
+
+/**
+ * mte_tags_to_ranges() - break @tags into arrays of tag ranges.
+ * @tags: MTE_GRANULES_PER_PAGE-byte array containing MTE tags.
+ * @out_tags: u8 array to store the tag of every range.
+ * @out_sizes: unsigned short array to store the size of every range.
+ * @out_len: length of @out_tags and @out_sizes (output parameter, initially
+ * equal to lengths of out_tags[] and out_sizes[]).
+ */
+void mte_tags_to_ranges(u8 *tags, u8 *out_tags, unsigned short *out_sizes,
+ size_t *out_len)
+{
+ u8 prev_tag = tags[0] / 16; /* First tag in the array. */
+ unsigned int cur_idx = 0, i, j;
+ u8 cur_tag;
+
+ memset(out_tags, 0, array_size(*out_len, sizeof(*out_tags)));
+ memset(out_sizes, 0, array_size(*out_len, sizeof(*out_sizes)));
+
+ out_tags[cur_idx] = prev_tag;
+ for (i = 0; i < MTE_GRANULES_PER_PAGE; i++) {
+ j = i % 2;
+ cur_tag = j ? (tags[i / 2] % 16) : (tags[i / 2] / 16);
+ if (cur_tag == prev_tag) {
+ out_sizes[cur_idx]++;
+ } else {
+ cur_idx++;
+ prev_tag = cur_tag;
+ out_tags[cur_idx] = prev_tag;
+ out_sizes[cur_idx] = 1;
+ }
+ }
+ *out_len = cur_idx + 1;
+}
+EXPORT_SYMBOL_NS(mte_tags_to_ranges, MTECOMP);
+
+/**
+ * mte_ranges_to_tags() - fill @tags using given tag ranges.
+ * @r_tags: u8[] containing the tag of every range.
+ * @r_sizes: unsigned short[] containing the size of every range.
+ * @r_len: length of @r_tags and @r_sizes.
+ * @tags: MTE_GRANULES_PER_PAGE-byte array to write the tags to.
+ */
+void mte_ranges_to_tags(u8 *r_tags, unsigned short *r_sizes, size_t r_len,
+ u8 *tags)
+{
+ unsigned int i, j, pos = 0;
+ u8 prev;
+
+ for (i = 0; i < r_len; i++) {
+ for (j = 0; j < r_sizes[i]; j++) {
+ if (pos % 2)
+ tags[pos / 2] = (prev << 4) | r_tags[i];
+ else
+ prev = r_tags[i];
+ pos++;
+ }
+ }
+}
+EXPORT_SYMBOL_NS(mte_ranges_to_tags, MTECOMP);
+
+/*
+ * Translate allocation size into maximum number of ranges that it can hold.
+ *
+ * It is the biggest number N such as:
+ * MTE_BITS_PER_LARGEST_IDX_INLINE + MTE_TAG_SIZE * N
+ * + MTE_BITS_PER_SIZE * (N-1) <= 63 bits,
+ * for the inline case, or
+ * MTE_BITS_PER_LARGEST_IDX
+ * + MTE_TAG_SIZE * N + MTE_BITS_PER_SIZE * (N-1) <= tag array size in bits,
+ * for the out-of line case.
+ */
+static size_t mte_size_to_ranges(size_t size)
+{
+ size_t largest_bits;
+
+ largest_bits = (size == 8) ? MTE_BITS_PER_LARGEST_IDX_INLINE :
+ MTE_BITS_PER_LARGEST_IDX;
+ return (size * 8 + MTE_BITS_PER_SIZE - largest_bits) /
+ (MTE_TAG_SIZE + MTE_BITS_PER_SIZE);
+}
+
+/* Translate @num_ranges into the allocation size needed to hold them. */
+static size_t mte_alloc_size(unsigned int num_ranges)
+{
+ size_t size = 8;
+
+ while (size < (1 << MTE_BITS_PER_SIZE)) {
+ if (num_ranges <= mte_size_to_ranges(size))
+ return size;
+ size <<= 1;
+ }
+ return size;
+}
+
+/* Is the data stored inline in the handle itself? */
+static bool mte_is_inline(unsigned long handle)
+{
+ return (handle & MTE_NOINLINE_MASK) != MTE_NOINLINE_MASK;
+}
+
+/**
+ * mte_storage_size() - calculate the memory occupied by compressed tags.
+ * @handle: storage handle returned by mte_compress.
+ *
+ * Returns: size of the storage used for @handle.
+ */
+size_t mte_storage_size(unsigned long handle)
+{
+ if (mte_is_inline(handle))
+ return 8;
+ return mte_cache_id_to_size(handle & MTE_CACHE_ID_MASK);
+}
+EXPORT_SYMBOL_NS(mte_storage_size, MTECOMP);
+
+static void mte_bitmap_write(unsigned long *bitmap, unsigned long value,
+ unsigned long *pos, unsigned long bits)
+{
+ bitmap_write(bitmap, value, *pos, bits);
+ *pos += bits;
+}
+
+static inline unsigned long mte_largest_idx_bits(size_t size)
+{
+ if (size == 8)
+ return MTE_BITS_PER_LARGEST_IDX_INLINE;
+ return MTE_BITS_PER_LARGEST_IDX;
+}
+
+/* Compress ranges into the buffer that can accommodate up to max_ranges. */
+static void mte_compress_to_buf(size_t len, u8 *tags, unsigned short *sizes,
+ unsigned long *bitmap, size_t size)
+{
+ unsigned long bit_pos = 0, l_bits;
+ unsigned int largest_idx, i;
+ unsigned short largest = 0;
+ size_t max_ranges;
+
+ for (i = 0; i < len; i++) {
+ if (sizes[i] > largest) {
+ largest = sizes[i];
+ largest_idx = i;
+ }
+ }
+ l_bits = mte_largest_idx_bits(size);
+ max_ranges = mte_size_to_ranges(size);
+ mte_bitmap_write(bitmap, largest_idx, &bit_pos, l_bits);
+ for (i = 0; i < len; i++)
+ mte_bitmap_write(bitmap, tags[i], &bit_pos, MTE_TAG_SIZE);
+ for (i = len; i < max_ranges; i++)
+ mte_bitmap_write(bitmap, 0, &bit_pos, MTE_TAG_SIZE);
+ for (i = 0; i < len; i++) {
+ if (i != largest_idx)
+ mte_bitmap_write(bitmap, sizes[i], &bit_pos,
+ MTE_BITS_PER_SIZE);
+ }
+ for (i = len; i < max_ranges; i++)
+ mte_bitmap_write(bitmap, 0, &bit_pos, MTE_BITS_PER_SIZE);
+}
+
+/**
+ * mte_compress() - compress the given tag array.
+ * @tags: MTE_GRANULES_PER_PAGE-byte array to read the tags from.
+ *
+ * Compresses the tags and returns a 64-bit opaque handle pointing to the
+ * tag storage. May allocate memory, which is freed by @mte_release_handle().
+ *
+ * Returns: 64-bit tag storage handle.
+ */
+unsigned long mte_compress(u8 *tags)
+{
+ struct kmem_cache *cache;
+ unsigned short *r_sizes;
+ unsigned long *storage;
+ unsigned int cache_id;
+ size_t alloc_size;
+ u8 *r_tags;
+ size_t r_len;
+ /*
+ * mte_compress_to_buf() only initializes the bits that mte_decompress()
+ * will read. But when the tags are stored in the handle itself, it must
+ * have all its bits initialized.
+ */
+ unsigned long result = 0;
+
+ r_sizes = kmalloc_array(MTE_GRANULES_PER_PAGE, sizeof(unsigned short),
+ GFP_KERNEL);
+ r_tags = kmalloc(MTE_GRANULES_PER_PAGE, GFP_KERNEL);
+ if (!r_sizes || !r_tags)
+ goto ret;
+ r_len = MTE_GRANULES_PER_PAGE;
+ mte_tags_to_ranges(tags, r_tags, r_sizes, &r_len);
+ alloc_size = mte_alloc_size(r_len);
+ if (alloc_size == 8) {
+ mte_compress_to_buf(r_len, r_tags, r_sizes, &result,
+ alloc_size);
+ atomic_long_inc(&alloc_counters[0]);
+ goto ret;
+ }
+ cache_id = mte_size_to_cache_id(alloc_size);
+ cache = mtecomp_caches[cache_id];
+ storage = kmem_cache_alloc(cache, GFP_KERNEL);
+ atomic_long_inc(&alloc_counters[cache_id + 1]);
+ if (!storage) {
+ result = 0;
+ goto ret;
+ }
+ if (alloc_size < MTE_PAGE_TAG_STORAGE) {
+ /* alloc_size is always a multiple of sizeof(unsigned long). */
+ mte_compress_to_buf(r_len, r_tags, r_sizes, storage,
+ alloc_size);
+ result = ((unsigned long)storage | cache_id) & MTE_HANDLE_MASK;
+ goto ret;
+ }
+ memcpy(storage, tags, alloc_size);
+ result = ((unsigned long)storage | cache_id) & MTE_HANDLE_MASK;
+ret:
+ kfree(r_tags);
+ kfree(r_sizes);
+ return result;
+}
+EXPORT_SYMBOL_NS(mte_compress, MTECOMP);
+
+static unsigned long mte_bitmap_read(const unsigned long *bitmap,
+ unsigned long *pos, unsigned long bits)
+{
+ unsigned long start = *pos;
+
+ *pos += bits;
+ return bitmap_read(bitmap, start, bits);
+}
+
+/* Decompress the contents of the given buffer into @tags. */
+static bool mte_decompress_from_buf(const unsigned long *bitmap, size_t size,
+ u8 *tags)
+{
+ unsigned long bit_pos = 0, l_bits;
+ unsigned short *r_sizes, sum;
+ unsigned int largest_idx, i;
+ bool result = true;
+ size_t max_ranges;
+ u8 *r_tags;
+
+ max_ranges = mte_size_to_ranges(size);
+ l_bits = mte_largest_idx_bits(size);
+ largest_idx = mte_bitmap_read(bitmap, &bit_pos, l_bits);
+ r_sizes = kmalloc_array(max_ranges, sizeof(unsigned short), GFP_KERNEL);
+ r_tags = kmalloc(max_ranges, GFP_KERNEL);
+ if (!r_sizes || !r_tags) {
+ result = false;
+ goto ret;
+ }
+
+ for (i = 0; i < max_ranges; i++)
+ r_tags[i] = mte_bitmap_read(bitmap, &bit_pos, MTE_TAG_SIZE);
+ for (i = 0, sum = 0; i < max_ranges; i++) {
+ if (i == largest_idx)
+ continue;
+ r_sizes[i] =
+ mte_bitmap_read(bitmap, &bit_pos, MTE_BITS_PER_SIZE);
+ if (!r_sizes[i]) {
+ max_ranges = i;
+ break;
+ }
+ sum += r_sizes[i];
+ }
+ if (sum >= MTE_GRANULES_PER_PAGE) {
+ result = false;
+ goto ret;
+ }
+ r_sizes[largest_idx] = MTE_GRANULES_PER_PAGE - sum;
+ mte_ranges_to_tags(r_tags, r_sizes, max_ranges, tags);
+ result = true;
+ret:
+ kfree(r_sizes);
+ kfree(r_tags);
+ return result;
+}
+
+/* Get pointer to the out-of-line storage from a handle. */
+static void *mte_storage(unsigned long handle)
+{
+ if (mte_is_inline(handle))
+ return NULL;
+ return (void *)((handle & (~MTE_CACHE_ID_MASK)) | BIT_ULL(63));
+}
+
+/**
+ * mte_decompress() - decompress the tag array addressed by the handle.
+ * @handle: handle returned by @mte_decompress()
+ * @tags: MTE_GRANULES_PER_PAGE-byte array to write the tags to.
+ *
+ * Reads the compressed data and writes it into the user-supplied tag array.
+ *
+ * Returns: true on success, false on error.
+ */
+bool mte_decompress(unsigned long handle, u8 *tags)
+{
+ unsigned long *storage = mte_storage(handle);
+ size_t size = mte_storage_size(handle);
+
+ switch (size) {
+ case 8:
+ return mte_decompress_from_buf(&handle, size, tags);
+ case MTE_PAGE_TAG_STORAGE:
+ memcpy(tags, storage, size);
+ return true;
+ default:
+ return mte_decompress_from_buf(storage, size, tags);
+ }
+}
+EXPORT_SYMBOL_NS(mte_decompress, MTECOMP);
+
+/**
+ * mte_release_handle() - release the handle returned by mte_compress().
+ * @handle: handle returned by mte_compress().
+ */
+void mte_release_handle(unsigned long handle)
+{
+ unsigned int cache_id;
+ struct kmem_cache *c;
+ void *storage;
+ size_t size;
+
+ storage = mte_storage(handle);
+ if (!storage) {
+ atomic_long_inc(&dealloc_counters[0]);
+ return;
+ }
+
+ size = mte_storage_size(handle);
+ cache_id = mte_size_to_cache_id(size);
+ c = mtecomp_caches[cache_id];
+ kmem_cache_free(c, storage);
+ atomic_long_inc(&dealloc_counters[cache_id + 1]);
+}
+EXPORT_SYMBOL_NS(mte_release_handle, MTECOMP);
+
+/* DebugFS interface. */
+static int stats_show(struct seq_file *seq, void *v)
+{
+ unsigned long total_mem_alloc = 0, total_mem_dealloc = 0;
+ unsigned long total_num_alloc = 0, total_num_dealloc = 0;
+ unsigned long size = 8;
+ long alloc, dealloc;
+ int i;
+
+ for (i = 0; i <= MTECOMP_NUM_CACHES; i++) {
+ alloc = atomic_long_read(&alloc_counters[i]);
+ dealloc = atomic_long_read(&dealloc_counters[i]);
+ total_num_alloc += alloc;
+ total_num_dealloc += dealloc;
+ /*
+ * Do not count 8-byte buffers towards compressed tag storage
+ * size.
+ */
+ if (i) {
+ total_mem_alloc += (size * alloc);
+ total_mem_dealloc += (size * dealloc);
+ }
+ seq_printf(seq,
+ "%lu bytes: %lu allocations, %lu deallocations\n",
+ size, alloc, dealloc);
+ size <<= 1;
+ }
+ seq_printf(seq, "uncompressed tag storage size: %lu\n",
+ (total_num_alloc - total_num_dealloc) *
+ MTE_PAGE_TAG_STORAGE);
+ seq_printf(seq, "compressed tag storage size: %lu\n",
+ total_mem_alloc - total_mem_dealloc);
+ return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(stats);
+
+static int mtecomp_debugfs_init(void)
+{
+ struct dentry *mtecomp_dir;
+
+ mtecomp_dir = debugfs_create_dir("mtecomp", NULL);
+ debugfs_create_file("stats", 0444, mtecomp_dir, NULL, &stats_fops);
+ return 0;
+}
+
+/* Set up mtecomp_caches[]. */
+static int mtecomp_init(void)
+{
+ unsigned int i;
+ char name[16];
+ size_t size;
+
+ static_assert(MTECOMP_NUM_CACHES <= (MTE_CACHE_ID_MASK + 1));
+ for (i = 0; i < MTECOMP_NUM_CACHES; i++) {
+ size = mte_cache_id_to_size(i);
+ snprintf(name, sizeof(name), "mte-tags-%ld", size);
+ mtecomp_caches[i] =
+ kmem_cache_create(name, size, size, 0, NULL);
+ }
+ mtecomp_debugfs_init();
+ return 0;
+}
+module_init(mtecomp_init);
diff --git a/arch/arm64/mm/mtecomp.h b/arch/arm64/mm/mtecomp.h
new file mode 100644
index 0000000000000..b94cf0384f2af
--- /dev/null
+++ b/arch/arm64/mm/mtecomp.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef ARCH_ARM64_MM_MTECOMP_H_
+#define ARCH_ARM64_MM_MTECOMP_H_
+
+/* Functions exported from mtecomp.c for test_mtecomp.c. */
+void mte_tags_to_ranges(u8 *tags, u8 *out_tags, unsigned short *out_sizes,
+ size_t *out_len);
+void mte_ranges_to_tags(u8 *r_tags, unsigned short *r_sizes, size_t r_len,
+ u8 *tags);
+
+#endif // ARCH_ARM64_MM_TEST_MTECOMP_H_
--
2.42.0.515.g380fc7ccd1-goog

2023-09-22 16:09:13

by Alexander Potapenko

[permalink] [raw]
Subject: [PATCH v5 2/5] lib/test_bitmap: add tests for bitmap_{read,write}()

Add basic tests ensuring that values can be added at arbitrary positions
of the bitmap, including those spanning into the adjacent unsigned
longs.

Signed-off-by: Alexander Potapenko <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>

---
This patch was previously called
"lib/test_bitmap: add tests for bitmap_{set,get}_value()"
(https://lore.kernel.org/lkml/[email protected]/)
and
"lib/test_bitmap: add tests for bitmap_{set,get}_value_unaligned"
(https://lore.kernel.org/lkml/[email protected]/)

v5:
- update patch title
- address Yury Norov's comments:
- rename the test cases
- factor out test_bitmap_write_helper() to test writing over
different background patterns;
- add a test case copying a nontrivial value bit-by-bit;
- drop volatile

v4:
- Address comments by Andy Shevchenko: added Reviewed-by: and a link to
the previous discussion
- Address comments by Yury Norov:
- expand the bitmap to catch more corner cases
- add code testing that bitmap_set_value() does not touch adjacent
bits
- add code testing the nbits==0 case
- rename bitmap_{get,set}_value() to bitmap_{read,write}()

v3:
- switch to using bitmap_{set,get}_value()
- change the expected bit pattern in test_set_get_value(),
as the test was incorrectly assuming 0 is the LSB.
---
lib/test_bitmap.c | 115 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 115 insertions(+)

diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c
index 187f5b2db4cf1..ea66e39b29a4e 100644
--- a/lib/test_bitmap.c
+++ b/lib/test_bitmap.c
@@ -71,6 +71,17 @@ __check_eq_uint(const char *srcfile, unsigned int line,
return true;
}

+static bool __init
+__check_eq_ulong(const char *srcfile, unsigned int line,
+ const unsigned long exp_ulong, unsigned long x)
+{
+ if (exp_ulong != x) {
+ pr_err("[%s:%u] expected %lu, got %lu\n",
+ srcfile, line, exp_ulong, x);
+ return false;
+ }
+ return true;
+}

static bool __init
__check_eq_bitmap(const char *srcfile, unsigned int line,
@@ -186,6 +197,7 @@ __check_eq_str(const char *srcfile, unsigned int line,
})

#define expect_eq_uint(...) __expect_eq(uint, ##__VA_ARGS__)
+#define expect_eq_ulong(...) __expect_eq(ulong, ##__VA_ARGS__)
#define expect_eq_bitmap(...) __expect_eq(bitmap, ##__VA_ARGS__)
#define expect_eq_pbl(...) __expect_eq(pbl, ##__VA_ARGS__)
#define expect_eq_u32_array(...) __expect_eq(u32_array, ##__VA_ARGS__)
@@ -1222,6 +1234,108 @@ static void __init test_bitmap_const_eval(void)
BUILD_BUG_ON(~var != ~BIT(25));
}

+/*
+ * Test bitmap should be big enough to include the cases when start is not in
+ * the first word, and start+nbits lands in the following word.
+ */
+#define TEST_BIT_LEN (1000)
+#define TEST_BYTE_LEN (BITS_TO_BYTES(TEST_BIT_LEN))
+
+/*
+ * Helper function to test bitmap_write() overwriting the chosen byte pattern.
+ */
+static void __init test_bitmap_write_helper(unsigned char pattern)
+{
+ DECLARE_BITMAP(bitmap, TEST_BIT_LEN);
+ DECLARE_BITMAP(exp_bitmap, TEST_BIT_LEN);
+ unsigned long w, r, bit;
+ int i, n, nbits;
+
+ /*
+ * Check that setting a single bit does not accidentally touch the
+ * adjacent bits.
+ */
+ for (i = 0; i < TEST_BIT_LEN; i++) {
+ /*
+ * A 0b10101010 pattern to catch both 0s replaced to 1s
+ * and vice versa.
+ */
+ memset(bitmap, pattern, TEST_BYTE_LEN);
+ memset(exp_bitmap, pattern, TEST_BYTE_LEN);
+ for (bit = 0; bit <= 1; bit++) {
+ bitmap_write(bitmap, bit, i, 1);
+ __assign_bit(i, exp_bitmap, bit);
+ expect_eq_bitmap(exp_bitmap, bitmap,
+ TEST_BIT_LEN);
+ }
+ }
+
+ /* Ensure setting 0 bits does not change anything. */
+ memset(bitmap, pattern, TEST_BYTE_LEN);
+ memset(exp_bitmap, pattern, TEST_BYTE_LEN);
+ for (i = 0; i < TEST_BIT_LEN; i++) {
+ bitmap_write(bitmap, ~0UL, i, 0);
+ expect_eq_bitmap(exp_bitmap, bitmap, TEST_BIT_LEN);
+ }
+
+ for (nbits = BITS_PER_LONG; nbits >= 1; nbits--) {
+ w = 0xdeadbeefdeadbeefUL >> (BITS_PER_LONG - nbits);
+ for (i = 0; i <= TEST_BIT_LEN - nbits; i++) {
+ memset(bitmap, pattern, TEST_BYTE_LEN);
+ memset(exp_bitmap, pattern, TEST_BYTE_LEN);
+ for (n = 0; n < nbits; n++)
+ __assign_bit(i + n, exp_bitmap, w & BIT(n));
+ bitmap_write(bitmap, w, i, nbits);
+ expect_eq_bitmap(exp_bitmap, bitmap, TEST_BIT_LEN);
+ r = bitmap_read(bitmap, i, nbits);
+ expect_eq_ulong(r, w);
+ }
+ }
+}
+
+static void __init test_bitmap_read_write(void)
+{
+ unsigned char pattern[3] = {0x00, 0xaa, 0xff};
+ DECLARE_BITMAP(bitmap, TEST_BIT_LEN);
+ unsigned long zero_bits = 0;
+ unsigned long val;
+ int i, pi;
+
+ /*
+ * Setting/getting zero bytes should not crash the kernel.
+ * READ_ONCE() prevents constant folding.
+ */
+ bitmap_write(NULL, 0, 0, READ_ONCE(zero_bits));
+ val = bitmap_read(NULL, 0, READ_ONCE(zero_bits));
+ expect_eq_ulong(0, val);
+
+ /*
+ * Ensure that bitmap_read() reads the same value that was previously
+ * written, and two consequent values are correctly merged.
+ * The resulting bit pattern is asymmetric to rule out possible issues
+ * with bit numeration order.
+ */
+ for (i = 0; i < TEST_BIT_LEN - 7; i++) {
+ bitmap_zero(bitmap, TEST_BIT_LEN);
+
+ bitmap_write(bitmap, 0b10101UL, i, 5);
+ val = bitmap_read(bitmap, i, 5);
+ expect_eq_ulong(0b10101UL, val);
+
+ bitmap_write(bitmap, 0b101UL, i + 5, 3);
+ val = bitmap_read(bitmap, i + 5, 3);
+ expect_eq_ulong(0b101UL, val);
+
+ val = bitmap_read(bitmap, i, 8);
+ expect_eq_ulong(0b10110101UL, val);
+ }
+
+ for (pi = 0; pi < sizeof(pattern); pi++)
+ test_bitmap_write_helper(pattern[pi]);
+}
+#undef TEST_BYTE_LEN
+#undef TEST_BIT_LEN
+
static void __init selftest(void)
{
test_zero_clear();
@@ -1237,6 +1351,7 @@ static void __init selftest(void)
test_bitmap_cut();
test_bitmap_print_buf();
test_bitmap_const_eval();
+ test_bitmap_read_write();

test_find_nth_bit();
test_for_each_set_bit();
--
2.42.0.515.g380fc7ccd1-goog

2023-09-22 20:57:04

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH v5 0/5] Implement MTE tag compression for swapped pages

From: Andy Shevchenko <[email protected]>
Date: Fri, 22 Sep 2023 17:35:19 +0300

> +Cc: Olek, who internally is being developed something similar to your first
> patch here.

Oh, thanks.
The patch you mentioned properly implements cross-boundary accesses,
mine does not :D
But I guess we want to keep them both to keep the latter as optimized as
the current bitmap_{get,set}_value8()?

>
> On Fri, Sep 22, 2023 at 10:08:42AM +0200, Alexander Potapenko wrote:
>> Currently, when MTE pages are swapped out, the tags are kept in the
>> memory, occupying PAGE_SIZE/32 bytes per page. This is especially
>> problematic for devices that use zram-backed in-memory swap, because
>> tags stored uncompressed in the heap effectively reduce the available
>> amount of swap memory.
>>
>> The RLE-based algorithm suggested by Evgenii Stepanov and implemented in
>> this patch series is able to efficiently compress fixed-size tag buffers,
>> resulting in practical compression ratio between 2.5x and 4x. In most
>> cases it is possible to store the compressed data in 63-bit Xarray values,
>> resulting in no extra memory allocations.
>>
>> Our measurements show that the proposed algorithm provides better
>> compression than existing kernel compression algorithms (LZ4, LZO,
>> LZ4HC, ZSTD) can offer.

[...]

Thanks,
Olek

2023-09-23 00:50:35

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH v5 0/5] Implement MTE tag compression for swapped pages

+Cc: Olek, who internally is being developed something similar to your first
patch here.

On Fri, Sep 22, 2023 at 10:08:42AM +0200, Alexander Potapenko wrote:
> Currently, when MTE pages are swapped out, the tags are kept in the
> memory, occupying PAGE_SIZE/32 bytes per page. This is especially
> problematic for devices that use zram-backed in-memory swap, because
> tags stored uncompressed in the heap effectively reduce the available
> amount of swap memory.
>
> The RLE-based algorithm suggested by Evgenii Stepanov and implemented in
> this patch series is able to efficiently compress fixed-size tag buffers,
> resulting in practical compression ratio between 2.5x and 4x. In most
> cases it is possible to store the compressed data in 63-bit Xarray values,
> resulting in no extra memory allocations.
>
> Our measurements show that the proposed algorithm provides better
> compression than existing kernel compression algorithms (LZ4, LZO,
> LZ4HC, ZSTD) can offer.
>
> To implement compression/decompression, we also extend <linux/bitmap.h>
> with methods to read/write bit values at arbitrary places in the map.
>
> We refactor arch/arm64/mm/mteswap.c to support both the compressed
> (CONFIG_ARM64_MTE_COMP) and non-compressed case. For the former, in
> addition to tag compression, we move tag allocation from kmalloc() to
> separate kmem caches, providing greater locality and relaxing the
> alignment requirements.
>
> v5:
> - fixed comments by Andy Shevchenko, Catalin Marinas, and Yury Norov
> - added support for 16K- and 64K pages
> - more efficient bitmap_write() implementation
>
> v4:
> - fixed a bunch of comments by Andy Shevchenko and Yury Norov
> - added Documentation/arch/arm64/mte-tag-compression.rst
>
> v3:
> - as suggested by Andy Shevchenko, use
> bitmap_get_value()/bitmap_set_value() written by Syed Nayyar Waris
> - switched to unsigned long to reduce typecasts
> - simplified the compression code
>
> v2:
> - as suggested by Yuri Norov, replace the poorly implemented struct
> bitq with <linux/bitmap.h>
>
>
>
> Alexander Potapenko (4):
> lib/test_bitmap: add tests for bitmap_{read,write}()
> arm64: mte: implement CONFIG_ARM64_MTE_COMP
> arm64: mte: add a test for MTE tags compression
> arm64: mte: add compression support to mteswap.c
>
> Syed Nayyar Waris (1):
> lib/bitmap: add bitmap_{read,write}()
>
> Documentation/arch/arm64/index.rst | 1 +
> .../arch/arm64/mte-tag-compression.rst | 245 +++++++++
> arch/arm64/Kconfig | 21 +
> arch/arm64/include/asm/mtecomp.h | 13 +
> arch/arm64/mm/Makefile | 7 +
> arch/arm64/mm/mtecomp.c | 467 ++++++++++++++++++
> arch/arm64/mm/mtecomp.h | 12 +
> arch/arm64/mm/mteswap.c | 20 +-
> arch/arm64/mm/mteswap.h | 12 +
> arch/arm64/mm/mteswap_comp.c | 60 +++
> arch/arm64/mm/mteswap_nocomp.c | 38 ++
> arch/arm64/mm/test_mtecomp.c | 287 +++++++++++
> include/linux/bitmap.h | 68 +++
> lib/test_bitmap.c | 115 +++++
> 14 files changed, 1355 insertions(+), 11 deletions(-)
> create mode 100644 Documentation/arch/arm64/mte-tag-compression.rst
> create mode 100644 arch/arm64/include/asm/mtecomp.h
> create mode 100644 arch/arm64/mm/mtecomp.c
> create mode 100644 arch/arm64/mm/mtecomp.h
> create mode 100644 arch/arm64/mm/mteswap.h
> create mode 100644 arch/arm64/mm/mteswap_comp.c
> create mode 100644 arch/arm64/mm/mteswap_nocomp.c
> create mode 100644 arch/arm64/mm/test_mtecomp.c
>
> --
> 2.42.0.515.g380fc7ccd1-goog
>

--
With Best Regards,
Andy Shevchenko


2023-09-25 12:23:29

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] lib/test_bitmap: add tests for bitmap_{read,write}()

On Mon, Sep 25, 2023 at 02:16:37PM +0200, Alexander Potapenko wrote:

...

> > +/*
> > + * Test bitmap should be big enough to include the cases when start is not in
> > + * the first word, and start+nbits lands in the following word.
> > + */
> > +#define TEST_BIT_LEN (1000)
>
> Dunno why this didn't fire previously, but CONFIG_CPU_BIG_ENDIAN=y
> kernel reports mismatches here, presumably because the last quad word
> ends up partially initialized.

Hmm... But if designed and used correctly it shouldn't be the issue,
and 1000, I believe, is carefully chosen to be specifically not dividable
by pow-of-2 value.

> The easiest fix is to make TEST_BIT_LEN divisible by 64 (e.g. 1024)

Wouldn't it hide the real issue somewhere?

--
With Best Regards,
Andy Shevchenko


2023-09-25 13:32:27

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] lib/test_bitmap: add tests for bitmap_{read,write}()

>
> +/*
> + * Test bitmap should be big enough to include the cases when start is not in
> + * the first word, and start+nbits lands in the following word.
> + */
> +#define TEST_BIT_LEN (1000)

Dunno why this didn't fire previously, but CONFIG_CPU_BIG_ENDIAN=y
kernel reports mismatches here, presumably because the last quad word
ends up partially initialized.
The easiest fix is to make TEST_BIT_LEN divisible by 64 (e.g. 1024)

2023-09-25 16:14:03

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] lib/test_bitmap: add tests for bitmap_{read,write}()

On Mon, Sep 25, 2023 at 3:09 PM Alexander Potapenko <[email protected]> wrote:
>
> On Mon, Sep 25, 2023 at 2:23 PM Andy Shevchenko
> <[email protected]> wrote:
> >
> > On Mon, Sep 25, 2023 at 02:16:37PM +0200, Alexander Potapenko wrote:
> >
> > ...
> >
> > > > +/*
> > > > + * Test bitmap should be big enough to include the cases when start is not in
> > > > + * the first word, and start+nbits lands in the following word.
> > > > + */
> > > > +#define TEST_BIT_LEN (1000)
> > >
> > > Dunno why this didn't fire previously, but CONFIG_CPU_BIG_ENDIAN=y
> > > kernel reports mismatches here, presumably because the last quad word
> > > ends up partially initialized.
> >
> > Hmm... But if designed and used correctly it shouldn't be the issue,
> > and 1000, I believe, is carefully chosen to be specifically not dividable
> > by pow-of-2 value.
> >
>
> The problem manifests already right after initialization:
>
> static void __init test_bit_len_1000(void)
> {
> DECLARE_BITMAP(bitmap, TEST_BIT_LEN);
> DECLARE_BITMAP(exp_bitmap, TEST_BIT_LEN);
> memset(bitmap, 0x00, TEST_BYTE_LEN);
> memset(exp_bitmap, 0x00, TEST_BYTE_LEN);
> expect_eq_bitmap(exp_bitmap, bitmap, TEST_BIT_LEN);
> }

The problem is that there's no direct analog of memset() that can be
used to initialize bitmaps on both BE and LE systems.
bitmap_zero() and bitmap_set() work by rounding up the bitmap size to
BITS_TO_LONGS(nbits), but there's no bitmap_memset() that would do the
same for an arbitrary byte pattern.
We could call memset(..., ..., BITS_TO_LONGS(TEST_BIT_LEN)), but that
would be similar to declaring a bigger bitmap and not testing the last
24 bits.

Overall, unless allocating and initializing bitmaps with size
divisible by sizeof(long), most of bitmap.c is undefined behavior, so
I don't think it makes much sense to specifically test this case here
(given that we do not extend bitmap_equal() in the patch set).

2023-09-25 17:40:47

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] lib/test_bitmap: add tests for bitmap_{read,write}()

On Mon, Sep 25, 2023 at 04:54:00PM +0200, Alexander Potapenko wrote:
> On Mon, Sep 25, 2023 at 3:09 PM Alexander Potapenko <[email protected]> wrote:
> >
> > On Mon, Sep 25, 2023 at 2:23 PM Andy Shevchenko
> > <[email protected]> wrote:
> > >
> > > On Mon, Sep 25, 2023 at 02:16:37PM +0200, Alexander Potapenko wrote:
> > >
> > > ...
> > >
> > > > > +/*
> > > > > + * Test bitmap should be big enough to include the cases when start is not in
> > > > > + * the first word, and start+nbits lands in the following word.
> > > > > + */
> > > > > +#define TEST_BIT_LEN (1000)
> > > >
> > > > Dunno why this didn't fire previously, but CONFIG_CPU_BIG_ENDIAN=y
> > > > kernel reports mismatches here, presumably because the last quad word
> > > > ends up partially initialized.
> > >
> > > Hmm... But if designed and used correctly it shouldn't be the issue,
> > > and 1000, I believe, is carefully chosen to be specifically not dividable
> > > by pow-of-2 value.
> > >
> >
> > The problem manifests already right after initialization:
> >
> > static void __init test_bit_len_1000(void)
> > {
> > DECLARE_BITMAP(bitmap, TEST_BIT_LEN);
> > DECLARE_BITMAP(exp_bitmap, TEST_BIT_LEN);
> > memset(bitmap, 0x00, TEST_BYTE_LEN);
> > memset(exp_bitmap, 0x00, TEST_BYTE_LEN);
> > expect_eq_bitmap(exp_bitmap, bitmap, TEST_BIT_LEN);
> > }
>
> The problem is that there's no direct analog of memset() that can be
> used to initialize bitmaps on both BE and LE systems.

memset fills an array of chars with the same value. In bitmap world we operate
on array of bits, and there are only 2 possible values: '0' and '1'. For those
we've got bitmap_zero() and bitmap_fill().

> bitmap_zero() and bitmap_set() work by rounding up the bitmap size to
> BITS_TO_LONGS(nbits), but there's no bitmap_memset() that would do the
> same for an arbitrary byte pattern.
> We could call memset(..., ..., BITS_TO_LONGS(TEST_BIT_LEN)), but that
> would be similar to declaring a bigger bitmap and not testing the last
> 24 bits.

No, you couldn't. On the test level, bitmap should be considered as a
black box. memset()'ing it may (and did) damage internal structure.

If you have some pattern in mind, you can use bitmap_parselist(). For example,
you can set every 2nd bit in your bitmap like this:

bitmap_parselist("all:1/2", bitmap, 1000);

Check for almost 100 examples of bitmap_parselist usage in the test for
bitmap_parselist in the same file.

> Overall, unless allocating and initializing bitmaps with size
> divisible by sizeof(long), most of bitmap.c is undefined behavior, so
> I don't think it makes much sense to specifically test this case here
> (given that we do not extend bitmap_equal() in the patch set).

This is wrong statement. People spent huge amount of time making bitmap
API working well for all combinations of lengths and endiannesses,
including Andy and me.

NAK for this and for ignoring my other comment to v4.

2023-09-25 18:49:44

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] lib/test_bitmap: add tests for bitmap_{read,write}()

On Mon, Sep 25, 2023 at 6:06 PM Yury Norov <[email protected]> wrote:
>
> On Mon, Sep 25, 2023 at 04:54:00PM +0200, Alexander Potapenko wrote:
> > On Mon, Sep 25, 2023 at 3:09 PM Alexander Potapenko <[email protected]> wrote:
> > >
> > > On Mon, Sep 25, 2023 at 2:23 PM Andy Shevchenko
> > > <[email protected]> wrote:
> > > >
> > > > On Mon, Sep 25, 2023 at 02:16:37PM +0200, Alexander Potapenko wrote:
> > > >
> > > > ...
> > > >
> > > > > > +/*
> > > > > > + * Test bitmap should be big enough to include the cases when start is not in
> > > > > > + * the first word, and start+nbits lands in the following word.
> > > > > > + */
> > > > > > +#define TEST_BIT_LEN (1000)
> > > > >
> > > > > Dunno why this didn't fire previously, but CONFIG_CPU_BIG_ENDIAN=y
> > > > > kernel reports mismatches here, presumably because the last quad word
> > > > > ends up partially initialized.
> > > >
> > > > Hmm... But if designed and used correctly it shouldn't be the issue,
> > > > and 1000, I believe, is carefully chosen to be specifically not dividable
> > > > by pow-of-2 value.
> > > >
> > >
> > > The problem manifests already right after initialization:
> > >
> > > static void __init test_bit_len_1000(void)
> > > {
> > > DECLARE_BITMAP(bitmap, TEST_BIT_LEN);
> > > DECLARE_BITMAP(exp_bitmap, TEST_BIT_LEN);
> > > memset(bitmap, 0x00, TEST_BYTE_LEN);
> > > memset(exp_bitmap, 0x00, TEST_BYTE_LEN);
> > > expect_eq_bitmap(exp_bitmap, bitmap, TEST_BIT_LEN);
> > > }
> >
> > The problem is that there's no direct analog of memset() that can be
> > used to initialize bitmaps on both BE and LE systems.
>
> memset fills an array of chars with the same value. In bitmap world we operate
> on array of bits, and there are only 2 possible values: '0' and '1'. For those
> we've got bitmap_zero() and bitmap_fill().
>
> > bitmap_zero() and bitmap_set() work by rounding up the bitmap size to
> > BITS_TO_LONGS(nbits), but there's no bitmap_memset() that would do the
> > same for an arbitrary byte pattern.
> > We could call memset(..., ..., BITS_TO_LONGS(TEST_BIT_LEN)), but that
> > would be similar to declaring a bigger bitmap and not testing the last
> > 24 bits.
>
> No, you couldn't. On the test level, bitmap should be considered as a
> black box. memset()'ing it may (and did) damage internal structure.

You are right about this. bitmap_zero() and bitmap_fill() are calling
memset() under the hood, but I shouldn't have assumed doing raw memset
is safe.
Unfortunately lib/test_bitmap.c does a bunch of memsets already, which
probably led to the confusion.


> If you have some pattern in mind, you can use bitmap_parselist(). For example,
> you can set every 2nd bit in your bitmap like this:
>
> bitmap_parselist("all:1/2", bitmap, 1000);
>
> Check for almost 100 examples of bitmap_parselist usage in the test for
> bitmap_parselist in the same file.

Thanks! This solves my problem.
I am planning to use an intermediate bitmap to avoid parsing the
pattern multiple times.

>
> > Overall, unless allocating and initializing bitmaps with size
> > divisible by sizeof(long), most of bitmap.c is undefined behavior, so
> > I don't think it makes much sense to specifically test this case here
> > (given that we do not extend bitmap_equal() in the patch set).
>
> This is wrong statement. People spent huge amount of time making bitmap
> API working well for all combinations of lengths and endiannesses,
> including Andy and me.

Please accept my apologies for that, I didn't mean to insult you,
Andy, or anyone else.
My understanding of the comment at the top of lib/bitmap.c was that
the bitmap API is expected to work in the presence of uninitialized
values, which is actually not what the comment says.
bitmap_zero() and such do ensure that none of the tail bytes remain
uninitialized, so we are safe as long as those functions are used.


>
> NAK for this and for ignoring my other comment to v4.

If you are talking about this comment:
https://lore.kernel.org/lkml/ZQ2WboApqYyEkjjG@yury-ThinkPad/, I was
going to incorporate it in v6 (as it was sent after I published v5).
I am fine with not mandating the return value for reading/writing zero bytes.

2023-09-25 19:35:27

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] lib/test_bitmap: add tests for bitmap_{read,write}()

On Mon, Sep 25, 2023 at 2:23 PM Andy Shevchenko
<[email protected]> wrote:
>
> On Mon, Sep 25, 2023 at 02:16:37PM +0200, Alexander Potapenko wrote:
>
> ...
>
> > > +/*
> > > + * Test bitmap should be big enough to include the cases when start is not in
> > > + * the first word, and start+nbits lands in the following word.
> > > + */
> > > +#define TEST_BIT_LEN (1000)
> >
> > Dunno why this didn't fire previously, but CONFIG_CPU_BIG_ENDIAN=y
> > kernel reports mismatches here, presumably because the last quad word
> > ends up partially initialized.
>
> Hmm... But if designed and used correctly it shouldn't be the issue,
> and 1000, I believe, is carefully chosen to be specifically not dividable
> by pow-of-2 value.
>

The problem manifests already right after initialization:

static void __init test_bit_len_1000(void)
{
DECLARE_BITMAP(bitmap, TEST_BIT_LEN);
DECLARE_BITMAP(exp_bitmap, TEST_BIT_LEN);
memset(bitmap, 0x00, TEST_BYTE_LEN);
memset(exp_bitmap, 0x00, TEST_BYTE_LEN);
expect_eq_bitmap(exp_bitmap, bitmap, TEST_BIT_LEN);
}

...
[ 29.601614][ T1] test_bitmap: [lib/test_bitmap.c:1250] bitmaps
contents differ: expected
"960-963,966-967,969,971-973,976,978-979,981", got "963"
...

So it's probably expect_eq_bitmap() that is incorrectly rounding up
the bitmap length somewhere (or maybe it is not supposed to be used
for non-aligned bitmaps?)
Looking further...

2023-09-27 07:59:35

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v5 2/5] lib/test_bitmap: add tests for bitmap_{read,write}()

...
> Overall, unless allocating and initializing bitmaps with size
> divisible by sizeof(long), most of bitmap.c is undefined behavior, so
> I don't think it makes much sense to specifically test this case here
> (given that we do not extend bitmap_equal() in the patch set).

Bitmaps are arrays of unsigned long.
Using any of the APIs on anything else is a bug.
So it is always wrong to try to initialise 'a number of bytes'.
The size used in the definition need not be a multiple of 8 (on 64bit)
but the allocated data is always a multiple of 8.

Any calls to the functions that have a cast of the bitmap
parameter are likely to be buggy.
And yes, there are loads of them, and many are buggy.

On LE you mostly get away with shorter memory allocations.
But still get errors when trying to do locked operations
on misaligned addresses.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-09-28 15:35:38

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] lib/test_bitmap: add tests for bitmap_{read,write}()

On Wed, Sep 27, 2023 at 9:51 AM David Laight <[email protected]> wrote:
>
> ...
> > Overall, unless allocating and initializing bitmaps with size
> > divisible by sizeof(long), most of bitmap.c is undefined behavior, so
> > I don't think it makes much sense to specifically test this case here
> > (given that we do not extend bitmap_equal() in the patch set).
>
> Bitmaps are arrays of unsigned long.
> Using any of the APIs on anything else is a bug.
> So it is always wrong to try to initialise 'a number of bytes'.
> The size used in the definition need not be a multiple of 8 (on 64bit)
> but the allocated data is always a multiple of 8.
>
> Any calls to the functions that have a cast of the bitmap
> parameter are likely to be buggy.
> And yes, there are loads of them, and many are buggy.

I got rid of the casts in the bitmap test, but they remain in
mtecomp.c, where 16-, 32-, 64-byte buffers allocated by
kmem_cache_alloc() are treated as bitmaps:
https://lore.kernel.org/linux-arm-kernel/[email protected]/T/#mdb0d636d2d357f8ffe6ac79cef1145df3440f659

Having them allocated by bitmap_alloc() won't work, because on Android
bitmap_alloc() will allocate the buffers from the kmalloc-64 cache,
defeating the purpose of the compression.

Would it be better to extend the bitmap.h API so that it is possible
to allocate from a kmem cache (which would in turn require
bitmap_kmem_cache_create() to ensure the alignment requirements)?

> On LE you mostly get away with shorter memory allocations.
> But still get errors when trying to do locked operations
> on misaligned addresses.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)



--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

2023-09-28 21:05:38

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] lib/test_bitmap: add tests for bitmap_{read,write}()

On Thu, Sep 28, 2023 at 05:14:55PM +0200, Alexander Potapenko wrote:
> On Thu, Sep 28, 2023 at 4:43 PM Yury Norov <[email protected]> wrote:
> >
> >
> >
> > On Thu, Sep 28, 2023, 10:20 AM Alexander Potapenko <[email protected]> wrote:
> >>
> >> On Wed, Sep 27, 2023 at 9:51 AM David Laight <[email protected]> wrote:
> >> >
> >> > ...
> >> > > Overall, unless allocating and initializing bitmaps with size
> >> > > divisible by sizeof(long), most of bitmap.c is undefined behavior, so
> >> > > I don't think it makes much sense to specifically test this case here
> >> > > (given that we do not extend bitmap_equal() in the patch set).
> >> >
> >> > Bitmaps are arrays of unsigned long.
> >> > Using any of the APIs on anything else is a bug.
> >> > So it is always wrong to try to initialise 'a number of bytes'.
> >> > The size used in the definition need not be a multiple of 8 (on 64bit)
> >> > but the allocated data is always a multiple of 8.
> >> >
> >> > Any calls to the functions that have a cast of the bitmap
> >> > parameter are likely to be buggy.
> >> > And yes, there are loads of them, and many are buggy.
> >>
> >> I got rid of the casts in the bitmap test, but they remain in
> >> mtecomp.c, where 16-, 32-, 64-byte buffers allocated by
> >> kmem_cache_alloc() are treated as bitmaps:
> >> https://lore.kernel.org/linux-arm-kernel/[email protected]/T/#mdb0d636d2d357f8ffe6ac79cef1145df3440f659
> >>
> >> Having them allocated by bitmap_alloc() won't work, because on Android
> >> bitmap_alloc() will allocate the buffers from the kmalloc-64 cache,
> >> defeating the purpose of the compression.
> >>
> >> Would it be better to extend the bitmap.h API so that it is possible
> >> to allocate from a kmem cache (which would in turn require
> >> bitmap_kmem_cache_create() to ensure the alignment requirements)?
> >
> >
> > So all that is wrong then. Bad on me, I'd spend more time looking into your driver code...
> >
> > We already have bitmap_(from,to)_u(64,32),
> > And you can use them. For 16-bit you have to add helpers yourself. But it's not a rocket science.
> >
>
> So e.g. for compressing something into a 16-byte buffer using bitmaps
> I'd need to:
>
> 1) Allocate the buffer: buf = kmem_cache_alloc(...)
> 2) Allocate the bitmap: bitmap = bitmap_alloc(16*8, ...)
> 3) Fill the bitmap: mte_compress_to_buf(..., bitmap, 16)
> 4) Copy the bitmap contents to the buffer: bitmap_to_arr64(buf, bitmap, 16*8)
> 5) Deallocate the bitmap: bitmap_free(bitmap)
>
> instead of:
>
> buf = kmem_cache_alloc(...)
> mte_compress_to_buf(..., (unsigned long *)buf, 16)
>
> , correct?
>
> Given that the buffer contents are opaque and its size is aligned on 8
> bytes, could it be possible to somehow adopt the `buf` pointer
> instead?

I didn't find an explicit typecasting where you're using
mte_compress_to_buf(), but now after hard 2nd look I see...

Firstly, now that in the documentation you are explicitly describing the
return value of mte_compress() as 64-bit frame, the right way to go would
be declaring the function as: u64 mte_compress(u8 *tags).

And the general pattern should be like this:

unsigned long mte_compress(u8 *tags)
{
DECLARE_BITMAP(tmp, MTECOMP_CACHES_MAXBITS);
void *storage;
...
if (alloc_size < MTE_PAGE_TAG_STORAGE) {
storage = kmem_cache_alloc(cache, GFP_KERNEL);
mte_compress_to_buf(r_len, r_tags, r_sizes, tmp, alloc_size);

switch (alloc_size) {
case 16:
bitmap_to_arr16(storage, tmp, 16);
break;
case 32:
bitmap_to_arr32(storage, tmp, 32);
break;
case 64:
bitmap_to_arr64(storage, tmp, 64);
break;
default:
pr_err("error\n");
}
result = ((u64)storage | cache_id) & MTE_HANDLE_MASK;
goto ret;
}
...
}

Yeah, it looks cumbersome, but this is the right way to go if you need a
reliable BE-compatible driver. I think it will be less scary if you wrap
the switch with a helper, and/or move it inside mte_compress_to_buf(),
so that the mte_compress will stay unchanged.

Anyways, hope the above helped.

Thanks,
Yury

2023-09-28 23:31:33

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] lib/test_bitmap: add tests for bitmap_{read,write}()

On Thu, Sep 28, 2023 at 4:43 PM Yury Norov <[email protected]> wrote:
>
>
>
> On Thu, Sep 28, 2023, 10:20 AM Alexander Potapenko <[email protected]> wrote:
>>
>> On Wed, Sep 27, 2023 at 9:51 AM David Laight <[email protected]> wrote:
>> >
>> > ...
>> > > Overall, unless allocating and initializing bitmaps with size
>> > > divisible by sizeof(long), most of bitmap.c is undefined behavior, so
>> > > I don't think it makes much sense to specifically test this case here
>> > > (given that we do not extend bitmap_equal() in the patch set).
>> >
>> > Bitmaps are arrays of unsigned long.
>> > Using any of the APIs on anything else is a bug.
>> > So it is always wrong to try to initialise 'a number of bytes'.
>> > The size used in the definition need not be a multiple of 8 (on 64bit)
>> > but the allocated data is always a multiple of 8.
>> >
>> > Any calls to the functions that have a cast of the bitmap
>> > parameter are likely to be buggy.
>> > And yes, there are loads of them, and many are buggy.
>>
>> I got rid of the casts in the bitmap test, but they remain in
>> mtecomp.c, where 16-, 32-, 64-byte buffers allocated by
>> kmem_cache_alloc() are treated as bitmaps:
>> https://lore.kernel.org/linux-arm-kernel/[email protected]/T/#mdb0d636d2d357f8ffe6ac79cef1145df3440f659
>>
>> Having them allocated by bitmap_alloc() won't work, because on Android
>> bitmap_alloc() will allocate the buffers from the kmalloc-64 cache,
>> defeating the purpose of the compression.
>>
>> Would it be better to extend the bitmap.h API so that it is possible
>> to allocate from a kmem cache (which would in turn require
>> bitmap_kmem_cache_create() to ensure the alignment requirements)?
>
>
> So all that is wrong then. Bad on me, I'd spend more time looking into your driver code...
>
> We already have bitmap_(from,to)_u(64,32),
> And you can use them. For 16-bit you have to add helpers yourself. But it's not a rocket science.
>

So e.g. for compressing something into a 16-byte buffer using bitmaps
I'd need to:

1) Allocate the buffer: buf = kmem_cache_alloc(...)
2) Allocate the bitmap: bitmap = bitmap_alloc(16*8, ...)
3) Fill the bitmap: mte_compress_to_buf(..., bitmap, 16)
4) Copy the bitmap contents to the buffer: bitmap_to_arr64(buf, bitmap, 16*8)
5) Deallocate the bitmap: bitmap_free(bitmap)

instead of:

buf = kmem_cache_alloc(...)
mte_compress_to_buf(..., (unsigned long *)buf, 16)

, correct?

Given that the buffer contents are opaque and its size is aligned on 8
bytes, could it be possible to somehow adopt the `buf` pointer
instead?

> I'm AFK at the moment. I'll take a close look at your machinery at the weekend.

Thanks and take your time!

2023-09-29 17:51:59

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] lib/test_bitmap: add tests for bitmap_{read,write}()

On Thu, Sep 28, 2023 at 10:02 PM Yury Norov <[email protected]> wrote:
>
> On Thu, Sep 28, 2023 at 05:14:55PM +0200, Alexander Potapenko wrote:
> >
> > So e.g. for compressing something into a 16-byte buffer using bitmaps
> > I'd need to:
> >
> > 1) Allocate the buffer: buf = kmem_cache_alloc(...)
> > 2) Allocate the bitmap: bitmap = bitmap_alloc(16*8, ...)
> > 3) Fill the bitmap: mte_compress_to_buf(..., bitmap, 16)
> > 4) Copy the bitmap contents to the buffer: bitmap_to_arr64(buf, bitmap, 16*8)
> > 5) Deallocate the bitmap: bitmap_free(bitmap)
> >
> > instead of:
> >
> > buf = kmem_cache_alloc(...)
> > mte_compress_to_buf(..., (unsigned long *)buf, 16)
> >
> > , correct?
> >
> > Given that the buffer contents are opaque and its size is aligned on 8
> > bytes, could it be possible to somehow adopt the `buf` pointer
> > instead?
>
> I didn't find an explicit typecasting where you're using
> mte_compress_to_buf(), but now after hard 2nd look I see...
>
> Firstly, now that in the documentation you are explicitly describing the
> return value of mte_compress() as 64-bit frame, the right way to go would
> be declaring the function as: u64 mte_compress(u8 *tags).

Ack.

> And the general pattern should be like this:
>
> unsigned long mte_compress(u8 *tags)
> {
> DECLARE_BITMAP(tmp, MTECOMP_CACHES_MAXBITS);
> void *storage;
> ...
> if (alloc_size < MTE_PAGE_TAG_STORAGE) {
> storage = kmem_cache_alloc(cache, GFP_KERNEL);
> mte_compress_to_buf(r_len, r_tags, r_sizes, tmp, alloc_size);
>
> switch (alloc_size) {
> case 16:
> bitmap_to_arr16(storage, tmp, 16);

I might be missing something, but why do we need the switch at all?
The buffers we are allocating always contain a whole number of u64's -
cannot we just always call bitmap_to_arr64()?

Note that for cases where alloc_size is > 8 we never make any
assumptions about the contents of @storage, and don't care much about
the byte order as long as swap decompression is done with the same
endianness (which is always the case).
(The case where alloc_size==8 is somewhat special, and needs more
accurate handling, because we do make assumptions about the bit layout
there).

> break;
> case 32:
> bitmap_to_arr32(storage, tmp, 32);
> break;
> case 64:
> bitmap_to_arr64(storage, tmp, 64);
> break;
> default:
> pr_err("error\n");
> }
> result = ((u64)storage | cache_id) & MTE_HANDLE_MASK;
> goto ret;
> }
> ...
> }
>
> Yeah, it looks cumbersome, but this is the right way to go if you need a
> reliable BE-compatible driver.

What is the BE compatibility problem that you are anticipating here?

The issue that came up during testing was that on systems with
different endianness we cannot treat unaligned buffers as bitmaps,
because bitmap_write() might be touching memory past the allocation
size.
I agree therefore with the general approach of encapsulating all
bitmap operations in bitmap.h and hiding the implementation details
behind the API.
But in this particular case it seems to complicate the otherwise
trivial application of bitmap_read()/bitmap_write() to an external
buffer with guaranteed 64-bit alignment: we'll end up allocating this
temporary bitmap and doing a hard-to-optimize memcpy() between it and
the final storage.

A straightforward solution for this would be to fork u64* versions of
bitmap_read()/bitmap_write() implementations in mtecomp.c and apply
them to u64* buffers allocated in that file.
This would remove the need for the temporary bitmap (because there's
no encapsulation that mandates using bitmap.h now) and the extra
memcpy().
But I don't like this either, because forking code that exists in
headers is just wrong.

> I think it will be less scary if you wrap
> the switch with a helper, and/or move it inside mte_compress_to_buf(),
> so that the mte_compress will stay unchanged.
>
> Anyways, hope the above helped.
>
> Thanks,
> Yury


--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

2023-10-02 07:01:15

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] lib/test_bitmap: add tests for bitmap_{read,write}()

On Fri, Sep 29, 2023 at 10:54:59AM +0200, Alexander Potapenko wrote:
> On Thu, Sep 28, 2023 at 10:02 PM Yury Norov <[email protected]> wrote:
> >
> > On Thu, Sep 28, 2023 at 05:14:55PM +0200, Alexander Potapenko wrote:
> > >
> > > So e.g. for compressing something into a 16-byte buffer using bitmaps
> > > I'd need to:
> > >
> > > 1) Allocate the buffer: buf = kmem_cache_alloc(...)
> > > 2) Allocate the bitmap: bitmap = bitmap_alloc(16*8, ...)
> > > 3) Fill the bitmap: mte_compress_to_buf(..., bitmap, 16)
> > > 4) Copy the bitmap contents to the buffer: bitmap_to_arr64(buf, bitmap, 16*8)
> > > 5) Deallocate the bitmap: bitmap_free(bitmap)
> > >
> > > instead of:
> > >
> > > buf = kmem_cache_alloc(...)
> > > mte_compress_to_buf(..., (unsigned long *)buf, 16)
> > >
> > > , correct?
> > >
> > > Given that the buffer contents are opaque and its size is aligned on 8
> > > bytes, could it be possible to somehow adopt the `buf` pointer
> > > instead?
> >
> > I didn't find an explicit typecasting where you're using
> > mte_compress_to_buf(), but now after hard 2nd look I see...
> >
> > Firstly, now that in the documentation you are explicitly describing the
> > return value of mte_compress() as 64-bit frame, the right way to go would
> > be declaring the function as: u64 mte_compress(u8 *tags).
>
> Ack.
>
> > And the general pattern should be like this:
> >
> > unsigned long mte_compress(u8 *tags)
> > {
> > DECLARE_BITMAP(tmp, MTECOMP_CACHES_MAXBITS);
> > void *storage;
> > ...
> > if (alloc_size < MTE_PAGE_TAG_STORAGE) {
> > storage = kmem_cache_alloc(cache, GFP_KERNEL);
> > mte_compress_to_buf(r_len, r_tags, r_sizes, tmp, alloc_size);
> >
> > switch (alloc_size) {
> > case 16:
> > bitmap_to_arr16(storage, tmp, 16);
>
> I might be missing something, but why do we need the switch at all?
> The buffers we are allocating always contain a whole number of u64's -
> cannot we just always call bitmap_to_arr64()?
>
> Note that for cases where alloc_size is > 8 we never make any
> assumptions about the contents of @storage, and don't care much about
> the byte order as long as swap decompression is done with the same
> endianness (which is always the case).
> (The case where alloc_size==8 is somewhat special, and needs more
> accurate handling, because we do make assumptions about the bit layout
> there).

So, this is my fault, and I'm really sorry. I read that 16-byte as
16-bit, and mistaken everything else. Please scratch the above.

If you allocate word-aligned memory, and it's a multiple of words,
which is your case, and access it only using bitmap API like
bitmap_read/write, everything should be fine.

Sorry again.

2023-10-02 11:00:19

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v5 2/5] lib/test_bitmap: add tests for bitmap_{read,write}()

On Mon, Oct 2, 2023 at 4:44 AM Yury Norov <[email protected]> wrote:
>
> On Fri, Sep 29, 2023 at 10:54:59AM +0200, Alexander Potapenko wrote:
> > On Thu, Sep 28, 2023 at 10:02 PM Yury Norov <[email protected]> wrote:
> > >
> > > On Thu, Sep 28, 2023 at 05:14:55PM +0200, Alexander Potapenko wrote:
> > > >
> > > > So e.g. for compressing something into a 16-byte buffer using bitmaps
> > > > I'd need to:
> > > >
> > > > 1) Allocate the buffer: buf = kmem_cache_alloc(...)
> > > > 2) Allocate the bitmap: bitmap = bitmap_alloc(16*8, ...)
> > > > 3) Fill the bitmap: mte_compress_to_buf(..., bitmap, 16)
> > > > 4) Copy the bitmap contents to the buffer: bitmap_to_arr64(buf, bitmap, 16*8)
> > > > 5) Deallocate the bitmap: bitmap_free(bitmap)
> > > >
> > > > instead of:
> > > >
> > > > buf = kmem_cache_alloc(...)
> > > > mte_compress_to_buf(..., (unsigned long *)buf, 16)
> > > >
> > > > , correct?
> > > >
> > > > Given that the buffer contents are opaque and its size is aligned on 8
> > > > bytes, could it be possible to somehow adopt the `buf` pointer
> > > > instead?
> > >
> > > I didn't find an explicit typecasting where you're using
> > > mte_compress_to_buf(), but now after hard 2nd look I see...
> > >
> > > Firstly, now that in the documentation you are explicitly describing the
> > > return value of mte_compress() as 64-bit frame, the right way to go would
> > > be declaring the function as: u64 mte_compress(u8 *tags).
> >
> > Ack.
> >
> > > And the general pattern should be like this:
> > >
> > > unsigned long mte_compress(u8 *tags)
> > > {
> > > DECLARE_BITMAP(tmp, MTECOMP_CACHES_MAXBITS);
> > > void *storage;
> > > ...
> > > if (alloc_size < MTE_PAGE_TAG_STORAGE) {
> > > storage = kmem_cache_alloc(cache, GFP_KERNEL);
> > > mte_compress_to_buf(r_len, r_tags, r_sizes, tmp, alloc_size);
> > >
> > > switch (alloc_size) {
> > > case 16:
> > > bitmap_to_arr16(storage, tmp, 16);
> >
> > I might be missing something, but why do we need the switch at all?
> > The buffers we are allocating always contain a whole number of u64's -
> > cannot we just always call bitmap_to_arr64()?
> >
> > Note that for cases where alloc_size is > 8 we never make any
> > assumptions about the contents of @storage, and don't care much about
> > the byte order as long as swap decompression is done with the same
> > endianness (which is always the case).
> > (The case where alloc_size==8 is somewhat special, and needs more
> > accurate handling, because we do make assumptions about the bit layout
> > there).
>
> So, this is my fault, and I'm really sorry. I read that 16-byte as
> 16-bit, and mistaken everything else. Please scratch the above.
>
> If you allocate word-aligned memory, and it's a multiple of words,
> which is your case, and access it only using bitmap API like
> bitmap_read/write, everything should be fine.

Ok, fine, I'll stick to the current implementation then.