2022-12-16 19:27:56

by Nhat Pham

[permalink] [raw]
Subject: [PATCH v4 0/4] cachestat: a new syscall for page cache state of files

Changelog:
v4:
* Refactor cachestat and move it to mm/filemap.c (patch 3)
(suggested by Brian Foster)
* Remove redundant checks (!folio, access_ok)
(patch 3) (suggested by Matthew Wilcox and Al Viro)
* Fix a bug in handling multipages folio.
(patch 3) (suggested by Matthew Wilcox)
* Add a selftest for shmem files, which can be used to test huge
pages (patch 4) (suggested by Johannes Weiner)
v3:
* Fix some minor formatting issues and build errors.
* Add the new syscall entry to missing architecture syscall tables.
(patch 3).
* Add flags argument for the syscall. (patch 3).
* Clean up the recency refactoring (patch 2) (suggested by Yu Zhao)
* Add the new Kconfig (CONFIG_CACHESTAT) to disable the syscall.
(patch 3) (suggested by Josh Triplett)
v2:
* len == 0 means query to EOF. len < 0 is invalid.
(patch 3) (suggested by Brian Foster)
* Make cachestat extensible by adding the `cstat_size` argument in the
syscall (patch 3)

There is currently no good way to query the page cache state of large
file sets and directory trees. There is mincore(), but it scales poorly:
the kernel writes out a lot of bitmap data that userspace has to
aggregate, when the user really doesn not care about per-page information
in that case. The user also needs to mmap and unmap each file as it goes
along, which can be quite slow as well.

This series of patches introduces a new system call, cachestat, that
summarizes the page cache statistics (number of cached pages, dirty
pages, pages marked for writeback, evicted pages etc.) of a file, in a
specified range of bytes. It also include a selftest suite that tests some
typical usage

This interface is inspired by past discussion and concerns with fincore,
which has a similar design (and as a result, issues) as mincore.
Relevant links:

https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html

For comparison with mincore, I ran both syscalls on a 2TB sparse file:

Using mincore:
real 0m37.510s
user 0m2.934s
sys 0m34.558s

Using cachestat:
real 0m0.009s
user 0m0.000s
sys 0m0.009s

This series consist of 4 patches:

Johannes Weiner (1):
workingset: fix confusion around eviction vs refault container

Nhat Pham (3):
workingset: refactor LRU refault to expose refault recency check
cachestat: implement cachestat syscall
selftests: Add selftests for cachestat

MAINTAINERS | 7 +
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/fs.h | 3 +
include/linux/swap.h | 1 +
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/mman.h | 9 +
init/Kconfig | 10 +
kernel/sys_ni.c | 1 +
mm/filemap.c | 137 +++++++++
mm/workingset.c | 130 ++++++---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/cachestat/.gitignore | 2 +
tools/testing/selftests/cachestat/Makefile | 8 +
.../selftests/cachestat/test_cachestat.c | 259 ++++++++++++++++++
27 files changed, 550 insertions(+), 39 deletions(-)
create mode 100644 tools/testing/selftests/cachestat/.gitignore
create mode 100644 tools/testing/selftests/cachestat/Makefile
create mode 100644 tools/testing/selftests/cachestat/test_cachestat.c

--
2.30.2


2022-12-16 19:43:49

by Nhat Pham

[permalink] [raw]
Subject: [PATCH v4 4/4] selftests: Add selftests for cachestat

Test cachestat on a newly created file, /dev/ files, and /proc/ files.
Also test on a shmem file (which can also be tested with huge pages
since tmpfs supports huge pages).

Signed-off-by: Nhat Pham <[email protected]>
---
MAINTAINERS | 7 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/cachestat/.gitignore | 2 +
tools/testing/selftests/cachestat/Makefile | 8 +
.../selftests/cachestat/test_cachestat.c | 259 ++++++++++++++++++
5 files changed, 277 insertions(+)
create mode 100644 tools/testing/selftests/cachestat/.gitignore
create mode 100644 tools/testing/selftests/cachestat/Makefile
create mode 100644 tools/testing/selftests/cachestat/test_cachestat.c

diff --git a/MAINTAINERS b/MAINTAINERS
index a198da986146..792a866353ec 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4552,6 +4552,13 @@ S: Supported
F: Documentation/filesystems/caching/cachefiles.rst
F: fs/cachefiles/

+CACHESTAT: PAGE CACHE STATS FOR A FILE
+M: Nhat Pham <[email protected]>
+M: Johannes Weiner <[email protected]>
+L: [email protected]
+S: Maintained
+F: tools/testing/selftests/cachestat/test_cachestat.c
+
CADENCE MIPI-CSI2 BRIDGES
M: Maxime Ripard <[email protected]>
L: [email protected]
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 0464b2c6c1e4..3cad0b38c5c2 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -4,6 +4,7 @@ TARGETS += amd-pstate
TARGETS += arm64
TARGETS += bpf
TARGETS += breakpoints
+TARGETS += cachestat
TARGETS += capabilities
TARGETS += cgroup
TARGETS += clone3
diff --git a/tools/testing/selftests/cachestat/.gitignore b/tools/testing/selftests/cachestat/.gitignore
new file mode 100644
index 000000000000..d6c30b43a4bb
--- /dev/null
+++ b/tools/testing/selftests/cachestat/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+test_cachestat
diff --git a/tools/testing/selftests/cachestat/Makefile b/tools/testing/selftests/cachestat/Makefile
new file mode 100644
index 000000000000..fca73aaa7d14
--- /dev/null
+++ b/tools/testing/selftests/cachestat/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0
+TEST_GEN_PROGS := test_cachestat
+
+CFLAGS += $(KHDR_INCLUDES)
+CFLAGS += -Wall
+CFLAGS += -lrt
+
+include ../lib.mk
diff --git a/tools/testing/selftests/cachestat/test_cachestat.c b/tools/testing/selftests/cachestat/test_cachestat.c
new file mode 100644
index 000000000000..3f5cd5695c51
--- /dev/null
+++ b/tools/testing/selftests/cachestat/test_cachestat.c
@@ -0,0 +1,259 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdbool.h>
+#include <linux/kernel.h>
+#include <linux/mman.h>
+#include <sys/mman.h>
+#include <sys/shm.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+#include <string.h>
+#include <fcntl.h>
+#include <errno.h>
+
+#include "../kselftest.h"
+
+static const char * const dev_files[] = {
+ "/dev/zero", "/dev/null", "/dev/urandom",
+ "/proc/version", "/proc"
+};
+static const int cachestat_nr = 451;
+
+void print_cachestat(struct cachestat *cs)
+{
+ ksft_print_msg(
+ "Using cachestat: Cached: %lu, Dirty: %lu, Writeback: %lu, Evicted: %lu, Recently Evicted: %lu\n",
+ cs->nr_cache, cs->nr_dirty, cs->nr_writeback,
+ cs->nr_evicted, cs->nr_recently_evicted);
+}
+
+bool write_exactly(int fd, size_t filesize)
+{
+ char data[filesize];
+ bool ret = true;
+ int random_fd = open("/dev/urandom", O_RDONLY);
+
+ if (random_fd < 0) {
+ ksft_print_msg("Unable to access urandom.\n");
+ ret = false;
+ goto out;
+ } else {
+ int remained = filesize;
+ char *cursor = data;
+
+ while (remained) {
+ ssize_t read_len = read(random_fd, cursor, remained);
+
+ if (read_len <= 0) {
+ ksft_print_msg("Unable to read from urandom.\n");
+ ret = false;
+ goto close_random_fd;
+ }
+
+ remained -= read_len;
+ cursor += read_len;
+ }
+
+ /* write random data to fd */
+ remained = filesize;
+ cursor = data;
+ while (remained) {
+ ssize_t write_len = write(fd, cursor, remained);
+
+ if (write_len <= 0) {
+ ksft_print_msg("Unable write random data to file.\n");
+ ret = false;
+ goto close_random_fd;
+ }
+
+ remained -= write_len;
+ cursor += write_len;
+ }
+ }
+
+close_random_fd:
+ close(random_fd);
+out:
+ return ret;
+}
+
+/*
+ * Open/create the file at filename, (optionally) write random data to it
+ * (exactly num_pages), then test the cachestat syscall on this file.
+ *
+ * If test_fsync == true, fsync the file, then check the number of dirty
+ * pages.
+ */
+bool test_cachestat(const char *filename, bool write_random, bool create,
+ bool test_fsync, unsigned long num_pages, int open_flags,
+ mode_t open_mode)
+{
+ size_t PS = sysconf(_SC_PAGESIZE);
+ int filesize = num_pages * PS;
+ bool ret = true;
+ long syscall_ret;
+ struct cachestat cs;
+
+ int fd = open(filename, open_flags, open_mode);
+
+ if (fd == -1) {
+ ksft_print_msg("Unable to create/open file.\n");
+ goto out;
+ } else {
+ ksft_print_msg("Create/open %s\n", filename);
+ }
+
+ if (write_random) {
+ if (!write_exactly(fd, filesize)) {
+ ksft_print_msg("Unable to access urandom.\n");
+ ret = false;
+ goto out1;
+ }
+ }
+
+ syscall_ret = syscall(cachestat_nr, fd, 0, filesize,
+ sizeof(struct cachestat), &cs, 0);
+
+ ksft_print_msg("Cachestat call returned %ld\n", syscall_ret);
+
+ if (syscall_ret) {
+ ksft_print_msg("Cachestat returned non-zero.\n");
+ ret = false;
+ goto out1;
+
+ } else {
+ print_cachestat(&cs);
+
+ if (write_random) {
+ if (cs.nr_cache + cs.nr_evicted != num_pages) {
+ ksft_print_msg(
+ "Total number of cached and evicted pages is off.\n");
+ ret = false;
+ }
+ }
+ }
+
+ if (test_fsync) {
+ if (fsync(fd)) {
+ ksft_print_msg("fsync fails.\n");
+ ret = false;
+ } else {
+ syscall_ret = syscall(cachestat_nr, fd, 0, filesize,
+ sizeof(struct cachestat), &cs, 0);
+
+ ksft_print_msg("Cachestat call (after fsync) returned %ld\n",
+ syscall_ret);
+
+ if (!syscall_ret) {
+ print_cachestat(&cs);
+
+ if (cs.nr_dirty) {
+ ret = false;
+ ksft_print_msg(
+ "Number of dirty should be zero after fsync.\n");
+ }
+ } else {
+ ksft_print_msg("Cachestat (after fsync) returned non-zero.\n");
+ ret = false;
+ goto out1;
+ }
+ }
+ }
+
+out1:
+ close(fd);
+
+ if (create)
+ remove(filename);
+out:
+ return ret;
+}
+
+bool test_cachestat_shmem(void)
+{
+ size_t PS = sysconf(_SC_PAGESIZE);
+ size_t filesize = PS * 512 * 2; /* 2 2MB huge pages */
+ int syscall_ret;
+ off_t off = PS;
+ size_t compute_len = PS * 512;
+ char *filename = "tmpshmcstat";
+ struct cachestat cs;
+ bool ret = true;
+ unsigned long num_pages = compute_len / PS;
+ int fd = shm_open(filename, O_CREAT | O_RDWR, 0600);
+
+ if (fd < 0) {
+ ksft_print_msg("Unable to create shmem file.\n");
+ ret = false;
+ goto out;
+ }
+
+ if (ftruncate(fd, filesize)) {
+ ksft_print_msg("Unable to trucate shmem file.\n");
+ ret = false;
+ goto close_fd;
+ }
+
+ if (!write_exactly(fd, filesize)) {
+ ksft_print_msg("Unable to write to shmem file.\n");
+ ret = false;
+ goto close_fd;
+ }
+
+ syscall_ret = syscall(cachestat_nr, fd, off, compute_len,
+ sizeof(struct cachestat), &cs, 0);
+
+ if (syscall_ret) {
+ ksft_print_msg("Cachestat returned non-zero.\n");
+ ret = false;
+ goto close_fd;
+ } else {
+ print_cachestat(&cs);
+ if (cs.nr_cache + cs.nr_evicted != num_pages) {
+ ksft_print_msg(
+ "Total number of cached and evicted pages is off.\n");
+ ret = false;
+ }
+ }
+
+close_fd:
+ shm_unlink(filename);
+out:
+ return ret;
+}
+
+int main(void)
+{
+ int ret = 0;
+
+ for (int i = 0; i < 5; i++) {
+ const char *dev_filename = dev_files[i];
+
+ if (test_cachestat(dev_filename, false, false, false,
+ 4, O_RDONLY, 0400))
+ ksft_test_result_pass("cachestat works with %s\n", dev_filename);
+ else {
+ ksft_test_result_fail("cachestat fails with %s\n", dev_filename);
+ ret = 1;
+ }
+ }
+
+ if (test_cachestat("tmpfilecachestat", true, true,
+ true, 4, O_CREAT | O_RDWR, 0400 | 0600))
+ ksft_test_result_pass("cachestat works with a normal file\n");
+ else {
+ ksft_test_result_fail("cachestat fails with normal file\n");
+ ret = 1;
+ }
+
+ if (test_cachestat_shmem())
+ ksft_test_result_pass("cachestat works with a shmem file\n");
+ else {
+ ksft_test_result_fail("cachestat fails with a shmem file\n");
+ ret = 1;
+ }
+
+ return ret;
+}
--
2.30.2

2022-12-16 19:45:59

by Nhat Pham

[permalink] [raw]
Subject: [PATCH v4 1/4] workingset: fix confusion around eviction vs refault container

From: Johannes Weiner <[email protected]>

Refault decisions are made based on the lruvec where the page was
evicted, as that determined its LRU order while it was alive. Stats
and workingset aging must then occur on the lruvec of the new page, as
that's the node and cgroup that experience the refault and that's the
lruvec whose nonresident info ages out by a new resident page. Those
lruvecs could be different when a page is shared between cgroups, or
the refaulting page is allocated on a different node.

There are currently two mix-ups:

1. When swap is available, the resident anon set must be considered
when comparing the refault distance. The comparison is made against
the right anon set, but the check for swap is not. When pages get
evicted from a cgroup with swap, and refault in one without, this
can incorrectly consider a hot refault as cold - and vice
versa. Fix that by using the eviction cgroup for the swap check.

2. The stats and workingset age are updated against the wrong lruvec
altogether: the right cgroup but the wrong NUMA node. When a page
refaults on a different NUMA node, this will have confusing stats
and distort the workingset age on a different lruvec - again
possibly resulting in hot/cold misclassifications down the line.

Fix the swap check and the refault pgdat to address both concerns.

This was found during code review. It hasn't caused notable issues in
production, suggesting that those refault-migrations are relatively
rare in practice.

Signed-off-by: Johannes Weiner <[email protected]>
Co-developed-by: Nhat Pham <[email protected]>
Signed-off-by: Nhat Pham <[email protected]>
---
mm/workingset.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index ae7e984b23c6..79585d55c45d 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -457,6 +457,7 @@ void workingset_refault(struct folio *folio, void *shadow)
*/
nr = folio_nr_pages(folio);
memcg = folio_memcg(folio);
+ pgdat = folio_pgdat(folio);
lruvec = mem_cgroup_lruvec(memcg, pgdat);

mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
@@ -474,7 +475,7 @@ void workingset_refault(struct folio *folio, void *shadow)
workingset_size += lruvec_page_state(eviction_lruvec,
NR_INACTIVE_FILE);
}
- if (mem_cgroup_get_nr_swap_pages(memcg) > 0) {
+ if (mem_cgroup_get_nr_swap_pages(eviction_memcg) > 0) {
workingset_size += lruvec_page_state(eviction_lruvec,
NR_ACTIVE_ANON);
if (file) {
--
2.30.2

2022-12-16 19:54:18

by Nhat Pham

[permalink] [raw]
Subject: [PATCH v4 2/4] workingset: refactor LRU refault to expose refault recency check

In preparation for computing recently evicted pages in cachestat,
refactor workingset_refault and lru_gen_refault to expose a helper
function that would test if an evicted page is recently evicted.

Signed-off-by: Nhat Pham <[email protected]>
---
include/linux/swap.h | 1 +
mm/workingset.c | 129 ++++++++++++++++++++++++++++++-------------
2 files changed, 92 insertions(+), 38 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a18cf4b7c724..dae6f6f955eb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -361,6 +361,7 @@ static inline void folio_set_swap_entry(struct folio *folio, swp_entry_t entry)
}

/* linux/mm/workingset.c */
+bool workingset_test_recent(void *shadow, bool file, bool *workingset);
void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages);
void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg);
void workingset_refault(struct folio *folio, void *shadow);
diff --git a/mm/workingset.c b/mm/workingset.c
index 79585d55c45d..006482c4e0bd 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -244,6 +244,33 @@ static void *lru_gen_eviction(struct folio *folio)
return pack_shadow(mem_cgroup_id(memcg), pgdat, token, refs);
}

+/*
+ * Test if the folio is recently evicted.
+ *
+ * As a side effect, also populates the references with
+ * values unpacked from the shadow of the evicted folio.
+ */
+static bool lru_gen_test_recent(void *shadow, bool file, bool *workingset)
+{
+ struct mem_cgroup *eviction_memcg;
+ struct lruvec *lruvec;
+ struct lru_gen_struct *lrugen;
+ unsigned long min_seq;
+
+ int memcgid;
+ struct pglist_data *pgdat;
+ unsigned long token;
+
+ unpack_shadow(shadow, &memcgid, &pgdat, &token, workingset);
+ eviction_memcg = mem_cgroup_from_id(memcgid);
+
+ lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
+ lrugen = &lruvec->lrugen;
+
+ min_seq = READ_ONCE(lrugen->min_seq[file]);
+ return !((token >> LRU_REFS_WIDTH) != (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH)));
+}
+
static void lru_gen_refault(struct folio *folio, void *shadow)
{
int hist, tier, refs;
@@ -306,6 +333,11 @@ static void *lru_gen_eviction(struct folio *folio)
return NULL;
}

+static bool lru_gen_test_recent(void *shadow, bool file, bool *workingset)
+{
+ return true;
+}
+
static void lru_gen_refault(struct folio *folio, void *shadow)
{
}
@@ -373,40 +405,31 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
folio_test_workingset(folio));
}

-/**
- * workingset_refault - Evaluate the refault of a previously evicted folio.
- * @folio: The freshly allocated replacement folio.
- * @shadow: Shadow entry of the evicted folio.
+/*
+ * Test if the folio is recently evicted by checking if
+ * refault distance of shadow exceeds workingset size.
*
- * Calculates and evaluates the refault distance of the previously
- * evicted folio in the context of the node and the memcg whose memory
- * pressure caused the eviction.
+ * As a side effect, populate workingset with the value
+ * unpacked from shadow.
*/
-void workingset_refault(struct folio *folio, void *shadow)
+bool workingset_test_recent(void *shadow, bool file, bool *workingset)
{
- bool file = folio_is_file_lru(folio);
struct mem_cgroup *eviction_memcg;
struct lruvec *eviction_lruvec;
unsigned long refault_distance;
unsigned long workingset_size;
- struct pglist_data *pgdat;
- struct mem_cgroup *memcg;
- unsigned long eviction;
- struct lruvec *lruvec;
unsigned long refault;
- bool workingset;
+
int memcgid;
- long nr;
+ struct pglist_data *pgdat;
+ unsigned long eviction;

- if (lru_gen_enabled()) {
- lru_gen_refault(folio, shadow);
- return;
- }
+ if (lru_gen_enabled())
+ return lru_gen_test_recent(shadow, file, workingset);

- unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
+ unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
eviction <<= bucket_order;

- rcu_read_lock();
/*
* Look up the memcg associated with the stored ID. It might
* have been deleted since the folio's eviction.
@@ -425,7 +448,8 @@ void workingset_refault(struct folio *folio, void *shadow)
*/
eviction_memcg = mem_cgroup_from_id(memcgid);
if (!mem_cgroup_disabled() && !eviction_memcg)
- goto out;
+ return false;
+
eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
refault = atomic_long_read(&eviction_lruvec->nonresident_age);

@@ -447,21 +471,6 @@ void workingset_refault(struct folio *folio, void *shadow)
*/
refault_distance = (refault - eviction) & EVICTION_MASK;

- /*
- * The activation decision for this folio is made at the level
- * where the eviction occurred, as that is where the LRU order
- * during folio reclaim is being determined.
- *
- * However, the cgroup that will own the folio is the one that
- * is actually experiencing the refault event.
- */
- nr = folio_nr_pages(folio);
- memcg = folio_memcg(folio);
- pgdat = folio_pgdat(folio);
- lruvec = mem_cgroup_lruvec(memcg, pgdat);
-
- mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
-
mem_cgroup_flush_stats_delayed();
/*
* Compare the distance to the existing workingset size. We
@@ -483,8 +492,51 @@ void workingset_refault(struct folio *folio, void *shadow)
NR_INACTIVE_ANON);
}
}
- if (refault_distance > workingset_size)
+
+ return refault_distance <= workingset_size;
+}
+
+/**
+ * workingset_refault - Evaluate the refault of a previously evicted folio.
+ * @folio: The freshly allocated replacement folio.
+ * @shadow: Shadow entry of the evicted folio.
+ *
+ * Calculates and evaluates the refault distance of the previously
+ * evicted folio in the context of the node and the memcg whose memory
+ * pressure caused the eviction.
+ */
+void workingset_refault(struct folio *folio, void *shadow)
+{
+ bool file = folio_is_file_lru(folio);
+ struct pglist_data *pgdat;
+ struct mem_cgroup *memcg;
+ struct lruvec *lruvec;
+ bool workingset;
+ long nr;
+
+ if (lru_gen_enabled()) {
+ lru_gen_refault(folio, shadow);
+ return;
+ }
+
+ rcu_read_lock();
+
+ nr = folio_nr_pages(folio);
+ memcg = folio_memcg(folio);
+ pgdat = folio_pgdat(folio);
+ lruvec = mem_cgroup_lruvec(memcg, pgdat);
+
+ if (!workingset_test_recent(shadow, file, &workingset)) {
+ /*
+ * The activation decision for this folio is made at the level
+ * where the eviction occurred, as that is where the LRU order
+ * during folio reclaim is being determined.
+ *
+ * However, the cgroup that will own the folio is the one that
+ * is actually experiencing the refault event.
+ */
goto out;
+ }

folio_set_active(folio);
workingset_age_nonresident(lruvec, nr);
@@ -498,6 +550,7 @@ void workingset_refault(struct folio *folio, void *shadow)
mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr);
}
out:
+ mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
rcu_read_unlock();
}

--
2.30.2

2022-12-16 20:29:15

by Nhat Pham

[permalink] [raw]
Subject: [PATCH v4 3/4] cachestat: implement cachestat syscall

Implement a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in
a given range.

NAME
cachestat - query the page cache status of a file.

SYNOPSIS
#include <sys/mman.h>

struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};

int cachestat(unsigned int fd, off_t off, size_t len,
size_t cstat_size, struct cachestat *cstat,
unsigned int flags);

DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of (recently)
evicted pages, in the bytes range given by `off` and `len`.

These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.

The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.

`cstat_size` allows users to obtain partial results. The syscall
will copy the first `csstat_size` bytes to the specified userspace
memory. `cstat_size` must be a non-negative value that is no larger
than the current size of the cachestat struct.

The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).

RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.

ERRORS
EFAULT cstat points to an invalid address.

EINVAL invalid `cstat_size` or `flags`

EBADF invalid file descriptor.

Signed-off-by: Nhat Pham <[email protected]>
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/fs.h | 3 +
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/mman.h | 9 ++
init/Kconfig | 10 ++
kernel/sys_ni.c | 1 +
mm/filemap.c | 137 ++++++++++++++++++++
20 files changed, 180 insertions(+), 1 deletion(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 8ebacf37a8cf..1f13995d00d7 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -490,3 +490,4 @@
558 common process_mrelease sys_process_mrelease
559 common futex_waitv sys_futex_waitv
560 common set_mempolicy_home_node sys_ni_syscall
+561 common cachestat sys_cachestat
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index ac964612d8b0..8ebed8a13874 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -464,3 +464,4 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common cachestat sys_cachestat
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 72c929d9902b..f8c74ffeeefb 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -371,3 +371,4 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common cachestat sys_cachestat
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index b1f3940bc298..4f504783371f 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -450,3 +450,4 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common cachestat sys_cachestat
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 820145e47350..858d22bf275c 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -456,3 +456,4 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common cachestat sys_cachestat
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 8a99c998da9b..7c84a72306d1 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -448,3 +448,4 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common cachestat sys_cachestat
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 2bca64f96164..937460f0a8ec 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -530,3 +530,4 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 nospu set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common cachestat sys_cachestat
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 799147658dee..7df0329d46cb 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -453,3 +453,4 @@
448 common process_mrelease sys_process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common cachestat sys_cachestat sys_cachestat
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 2de85c977f54..97377e8c5025 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -453,3 +453,4 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common cachestat sys_cachestat
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 4398cc6fb68d..faa835f3c54a 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -496,3 +496,4 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common cachestat sys_cachestat
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 320480a8db4f..bc0a3c941b35 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -455,3 +455,4 @@
448 i386 process_mrelease sys_process_mrelease
449 i386 futex_waitv sys_futex_waitv
450 i386 set_mempolicy_home_node sys_set_mempolicy_home_node
+451 i386 cachestat sys_cachestat
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..227538b0ce80 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common cachestat sys_cachestat

#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 52c94ab5c205..2b69c3c035b6 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -421,3 +421,4 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common cachestat sys_cachestat
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e654435f1651..8902799121c9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -46,6 +46,7 @@

#include <asm/byteorder.h>
#include <uapi/linux/fs.h>
+#include <uapi/linux/mman.h>

struct backing_dev_info;
struct bdi_writeback;
@@ -830,6 +831,8 @@ void filemap_invalidate_lock_two(struct address_space *mapping1,
struct address_space *mapping2);
void filemap_invalidate_unlock_two(struct address_space *mapping1,
struct address_space *mapping2);
+void filemap_cachestat(struct address_space *mapping, pgoff_t first_index,
+ pgoff_t last_index, struct cachestat *cs);


/*
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a34b0f9a9972..50f8c6999d99 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -72,6 +72,7 @@ struct open_how;
struct mount_attr;
struct landlock_ruleset_attr;
enum landlock_rule_type;
+struct cachestat;

#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -1056,6 +1057,8 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
unsigned long home_node,
unsigned long flags);
+asmlinkage long sys_cachestat(unsigned int fd, off_t off, size_t len,
+ size_t cstat_size, struct cachestat __user *cstat, unsigned int flags);

/*
* Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 45fa180cc56a..cd639fae9086 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
#define __NR_set_mempolicy_home_node 450
__SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)

+#define __NR_cachestat 451
+__SYSCALL(__NR_cachestat, sys_cachestat)
+
#undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452

/*
* 32 bit systems traditionally used different
diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index f55bc680b5b0..fe03ed0b7587 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -4,6 +4,7 @@

#include <asm/mman.h>
#include <asm-generic/hugetlb_encode.h>
+#include <linux/types.h>

#define MREMAP_MAYMOVE 1
#define MREMAP_FIXED 2
@@ -41,4 +42,12 @@
#define MAP_HUGE_2GB HUGETLB_FLAG_ENCODE_2GB
#define MAP_HUGE_16GB HUGETLB_FLAG_ENCODE_16GB

+struct cachestat {
+ __u64 nr_cache;
+ __u64 nr_dirty;
+ __u64 nr_writeback;
+ __u64 nr_evicted;
+ __u64 nr_recently_evicted;
+};
+
#endif /* _UAPI_LINUX_MMAN_H */
diff --git a/init/Kconfig b/init/Kconfig
index 694f7c160c9c..ecc4f781dd6c 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1798,6 +1798,16 @@ config RSEQ

If unsure, say Y.

+config CACHESTAT
+ bool "Enable cachestat() system call" if EXPERT
+ default y
+ help
+ Enable the cachestat system call, which queries the page cache
+ statistics of a file (number of cached pages, dirty pages,
+ pages marked for writeback, (recently) evicted pages).
+
+ If unsure say Y here.
+
config DEBUG_RSEQ
default n
bool "Enabled debugging of rseq() system call" if EXPERT
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 860b2dcf3ac4..04bfb1e4d377 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -299,6 +299,7 @@ COND_SYSCALL(set_mempolicy);
COND_SYSCALL(migrate_pages);
COND_SYSCALL(move_pages);
COND_SYSCALL(set_mempolicy_home_node);
+COND_SYSCALL(cachestat);

COND_SYSCALL(perf_event_open);
COND_SYSCALL(accept4);
diff --git a/mm/filemap.c b/mm/filemap.c
index 08341616ae7a..29ffb906caa4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -22,6 +22,7 @@
#include <linux/mm.h>
#include <linux/swap.h>
#include <linux/swapops.h>
+#include <linux/syscalls.h>
#include <linux/mman.h>
#include <linux/pagemap.h>
#include <linux/file.h>
@@ -55,6 +56,9 @@
#include <linux/buffer_head.h> /* for try_to_free_buffers */

#include <asm/mman.h>
+#include <uapi/linux/mman.h>
+
+#include "swap.h"

/*
* Shared mappings implemented 30.11.1994. It's not fully working yet,
@@ -3949,3 +3953,136 @@ bool filemap_release_folio(struct folio *folio, gfp_t gfp)
return try_to_free_buffers(folio);
}
EXPORT_SYMBOL(filemap_release_folio);
+
+#ifdef CONFIG_CACHESTAT
+/**
+ * filemap_cachestat() - compute the page cache statistics of a mapping
+ * @mapping: The mapping to compute the statistics for.
+ * @first_index: The starting page cache index.
+ * @last_index: The final page index (inclusive).
+ * @cs: the cachestat struct to write the result to.
+ *
+ * This will query the page cache statistics of a mapping in the
+ * page range of [first_index, last_index] (inclusive). THe statistics
+ * queried include: number of dirty pages, number of pages marked for
+ * writeback, and the number of (recently) evicted pages.
+ */
+void filemap_cachestat(struct address_space *mapping, pgoff_t first_index,
+ pgoff_t last_index, struct cachestat *cs)
+{
+ XA_STATE(xas, &mapping->i_pages, first_index);
+ struct folio *folio;
+
+ rcu_read_lock();
+ xas_for_each(&xas, folio, last_index) {
+ unsigned long nr_pages;
+ pgoff_t folio_first_index, folio_last_index;
+
+ if (xas_retry(&xas, folio))
+ continue;
+
+ nr_pages = folio_nr_pages(folio);
+ folio_first_index = folio_pgoff(folio);
+ folio_last_index = folio_first_index + nr_pages - 1;
+
+ /* Folios might straddle the range boundaries, only count covered subpages */
+ if (folio_first_index < first_index)
+ nr_pages -= first_index - folio_first_index;
+
+ if (folio_last_index > last_index)
+ nr_pages -= folio_last_index - last_index;
+
+ if (xa_is_value(folio)) {
+ /* page is evicted */
+ void *shadow = (void *)folio;
+ bool workingset; /* not used */
+
+ cs->nr_evicted += nr_pages;
+
+#ifdef CONFIG_SWAP /* implies CONFIG_MMU */
+ if (shmem_mapping(mapping)) {
+ /* shmem file - in swap cache */
+ swp_entry_t swp = radix_to_swp_entry(folio);
+
+ shadow = get_shadow_from_swap_cache(swp);
+ }
+#endif
+ if (workingset_test_recent(shadow, true, &workingset))
+ cs->nr_recently_evicted += nr_pages;
+
+ continue;
+ }
+
+ /* page is in cache */
+ cs->nr_cache += nr_pages;
+
+ if (folio_test_dirty(folio))
+ cs->nr_dirty += nr_pages;
+
+ if (folio_test_writeback(folio))
+ cs->nr_writeback += nr_pages;
+ }
+ rcu_read_unlock();
+}
+EXPORT_SYMBOL(filemap_cachestat);
+
+/*
+ * The cachestat(5) system call.
+ *
+ * cachestat() returns the page cache statistics of a file in the
+ * bytes specified by `off` and `len`: number of cached pages,
+ * number of dirty pages, number of pages marked for writeback,
+ * number of (recently) evicted pages.
+ *
+ * `off` and `len` must be non-negative integers. If `len` > 0,
+ * the queried range is [`off`, `off` + `len`]. If `len` == 0,
+ * we will query in the range from `off` to the end of the file.
+ *
+ * `cstat_size` allows users to obtain partial results. The syscall
+ * will copy the first `csstat_size` bytes to the specified userspace
+ * memory. It also makes the cachestat struct extensible - new fields
+ * can be added in the future without breaking existing usage.
+ * `cstat_size` must be a non-negative value that is no larger than
+ * the current size of the cachestat struct.
+ *
+ * The `flags` argument is unused for now, but is included for future
+ * extensibility. User should pass 0 (i.e no flag specified).
+ *
+ * Because the status of a page can change after cachestat() checks it
+ * but before it returns to the application, the returned values may
+ * contain stale information.
+ *
+ * return values:
+ * zero - success
+ * -EFAULT - cstat points to an illegal address
+ * -EINVAL - invalid arguments
+ * -EBADF - invalid file descriptor
+ */
+SYSCALL_DEFINE6(cachestat, unsigned int, fd, off_t, off, size_t, len,
+ size_t, cstat_size, struct cachestat __user *, cstat,
+ unsigned int, flags)
+{
+ struct fd f = fdget(fd);
+ struct address_space *mapping;
+ struct cachestat cs;
+ pgoff_t first_index = off >> PAGE_SHIFT;
+ pgoff_t last_index =
+ len == 0 ? ULONG_MAX : (off + len - 1) >> PAGE_SHIFT;
+
+ if (off < 0 || cstat_size > sizeof(struct cachestat) || flags != 0)
+ return -EINVAL;
+
+ if (!f.file)
+ return -EBADF;
+
+ memset(&cs, 0, sizeof(struct cachestat));
+ mapping = f.file->f_mapping;
+ filemap_cachestat(mapping, first_index, last_index, &cs);
+ fdput(f);
+
+ if (copy_to_user(cstat, &cs, cstat_size))
+ return -EFAULT;
+
+ return 0;
+}
+#endif /* CONFIG_CACHESTAT */
--
2.30.2

2022-12-16 22:17:55

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 3/4] cachestat: implement cachestat syscall

On Fri, 16 Dec 2022 11:21:48 -0800 Nhat Pham <[email protected]> wrote:

> Implement a new syscall that queries cache state of a file and
> summarizes the number of cached pages, number of dirty pages, number of
> pages marked for writeback, number of (recently) evicted pages, etc. in
> a given range.
>
> NAME
> cachestat - query the page cache status of a file.
>
> SYNOPSIS
> #include <sys/mman.h>
>
> struct cachestat {
> __u64 nr_cache;
> __u64 nr_dirty;
> __u64 nr_writeback;
> __u64 nr_evicted;
> __u64 nr_recently_evicted;
> };
>
> int cachestat(unsigned int fd, off_t off, size_t len,
> size_t cstat_size, struct cachestat *cstat,
> unsigned int flags);
>
> DESCRIPTION
> cachestat() queries the number of cached pages, number of dirty
> pages, number of pages marked for writeback, number of (recently)
> evicted pages, in the bytes range given by `off` and `len`.

I suggest this be spelled out better: "number of evicted and number or
recently evicted pages".

I suggest this clearly tell readers what an "evicted" page is - they
aren't kernel programmers!

What is the benefit of the "recently evicted" pages? "recently" seems
very vague - what use is this to anyone?

> These values are returned in a cachestat struct, whose address is
> given by the `cstat` argument.
>
> The `off` and `len` arguments must be non-negative integers. If
> `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
> 0, we will query in the range from `off` to the end of the file.
>
> `cstat_size` allows users to obtain partial results. The syscall
> will copy the first `csstat_size` bytes to the specified userspace
> memory. `cstat_size` must be a non-negative value that is no larger
> than the current size of the cachestat struct.
>
> The `flags` argument is unused for now, but is included for future
> extensibility. User should pass 0 (i.e no flag specified).

Why is `flags' here? We could add an unused flags arg to any syscall,
but we don't. What's the plan?


Are there security implications? If I know that some process has a
file open, I can use cachestat() to infer which parts of that file
they're looking at (like mincore(), I guess). And I can infer which
parts they're writing to, unlike mincore().


I suggest the [patch 1/4] fixup be separated from this series.


2022-12-16 23:04:10

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v4 3/4] cachestat: implement cachestat syscall

Hi Nhat,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on shuah-kselftest/next]
[also build test WARNING on shuah-kselftest/fixes tip/x86/asm akpm-mm/mm-everything linus/master v6.1 next-20221216]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254
base: https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
patch link: https://lore.kernel.org/r/20221216192149.3902877-4-nphamcs%40gmail.com
patch subject: [PATCH v4 3/4] cachestat: implement cachestat syscall
config: alpha-randconfig-r036-20221216
compiler: alpha-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/85fdb04584b3d9a3b90d2c63794ec65e8d996b18
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254
git checkout 85fdb04584b3d9a3b90d2c63794ec65e8d996b18
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=alpha olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=alpha SHELL=/bin/bash mm/damon/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

In file included from include/uapi/linux/mman.h:5,
from include/linux/fs.h:49,
from include/linux/highmem.h:5,
from mm/damon/vaddr.c:11:
>> arch/alpha/include/uapi/asm/mman.h:15: warning: "MAP_FIXED" redefined
15 | #define MAP_FIXED 0x100 /* Interpret addr exactly */
|
In file included from mm/damon/vaddr.c:10:
include/uapi/asm-generic/mman-common.h:22: note: this is the location of the previous definition
22 | #define MAP_FIXED 0x10 /* Interpret addr exactly */
|
>> arch/alpha/include/uapi/asm/mman.h:16: warning: "MAP_ANONYMOUS" redefined
16 | #define MAP_ANONYMOUS 0x10 /* don't use a file */
|
include/uapi/asm-generic/mman-common.h:23: note: this is the location of the previous definition
23 | #define MAP_ANONYMOUS 0x20 /* don't use a file */
|
>> arch/alpha/include/uapi/asm/mman.h:29: warning: "MAP_POPULATE" redefined
29 | #define MAP_POPULATE 0x20000 /* populate (prefault) pagetables */
|
include/uapi/asm-generic/mman-common.h:26: note: this is the location of the previous definition
26 | #define MAP_POPULATE 0x008000 /* populate (prefault) pagetables */
|
>> arch/alpha/include/uapi/asm/mman.h:30: warning: "MAP_NONBLOCK" redefined
30 | #define MAP_NONBLOCK 0x40000 /* do not block on IO */
|
include/uapi/asm-generic/mman-common.h:27: note: this is the location of the previous definition
27 | #define MAP_NONBLOCK 0x010000 /* do not block on IO */
|
>> arch/alpha/include/uapi/asm/mman.h:31: warning: "MAP_STACK" redefined
31 | #define MAP_STACK 0x80000 /* give out an address that is best suited for process/thread stacks */
|
include/uapi/asm-generic/mman-common.h:28: note: this is the location of the previous definition
28 | #define MAP_STACK 0x020000 /* give out an address that is best suited for process/thread stacks */
|
>> arch/alpha/include/uapi/asm/mman.h:32: warning: "MAP_HUGETLB" redefined
32 | #define MAP_HUGETLB 0x100000 /* create a huge page mapping */
|
include/uapi/asm-generic/mman-common.h:29: note: this is the location of the previous definition
29 | #define MAP_HUGETLB 0x040000 /* create a huge page mapping */
|
>> arch/alpha/include/uapi/asm/mman.h:33: warning: "MAP_FIXED_NOREPLACE" redefined
33 | #define MAP_FIXED_NOREPLACE 0x200000/* MAP_FIXED which doesn't unmap underlying mapping */
|
include/uapi/asm-generic/mman-common.h:31: note: this is the location of the previous definition
31 | #define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */
|
>> arch/alpha/include/uapi/asm/mman.h:36: warning: "MS_SYNC" redefined
36 | #define MS_SYNC 2 /* synchronous memory sync */
|
include/uapi/asm-generic/mman-common.h:43: note: this is the location of the previous definition
43 | #define MS_SYNC 4 /* synchronous memory sync */
|
>> arch/alpha/include/uapi/asm/mman.h:37: warning: "MS_INVALIDATE" redefined
37 | #define MS_INVALIDATE 4 /* invalidate the caches */
|
include/uapi/asm-generic/mman-common.h:42: note: this is the location of the previous definition
42 | #define MS_INVALIDATE 2 /* invalidate the caches */
|
>> arch/alpha/include/uapi/asm/mman.h:50: warning: "MADV_DONTNEED" redefined
50 | #define MADV_DONTNEED 6 /* don't need these pages */
|
include/uapi/asm-generic/mman-common.h:49: note: this is the location of the previous definition
49 | #define MADV_DONTNEED 4 /* don't need these pages */
|


vim +/MAP_FIXED +15 arch/alpha/include/uapi/asm/mman.h

^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 12
746c9398f5ac2b arch/alpha/include/uapi/asm/mman.h Michael S. Tsirkin 2019-02-08 13 /* 0x01 - 0x03 are defined in linux/mman.h */
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 14 #define MAP_TYPE 0x0f /* Mask for type of mapping (OSF/1 is _wrong_) */
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 @15 #define MAP_FIXED 0x100 /* Interpret addr exactly */
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 @16 #define MAP_ANONYMOUS 0x10 /* don't use a file */
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 17
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 18 /* not used by linux, but here to make sure we don't clash with OSF/1 defines */
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 19 #define _MAP_HASSEMAPHORE 0x0200
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 20 #define _MAP_INHERIT 0x0400
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 21 #define _MAP_UNALIGNED 0x0800
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 22
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 23 /* These are linux-specific */
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 24 #define MAP_GROWSDOWN 0x01000 /* stack-like segment */
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 25 #define MAP_DENYWRITE 0x02000 /* ETXTBSY */
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 26 #define MAP_EXECUTABLE 0x04000 /* mark it as an executable */
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 27 #define MAP_LOCKED 0x08000 /* lock the mapping */
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 28 #define MAP_NORESERVE 0x10000 /* don't check for reservations */
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 @29 #define MAP_POPULATE 0x20000 /* populate (prefault) pagetables */
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 @30 #define MAP_NONBLOCK 0x40000 /* do not block on IO */
90f72aa58bbf07 arch/alpha/include/asm/mman.h Arnd Bergmann 2009-09-21 @31 #define MAP_STACK 0x80000 /* give out an address that is best suited for process/thread stacks */
90f72aa58bbf07 arch/alpha/include/asm/mman.h Arnd Bergmann 2009-09-21 @32 #define MAP_HUGETLB 0x100000 /* create a huge page mapping */
a4ff8e8620d3f4 arch/alpha/include/uapi/asm/mman.h Michal Hocko 2018-04-10 @33 #define MAP_FIXED_NOREPLACE 0x200000/* MAP_FIXED which doesn't unmap underlying mapping */
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 34
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 35 #define MS_ASYNC 1 /* sync memory asynchronously */
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 @36 #define MS_SYNC 2 /* synchronous memory sync */
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 @37 #define MS_INVALIDATE 4 /* invalidate the caches */
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 38
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 39 #define MCL_CURRENT 8192 /* lock all currently mapped pages */
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 40 #define MCL_FUTURE 16384 /* lock all additions to address space */
b0f205c2a3082d arch/alpha/include/uapi/asm/mman.h Eric B Munson 2015-11-05 41 #define MCL_ONFAULT 32768 /* lock all pages that are faulted in */
b0f205c2a3082d arch/alpha/include/uapi/asm/mman.h Eric B Munson 2015-11-05 42
b0f205c2a3082d arch/alpha/include/uapi/asm/mman.h Eric B Munson 2015-11-05 43 #define MLOCK_ONFAULT 0x01 /* Lock pages in range after they are faulted in, do not prefault */
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 44
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 45 #define MADV_NORMAL 0 /* no further special treatment */
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 46 #define MADV_RANDOM 1 /* expect random page references */
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 47 #define MADV_SEQUENTIAL 2 /* expect sequential page references */
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 48 #define MADV_WILLNEED 3 /* will need these pages */
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 49 #define MADV_SPACEAVAIL 5 /* ensure resources are available */
^1da177e4c3f41 include/asm-alpha/mman.h Linus Torvalds 2005-04-16 @50 #define MADV_DONTNEED 6 /* don't need these pages */
5f6164f3092832 include/asm-alpha/mman.h Michael S. Tsirkin 2006-02-15 51

--
0-DAY CI Kernel Test Service
https://01.org/lkp


Attachments:
(No filename) (11.22 kB)
config (139.61 kB)
Download all attachments

2022-12-16 23:04:13

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v4 3/4] cachestat: implement cachestat syscall

Hi Nhat,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on shuah-kselftest/next]
[also build test ERROR on shuah-kselftest/fixes tip/x86/asm akpm-mm/mm-everything linus/master v6.1 next-20221216]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254
base: https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
patch link: https://lore.kernel.org/r/20221216192149.3902877-4-nphamcs%40gmail.com
patch subject: [PATCH v4 3/4] cachestat: implement cachestat syscall
config: arm64-randconfig-r035-20221216
compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 98b13979fb05f3ed288a900deb843e7b27589e58)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install arm64 cross compiling tool for clang build
# apt-get install binutils-aarch64-linux-gnu
# https://github.com/intel-lab-lkp/linux/commit/85fdb04584b3d9a3b90d2c63794ec65e8d996b18
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254
git checkout 85fdb04584b3d9a3b90d2c63794ec65e8d996b18
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm64 olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm64 prepare

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

In file included from arch/arm64/kernel/asm-offsets.c:10:
In file included from include/linux/arm_sdei.h:8:
In file included from include/acpi/ghes.h:5:
In file included from include/acpi/apei.h:9:
In file included from include/linux/acpi.h:15:
In file included from include/linux/device.h:32:
In file included from include/linux/device/driver.h:21:
In file included from include/linux/module.h:19:
In file included from include/linux/elf.h:6:
In file included from arch/arm64/include/asm/elf.h:141:
In file included from include/linux/fs.h:49:
In file included from include/uapi/linux/mman.h:5:
>> arch/arm64/include/asm/mman.h:15:10: error: use of undeclared identifier 'VM_ARM64_BTI'
ret |= VM_ARM64_BTI;
^
>> arch/arm64/include/asm/mman.h:18:10: error: use of undeclared identifier 'VM_MTE'
ret |= VM_MTE;
^
>> arch/arm64/include/asm/mman.h:32:10: error: use of undeclared identifier 'VM_MTE_ALLOWED'
return VM_MTE_ALLOWED;
^
arch/arm64/include/asm/mman.h:59:22: error: use of undeclared identifier 'VM_MTE'
return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED);
^
arch/arm64/include/asm/mman.h:59:45: error: use of undeclared identifier 'VM_MTE_ALLOWED'
return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED);
^
5 errors generated.
make[2]: *** [scripts/Makefile.build:118: arch/arm64/kernel/asm-offsets.s] Error 1
make[2]: Target 'prepare' not remade because of errors.
make[1]: *** [Makefile:1270: prepare0] Error 2
make[1]: Target 'prepare' not remade because of errors.
make: *** [Makefile:231: __sub-make] Error 2
make: Target 'prepare' not remade because of errors.


vim +/VM_ARM64_BTI +15 arch/arm64/include/asm/mman.h

8ef8f360cf30be Dave Martin 2020-03-16 8
8ef8f360cf30be Dave Martin 2020-03-16 9 static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
8ef8f360cf30be Dave Martin 2020-03-16 10 unsigned long pkey __always_unused)
8ef8f360cf30be Dave Martin 2020-03-16 11 {
9f3419315f3cdc Catalin Marinas 2019-11-27 12 unsigned long ret = 0;
9f3419315f3cdc Catalin Marinas 2019-11-27 13
8ef8f360cf30be Dave Martin 2020-03-16 14 if (system_supports_bti() && (prot & PROT_BTI))
9f3419315f3cdc Catalin Marinas 2019-11-27 @15 ret |= VM_ARM64_BTI;
8ef8f360cf30be Dave Martin 2020-03-16 16
9f3419315f3cdc Catalin Marinas 2019-11-27 17 if (system_supports_mte() && (prot & PROT_MTE))
9f3419315f3cdc Catalin Marinas 2019-11-27 @18 ret |= VM_MTE;
9f3419315f3cdc Catalin Marinas 2019-11-27 19
9f3419315f3cdc Catalin Marinas 2019-11-27 20 return ret;
8ef8f360cf30be Dave Martin 2020-03-16 21 }
8ef8f360cf30be Dave Martin 2020-03-16 22 #define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
8ef8f360cf30be Dave Martin 2020-03-16 23
9f3419315f3cdc Catalin Marinas 2019-11-27 24 static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags)
9f3419315f3cdc Catalin Marinas 2019-11-27 25 {
9f3419315f3cdc Catalin Marinas 2019-11-27 26 /*
9f3419315f3cdc Catalin Marinas 2019-11-27 27 * Only allow MTE on anonymous mappings as these are guaranteed to be
9f3419315f3cdc Catalin Marinas 2019-11-27 28 * backed by tags-capable memory. The vm_flags may be overridden by a
9f3419315f3cdc Catalin Marinas 2019-11-27 29 * filesystem supporting MTE (RAM-based).
9f3419315f3cdc Catalin Marinas 2019-11-27 30 */
9f3419315f3cdc Catalin Marinas 2019-11-27 31 if (system_supports_mte() && (flags & MAP_ANONYMOUS))
9f3419315f3cdc Catalin Marinas 2019-11-27 @32 return VM_MTE_ALLOWED;
9f3419315f3cdc Catalin Marinas 2019-11-27 33
9f3419315f3cdc Catalin Marinas 2019-11-27 34 return 0;
9f3419315f3cdc Catalin Marinas 2019-11-27 35 }
9f3419315f3cdc Catalin Marinas 2019-11-27 36 #define arch_calc_vm_flag_bits(flags) arch_calc_vm_flag_bits(flags)
9f3419315f3cdc Catalin Marinas 2019-11-27 37

--
0-DAY CI Kernel Test Service
https://01.org/lkp


Attachments:
(No filename) (6.22 kB)
config (227.74 kB)
Download all attachments

2022-12-16 23:18:34

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v4 3/4] cachestat: implement cachestat syscall

Hi Nhat,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on shuah-kselftest/next]
[also build test ERROR on akpm-mm/mm-everything linus/master v6.1 next-20221216]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254
base: https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
patch link: https://lore.kernel.org/r/20221216192149.3902877-4-nphamcs%40gmail.com
patch subject: [PATCH v4 3/4] cachestat: implement cachestat syscall
config: s390-allmodconfig
compiler: s390-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/85fdb04584b3d9a3b90d2c63794ec65e8d996b18
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254
git checkout 85fdb04584b3d9a3b90d2c63794ec65e8d996b18
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 SHELL=/bin/bash arch/s390/kvm/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>

All error/warnings (new ones prefixed by >>):

In file included from include/uapi/asm-generic/mman.h:5,
from ./arch/s390/include/generated/uapi/asm/mman.h:1,
from include/uapi/linux/mman.h:5,
from include/linux/fs.h:49,
from include/linux/huge_mm.h:8,
from include/linux/mm.h:737,
from include/linux/kvm_host.h:16,
from arch/s390/kvm/kvm-s390.h:17,
from arch/s390/kvm/gaccess.c:16:
>> include/uapi/asm-generic/mman-common.h:16:25: error: expected identifier before numeric constant
16 | #define PROT_NONE 0x0 /* page can not be accessed */
| ^~~
arch/s390/kvm/gaccess.c:493:9: note: in expansion of macro 'PROT_NONE'
493 | PROT_NONE,
| ^~~~~~~~~
arch/s390/kvm/gaccess.c: In function 'trans_exc_ending':
>> arch/s390/kvm/gaccess.c:516:17: error: duplicate case value
516 | case PROT_TYPE_LA:
| ^~~~
arch/s390/kvm/gaccess.c:509:17: note: previously used here
509 | case PROT_NONE:
| ^~~~
In file included from include/linux/compiler_types.h:79,
from <command-line>:
>> include/linux/compiler_attributes.h:223:41: warning: attribute 'fallthrough' not preceding a case label or default label
223 | # define fallthrough __attribute__((__fallthrough__))
| ^~~~~~~~~~~~~
arch/s390/kvm/gaccess.c:515:25: note: in expansion of macro 'fallthrough'
515 | fallthrough;
| ^~~~~~~~~~~


vim +16 include/uapi/asm-generic/mman-common.h

5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 9
5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 10 #define PROT_READ 0x1 /* page can be read */
5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 11 #define PROT_WRITE 0x2 /* page can be written */
5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 12 #define PROT_EXEC 0x4 /* page can be executed */
5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 13 #define PROT_SEM 0x8 /* page may be used for atomic ops */
d41938d2cbee92 include/uapi/asm-generic/mman-common.h Dave Martin 2019-12-11 14 /* 0x10 reserved for arch-specific use */
d41938d2cbee92 include/uapi/asm-generic/mman-common.h Dave Martin 2019-12-11 15 /* 0x20 reserved for arch-specific use */
5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 @16 #define PROT_NONE 0x0 /* page can not be accessed */
5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 17 #define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 18 #define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 19

--
0-DAY CI Kernel Test Service
https://01.org/lkp


Attachments:
(No filename) (5.18 kB)
config (123.41 kB)
Download all attachments

2022-12-17 02:25:58

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v4 3/4] cachestat: implement cachestat syscall

Hi Nhat,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on shuah-kselftest/next]
[also build test ERROR on shuah-kselftest/fixes tip/x86/asm akpm-mm/mm-everything linus/master v6.1 next-20221216]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254
base: https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
patch link: https://lore.kernel.org/r/20221216192149.3902877-4-nphamcs%40gmail.com
patch subject: [PATCH v4 3/4] cachestat: implement cachestat syscall
config: arm64-allyesconfig
compiler: aarch64-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/85fdb04584b3d9a3b90d2c63794ec65e8d996b18
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254
git checkout 85fdb04584b3d9a3b90d2c63794ec65e8d996b18
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arm64 olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arm64 prepare

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

In file included from include/uapi/linux/mman.h:5,
from include/linux/fs.h:49,
from arch/arm64/include/asm/elf.h:141,
from include/linux/elf.h:6,
from include/linux/module.h:19,
from include/linux/device/driver.h:21,
from include/linux/device.h:32,
from include/linux/acpi.h:15,
from include/acpi/apei.h:9,
from include/acpi/ghes.h:5,
from include/linux/arm_sdei.h:8,
from arch/arm64/kernel/asm-offsets.c:10:
arch/arm64/include/asm/mman.h: In function 'arch_calc_vm_prot_bits':
>> arch/arm64/include/asm/mman.h:15:24: error: 'VM_ARM64_BTI' undeclared (first use in this function); did you mean 'ARM64_BTI'?
15 | ret |= VM_ARM64_BTI;
| ^~~~~~~~~~~~
| ARM64_BTI
arch/arm64/include/asm/mman.h:15:24: note: each undeclared identifier is reported only once for each function it appears in
>> arch/arm64/include/asm/mman.h:18:24: error: 'VM_MTE' undeclared (first use in this function)
18 | ret |= VM_MTE;
| ^~~~~~
arch/arm64/include/asm/mman.h: In function 'arch_calc_vm_flag_bits':
>> arch/arm64/include/asm/mman.h:32:24: error: 'VM_MTE_ALLOWED' undeclared (first use in this function)
32 | return VM_MTE_ALLOWED;
| ^~~~~~~~~~~~~~
arch/arm64/include/asm/mman.h: In function 'arch_validate_flags':
arch/arm64/include/asm/mman.h:59:29: error: 'VM_MTE' undeclared (first use in this function)
59 | return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED);
| ^~~~~~
arch/arm64/include/asm/mman.h:59:52: error: 'VM_MTE_ALLOWED' undeclared (first use in this function)
59 | return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED);
| ^~~~~~~~~~~~~~
make[2]: *** [scripts/Makefile.build:118: arch/arm64/kernel/asm-offsets.s] Error 1
make[2]: Target 'prepare' not remade because of errors.
make[1]: *** [Makefile:1270: prepare0] Error 2
make[1]: Target 'prepare' not remade because of errors.
make: *** [Makefile:231: __sub-make] Error 2
make: Target 'prepare' not remade because of errors.


vim +15 arch/arm64/include/asm/mman.h

8ef8f360cf30be Dave Martin 2020-03-16 8
8ef8f360cf30be Dave Martin 2020-03-16 9 static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
8ef8f360cf30be Dave Martin 2020-03-16 10 unsigned long pkey __always_unused)
8ef8f360cf30be Dave Martin 2020-03-16 11 {
9f3419315f3cdc Catalin Marinas 2019-11-27 12 unsigned long ret = 0;
9f3419315f3cdc Catalin Marinas 2019-11-27 13
8ef8f360cf30be Dave Martin 2020-03-16 14 if (system_supports_bti() && (prot & PROT_BTI))
9f3419315f3cdc Catalin Marinas 2019-11-27 @15 ret |= VM_ARM64_BTI;
8ef8f360cf30be Dave Martin 2020-03-16 16
9f3419315f3cdc Catalin Marinas 2019-11-27 17 if (system_supports_mte() && (prot & PROT_MTE))
9f3419315f3cdc Catalin Marinas 2019-11-27 @18 ret |= VM_MTE;
9f3419315f3cdc Catalin Marinas 2019-11-27 19
9f3419315f3cdc Catalin Marinas 2019-11-27 20 return ret;
8ef8f360cf30be Dave Martin 2020-03-16 21 }
8ef8f360cf30be Dave Martin 2020-03-16 22 #define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
8ef8f360cf30be Dave Martin 2020-03-16 23
9f3419315f3cdc Catalin Marinas 2019-11-27 24 static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags)
9f3419315f3cdc Catalin Marinas 2019-11-27 25 {
9f3419315f3cdc Catalin Marinas 2019-11-27 26 /*
9f3419315f3cdc Catalin Marinas 2019-11-27 27 * Only allow MTE on anonymous mappings as these are guaranteed to be
9f3419315f3cdc Catalin Marinas 2019-11-27 28 * backed by tags-capable memory. The vm_flags may be overridden by a
9f3419315f3cdc Catalin Marinas 2019-11-27 29 * filesystem supporting MTE (RAM-based).
9f3419315f3cdc Catalin Marinas 2019-11-27 30 */
9f3419315f3cdc Catalin Marinas 2019-11-27 31 if (system_supports_mte() && (flags & MAP_ANONYMOUS))
9f3419315f3cdc Catalin Marinas 2019-11-27 @32 return VM_MTE_ALLOWED;
9f3419315f3cdc Catalin Marinas 2019-11-27 33
9f3419315f3cdc Catalin Marinas 2019-11-27 34 return 0;
9f3419315f3cdc Catalin Marinas 2019-11-27 35 }
9f3419315f3cdc Catalin Marinas 2019-11-27 36 #define arch_calc_vm_flag_bits(flags) arch_calc_vm_flag_bits(flags)
9f3419315f3cdc Catalin Marinas 2019-11-27 37

--
0-DAY CI Kernel Test Service
https://01.org/lkp


Attachments:
(No filename) (6.58 kB)
config (369.20 kB)
Download all attachments

2022-12-20 14:52:10

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH v4 3/4] cachestat: implement cachestat syscall

On Fri, Dec 16, 2022 at 11:21:48AM -0800, Nhat Pham wrote:
> Implement a new syscall that queries cache state of a file and
> summarizes the number of cached pages, number of dirty pages, number of
> pages marked for writeback, number of (recently) evicted pages, etc. in
> a given range.
>
> NAME
> cachestat - query the page cache status of a file.
>
> SYNOPSIS
> #include <sys/mman.h>
>
> struct cachestat {
> __u64 nr_cache;
> __u64 nr_dirty;
> __u64 nr_writeback;
> __u64 nr_evicted;
> __u64 nr_recently_evicted;
> };
>
> int cachestat(unsigned int fd, off_t off, size_t len,
> size_t cstat_size, struct cachestat *cstat,
> unsigned int flags);
>
> DESCRIPTION
> cachestat() queries the number of cached pages, number of dirty
> pages, number of pages marked for writeback, number of (recently)
> evicted pages, in the bytes range given by `off` and `len`.
>
> These values are returned in a cachestat struct, whose address is
> given by the `cstat` argument.
>
> The `off` and `len` arguments must be non-negative integers. If
> `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
> 0, we will query in the range from `off` to the end of the file.
>
> `cstat_size` allows users to obtain partial results. The syscall
> will copy the first `csstat_size` bytes to the specified userspace
> memory. `cstat_size` must be a non-negative value that is no larger
> than the current size of the cachestat struct.
>
> The `flags` argument is unused for now, but is included for future
> extensibility. User should pass 0 (i.e no flag specified).
>
> RETURN VALUE
> On success, cachestat returns 0. On error, -1 is returned, and errno
> is set to indicate the error.
>
> ERRORS
> EFAULT cstat points to an invalid address.
>
> EINVAL invalid `cstat_size` or `flags`
>
> EBADF invalid file descriptor.
>
> Signed-off-by: Nhat Pham <[email protected]>
> ---

Hi Nhat,

Thanks for the tweaks. FWIW, this by and large looks reasonable to me.
Just a couple more random nitty things, if you happen to care about
them..

> arch/alpha/kernel/syscalls/syscall.tbl | 1 +
> arch/arm/tools/syscall.tbl | 1 +
> arch/ia64/kernel/syscalls/syscall.tbl | 1 +
> arch/m68k/kernel/syscalls/syscall.tbl | 1 +
> arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
> arch/parisc/kernel/syscalls/syscall.tbl | 1 +
> arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
> arch/s390/kernel/syscalls/syscall.tbl | 1 +
> arch/sh/kernel/syscalls/syscall.tbl | 1 +
> arch/sparc/kernel/syscalls/syscall.tbl | 1 +
> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
> include/linux/fs.h | 3 +
> include/linux/syscalls.h | 3 +
> include/uapi/asm-generic/unistd.h | 5 +-
> include/uapi/linux/mman.h | 9 ++
> init/Kconfig | 10 ++
> kernel/sys_ni.c | 1 +
> mm/filemap.c | 137 ++++++++++++++++++++
> 20 files changed, 180 insertions(+), 1 deletion(-)
>
...
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 08341616ae7a..29ffb906caa4 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
...
> @@ -3949,3 +3953,136 @@ bool filemap_release_folio(struct folio *folio, gfp_t gfp)
> return try_to_free_buffers(folio);
> }
> EXPORT_SYMBOL(filemap_release_folio);
> +
> +#ifdef CONFIG_CACHESTAT
> +/**
> + * filemap_cachestat() - compute the page cache statistics of a mapping
> + * @mapping: The mapping to compute the statistics for.
> + * @first_index: The starting page cache index.
> + * @last_index: The final page index (inclusive).
> + * @cs: the cachestat struct to write the result to.
> + *
> + * This will query the page cache statistics of a mapping in the
> + * page range of [first_index, last_index] (inclusive). THe statistics
> + * queried include: number of dirty pages, number of pages marked for
> + * writeback, and the number of (recently) evicted pages.
> + */
> +void filemap_cachestat(struct address_space *mapping, pgoff_t first_index,
> + pgoff_t last_index, struct cachestat *cs)
> +{

Not sure why the internal helper needs to be wrapped in a config option?
Perhaps it makes more sense to define CONFIG_CACHESTAT_SYSCALL and just
wrap the syscall bits with that..?

I would also think it might make things simpler to split out all the
syscall bits into a separate patch from filemap_cachestat().

> + XA_STATE(xas, &mapping->i_pages, first_index);
> + struct folio *folio;
> +
> + rcu_read_lock();
> + xas_for_each(&xas, folio, last_index) {
> + unsigned long nr_pages;
> + pgoff_t folio_first_index, folio_last_index;
> +
> + if (xas_retry(&xas, folio))
> + continue;
> +
> + nr_pages = folio_nr_pages(folio);
> + folio_first_index = folio_pgoff(folio);
> + folio_last_index = folio_first_index + nr_pages - 1;
> +
> + /* Folios might straddle the range boundaries, only count covered subpages */
> + if (folio_first_index < first_index)
> + nr_pages -= first_index - folio_first_index;
> +
> + if (folio_last_index > last_index)
> + nr_pages -= folio_last_index - last_index;
> +
> + if (xa_is_value(folio)) {
> + /* page is evicted */
> + void *shadow = (void *)folio;
> + bool workingset; /* not used */
> +
> + cs->nr_evicted += nr_pages;
> +
> +#ifdef CONFIG_SWAP /* implies CONFIG_MMU */
> + if (shmem_mapping(mapping)) {
> + /* shmem file - in swap cache */
> + swp_entry_t swp = radix_to_swp_entry(folio);
> +
> + shadow = get_shadow_from_swap_cache(swp);
> + }
> +#endif
> + if (workingset_test_recent(shadow, true, &workingset))
> + cs->nr_recently_evicted += nr_pages;
> +
> + continue;
> + }
> +
> + /* page is in cache */
> + cs->nr_cache += nr_pages;
> +
> + if (folio_test_dirty(folio))
> + cs->nr_dirty += nr_pages;
> +
> + if (folio_test_writeback(folio))
> + cs->nr_writeback += nr_pages;

I'm not sure this is an issue right now (or if it will ever be for your
use cases), but from a filesystem perspective it is possible to have
large or variable sized folios in cache. At the moment I think the fs
(or XFS+iomap at least) manages dirty/writeback state at folio
granularity, but that may change in the near future if/when iomap
sub-page dirty tracking comes along. I suspect that means it may become
possible to have a large folio of some N number of pages where only a
subset of those pages are actually in dirty/writeback state, and thus
introduces some inaccuracy here because this assumes that folio state
applies to folio_nr_pages() worth of pages. Just something to be aware
of..

Brian

> + }
> + rcu_read_unlock();
> +}
> +EXPORT_SYMBOL(filemap_cachestat);
> +
> +/*
> + * The cachestat(5) system call.
> + *
> + * cachestat() returns the page cache statistics of a file in the
> + * bytes specified by `off` and `len`: number of cached pages,
> + * number of dirty pages, number of pages marked for writeback,
> + * number of (recently) evicted pages.
> + *
> + * `off` and `len` must be non-negative integers. If `len` > 0,
> + * the queried range is [`off`, `off` + `len`]. If `len` == 0,
> + * we will query in the range from `off` to the end of the file.
> + *
> + * `cstat_size` allows users to obtain partial results. The syscall
> + * will copy the first `csstat_size` bytes to the specified userspace
> + * memory. It also makes the cachestat struct extensible - new fields
> + * can be added in the future without breaking existing usage.
> + * `cstat_size` must be a non-negative value that is no larger than
> + * the current size of the cachestat struct.
> + *
> + * The `flags` argument is unused for now, but is included for future
> + * extensibility. User should pass 0 (i.e no flag specified).
> + *
> + * Because the status of a page can change after cachestat() checks it
> + * but before it returns to the application, the returned values may
> + * contain stale information.
> + *
> + * return values:
> + * zero - success
> + * -EFAULT - cstat points to an illegal address
> + * -EINVAL - invalid arguments
> + * -EBADF - invalid file descriptor
> + */
> +SYSCALL_DEFINE6(cachestat, unsigned int, fd, off_t, off, size_t, len,
> + size_t, cstat_size, struct cachestat __user *, cstat,
> + unsigned int, flags)
> +{
> + struct fd f = fdget(fd);
> + struct address_space *mapping;
> + struct cachestat cs;
> + pgoff_t first_index = off >> PAGE_SHIFT;
> + pgoff_t last_index =
> + len == 0 ? ULONG_MAX : (off + len - 1) >> PAGE_SHIFT;
> +
> + if (off < 0 || cstat_size > sizeof(struct cachestat) || flags != 0)
> + return -EINVAL;
> +
> + if (!f.file)
> + return -EBADF;
> +
> + memset(&cs, 0, sizeof(struct cachestat));
> + mapping = f.file->f_mapping;
> + filemap_cachestat(mapping, first_index, last_index, &cs);
> + fdput(f);
> +
> + if (copy_to_user(cstat, &cs, cstat_size))
> + return -EFAULT;
> +
> + return 0;
> +}
> +#endif /* CONFIG_CACHESTAT */
> --
> 2.30.2
>

2022-12-22 01:22:52

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH v4 3/4] cachestat: implement cachestat syscall

On Fri, Dec 16, 2022 at 1:48 PM Andrew Morton <[email protected]> wrote:
>
> On Fri, 16 Dec 2022 11:21:48 -0800 Nhat Pham <[email protected]> wrote:
>
> > Implement a new syscall that queries cache state of a file and
> > summarizes the number of cached pages, number of dirty pages, number of
> > pages marked for writeback, number of (recently) evicted pages, etc. in
> > a given range.
> >
> > NAME
> > cachestat - query the page cache status of a file.
> >
> > SYNOPSIS
> > #include <sys/mman.h>
> >
> > struct cachestat {
> > __u64 nr_cache;
> > __u64 nr_dirty;
> > __u64 nr_writeback;
> > __u64 nr_evicted;
> > __u64 nr_recently_evicted;
> > };
> >
> > int cachestat(unsigned int fd, off_t off, size_t len,
> > size_t cstat_size, struct cachestat *cstat,
> > unsigned int flags);
> >
> > DESCRIPTION
> > cachestat() queries the number of cached pages, number of dirty
> > pages, number of pages marked for writeback, number of (recently)
> > evicted pages, in the bytes range given by `off` and `len`.
>
> I suggest this be spelled out better: "number of evicted and number or
> recently evicted pages".
>
> I suggest this clearly tell readers what an "evicted" page is - they
> aren't kernel programmers!

Valid points - I'll try to explain this more clearly in the future
versions of this patch series, especially in the draft man page.

>
> What is the benefit of the "recently evicted" pages? "recently" seems
> very vague - what use is this to anyone?

This eviction recency semantics comes from the LRU's refault
computation. Users of cachestat might be interested in two very
different questions:

1. How many pages are not resident in the page cache.
2. How many pages are recently evicted (recently enough that
their refault will be construed as memory pressure).

The first question is answered with nr_evicted, whereas the second
is answered with nr_recently_evicted.

I will figure out a way to explain this better in the next version. Users
definitely do not need to know the nitty gritty details of LRU logic,
but they should know the general idea of each field at least.

>
> > These values are returned in a cachestat struct, whose address is
> > given by the `cstat` argument.
> >
> > The `off` and `len` arguments must be non-negative integers. If
> > `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
> > 0, we will query in the range from `off` to the end of the file.
> >
> > `cstat_size` allows users to obtain partial results. The syscall
> > will copy the first `csstat_size` bytes to the specified userspace
> > memory. `cstat_size` must be a non-negative value that is no larger
> > than the current size of the cachestat struct.
> >
> > The `flags` argument is unused for now, but is included for future
> > extensibility. User should pass 0 (i.e no flag specified).
>
> Why is `flags' here? We could add an unused flags arg to any syscall,
> but we don't. What's the plan?

I included this field to ensure that cachestat can be extended safely,
especially when different users might want different things out of it.

For instance, in the future there might be new fields/computations
that are too heavy for certain use cases - a flag could be used to
disable/skip such fields/computations.

Another thing it might be used for is the huge page counting -
we have not implemented this in this version yet, but it might
introduce murky semantics to new/existing fields in struct
cachestat. Or maybe not - but worst case scenario we can
leave this decision to the users to decide through flags.

I'm sure there are more potential pitfalls that the flags could
save us from, but these are the two on top of my head.

>
> Are there security implications? If I know that some process has a
> file open, I can use cachestat() to infer which parts of that file
> they're looking at (like mincore(), I guess). And I can infer which
> parts they're writing to, unlike mincore().

This one, I'm not 100% sure, but it is a valid concern. Let me think
about it and discuss with more security-oriented minds before
responding to this.

>
> I suggest the [patch 1/4] fixup be separated from this series.

Sounds good! I'll loop Johannes in about this breakup as well.

2022-12-22 23:00:59

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH v4 3/4] cachestat: implement cachestat syscall

On Tue, Dec 20, 2022 at 6:37 AM Brian Foster <[email protected]> wrote:
>
> On Fri, Dec 16, 2022 at 11:21:48AM -0800, Nhat Pham wrote:
> > Implement a new syscall that queries cache state of a file and
> > summarizes the number of cached pages, number of dirty pages, number of
> > pages marked for writeback, number of (recently) evicted pages, etc. in
> > a given range.
> >
> > NAME
> > cachestat - query the page cache status of a file.
> >
> > SYNOPSIS
> > #include <sys/mman.h>
> >
> > struct cachestat {
> > __u64 nr_cache;
> > __u64 nr_dirty;
> > __u64 nr_writeback;
> > __u64 nr_evicted;
> > __u64 nr_recently_evicted;
> > };
> >
> > int cachestat(unsigned int fd, off_t off, size_t len,
> > size_t cstat_size, struct cachestat *cstat,
> > unsigned int flags);
> >
> > DESCRIPTION
> > cachestat() queries the number of cached pages, number of dirty
> > pages, number of pages marked for writeback, number of (recently)
> > evicted pages, in the bytes range given by `off` and `len`.
> >
> > These values are returned in a cachestat struct, whose address is
> > given by the `cstat` argument.
> >
> > The `off` and `len` arguments must be non-negative integers. If
> > `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
> > 0, we will query in the range from `off` to the end of the file.
> >
> > `cstat_size` allows users to obtain partial results. The syscall
> > will copy the first `csstat_size` bytes to the specified userspace
> > memory. `cstat_size` must be a non-negative value that is no larger
> > than the current size of the cachestat struct.
> >
> > The `flags` argument is unused for now, but is included for future
> > extensibility. User should pass 0 (i.e no flag specified).
> >
> > RETURN VALUE
> > On success, cachestat returns 0. On error, -1 is returned, and errno
> > is set to indicate the error.
> >
> > ERRORS
> > EFAULT cstat points to an invalid address.
> >
> > EINVAL invalid `cstat_size` or `flags`
> >
> > EBADF invalid file descriptor.
> >
> > Signed-off-by: Nhat Pham <[email protected]>
> > ---
>
> Hi Nhat,
>
> Thanks for the tweaks. FWIW, this by and large looks reasonable to me.

Thanks for suggesting the tweaks! I tried refactoring it out and it
did look cleaner to me (and one fewer file that I have to keep
track of...)

> Just a couple more random nitty things, if you happen to care about
> them..
>
> > arch/alpha/kernel/syscalls/syscall.tbl | 1 +
> > arch/arm/tools/syscall.tbl | 1 +
> > arch/ia64/kernel/syscalls/syscall.tbl | 1 +
> > arch/m68k/kernel/syscalls/syscall.tbl | 1 +
> > arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
> > arch/parisc/kernel/syscalls/syscall.tbl | 1 +
> > arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
> > arch/s390/kernel/syscalls/syscall.tbl | 1 +
> > arch/sh/kernel/syscalls/syscall.tbl | 1 +
> > arch/sparc/kernel/syscalls/syscall.tbl | 1 +
> > arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> > arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> > arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
> > include/linux/fs.h | 3 +
> > include/linux/syscalls.h | 3 +
> > include/uapi/asm-generic/unistd.h | 5 +-
> > include/uapi/linux/mman.h | 9 ++
> > init/Kconfig | 10 ++
> > kernel/sys_ni.c | 1 +
> > mm/filemap.c | 137 ++++++++++++++++++++
> > 20 files changed, 180 insertions(+), 1 deletion(-)
> >
> ...
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 08341616ae7a..29ffb906caa4 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> ...
> > @@ -3949,3 +3953,136 @@ bool filemap_release_folio(struct folio *folio, gfp_t gfp)
> > return try_to_free_buffers(folio);
> > }
> > EXPORT_SYMBOL(filemap_release_folio);
> > +
> > +#ifdef CONFIG_CACHESTAT
> > +/**
> > + * filemap_cachestat() - compute the page cache statistics of a mapping
> > + * @mapping: The mapping to compute the statistics for.
> > + * @first_index: The starting page cache index.
> > + * @last_index: The final page index (inclusive).
> > + * @cs: the cachestat struct to write the result to.
> > + *
> > + * This will query the page cache statistics of a mapping in the
> > + * page range of [first_index, last_index] (inclusive). THe statistics
> > + * queried include: number of dirty pages, number of pages marked for
> > + * writeback, and the number of (recently) evicted pages.
> > + */
> > +void filemap_cachestat(struct address_space *mapping, pgoff_t first_index,
> > + pgoff_t last_index, struct cachestat *cs)
> > +{
>
> Not sure why the internal helper needs to be wrapped in a config option?
> Perhaps it makes more sense to define CONFIG_CACHESTAT_SYSCALL and just
> wrap the syscall bits with that..?

No particularly strong reasons - I was just bundling the two together
because I was not entirely sure if it's interesting or has any use
case outside of the syscall itself.

Do you have something in mind?

>
> I would also think it might make things simpler to split out all the
> syscall bits into a separate patch from filemap_cachestat().

Same as above.

>
> > + XA_STATE(xas, &mapping->i_pages, first_index);
> > + struct folio *folio;
> > +
> > + rcu_read_lock();
> > + xas_for_each(&xas, folio, last_index) {
> > + unsigned long nr_pages;
> > + pgoff_t folio_first_index, folio_last_index;
> > +
> > + if (xas_retry(&xas, folio))
> > + continue;
> > +
> > + nr_pages = folio_nr_pages(folio);
> > + folio_first_index = folio_pgoff(folio);
> > + folio_last_index = folio_first_index + nr_pages - 1;
> > +
> > + /* Folios might straddle the range boundaries, only count covered subpages */
> > + if (folio_first_index < first_index)
> > + nr_pages -= first_index - folio_first_index;
> > +
> > + if (folio_last_index > last_index)
> > + nr_pages -= folio_last_index - last_index;
> > +
> > + if (xa_is_value(folio)) {
> > + /* page is evicted */
> > + void *shadow = (void *)folio;
> > + bool workingset; /* not used */
> > +
> > + cs->nr_evicted += nr_pages;
> > +
> > +#ifdef CONFIG_SWAP /* implies CONFIG_MMU */
> > + if (shmem_mapping(mapping)) {
> > + /* shmem file - in swap cache */
> > + swp_entry_t swp = radix_to_swp_entry(folio);
> > +
> > + shadow = get_shadow_from_swap_cache(swp);
> > + }
> > +#endif
> > + if (workingset_test_recent(shadow, true, &workingset))
> > + cs->nr_recently_evicted += nr_pages;
> > +
> > + continue;
> > + }
> > +
> > + /* page is in cache */
> > + cs->nr_cache += nr_pages;
> > +
> > + if (folio_test_dirty(folio))
> > + cs->nr_dirty += nr_pages;
> > +
> > + if (folio_test_writeback(folio))
> > + cs->nr_writeback += nr_pages;
>
> I'm not sure this is an issue right now (or if it will ever be for your
> use cases), but from a filesystem perspective it is possible to have
> large or variable sized folios in cache. At the moment I think the fs
> (or XFS+iomap at least) manages dirty/writeback state at folio
> granularity, but that may change in the near future if/when iomap
> sub-page dirty tracking comes along. I suspect that means it may become
> possible to have a large folio of some N number of pages where only a
> subset of those pages are actually in dirty/writeback state, and thus
> introduces some inaccuracy here because this assumes that folio state
> applies to folio_nr_pages() worth of pages. Just something to be aware
> of..
>
> Brian

Oof, I coded this with the mental framework of folio-as-a-unit, and
assumed that the dirty/writeback state is managed at the granularity of folio.
Thanks for bringing this up Brian! I'll have to watch out for this as iomap
evolves (and the subpage tracking becomes a thing).

>
> > + }
> > + rcu_read_unlock();
> > +}
> > +EXPORT_SYMBOL(filemap_cachestat);
> > +
> > +/*
> > + * The cachestat(5) system call.
> > + *
> > + * cachestat() returns the page cache statistics of a file in the
> > + * bytes specified by `off` and `len`: number of cached pages,
> > + * number of dirty pages, number of pages marked for writeback,
> > + * number of (recently) evicted pages.
> > + *
> > + * `off` and `len` must be non-negative integers. If `len` > 0,
> > + * the queried range is [`off`, `off` + `len`]. If `len` == 0,
> > + * we will query in the range from `off` to the end of the file.
> > + *
> > + * `cstat_size` allows users to obtain partial results. The syscall
> > + * will copy the first `csstat_size` bytes to the specified userspace
> > + * memory. It also makes the cachestat struct extensible - new fields
> > + * can be added in the future without breaking existing usage.
> > + * `cstat_size` must be a non-negative value that is no larger than
> > + * the current size of the cachestat struct.
> > + *
> > + * The `flags` argument is unused for now, but is included for future
> > + * extensibility. User should pass 0 (i.e no flag specified).
> > + *
> > + * Because the status of a page can change after cachestat() checks it
> > + * but before it returns to the application, the returned values may
> > + * contain stale information.
> > + *
> > + * return values:
> > + * zero - success
> > + * -EFAULT - cstat points to an illegal address
> > + * -EINVAL - invalid arguments
> > + * -EBADF - invalid file descriptor
> > + */
> > +SYSCALL_DEFINE6(cachestat, unsigned int, fd, off_t, off, size_t, len,
> > + size_t, cstat_size, struct cachestat __user *, cstat,
> > + unsigned int, flags)
> > +{
> > + struct fd f = fdget(fd);
> > + struct address_space *mapping;
> > + struct cachestat cs;
> > + pgoff_t first_index = off >> PAGE_SHIFT;
> > + pgoff_t last_index =
> > + len == 0 ? ULONG_MAX : (off + len - 1) >> PAGE_SHIFT;
> > +
> > + if (off < 0 || cstat_size > sizeof(struct cachestat) || flags != 0)
> > + return -EINVAL;
> > +
> > + if (!f.file)
> > + return -EBADF;
> > +
> > + memset(&cs, 0, sizeof(struct cachestat));
> > + mapping = f.file->f_mapping;
> > + filemap_cachestat(mapping, first_index, last_index, &cs);
> > + fdput(f);
> > +
> > + if (copy_to_user(cstat, &cs, cstat_size))
> > + return -EFAULT;
> > +
> > + return 0;
> > +}
> > +#endif /* CONFIG_CACHESTAT */
> > --
> > 2.30.2
> >
>

2022-12-31 09:34:10

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH v4 3/4] cachestat: implement cachestat syscall

On Fri, Dec 16, 2022 at 2:27 PM kernel test robot <[email protected]> wrote:
>
> Hi Nhat,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on shuah-kselftest/next]
> [also build test ERROR on akpm-mm/mm-everything linus/master v6.1 next-20221216]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254
> base: https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
> patch link: https://lore.kernel.org/r/20221216192149.3902877-4-nphamcs%40gmail.com
> patch subject: [PATCH v4 3/4] cachestat: implement cachestat syscall
> config: s390-allmodconfig
> compiler: s390-linux-gcc (GCC) 12.1.0
> reproduce (this is a W=1 build):
> wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # https://github.com/intel-lab-lkp/linux/commit/85fdb04584b3d9a3b90d2c63794ec65e8d996b18
> git remote add linux-review https://github.com/intel-lab-lkp/linux
> git fetch --no-tags linux-review Nhat-Pham/cachestat-a-new-syscall-for-page-cache-state-of-files/20221217-032254
> git checkout 85fdb04584b3d9a3b90d2c63794ec65e8d996b18
> # save the config file
> mkdir build_dir && cp config build_dir/.config
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 olddefconfig
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 SHELL=/bin/bash arch/s390/kvm/
>
> If you fix the issue, kindly add following tag where applicable
> | Reported-by: kernel test robot <[email protected]>
>
> All error/warnings (new ones prefixed by >>):
>
> In file included from include/uapi/asm-generic/mman.h:5,
> from ./arch/s390/include/generated/uapi/asm/mman.h:1,
> from include/uapi/linux/mman.h:5,
> from include/linux/fs.h:49,
> from include/linux/huge_mm.h:8,
> from include/linux/mm.h:737,
> from include/linux/kvm_host.h:16,
> from arch/s390/kvm/kvm-s390.h:17,
> from arch/s390/kvm/gaccess.c:16:
> >> include/uapi/asm-generic/mman-common.h:16:25: error: expected identifier before numeric constant
> 16 | #define PROT_NONE 0x0 /* page can not be accessed */
> | ^~~
> arch/s390/kvm/gaccess.c:493:9: note: in expansion of macro 'PROT_NONE'
> 493 | PROT_NONE,
> | ^~~~~~~~~
> arch/s390/kvm/gaccess.c: In function 'trans_exc_ending':
> >> arch/s390/kvm/gaccess.c:516:17: error: duplicate case value
> 516 | case PROT_TYPE_LA:
> | ^~~~
> arch/s390/kvm/gaccess.c:509:17: note: previously used here
> 509 | case PROT_NONE:
> | ^~~~
> In file included from include/linux/compiler_types.h:79,
> from <command-line>:
> >> include/linux/compiler_attributes.h:223:41: warning: attribute 'fallthrough' not preceding a case label or default label
> 223 | # define fallthrough __attribute__((__fallthrough__))
> | ^~~~~~~~~~~~~
> arch/s390/kvm/gaccess.c:515:25: note: in expansion of macro 'fallthrough'
> 515 | fallthrough;
> | ^~~~~~~~~~~
>
>
> vim +16 include/uapi/asm-generic/mman-common.h
>
> 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 9
> 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 10 #define PROT_READ 0x1 /* page can be read */
> 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 11 #define PROT_WRITE 0x2 /* page can be written */
> 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 12 #define PROT_EXEC 0x4 /* page can be executed */
> 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 13 #define PROT_SEM 0x8 /* page may be used for atomic ops */
> d41938d2cbee92 include/uapi/asm-generic/mman-common.h Dave Martin 2019-12-11 14 /* 0x10 reserved for arch-specific use */
> d41938d2cbee92 include/uapi/asm-generic/mman-common.h Dave Martin 2019-12-11 15 /* 0x20 reserved for arch-specific use */
> 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 @16 #define PROT_NONE 0x0 /* page can not be accessed */
> 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 17 #define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
> 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 18 #define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
> 5f6164f3092832 include/asm-generic/mman.h Michael S. Tsirkin 2006-02-15 19
>
> --
> 0-DAY CI Kernel Test Service
> https://01.org/lkp

These build errors (and the ones reported by the later kernel test robot emails)
seem to come from my refactoring, which adds:

#include <uapi/linux/mman.h>

to include/linux/fs.h. Several other files (which chain-include fs.h)
also define
several same constants elsewhere. This include seems unnecessary, as
I can just replace it with:

struct cachestat;

This should be fixed in the next version of the patch series.

2023-01-03 19:00:44

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH v4 3/4] cachestat: implement cachestat syscall

On Thu, Dec 22, 2022 at 01:50:19PM -0800, Nhat Pham wrote:
> On Tue, Dec 20, 2022 at 6:37 AM Brian Foster <[email protected]> wrote:
> >
> > On Fri, Dec 16, 2022 at 11:21:48AM -0800, Nhat Pham wrote:
> > > Implement a new syscall that queries cache state of a file and
> > > summarizes the number of cached pages, number of dirty pages, number of
> > > pages marked for writeback, number of (recently) evicted pages, etc. in
> > > a given range.
> > >
> > > NAME
> > > cachestat - query the page cache status of a file.
> > >
> > > SYNOPSIS
> > > #include <sys/mman.h>
> > >
> > > struct cachestat {
> > > __u64 nr_cache;
> > > __u64 nr_dirty;
> > > __u64 nr_writeback;
> > > __u64 nr_evicted;
> > > __u64 nr_recently_evicted;
> > > };
> > >
> > > int cachestat(unsigned int fd, off_t off, size_t len,
> > > size_t cstat_size, struct cachestat *cstat,
> > > unsigned int flags);
> > >
> > > DESCRIPTION
> > > cachestat() queries the number of cached pages, number of dirty
> > > pages, number of pages marked for writeback, number of (recently)
> > > evicted pages, in the bytes range given by `off` and `len`.
> > >
> > > These values are returned in a cachestat struct, whose address is
> > > given by the `cstat` argument.
> > >
> > > The `off` and `len` arguments must be non-negative integers. If
> > > `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
> > > 0, we will query in the range from `off` to the end of the file.
> > >
> > > `cstat_size` allows users to obtain partial results. The syscall
> > > will copy the first `csstat_size` bytes to the specified userspace
> > > memory. `cstat_size` must be a non-negative value that is no larger
> > > than the current size of the cachestat struct.
> > >
> > > The `flags` argument is unused for now, but is included for future
> > > extensibility. User should pass 0 (i.e no flag specified).
> > >
> > > RETURN VALUE
> > > On success, cachestat returns 0. On error, -1 is returned, and errno
> > > is set to indicate the error.
> > >
> > > ERRORS
> > > EFAULT cstat points to an invalid address.
> > >
> > > EINVAL invalid `cstat_size` or `flags`
> > >
> > > EBADF invalid file descriptor.
> > >
> > > Signed-off-by: Nhat Pham <[email protected]>
> > > ---
> >
> > Hi Nhat,
> >
> > Thanks for the tweaks. FWIW, this by and large looks reasonable to me.
>
> Thanks for suggesting the tweaks! I tried refactoring it out and it
> did look cleaner to me (and one fewer file that I have to keep
> track of...)
>
> > Just a couple more random nitty things, if you happen to care about
> > them..
> >
> > > arch/alpha/kernel/syscalls/syscall.tbl | 1 +
> > > arch/arm/tools/syscall.tbl | 1 +
> > > arch/ia64/kernel/syscalls/syscall.tbl | 1 +
> > > arch/m68k/kernel/syscalls/syscall.tbl | 1 +
> > > arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
> > > arch/parisc/kernel/syscalls/syscall.tbl | 1 +
> > > arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
> > > arch/s390/kernel/syscalls/syscall.tbl | 1 +
> > > arch/sh/kernel/syscalls/syscall.tbl | 1 +
> > > arch/sparc/kernel/syscalls/syscall.tbl | 1 +
> > > arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> > > arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> > > arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
> > > include/linux/fs.h | 3 +
> > > include/linux/syscalls.h | 3 +
> > > include/uapi/asm-generic/unistd.h | 5 +-
> > > include/uapi/linux/mman.h | 9 ++
> > > init/Kconfig | 10 ++
> > > kernel/sys_ni.c | 1 +
> > > mm/filemap.c | 137 ++++++++++++++++++++
> > > 20 files changed, 180 insertions(+), 1 deletion(-)
> > >
> > ...
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index 08341616ae7a..29ffb906caa4 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > ...
> > > @@ -3949,3 +3953,136 @@ bool filemap_release_folio(struct folio *folio, gfp_t gfp)
> > > return try_to_free_buffers(folio);
> > > }
> > > EXPORT_SYMBOL(filemap_release_folio);
> > > +
> > > +#ifdef CONFIG_CACHESTAT
> > > +/**
> > > + * filemap_cachestat() - compute the page cache statistics of a mapping
> > > + * @mapping: The mapping to compute the statistics for.
> > > + * @first_index: The starting page cache index.
> > > + * @last_index: The final page index (inclusive).
> > > + * @cs: the cachestat struct to write the result to.
> > > + *
> > > + * This will query the page cache statistics of a mapping in the
> > > + * page range of [first_index, last_index] (inclusive). THe statistics
> > > + * queried include: number of dirty pages, number of pages marked for
> > > + * writeback, and the number of (recently) evicted pages.
> > > + */
> > > +void filemap_cachestat(struct address_space *mapping, pgoff_t first_index,
> > > + pgoff_t last_index, struct cachestat *cs)
> > > +{
> >
> > Not sure why the internal helper needs to be wrapped in a config option?
> > Perhaps it makes more sense to define CONFIG_CACHESTAT_SYSCALL and just
> > wrap the syscall bits with that..?
>
> No particularly strong reasons - I was just bundling the two together
> because I was not entirely sure if it's interesting or has any use
> case outside of the syscall itself.
>
> Do you have something in mind?
>

Not immediately, though it looks like a straightforward and potentially
useful helper. It wouldn't surprise me if it grew more in-kernel users
eventually. :)

That said, I only suggested this for cleanup/consistency reasons. I.e.,
it seems there is plenty of precedent for CONFIG_*_SYSCALL config
options, and the one or two I happened to peek at looked like they only
wrapped the syscall bits as opposed to the entire implementation (which
also seems like a clean and elegant approach to me). Somebody could
easily come along and make that change later if/when they might want to
use the helper (though it might be annoying to change the name of the
config option), so I have no strong opinion on either suggestion.

Brian

> >
> > I would also think it might make things simpler to split out all the
> > syscall bits into a separate patch from filemap_cachestat().
>
> Same as above.
>
> >
> > > + XA_STATE(xas, &mapping->i_pages, first_index);
> > > + struct folio *folio;
> > > +
> > > + rcu_read_lock();
> > > + xas_for_each(&xas, folio, last_index) {
> > > + unsigned long nr_pages;
> > > + pgoff_t folio_first_index, folio_last_index;
> > > +
> > > + if (xas_retry(&xas, folio))
> > > + continue;
> > > +
> > > + nr_pages = folio_nr_pages(folio);
> > > + folio_first_index = folio_pgoff(folio);
> > > + folio_last_index = folio_first_index + nr_pages - 1;
> > > +
> > > + /* Folios might straddle the range boundaries, only count covered subpages */
> > > + if (folio_first_index < first_index)
> > > + nr_pages -= first_index - folio_first_index;
> > > +
> > > + if (folio_last_index > last_index)
> > > + nr_pages -= folio_last_index - last_index;
> > > +
> > > + if (xa_is_value(folio)) {
> > > + /* page is evicted */
> > > + void *shadow = (void *)folio;
> > > + bool workingset; /* not used */
> > > +
> > > + cs->nr_evicted += nr_pages;
> > > +
> > > +#ifdef CONFIG_SWAP /* implies CONFIG_MMU */
> > > + if (shmem_mapping(mapping)) {
> > > + /* shmem file - in swap cache */
> > > + swp_entry_t swp = radix_to_swp_entry(folio);
> > > +
> > > + shadow = get_shadow_from_swap_cache(swp);
> > > + }
> > > +#endif
> > > + if (workingset_test_recent(shadow, true, &workingset))
> > > + cs->nr_recently_evicted += nr_pages;
> > > +
> > > + continue;
> > > + }
> > > +
> > > + /* page is in cache */
> > > + cs->nr_cache += nr_pages;
> > > +
> > > + if (folio_test_dirty(folio))
> > > + cs->nr_dirty += nr_pages;
> > > +
> > > + if (folio_test_writeback(folio))
> > > + cs->nr_writeback += nr_pages;
> >
> > I'm not sure this is an issue right now (or if it will ever be for your
> > use cases), but from a filesystem perspective it is possible to have
> > large or variable sized folios in cache. At the moment I think the fs
> > (or XFS+iomap at least) manages dirty/writeback state at folio
> > granularity, but that may change in the near future if/when iomap
> > sub-page dirty tracking comes along. I suspect that means it may become
> > possible to have a large folio of some N number of pages where only a
> > subset of those pages are actually in dirty/writeback state, and thus
> > introduces some inaccuracy here because this assumes that folio state
> > applies to folio_nr_pages() worth of pages. Just something to be aware
> > of..
> >
> > Brian
>
> Oof, I coded this with the mental framework of folio-as-a-unit, and
> assumed that the dirty/writeback state is managed at the granularity of folio.
> Thanks for bringing this up Brian! I'll have to watch out for this as iomap
> evolves (and the subpage tracking becomes a thing).
>
> >
> > > + }
> > > + rcu_read_unlock();
> > > +}
> > > +EXPORT_SYMBOL(filemap_cachestat);
> > > +
> > > +/*
> > > + * The cachestat(5) system call.
> > > + *
> > > + * cachestat() returns the page cache statistics of a file in the
> > > + * bytes specified by `off` and `len`: number of cached pages,
> > > + * number of dirty pages, number of pages marked for writeback,
> > > + * number of (recently) evicted pages.
> > > + *
> > > + * `off` and `len` must be non-negative integers. If `len` > 0,
> > > + * the queried range is [`off`, `off` + `len`]. If `len` == 0,
> > > + * we will query in the range from `off` to the end of the file.
> > > + *
> > > + * `cstat_size` allows users to obtain partial results. The syscall
> > > + * will copy the first `csstat_size` bytes to the specified userspace
> > > + * memory. It also makes the cachestat struct extensible - new fields
> > > + * can be added in the future without breaking existing usage.
> > > + * `cstat_size` must be a non-negative value that is no larger than
> > > + * the current size of the cachestat struct.
> > > + *
> > > + * The `flags` argument is unused for now, but is included for future
> > > + * extensibility. User should pass 0 (i.e no flag specified).
> > > + *
> > > + * Because the status of a page can change after cachestat() checks it
> > > + * but before it returns to the application, the returned values may
> > > + * contain stale information.
> > > + *
> > > + * return values:
> > > + * zero - success
> > > + * -EFAULT - cstat points to an illegal address
> > > + * -EINVAL - invalid arguments
> > > + * -EBADF - invalid file descriptor
> > > + */
> > > +SYSCALL_DEFINE6(cachestat, unsigned int, fd, off_t, off, size_t, len,
> > > + size_t, cstat_size, struct cachestat __user *, cstat,
> > > + unsigned int, flags)
> > > +{
> > > + struct fd f = fdget(fd);
> > > + struct address_space *mapping;
> > > + struct cachestat cs;
> > > + pgoff_t first_index = off >> PAGE_SHIFT;
> > > + pgoff_t last_index =
> > > + len == 0 ? ULONG_MAX : (off + len - 1) >> PAGE_SHIFT;
> > > +
> > > + if (off < 0 || cstat_size > sizeof(struct cachestat) || flags != 0)
> > > + return -EINVAL;
> > > +
> > > + if (!f.file)
> > > + return -EBADF;
> > > +
> > > + memset(&cs, 0, sizeof(struct cachestat));
> > > + mapping = f.file->f_mapping;
> > > + filemap_cachestat(mapping, first_index, last_index, &cs);
> > > + fdput(f);
> > > +
> > > + if (copy_to_user(cstat, &cs, cstat_size))
> > > + return -EFAULT;
> > > +
> > > + return 0;
> > > +}
> > > +#endif /* CONFIG_CACHESTAT */
> > > --
> > > 2.30.2
> > >
> >
>

2023-01-04 14:37:46

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH v4 3/4] cachestat: implement cachestat syscall

On Tue, Jan 3, 2023 at 10:42 AM Brian Foster <[email protected]> wrote:
>
> On Thu, Dec 22, 2022 at 01:50:19PM -0800, Nhat Pham wrote:
> > On Tue, Dec 20, 2022 at 6:37 AM Brian Foster <[email protected]> wrote:
> > >
> > > On Fri, Dec 16, 2022 at 11:21:48AM -0800, Nhat Pham wrote:
> > > > Implement a new syscall that queries cache state of a file and
> > > > summarizes the number of cached pages, number of dirty pages, number of
> > > > pages marked for writeback, number of (recently) evicted pages, etc. in
> > > > a given range.
> > > >
> > > > NAME
> > > > cachestat - query the page cache status of a file.
> > > >
> > > > SYNOPSIS
> > > > #include <sys/mman.h>
> > > >
> > > > struct cachestat {
> > > > __u64 nr_cache;
> > > > __u64 nr_dirty;
> > > > __u64 nr_writeback;
> > > > __u64 nr_evicted;
> > > > __u64 nr_recently_evicted;
> > > > };
> > > >
> > > > int cachestat(unsigned int fd, off_t off, size_t len,
> > > > size_t cstat_size, struct cachestat *cstat,
> > > > unsigned int flags);
> > > >
> > > > DESCRIPTION
> > > > cachestat() queries the number of cached pages, number of dirty
> > > > pages, number of pages marked for writeback, number of (recently)
> > > > evicted pages, in the bytes range given by `off` and `len`.
> > > >
> > > > These values are returned in a cachestat struct, whose address is
> > > > given by the `cstat` argument.
> > > >
> > > > The `off` and `len` arguments must be non-negative integers. If
> > > > `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
> > > > 0, we will query in the range from `off` to the end of the file.
> > > >
> > > > `cstat_size` allows users to obtain partial results. The syscall
> > > > will copy the first `csstat_size` bytes to the specified userspace
> > > > memory. `cstat_size` must be a non-negative value that is no larger
> > > > than the current size of the cachestat struct.
> > > >
> > > > The `flags` argument is unused for now, but is included for future
> > > > extensibility. User should pass 0 (i.e no flag specified).
> > > >
> > > > RETURN VALUE
> > > > On success, cachestat returns 0. On error, -1 is returned, and errno
> > > > is set to indicate the error.
> > > >
> > > > ERRORS
> > > > EFAULT cstat points to an invalid address.
> > > >
> > > > EINVAL invalid `cstat_size` or `flags`
> > > >
> > > > EBADF invalid file descriptor.
> > > >
> > > > Signed-off-by: Nhat Pham <[email protected]>
> > > > ---
> > >
> > > Hi Nhat,
> > >
> > > Thanks for the tweaks. FWIW, this by and large looks reasonable to me.
> >
> > Thanks for suggesting the tweaks! I tried refactoring it out and it
> > did look cleaner to me (and one fewer file that I have to keep
> > track of...)
> >
> > > Just a couple more random nitty things, if you happen to care about
> > > them..
> > >
> > > > arch/alpha/kernel/syscalls/syscall.tbl | 1 +
> > > > arch/arm/tools/syscall.tbl | 1 +
> > > > arch/ia64/kernel/syscalls/syscall.tbl | 1 +
> > > > arch/m68k/kernel/syscalls/syscall.tbl | 1 +
> > > > arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
> > > > arch/parisc/kernel/syscalls/syscall.tbl | 1 +
> > > > arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
> > > > arch/s390/kernel/syscalls/syscall.tbl | 1 +
> > > > arch/sh/kernel/syscalls/syscall.tbl | 1 +
> > > > arch/sparc/kernel/syscalls/syscall.tbl | 1 +
> > > > arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> > > > arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> > > > arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
> > > > include/linux/fs.h | 3 +
> > > > include/linux/syscalls.h | 3 +
> > > > include/uapi/asm-generic/unistd.h | 5 +-
> > > > include/uapi/linux/mman.h | 9 ++
> > > > init/Kconfig | 10 ++
> > > > kernel/sys_ni.c | 1 +
> > > > mm/filemap.c | 137 ++++++++++++++++++++
> > > > 20 files changed, 180 insertions(+), 1 deletion(-)
> > > >
> > > ...
> > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > index 08341616ae7a..29ffb906caa4 100644
> > > > --- a/mm/filemap.c
> > > > +++ b/mm/filemap.c
> > > ...
> > > > @@ -3949,3 +3953,136 @@ bool filemap_release_folio(struct folio *folio, gfp_t gfp)
> > > > return try_to_free_buffers(folio);
> > > > }
> > > > EXPORT_SYMBOL(filemap_release_folio);
> > > > +
> > > > +#ifdef CONFIG_CACHESTAT
> > > > +/**
> > > > + * filemap_cachestat() - compute the page cache statistics of a mapping
> > > > + * @mapping: The mapping to compute the statistics for.
> > > > + * @first_index: The starting page cache index.
> > > > + * @last_index: The final page index (inclusive).
> > > > + * @cs: the cachestat struct to write the result to.
> > > > + *
> > > > + * This will query the page cache statistics of a mapping in the
> > > > + * page range of [first_index, last_index] (inclusive). THe statistics
> > > > + * queried include: number of dirty pages, number of pages marked for
> > > > + * writeback, and the number of (recently) evicted pages.
> > > > + */
> > > > +void filemap_cachestat(struct address_space *mapping, pgoff_t first_index,
> > > > + pgoff_t last_index, struct cachestat *cs)
> > > > +{
> > >
> > > Not sure why the internal helper needs to be wrapped in a config option?
> > > Perhaps it makes more sense to define CONFIG_CACHESTAT_SYSCALL and just
> > > wrap the syscall bits with that..?
> >
> > No particularly strong reasons - I was just bundling the two together
> > because I was not entirely sure if it's interesting or has any use
> > case outside of the syscall itself.
> >
> > Do you have something in mind?
> >
>
> Not immediately, though it looks like a straightforward and potentially
> useful helper. It wouldn't surprise me if it grew more in-kernel users
> eventually. :)

Sounds good! I'll move it out in the next version, unless there is some
unexpected issue with that.

> That said, I only suggested this for cleanup/consistency reasons. I.e.,
> it seems there is plenty of precedent for CONFIG_*_SYSCALL config
> options, and the one or two I happened to peek at looked like they only
> wrapped the syscall bits as opposed to the entire implementation (which
> also seems like a clean and elegant approach to me). Somebody could
> easily come along and make that change later if/when they might want to
> use the helper (though it might be annoying to change the name of the
> config option), so I have no strong opinion on either suggestion.
>
> Brian
>
> > >
> > > I would also think it might make things simpler to split out all the
> > > syscall bits into a separate patch from filemap_cachestat().
> >
> > Same as above.
> >
> > >
> > > > + XA_STATE(xas, &mapping->i_pages, first_index);
> > > > + struct folio *folio;
> > > > +
> > > > + rcu_read_lock();
> > > > + xas_for_each(&xas, folio, last_index) {
> > > > + unsigned long nr_pages;
> > > > + pgoff_t folio_first_index, folio_last_index;
> > > > +
> > > > + if (xas_retry(&xas, folio))
> > > > + continue;
> > > > +
> > > > + nr_pages = folio_nr_pages(folio);
> > > > + folio_first_index = folio_pgoff(folio);
> > > > + folio_last_index = folio_first_index + nr_pages - 1;
> > > > +
> > > > + /* Folios might straddle the range boundaries, only count covered subpages */
> > > > + if (folio_first_index < first_index)
> > > > + nr_pages -= first_index - folio_first_index;
> > > > +
> > > > + if (folio_last_index > last_index)
> > > > + nr_pages -= folio_last_index - last_index;
> > > > +
> > > > + if (xa_is_value(folio)) {
> > > > + /* page is evicted */
> > > > + void *shadow = (void *)folio;
> > > > + bool workingset; /* not used */
> > > > +
> > > > + cs->nr_evicted += nr_pages;
> > > > +
> > > > +#ifdef CONFIG_SWAP /* implies CONFIG_MMU */
> > > > + if (shmem_mapping(mapping)) {
> > > > + /* shmem file - in swap cache */
> > > > + swp_entry_t swp = radix_to_swp_entry(folio);
> > > > +
> > > > + shadow = get_shadow_from_swap_cache(swp);
> > > > + }
> > > > +#endif
> > > > + if (workingset_test_recent(shadow, true, &workingset))
> > > > + cs->nr_recently_evicted += nr_pages;
> > > > +
> > > > + continue;
> > > > + }
> > > > +
> > > > + /* page is in cache */
> > > > + cs->nr_cache += nr_pages;
> > > > +
> > > > + if (folio_test_dirty(folio))
> > > > + cs->nr_dirty += nr_pages;
> > > > +
> > > > + if (folio_test_writeback(folio))
> > > > + cs->nr_writeback += nr_pages;
> > >
> > > I'm not sure this is an issue right now (or if it will ever be for your
> > > use cases), but from a filesystem perspective it is possible to have
> > > large or variable sized folios in cache. At the moment I think the fs
> > > (or XFS+iomap at least) manages dirty/writeback state at folio
> > > granularity, but that may change in the near future if/when iomap
> > > sub-page dirty tracking comes along. I suspect that means it may become
> > > possible to have a large folio of some N number of pages where only a
> > > subset of those pages are actually in dirty/writeback state, and thus
> > > introduces some inaccuracy here because this assumes that folio state
> > > applies to folio_nr_pages() worth of pages. Just something to be aware
> > > of..
> > >
> > > Brian
> >
> > Oof, I coded this with the mental framework of folio-as-a-unit, and
> > assumed that the dirty/writeback state is managed at the granularity of folio.
> > Thanks for bringing this up Brian! I'll have to watch out for this as iomap
> > evolves (and the subpage tracking becomes a thing).
> >
> > >
> > > > + }
> > > > + rcu_read_unlock();
> > > > +}
> > > > +EXPORT_SYMBOL(filemap_cachestat);
> > > > +
> > > > +/*
> > > > + * The cachestat(5) system call.
> > > > + *
> > > > + * cachestat() returns the page cache statistics of a file in the
> > > > + * bytes specified by `off` and `len`: number of cached pages,
> > > > + * number of dirty pages, number of pages marked for writeback,
> > > > + * number of (recently) evicted pages.
> > > > + *
> > > > + * `off` and `len` must be non-negative integers. If `len` > 0,
> > > > + * the queried range is [`off`, `off` + `len`]. If `len` == 0,
> > > > + * we will query in the range from `off` to the end of the file.
> > > > + *
> > > > + * `cstat_size` allows users to obtain partial results. The syscall
> > > > + * will copy the first `csstat_size` bytes to the specified userspace
> > > > + * memory. It also makes the cachestat struct extensible - new fields
> > > > + * can be added in the future without breaking existing usage.
> > > > + * `cstat_size` must be a non-negative value that is no larger than
> > > > + * the current size of the cachestat struct.
> > > > + *
> > > > + * The `flags` argument is unused for now, but is included for future
> > > > + * extensibility. User should pass 0 (i.e no flag specified).
> > > > + *
> > > > + * Because the status of a page can change after cachestat() checks it
> > > > + * but before it returns to the application, the returned values may
> > > > + * contain stale information.
> > > > + *
> > > > + * return values:
> > > > + * zero - success
> > > > + * -EFAULT - cstat points to an illegal address
> > > > + * -EINVAL - invalid arguments
> > > > + * -EBADF - invalid file descriptor
> > > > + */
> > > > +SYSCALL_DEFINE6(cachestat, unsigned int, fd, off_t, off, size_t, len,
> > > > + size_t, cstat_size, struct cachestat __user *, cstat,
> > > > + unsigned int, flags)
> > > > +{
> > > > + struct fd f = fdget(fd);
> > > > + struct address_space *mapping;
> > > > + struct cachestat cs;
> > > > + pgoff_t first_index = off >> PAGE_SHIFT;
> > > > + pgoff_t last_index =
> > > > + len == 0 ? ULONG_MAX : (off + len - 1) >> PAGE_SHIFT;
> > > > +
> > > > + if (off < 0 || cstat_size > sizeof(struct cachestat) || flags != 0)
> > > > + return -EINVAL;
> > > > +
> > > > + if (!f.file)
> > > > + return -EBADF;
> > > > +
> > > > + memset(&cs, 0, sizeof(struct cachestat));
> > > > + mapping = f.file->f_mapping;
> > > > + filemap_cachestat(mapping, first_index, last_index, &cs);
> > > > + fdput(f);
> > > > +
> > > > + if (copy_to_user(cstat, &cs, cstat_size))
> > > > + return -EFAULT;
> > > > +
> > > > + return 0;
> > > > +}
> > > > +#endif /* CONFIG_CACHESTAT */
> > > > --
> > > > 2.30.2
> > > >
> > >
> >
>

2023-01-11 18:35:13

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH v4 3/4] cachestat: implement cachestat syscall

On Wed, Dec 21, 2022 at 4:37 PM Nhat Pham <[email protected]> wrote:
>
> On Fri, Dec 16, 2022 at 1:48 PM Andrew Morton <[email protected]> wrote:
> >
> > On Fri, 16 Dec 2022 11:21:48 -0800 Nhat Pham <[email protected]> wrote:
> >
> > > Implement a new syscall that queries cache state of a file and
> > > summarizes the number of cached pages, number of dirty pages, number of
> > > pages marked for writeback, number of (recently) evicted pages, etc. in
> > > a given range.
> > >
> > > NAME
> > > cachestat - query the page cache status of a file.
> > >
> > > SYNOPSIS
> > > #include <sys/mman.h>
> > >
> > > struct cachestat {
> > > __u64 nr_cache;
> > > __u64 nr_dirty;
> > > __u64 nr_writeback;
> > > __u64 nr_evicted;
> > > __u64 nr_recently_evicted;
> > > };
> > >
> > > int cachestat(unsigned int fd, off_t off, size_t len,
> > > size_t cstat_size, struct cachestat *cstat,
> > > unsigned int flags);
> > >
> > > DESCRIPTION
> > > cachestat() queries the number of cached pages, number of dirty
> > > pages, number of pages marked for writeback, number of (recently)
> > > evicted pages, in the bytes range given by `off` and `len`.
> >
> > I suggest this be spelled out better: "number of evicted and number or
> > recently evicted pages".
> >
> > I suggest this clearly tell readers what an "evicted" page is - they
> > aren't kernel programmers!
>
> Valid points - I'll try to explain this more clearly in the future
> versions of this patch series, especially in the draft man page.
>
> >
> > What is the benefit of the "recently evicted" pages? "recently" seems
> > very vague - what use is this to anyone?
>
> This eviction recency semantics comes from the LRU's refault
> computation. Users of cachestat might be interested in two very
> different questions:
>
> 1. How many pages are not resident in the page cache.
> 2. How many pages are recently evicted (recently enough that
> their refault will be construed as memory pressure).
>
> The first question is answered with nr_evicted, whereas the second
> is answered with nr_recently_evicted.
>
> I will figure out a way to explain this better in the next version. Users
> definitely do not need to know the nitty gritty details of LRU logic,
> but they should know the general idea of each field at least.
>
> >
> > > These values are returned in a cachestat struct, whose address is
> > > given by the `cstat` argument.
> > >
> > > The `off` and `len` arguments must be non-negative integers. If
> > > `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
> > > 0, we will query in the range from `off` to the end of the file.
> > >
> > > `cstat_size` allows users to obtain partial results. The syscall
> > > will copy the first `csstat_size` bytes to the specified userspace
> > > memory. `cstat_size` must be a non-negative value that is no larger
> > > than the current size of the cachestat struct.
> > >
> > > The `flags` argument is unused for now, but is included for future
> > > extensibility. User should pass 0 (i.e no flag specified).
> >
> > Why is `flags' here? We could add an unused flags arg to any syscall,
> > but we don't. What's the plan?
>
> I included this field to ensure that cachestat can be extended safely,
> especially when different users might want different things out of it.
>
> For instance, in the future there might be new fields/computations
> that are too heavy for certain use cases - a flag could be used to
> disable/skip such fields/computations.
>
> Another thing it might be used for is the huge page counting -
> we have not implemented this in this version yet, but it might
> introduce murky semantics to new/existing fields in struct
> cachestat. Or maybe not - but worst case scenario we can
> leave this decision to the users to decide through flags.
>
> I'm sure there are more potential pitfalls that the flags could
> save us from, but these are the two on top of my head.
>
> >
> > Are there security implications? If I know that some process has a
> > file open, I can use cachestat() to infer which parts of that file
> > they're looking at (like mincore(), I guess). And I can infer which
> > parts they're writing to, unlike mincore().
>
> This one, I'm not 100% sure, but it is a valid concern. Let me think
> about it and discuss with more security-oriented minds before
> responding to this.

Hmm I've given it some more thought. The syscall does not seem to
expose any extra security issue, given that the user already has
read permission to the file itself (since they have an fd to that file).
This means that the user can already know the underlying data in its
entirety, which seems like much more information (and as a result,
security risk) than the cache status itself.

Do you have something concrete in mind that I might have missed?

>
> >
> > I suggest the [patch 1/4] fixup be separated from this series.
>
> Sounds good! I'll loop Johannes in about this breakup as well.