LinuxLists.cc - [PATCH 00/14] Small step toward KSM for file back page.

2020-10-07 01:46:05

Subject: [PATCH 00/14] Small step toward KSM for file back page.

From: Jérôme Glisse <[email protected]>

This patchset is a step toward a larger objective: generalize existing
KSM into a mechanism allowing exclusive write control for a page; either
anonymous memory (like KSM today) or file back page (modulo GUP which
would block that like it does today for KSM).

Exclusive write control page allow multiple different features to be
implemented:

- KSM kernel share page, ie de-duplicate pages with same content
to use a single page for all. From many pages to one read only
page. We have that today for anonymous memory only. The overall
patchset extends it to file back page ie sharing the same struct
page accross different file or accross same file. This can be
be usefull for containers for instance ... or for deduplication
in same file.

- NUMA duplication, duplicate a page into multiple local read only
copy. This is the opposite of KSM in a sense, instead of saving
memory. Using more memory to get better memory access performance.
For instance duplicating libc code to local node copy; or big
read only dataset duplicated on each nodes.

- Exclusive write access, owner of page write protection is the only
that can write to the page (and must still abide by fs rules for
fileback page in respect to writeback...). One use case is for
fast atomic operation using non atomic instruction. For instance
by PCIE device, if all mapping of the page is read only then PCIE
device driver knows device write can not race with CPU write. This
is a performance optimization.

- Use main memory as cache for persistent memory ie the page is
read only and write will trigger callback and different strategy
can be use like write combining (ie acumulating change in main
memory before copying to persistent memory).

Like KSM today such protection can be broken at _any_ time. The owner
of the protection gets a callback (KSM code for instance get calls) so
that it can unprotect the page. Breaking protection should not block
and must happens quickly (like KSM code today).

Convertion of existing KSM into generic mechanism is straightforward
for anonymous page (just factorize out KSM code that deals with page
protection from KSM code that deals with de-duplication).

The big changes here is the support for file back pages. The idea to
achieve it is that we almost always have the mapping a page belongs
to within the call stack as we operate on such page either from:
- Syscall/kernel against a file (file -> inode -> mapping).
- Syscall/kernel against virtual address (vma -> file -> mapping).
- Write back for a given mapping (mapping -> pages).

They are few exceptions:
- Reclaim, but reclaim does not care about mapping. Reclaim wants to
unmap page to free it up. So all we have to do is provide special
path to do that just like KSM does today for anonymous pages.

- Compaction, again we do not care about the mapping for compaction.
All we need is way to move page (ie migrate).

- Flush data cache on some architecture the cache line are tag with
the virtual address so when flushing a page we need to find all of
its virtual addresses. Again we do not care about the mapping, we
just need a way to find all virtual address in all process pointing
to the page.

- GUP user that want to set a page dirty. This is easy, we just do
not allow protection to work on GUPed page and GUP also will break
the protection. There is just no way to synchronize with GUP user
as they violate all mm and fs rules anyway.

- Some proc fs and other memory debugging API. Here we do not care
about the mapping but about the page states. Also some of those
API works on virtual address for which we can easily get the vma
and thus the mapping.

So when we have the mapping for a page from the context and not from
page->mapping then we can use it as a key to lookup private and index
fields value for the page.

To avoid any regression risk, only protected pages sees their fields
overloaded. It means that if you are not using the page protection then
the page->mapping, page->private and page->index all stays as they are
today. Also page->mapping is always use as canonical place to lookup
the page mapping for unprotected page so that any existing code will
keep working as it does today even if the mapping we get from the
context does not match the page->mapping. More on this below.

Overview:
=========

The core idea is pretty dumb, it is just about passing new mapping
argument to every function that get a page and need the mapping
corresponding to that page. Most of the changes are done through
semantic patches. Adding new function argument on itself does not
bring any risk. The risk is in making sure that the mapping we pass as
function argument is the one corresponding to the page. To avoid any
regression we keep using page->mapping as the canonical mapping even
if it does not match what we get as a new argument ie:

Today:
foo(struct page *page)
{
... local variable declaration ...
struct address_space *mapping = page->mapping;

... use mapping or inode from mapping ...
}

After:
foo(struct page *page, struct address_space *mapping)
{
... local variable declaration ...

mapping = fs_page_mapping(page, mapping);

... use mapping or inode from mapping ...
}

With:
struct address_space *fs_page_mapping(struct page *page,
struct address_space *mapping)
{
if (!PageProtected(page)) {
WARN_ON(page->mapping != mapping);
return page->mapping;
}
return mapping;
}

So as long as you are not using page protection the end result before
and after the patchset is the same ie the same mapping value will be
use.

Strategy for passing down mapping of page:
==========================================

To make it easier to review that correct mapping is pass down, changes
are split in 3 steps:
- First adds new __mapping argument to all functions that will need
it and do not have it already (or inode as we can get mapping from
inode). The argument is never use and thus in this step call site
pass MAPPING_NULL which is just macro that alias to NULL but is
easier to grep for.

- Second replace MAPPING_NULL with the __mapping argument whereever
possible. The rational is that if a function is passing down the
mapping and already have it as a function argument then the real
difficulty is not in that function but in caller of that function
where we must ascertain that the mapping we pass down is the
correct one.

Replace any local MAPPING_NULL with local mapping variable or
inode (some convertion are done manualy when they are multiple
possible choice or when the automatic convertion would be wrong).

At the end of this step there should not be any MAPPING_NULL left
anywhere.

- Finaly we replace any local mapping variable with __mapping and
we add a check ie going from:
foo(struct address_mapping *__mapping, struct page *page, ...){
struct address_space *mapping = page->mapping;
...
}
To:
foo(struct address_mapping *mapping, struct page *page, ...) {
...
mapping = fs_page_mapping(mapping, page);
}
With:
fs_page_mapping(struct address_mapping *mapping,
struct page *page) {
if (!PageProtected(page))
return page->mapping;
return mapping;
}

Do the same for local inode that are looked up from page->mapping.

So as long as you do not use the protection mechanism you will
use the same mapping/inode as today.

I hope that over enough time we can build confidence so that people
start using page protection and KSM for file back pages without any
fear of regression. But in the meantime as explained in above section,
code will keep working as it does today if page is not protected.

----------------------------------------------------------------------

The present patchset just add mapping argument to the various vfs call-
backs. It does not make use of that new parameter to avoid regression.
I am posting this whole things as small contain patchset as it is rather
big and i would like to make progress step by step.

----------------------------------------------------------------------
FAQ:
----------------------------------------------------------------------

Why multiple pass in the coccinelle patches ?

Two reasons for this:
- To modify callback you first need to identify callback. It is
common enough to have callback define in one file and set as
callback in another. For that to work you would have one pass
to identify all function use as a callback. Then anoter pass
executed for each of the callback identified in the first pass
to update each of them (like adding a new arugment).

- Run time, some of the coccinelle patch are complex enough that
they have a long runtime so to avoid spending hours on semantic
patching it is better to run a first pass (with very simple
semantic patch) that identify all the files that will need to
be updated and then a second pass on each of the file with the
actual update.

Why wrapping page->flags/index/private/mapping ?
A protected page can (is not necessary the case) alias to multiple
original page (also see "page alias" in glossary). Each alias do
correspond to a different original page which had a different file
(and thus mapping), likely at a different offset (page->index) and
with different properties (page->flags and page->private). So we
need to keep around all the original informations and we use the
page and mapping (or anon_vma) as key to lookup alias and get the
correct information for it.

Why keeping page->private for each page alias ?
For KSM or NUMA duplication we expect the page to be uptodate and
clean and thus we should be able to free buffers (or fs specific
struct) and thus no need to keep around private.

However for page exclusive access we will not alias the page with
any other page and we might want to keep the page in a writeable
state and thus we need to keep the page->private around.

Why a lot of patch move local variable out of declaration ?
Many patch modify code from:
foo(...) {
some_type1 some_name1 = some_arg1;
some_type2 some_name2 = some_arg2 || some_name1;
DECLARATION...

STATEMENT...
}

To:
foo(...) {
some_type1 some_name1;
some_type2 some_name2;
DECLARATION...

some_name1 = some_arg1;
some_name2 = some_arg2 || some_name1;

STATEMENT...
}

This is done so that we can add a test to check if mapping/inode
we get in argument does match the page->mapping field. We need to
add such test after all declaration but before any assignment to
local variable. Hence why declaration initializer need to become
regular statement after all declarations.

Saddly no way around that and this lead to a lot of code churn.

----------------------------------------------------------------------
Glossary:
----------------------------------------------------------------------

page alias
A protected page can alias with different file and/or vma (and thus
anon_vma). A page alias is just the information that correspond to
each of the file/vma for that that.

For instance let say that PA was a file back page it had:
PA->mapping == mapping corresponding to the file PA belong
PA->index == offset into the file for PA
PA->private == fs specific information (like buffer_head)
PA->flags == fs/mm specific informations

PB was a anonymous page, it had:
PB->mapping == anon_vma for the vma into which PB is map
PB->index == offset into the vma
PB->private == usualy 0 but can contain a swap entry
PB->flags == fs/mm specific informations

Now if PA and PB are merge together into a protect page then the
protected page will have an alias struct for each of the original
page ie:
struct page_alias {
struct list list;
unsigned long flags;
unsigned long index;
void *mapping;
void *private;
};
struct pxa {
struct list aliases;
...
};

So now for the PXA (protected page) that replace PA and PB we have:
PP->mapping == pointer to struct pxa for the page

Note that PP->index/private are undefine for now but could be use
for optimization like storing the most recent alias lookup.

And the alias list will have one entry for PA and one for PB:
PA_alias.mapping = PA.mapping
PA_alias.private = PA.private
PA_alias.index = PA.index
PA_alias.flags = PA.flags

PB_alias.mapping = PB.mapping
PB_alias.private = PB.private
PB_alias.index = PB.index
PB_alias.flags = PB.flags

raw page
A page that is not use as anonymous or file back. Many cases, can
be a free page, a page use by a device driver or some kernel
features. Each case can sometimes make use of the various struct
page fields (mapping, private, index, lru, ...).

To get proper education this patchset wrap usage of those field in
those code with raw_page*() helpers. This also make it clears what
kind of page we expect to see in such driver/feature.

Cc: [email protected]
Cc: [email protected]
Cc: Andrew Morton <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Josef Bacik <[email protected]>

Jérôme Glisse (14):
mm/pxa: page exclusive access add header file for all helpers.
fs: define filler_t as a function pointer type
fs: directly use a_ops->freepage() instead of a local copy of it.
mm: add struct address_space to readpage() callback
mm: add struct address_space to writepage() callback
mm: add struct address_space to set_page_dirty() callback
mm: add struct address_space to invalidatepage() callback
mm: add struct address_space to releasepage() callback
mm: add struct address_space to freepage() callback
mm: add struct address_space to putback_page() callback
mm: add struct address_space to launder_page() callback
mm: add struct address_space to is_partially_uptodate() callback
mm: add struct address_space to isolate_page() callback
mm: add struct address_space to is_dirty_writeback() callback

drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 3 +-
drivers/video/fbdev/core/fb_defio.c | 3 +-
fs/9p/vfs_addr.c | 24 ++++++---
fs/adfs/inode.c | 6 ++-
fs/affs/file.c | 9 ++--
fs/affs/symlink.c | 4 +-
fs/afs/dir.c | 15 ++++--
fs/afs/file.c | 18 ++++---
fs/afs/internal.h | 7 +--
fs/afs/write.c | 9 ++--
fs/befs/linuxvfs.c | 13 +++--
fs/bfs/file.c | 6 ++-
fs/block_dev.c | 10 ++--
fs/btrfs/ctree.h | 3 +-
fs/btrfs/disk-io.c | 16 +++---
fs/btrfs/extent_io.c | 5 +-
fs/btrfs/file.c | 2 +-
fs/btrfs/free-space-cache.c | 2 +-
fs/btrfs/inode.c | 24 +++++----
fs/btrfs/ioctl.c | 2 +-
fs/btrfs/relocation.c | 2 +-
fs/btrfs/send.c | 2 +-
fs/buffer.c | 14 +++--
fs/cachefiles/rdwr.c | 6 +--
fs/ceph/addr.c | 26 +++++----
fs/cifs/cifssmb.c | 2 +-
fs/cifs/file.c | 15 ++++--
fs/coda/symlink.c | 4 +-
fs/cramfs/inode.c | 3 +-
fs/ecryptfs/mmap.c | 8 ++-
fs/efs/inode.c | 3 +-
fs/efs/symlink.c | 4 +-
fs/erofs/data.c | 4 +-
fs/erofs/super.c | 8 +--
fs/erofs/zdata.c | 4 +-
fs/exfat/inode.c | 6 ++-
fs/ext2/inode.c | 11 ++--
fs/ext4/inode.c | 35 +++++++-----
fs/f2fs/checkpoint.c | 8 +--
fs/f2fs/data.c | 20 ++++---
fs/f2fs/f2fs.h | 8 +--
fs/f2fs/node.c | 8 +--
fs/fat/inode.c | 6 ++-
fs/freevxfs/vxfs_immed.c | 6 ++-
fs/freevxfs/vxfs_subr.c | 6 ++-
fs/fuse/dir.c | 4 +-
fs/fuse/file.c | 9 ++--
fs/gfs2/aops.c | 29 ++++++----
fs/gfs2/inode.h | 3 +-
fs/gfs2/meta_io.c | 4 +-
fs/hfs/inode.c | 9 ++--
fs/hfsplus/inode.c | 10 ++--
fs/hostfs/hostfs_kern.c | 6 ++-
fs/hpfs/file.c | 6 ++-
fs/hpfs/namei.c | 4 +-
fs/hugetlbfs/inode.c | 3 +-
fs/iomap/buffered-io.c | 15 +++---
fs/iomap/seek.c | 2 +-
fs/isofs/compress.c | 3 +-
fs/isofs/inode.c | 3 +-
fs/isofs/rock.c | 4 +-
fs/jffs2/file.c | 11 ++--
fs/jffs2/os-linux.h | 3 +-
fs/jfs/inode.c | 6 ++-
fs/jfs/jfs_metapage.c | 15 ++++--
fs/libfs.c | 13 +++--
fs/minix/inode.c | 6 ++-
fs/mpage.c | 2 +-
fs/nfs/dir.c | 14 ++---
fs/nfs/file.c | 16 +++---
fs/nfs/read.c | 6 ++-
fs/nfs/symlink.c | 7 +--
fs/nfs/write.c | 9 ++--
fs/nilfs2/inode.c | 11 ++--
fs/nilfs2/mdt.c | 3 +-
fs/nilfs2/page.c | 2 +-
fs/nilfs2/segment.c | 4 +-
fs/ntfs/aops.c | 10 ++--
fs/ntfs/file.c | 2 +-
fs/ocfs2/aops.c | 9 ++--
fs/ocfs2/symlink.c | 4 +-
fs/omfs/file.c | 6 ++-
fs/orangefs/inode.c | 38 +++++++------
fs/qnx4/inode.c | 3 +-
fs/qnx6/inode.c | 3 +-
fs/reiserfs/inode.c | 20 ++++---
fs/romfs/super.c | 3 +-
fs/squashfs/file.c | 4 +-
fs/squashfs/symlink.c | 4 +-
fs/sysv/itree.c | 6 ++-
fs/ubifs/file.c | 21 +++++---
fs/udf/file.c | 7 ++-
fs/udf/inode.c | 8 +--
fs/udf/symlink.c | 4 +-
fs/ufs/inode.c | 6 ++-
fs/vboxsf/file.c | 6 ++-
fs/xfs/xfs_aops.c | 6 +--
fs/zonefs/super.c | 6 ++-
include/linux/balloon_compaction.h | 11 ++--
include/linux/buffer_head.h | 12 +++--
include/linux/fs.h | 38 +++++++------
include/linux/iomap.h | 15 +++---
include/linux/mm.h | 11 +++-
include/linux/nfs_fs.h | 5 +-
include/linux/page-xa.h | 66 +++++++++++++++++++++++
include/linux/pagemap.h | 6 +--
include/linux/swap.h | 10 ++--
mm/balloon_compaction.c | 5 +-
mm/filemap.c | 33 ++++++------
mm/migrate.c | 6 +--
mm/page-writeback.c | 16 +++---
mm/page_io.c | 11 ++--
mm/readahead.c | 6 +--
mm/shmem.c | 5 +-
mm/truncate.c | 9 ++--
mm/vmscan.c | 12 ++---
mm/z3fold.c | 6 ++-
mm/zsmalloc.c | 6 ++-
118 files changed, 722 insertions(+), 385 deletions(-)
create mode 100644 include/linux/page-xa.h

--
2.26.2

2020-10-07 01:46:37

by Jerome Glisse

[permalink] [raw]

Subject: [PATCH 11/14] mm: add struct address_space to launder_page() callback

2020-10-07 01:48:20

by Jerome Glisse

[permalink] [raw]

Subject: [PATCH 08/14] mm: add struct address_space to releasepage() callback

From: Jérôme Glisse <[email protected]>

This is part of patchset to remove dependency on struct page.mapping
field so that we can temporarily update it to point to a special
structure tracking temporary page state (note that original mapping
pointer is preserved and can still be accessed but at a cost).

Add struct address_space to releasepage() callback arguments.

Note that this patch does not make use of the new argument, nor does
it use a valid one at call site (by default this patch just use NULL
for new argument value).

Use following script (from root of linux kernel tree):

./that-script.sh that-semantic-patch.spatch

%<--------------------------------------------------------------------
#!/bin/sh
spatch_file=$1

echo PART1 ===========================================================

# P1 find callback functions name
spatch --dir . --no-includes -D part1 --sp-file $spatch_file

echo PART2 ===========================================================

# P2 change callback function prototype
cat /tmp/unicorn-functions | sort | uniq | while read func ; do
for file in $( git grep -l $func -- '*.[ch]' ) ; do
echo $file
spatch --no-includes --in-place -D part2 \
-D fn=$func --sp-file $spatch_file $file
done
done

echo PART 3 ==========================================================

# P3 find all function which call the callback
spatch --dir . --include-headers -D part3 --sp-file $spatch_file

echo PART 4===========================================================

# P4 change all funcitons which call the callback
cat /tmp/unicorn-files | sort | uniq | while read file ; do
echo $file
spatch --no-includes --in-place -D part4 \
--sp-file $spatch_file $file
done
-------------------------------------------------------------------->%

With the following semantic patch:

%<--------------------------------------------------------------------
virtual part1, part2, part3, part4

// ----------------------------------------------------------------------------
// Part 1 is grepping all function that are use as callback for releasepage.

// initialize file where we collect all function name (erase it)
@initialize:python depends on part1@
@@
file=open('/tmp/unicorn-functions', 'w')
file.close()

// match function name use as a callback
@p1r2 depends on part1@
identifier I1, FN;
@@
struct address_space_operations I1 = {..., .releasepage = FN, ...};

@script:python p1r3 depends on p1r2@
funcname << p1r2.FN;
@@
if funcname != "NULL":
file=open('/tmp/unicorn-functions', 'a')
file.write(funcname + '\n')
file.close()

// -------------------------------------------------------------------
// Part 2 modify callback

// Add address_space argument to the function (releasepage callback one)
@p2r1 depends on part2@
identifier virtual.fn;
identifier I1, I2;
type T1, T2;
@@
int fn(
+struct address_space *__mapping,
T1 I1, T2 I2) { ... }

@p2r2 depends on part2@
identifier virtual.fn;
identifier I1, I2;
type T1, T2;
@@
int fn(
+struct address_space *__mapping,
T1 I1, T2 I2);

@p2r3 depends on part2@
identifier virtual.fn;
expression E1, E2;
@@
fn(
+MAPPING_NULL,
E1, E2)

// ----------------------------------------------------------------------------
// Part 3 is grepping all function that are use the callback for releasepage.

// initialize file where we collect all function name (erase it)
@initialize:python depends on part3@
@@
file=open('/tmp/unicorn-files', 'w')
file.write("./include/linux/pagemap.h\n")
file.write("./include/linux/mm.h\n")
file.write("./include/linux/fs.h\n")
file.write("./mm/readahead.c\n")
file.write("./mm/filemap.c\n")
file.close()

@p3r1 depends on part3 exists@
expression E1, E2, E3;
identifier FN;
position P;
@@
FN@P(...) {...
(
E1.a_ops->releasepage(E2, E3)
|
E1->a_ops->releasepage(E2, E3)
)
...}

@script:python p3r2 depends on p3r1@
P << p3r1.P;
@@
file=open('/tmp/unicorn-files', 'a')
file.write(P[0].file + '\n')
file.close()

// -------------------------------------------------------------------
// Part 4 generic modification
@p4r1 depends on part4@
@@
struct address_space_operations { ... int (*releasepage)(
+struct address_space *,
struct page *, ...); ... };

@p4r2 depends on part4@
expression E1, E2, E3;
@@
E1.a_ops->releasepage(
+MAPPING_NULL,
E2, E3)

@p4r3 depends on part4@
expression E1, E2, E3;
@@
E1->a_ops->releasepage(
+MAPPING_NULL,
E2, E3)
-------------------------------------------------------------------->%

Signed-off-by: Jérôme Glisse <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Andrew Morton <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: [email protected]
Cc: Tejun Heo <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Josef Bacik <[email protected]>
---
fs/9p/vfs_addr.c | 3 ++-
fs/afs/dir.c | 6 ++++--
fs/afs/file.c | 6 ++++--
fs/block_dev.c | 3 ++-
fs/btrfs/disk-io.c | 5 +++--
fs/btrfs/inode.c | 5 +++--
fs/ceph/addr.c | 3 ++-
fs/cifs/file.c | 3 ++-
fs/erofs/super.c | 5 +++--
fs/ext4/inode.c | 3 ++-
fs/f2fs/data.c | 3 ++-
fs/f2fs/f2fs.h | 3 ++-
fs/gfs2/aops.c | 3 ++-
fs/gfs2/inode.h | 3 ++-
fs/hfs/inode.c | 3 ++-
fs/hfsplus/inode.c | 3 ++-
fs/iomap/buffered-io.c | 3 ++-
fs/jfs/jfs_metapage.c | 5 +++--
fs/nfs/file.c | 3 ++-
fs/ocfs2/aops.c | 3 ++-
fs/orangefs/inode.c | 3 ++-
fs/reiserfs/inode.c | 3 ++-
fs/ubifs/file.c | 3 ++-
include/linux/fs.h | 2 +-
include/linux/iomap.h | 3 ++-
mm/filemap.c | 3 ++-
26 files changed, 59 insertions(+), 32 deletions(-)

diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index 357f2e5049c48..0ae4f31b3d7f2 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -123,7 +123,8 @@ static int v9fs_vfs_readpages(struct file *filp, struct address_space *mapping,
* Returns 1 if the page can be released, false otherwise.
*/

-static int v9fs_release_page(struct page *page, gfp_t gfp)
+static int v9fs_release_page(struct address_space *__mapping,
+ struct page *page, gfp_t gfp)
{
if (PagePrivate(page))
return 0;
diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index d77c13c213d2d..c27524f35281e 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -40,7 +40,8 @@ static int afs_symlink(struct inode *dir, struct dentry *dentry,
static int afs_rename(struct inode *old_dir, struct dentry *old_dentry,
struct inode *new_dir, struct dentry *new_dentry,
unsigned int flags);
-static int afs_dir_releasepage(struct page *page, gfp_t gfp_flags);
+static int afs_dir_releasepage(struct address_space *__mapping,
+ struct page *page, gfp_t gfp_flags);
static void afs_dir_invalidatepage(struct address_space *__mapping,
struct page *page, unsigned int offset,
unsigned int length);
@@ -1971,7 +1972,8 @@ static int afs_rename(struct inode *old_dir, struct dentry *old_dentry,
* Release a directory page and clean up its private state if it's not busy
* - return true if the page can now be released, false if not
*/
-static int afs_dir_releasepage(struct page *page, gfp_t gfp_flags)
+static int afs_dir_releasepage(struct address_space *__mapping,
+ struct page *page, gfp_t gfp_flags)
{
struct afs_vnode *dvnode = AFS_FS_I(page->mapping->host);

diff --git a/fs/afs/file.c b/fs/afs/file.c
index 43edfa65c7ac7..496595240f12b 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -22,7 +22,8 @@ static int afs_readpage(struct file *file, struct address_space *__mapping,
static void afs_invalidatepage(struct address_space *__mapping,
struct page *page, unsigned int offset,
unsigned int length);
-static int afs_releasepage(struct page *page, gfp_t gfp_flags);
+static int afs_releasepage(struct address_space *__mapping, struct page *page,
+ gfp_t gfp_flags);

static int afs_readpages(struct file *filp, struct address_space *mapping,
struct list_head *pages, unsigned nr_pages);
@@ -645,7 +646,8 @@ static void afs_invalidatepage(struct address_space *__mapping,
* release a page and clean up its private state if it's not busy
* - return true if the page can now be released, false if not
*/
-static int afs_releasepage(struct page *page, gfp_t gfp_flags)
+static int afs_releasepage(struct address_space *__mapping, struct page *page,
+ gfp_t gfp_flags)
{
struct afs_vnode *vnode = AFS_FS_I(page->mapping->host);
unsigned long priv;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index b50c93932dfdf..e4cb73598e3c1 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1935,7 +1935,8 @@ EXPORT_SYMBOL_GPL(blkdev_read_iter);
* Try to release a page associated with block device when the system
* is under memory pressure.
*/
-static int blkdev_releasepage(struct page *page, gfp_t wait)
+static int blkdev_releasepage(struct address_space *__mapping,
+ struct page *page, gfp_t wait)
{
struct super_block *super = BDEV_I(page->mapping->host)->bdev.bd_super;

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d57d0a6dd2621..c9d640cfa51cb 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -957,7 +957,8 @@ static int btree_readpage(struct file *file, struct address_space *__mapping,
return extent_read_full_page(page, btree_get_extent, 0);
}

-static int btree_releasepage(struct page *page, gfp_t gfp_flags)
+static int btree_releasepage(struct address_space *__mapping,
+ struct page *page, gfp_t gfp_flags)
{
if (PageWriteback(page) || PageDirty(page))
return 0;
@@ -972,7 +973,7 @@ static void btree_invalidatepage(struct address_space *__mapping,
struct extent_io_tree *tree;
tree = &BTRFS_I(page->mapping->host)->io_tree;
extent_invalidatepage(tree, page, offset);
- btree_releasepage(page, GFP_NOFS);
+ btree_releasepage(MAPPING_NULL, page, GFP_NOFS);
if (PagePrivate(page)) {
btrfs_warn(BTRFS_I(page->mapping->host)->root->fs_info,
"page private not zero on page %llu",
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 062886fc0e750..95bf86a871ffb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8057,7 +8057,8 @@ static int __btrfs_releasepage(struct page *page, gfp_t gfp_flags)
return ret;
}

-static int btrfs_releasepage(struct page *page, gfp_t gfp_flags)
+static int btrfs_releasepage(struct address_space *__mapping,
+ struct page *page, gfp_t gfp_flags)
{
if (PageWriteback(page) || PageDirty(page))
return 0;
@@ -8116,7 +8117,7 @@ static void btrfs_invalidatepage(struct address_space *__mapping,

tree = &BTRFS_I(inode)->io_tree;
if (offset) {
- btrfs_releasepage(page, GFP_NOFS);
+ btrfs_releasepage(MAPPING_NULL, page, GFP_NOFS);
return;
}

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index ed555b0d48bfa..f6739a7b9ad35 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -172,7 +172,8 @@ static void ceph_invalidatepage(struct address_space *__mapping,
ClearPagePrivate(page);
}

-static int ceph_releasepage(struct page *page, gfp_t g)
+static int ceph_releasepage(struct address_space *__mapping,
+ struct page *page, gfp_t g)
{
dout("%p releasepage %p idx %lu (%sdirty)\n", page->mapping->host,
page, page->index, PageDirty(page) ? "" : "not ");
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 84cb64821036c..38d79a9eafa76 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -4694,7 +4694,8 @@ static int cifs_write_begin(struct file *file, struct address_space *mapping,
return rc;
}

-static int cifs_release_page(struct page *page, gfp_t gfp)
+static int cifs_release_page(struct address_space *__mapping,
+ struct page *page, gfp_t gfp)
{
if (PagePrivate(page))
return 0;
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 3c0e10d1b4e19..d4082102c180f 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -281,7 +281,8 @@ static int erofs_fc_parse_param(struct fs_context *fc,
#ifdef CONFIG_EROFS_FS_ZIP
static const struct address_space_operations managed_cache_aops;

-static int erofs_managed_cache_releasepage(struct page *page, gfp_t gfp_mask)
+static int erofs_managed_cache_releasepage(struct address_space *__mapping,
+ struct page *page, gfp_t gfp_mask)
{
int ret = 1; /* 0 - busy */
struct address_space *const mapping = page->mapping;
@@ -308,7 +309,7 @@ static void erofs_managed_cache_invalidatepage(struct address_space *__mapping,
DBG_BUGON(stop > PAGE_SIZE || stop < length);

if (offset == 0 && stop == PAGE_SIZE)
- while (!erofs_managed_cache_releasepage(page, GFP_NOFS))
+ while (!erofs_managed_cache_releasepage(MAPPING_NULL, page, GFP_NOFS))
cond_resched();
}

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 27b8d57349d88..2fd0c674136cc 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3283,7 +3283,8 @@ static void ext4_journalled_invalidatepage(struct address_space *__mapping,
WARN_ON(__ext4_journalled_invalidatepage(page, offset, length) < 0);
}

-static int ext4_releasepage(struct page *page, gfp_t wait)
+static int ext4_releasepage(struct address_space *__mapping,
+ struct page *page, gfp_t wait)
{
journal_t *journal = EXT4_JOURNAL(page->mapping->host);

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index b13e430e62435..c6444ffd7d6e9 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -3720,7 +3720,8 @@ void f2fs_invalidate_page(struct address_space *__mapping, struct page *page,
f2fs_clear_page_private(page);
}

-int f2fs_release_page(struct page *page, gfp_t wait)
+int f2fs_release_page(struct address_space *__mapping, struct page *page,
+ gfp_t wait)
{
/* If this is dirty page, keep PagePrivate */
if (PageDirty(page))
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index eb6f9aa4007c6..6bb30d2192842 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -3472,7 +3472,8 @@ int f2fs_write_single_data_page(struct page *page, int *submitted,
void f2fs_invalidate_page(struct address_space *__mapping, struct page *page,
unsigned int offset,
unsigned int length);
-int f2fs_release_page(struct page *page, gfp_t wait);
+int f2fs_release_page(struct address_space *__mapping, struct page *page,
+ gfp_t wait);
#ifdef CONFIG_MIGRATION
int f2fs_migrate_page(struct address_space *mapping, struct page *newpage,
struct page *page, enum migrate_mode mode);
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 0c6d2e99a5243..77efc65a412ec 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -723,7 +723,8 @@ static void gfs2_invalidatepage(struct address_space *__mapping,
* Returns: 1 if the page was put or else 0
*/

-int gfs2_releasepage(struct page *page, gfp_t gfp_mask)
+int gfs2_releasepage(struct address_space *__mapping, struct page *page,
+ gfp_t gfp_mask)
{
struct address_space *mapping = page->mapping;
struct gfs2_sbd *sdp = gfs2_mapping2sbd(mapping);
diff --git a/fs/gfs2/inode.h b/fs/gfs2/inode.h
index b52ecf4ffe634..f1e878353cebb 100644
--- a/fs/gfs2/inode.h
+++ b/fs/gfs2/inode.h
@@ -12,7 +12,8 @@
#include <linux/mm.h>
#include "util.h"

-extern int gfs2_releasepage(struct page *page, gfp_t gfp_mask);
+extern int gfs2_releasepage(struct address_space *__mapping,
+ struct page *page, gfp_t gfp_mask);
extern int gfs2_internal_read(struct gfs2_inode *ip,
char *buf, loff_t *pos, unsigned size);
extern void gfs2_set_aops(struct inode *inode);
diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
index 101cc5e10524f..8986c8a0a23b2 100644
--- a/fs/hfs/inode.c
+++ b/fs/hfs/inode.c
@@ -72,7 +72,8 @@ static sector_t hfs_bmap(struct address_space *mapping, sector_t block)
return generic_block_bmap(mapping, block, hfs_get_block);
}

-static int hfs_releasepage(struct page *page, gfp_t mask)
+static int hfs_releasepage(struct address_space *__mapping, struct page *page,
+ gfp_t mask)
{
struct inode *inode = page->mapping->host;
struct super_block *sb = inode->i_sb;
diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c
index 1654ee206e7e5..0534280978457 100644
--- a/fs/hfsplus/inode.c
+++ b/fs/hfsplus/inode.c
@@ -66,7 +66,8 @@ static sector_t hfsplus_bmap(struct address_space *mapping, sector_t block)
return generic_block_bmap(mapping, block, hfsplus_get_block);
}

-static int hfsplus_releasepage(struct page *page, gfp_t mask)
+static int hfsplus_releasepage(struct address_space *__mapping,
+ struct page *page, gfp_t mask)
{
struct inode *inode = page->mapping->host;
struct super_block *sb = inode->i_sb;
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index b94729b7088a7..091f6656f3d6b 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -476,7 +476,8 @@ iomap_is_partially_uptodate(struct page *page, unsigned long from,
EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);

int
-iomap_releasepage(struct page *page, gfp_t gfp_mask)
+iomap_releasepage(struct address_space *__mapping, struct page *page,
+ gfp_t gfp_mask)
{
trace_iomap_releasepage(page->mapping->host, page_offset(page),
PAGE_SIZE);
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index 5be751fa11e0b..435b55faca4ff 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -528,7 +528,8 @@ static int metapage_readpage(struct file *fp, struct address_space *__mapping,
return -EIO;
}

-static int metapage_releasepage(struct page *page, gfp_t gfp_mask)
+static int metapage_releasepage(struct address_space *__mapping,
+ struct page *page, gfp_t gfp_mask)
{
struct metapage *mp;
int ret = 1;
@@ -565,7 +566,7 @@ static void metapage_invalidatepage(struct address_space *__mapping,

BUG_ON(PageWriteback(page));

- metapage_releasepage(page, 0);
+ metapage_releasepage(MAPPING_NULL, page, 0);
}

const struct address_space_operations jfs_metapage_aops = {
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 381288d686386..ddfe95d3da057 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -426,7 +426,8 @@ static void nfs_invalidate_page(struct address_space *__mapping,
* - Caller holds page lock
* - Return true (may release page) or false (may not)
*/
-static int nfs_release_page(struct page *page, gfp_t gfp)
+static int nfs_release_page(struct address_space *__mapping,
+ struct page *page, gfp_t gfp)
{
dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);

diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index c597a104e0af4..fdd3c6a55d817 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -501,7 +501,8 @@ static sector_t ocfs2_bmap(struct address_space *mapping, sector_t block)
return status;
}

-static int ocfs2_releasepage(struct page *page, gfp_t wait)
+static int ocfs2_releasepage(struct address_space *__mapping,
+ struct page *page, gfp_t wait)
{
if (!page_has_buffers(page))
return 0;
diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c
index 6ea0ec45754dc..1534dc2df6e5c 100644
--- a/fs/orangefs/inode.c
+++ b/fs/orangefs/inode.c
@@ -520,7 +520,8 @@ static void orangefs_invalidatepage(struct address_space *__mapping,
orangefs_launder_page(page);
}

-static int orangefs_releasepage(struct page *page, gfp_t foo)
+static int orangefs_releasepage(struct address_space *__mapping,
+ struct page *page, gfp_t foo)
{
return !PagePrivate(page);
}
diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index d35c03a7d3f5b..efd149cc897a9 100644
--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -3230,7 +3230,8 @@ static int reiserfs_set_page_dirty(struct address_space *__mapping,
* even in -o notail mode, we can't be sure an old mount without -o notail
* didn't create files with tails.
*/
-static int reiserfs_releasepage(struct page *page, gfp_t unused_gfp_flags)
+static int reiserfs_releasepage(struct address_space *__mapping,
+ struct page *page, gfp_t unused_gfp_flags)
{
struct inode *inode = page->mapping->host;
struct reiserfs_journal *j = SB_JOURNAL(inode->i_sb);
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 11ed42c4859f0..7e00370ca3ed1 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1486,7 +1486,8 @@ static int ubifs_migrate_page(struct address_space *mapping,
}
#endif

-static int ubifs_releasepage(struct page *page, gfp_t unused_gfp_flags)
+static int ubifs_releasepage(struct address_space *__mapping,
+ struct page *page, gfp_t unused_gfp_flags)
{
struct inode *inode = page->mapping->host;
struct ubifs_info *c = inode->i_sb->s_fs_info;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b471e82546001..989e505de9182 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -399,7 +399,7 @@ struct address_space_operations {
sector_t (*bmap)(struct address_space *, sector_t);
void (*invalidatepage) (struct address_space *, struct page *,
unsigned int, unsigned int);
- int (*releasepage) (struct page *, gfp_t);
+ int (*releasepage) (struct address_space *, struct page *, gfp_t);
void (*freepage)(struct page *);
ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
/*
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 45f23d2268365..cb4b207974756 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -159,7 +159,8 @@ void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops);
int iomap_set_page_dirty(struct address_space *__mapping, struct page *page);
int iomap_is_partially_uptodate(struct page *page, unsigned long from,
unsigned long count);
-int iomap_releasepage(struct page *page, gfp_t gfp_mask);
+int iomap_releasepage(struct address_space *__mapping, struct page *page,
+ gfp_t gfp_mask);
void iomap_invalidatepage(struct address_space *__mapping, struct page *page,
unsigned int offset,
unsigned int len);
diff --git a/mm/filemap.c b/mm/filemap.c
index b67a253e5e6c7..eccd5d0554851 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3694,7 +3694,8 @@ int try_to_release_page(struct page *page, gfp_t gfp_mask)
return 0;

if (mapping && mapping->a_ops->releasepage)
- return mapping->a_ops->releasepage(page, gfp_mask);
+ return mapping->a_ops->releasepage(MAPPING_NULL, page,
+ gfp_mask);
return try_to_free_buffers(page);
}

--
2.26.2

2020-10-07 03:49:00

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH 00/14] Small step toward KSM for file back page.

On Tue, Oct 06, 2020 at 09:05:49PM -0400, [email protected] wrote:
> The present patchset just add mapping argument to the various vfs call-
> backs. It does not make use of that new parameter to avoid regression.
> I am posting this whole things as small contain patchset as it is rather
> big and i would like to make progress step by step.

Well, that's the problem. This patch set is gigantic and unreviewable.
And it has no benefits. The idea you present here was discussed at
LSFMM in Utah and I recall absolutely nobody being in favour of it.
You claim many wonderful features will be unlocked by this, but I think
they can all be achieved without doing any of this very disruptive work.

> 118 files changed, 722 insertions(+), 385 deletions(-)

mmm.

2020-10-07 17:25:29

by Jerome Glisse

[permalink] [raw]

Subject: Re: [PATCH 00/14] Small step toward KSM for file back page.

On Wed, Oct 07, 2020 at 04:20:13AM +0100, Matthew Wilcox wrote:
> On Tue, Oct 06, 2020 at 09:05:49PM -0400, [email protected] wrote:
> > The present patchset just add mapping argument to the various vfs call-
> > backs. It does not make use of that new parameter to avoid regression.
> > I am posting this whole things as small contain patchset as it is rather
> > big and i would like to make progress step by step.
>
> Well, that's the problem. This patch set is gigantic and unreviewable.
> And it has no benefits. The idea you present here was discussed at
> LSFMM in Utah and I recall absolutely nobody being in favour of it.
> You claim many wonderful features will be unlocked by this, but I think
> they can all be achieved without doing any of this very disruptive work.

You have any ideas on how to achieve them without such change ? I will
be more than happy for a simpler solution but i fail to see how you can
work around the need for a pointer inside struct page. Given struct
page can not grow it means you need to be able to overload one of the
existing field, at least i do not see any otherway.

Cheers,
J?r?me

2020-10-07 18:22:38

by Jerome Glisse

[permalink] [raw]

Subject: Re: [PATCH 00/14] Small step toward KSM for file back page.

On Wed, Oct 07, 2020 at 06:05:58PM +0100, Matthew Wilcox wrote:
> On Wed, Oct 07, 2020 at 10:48:35AM -0400, Jerome Glisse wrote:
> > On Wed, Oct 07, 2020 at 04:20:13AM +0100, Matthew Wilcox wrote:
> > > On Tue, Oct 06, 2020 at 09:05:49PM -0400, [email protected] wrote:
> > > > The present patchset just add mapping argument to the various vfs call-
> > > > backs. It does not make use of that new parameter to avoid regression.
> > > > I am posting this whole things as small contain patchset as it is rather
> > > > big and i would like to make progress step by step.
> > >
> > > Well, that's the problem. This patch set is gigantic and unreviewable.
> > > And it has no benefits. The idea you present here was discussed at
> > > LSFMM in Utah and I recall absolutely nobody being in favour of it.
> > > You claim many wonderful features will be unlocked by this, but I think
> > > they can all be achieved without doing any of this very disruptive work.
> >
> > You have any ideas on how to achieve them without such change ? I will
> > be more than happy for a simpler solution but i fail to see how you can
> > work around the need for a pointer inside struct page. Given struct
> > page can not grow it means you need to be able to overload one of the
> > existing field, at least i do not see any otherway.
>
> The one I've spent the most time thinking about is sharing pages between
> reflinked files. My approach is to pull DAX entries into the main page
> cache and have them reference the PFN directly. It's not a struct page,
> but we can find a struct page from it if we need it. The struct page
> would belong to a mapping that isn't part of the file.

You would need to do a lot of filesystem specific change to make sure
the fs understand the special mapping. It is doable but i feel it would
have a lot of fs specific part.

> For other things (NUMA distribution), we can point to something which
> isn't a struct page and can be distiguished from a real struct page by a
> bit somewhere (I have ideas for at least three bits in struct page that
> could be used for this). Then use a pointer in that data structure to
> point to the real page. Or do NUMA distribution at the inode level.
> Have a way to get from (inode, node) to an address_space which contains
> just regular pages.

How do you find all the copies ? KSM maintains a list for a reasons.
Same would be needed here because if you want to break the write prot
you need to find all the copy first. If you intend to walk page table
then how do you synchronize to avoid more copy to spawn while you
walk reverse mapping, we could lock the struct page i guess. Also how
do you walk device page table which are completely hidden from core mm.

> Using main memory to cache DAX could be done today without any data
> structure changes. It just needs the DAX entries pulled up into the
> main pagecache. See earlier item.
>
> Exclusive write access ... you could put a magic value in the pagecache
> for pages which are exclusively for someone else's use and handle those
> specially. I don't entirely understand this use case.

For this use case you need a callback to break the protection and it
needs to handle all cases ie not only write by CPU through file mapping
but also file write syscall and other syscall that can write to page
(pipe, ...).

> I don't have time to work on all of these. If there's one that
> particularly interests you, let's dive deep into it and figure out how

I care about KSM, duplicate NUMA copy (not only for CPU but also
device) and write protection or exclusive write access. In each case
you need a list of all the copy (for KSM of the deduplicated page)
Having a special entry in the page cache does not sound like a good
option in many code path you would need to re-look the page cache to
find out if the page is in special state. If you use a bit flag in
struct page how do you get to the callback or to the copy/alias,
walk all the page tables ?

> you can do it without committing this kind of violence to struct page.

I do not see how i am doing violence to struct page :) The basis of
my approach is to pass down the mapping. We always have the mapping
at the top of the stack (either syscall entry point on a file or
through the vma when working on virtual address).

But we rarely pass down this mapping down the stack into the fs code.
I am only passing down the mapping through the bottom of the stack so
we do not need to rely of page->mapping all the time. I am not trying
to remove the page->mapping field, it is still usefull, i just want
to be able to overload it so that we can make KSM code generic and
allow to reuse that generic part for other usecase.

Cheers,
J?r?me

2020-10-07 18:34:51

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH 00/14] Small step toward KSM for file back page.

On Wed, Oct 07, 2020 at 01:54:19PM -0400, Jerome Glisse wrote:
> On Wed, Oct 07, 2020 at 06:05:58PM +0100, Matthew Wilcox wrote:
> > On Wed, Oct 07, 2020 at 10:48:35AM -0400, Jerome Glisse wrote:
> > > On Wed, Oct 07, 2020 at 04:20:13AM +0100, Matthew Wilcox wrote:
> > > > On Tue, Oct 06, 2020 at 09:05:49PM -0400, [email protected] wrote:
> > > > > The present patchset just add mapping argument to the various vfs call-
> > > > > backs. It does not make use of that new parameter to avoid regression.
> > > > > I am posting this whole things as small contain patchset as it is rather
> > > > > big and i would like to make progress step by step.
> > > >
> > > > Well, that's the problem. This patch set is gigantic and unreviewable.
> > > > And it has no benefits. The idea you present here was discussed at
> > > > LSFMM in Utah and I recall absolutely nobody being in favour of it.
> > > > You claim many wonderful features will be unlocked by this, but I think
> > > > they can all be achieved without doing any of this very disruptive work.
> > >
> > > You have any ideas on how to achieve them without such change ? I will
> > > be more than happy for a simpler solution but i fail to see how you can
> > > work around the need for a pointer inside struct page. Given struct
> > > page can not grow it means you need to be able to overload one of the
> > > existing field, at least i do not see any otherway.
> >
> > The one I've spent the most time thinking about is sharing pages between
> > reflinked files. My approach is to pull DAX entries into the main page
> > cache and have them reference the PFN directly. It's not a struct page,
> > but we can find a struct page from it if we need it. The struct page
> > would belong to a mapping that isn't part of the file.
>
> You would need to do a lot of filesystem specific change to make sure
> the fs understand the special mapping. It is doable but i feel it would
> have a lot of fs specific part.

I can't see any way to make it work without filesystem cooperation.

> > For other things (NUMA distribution), we can point to something which
> > isn't a struct page and can be distiguished from a real struct page by a
> > bit somewhere (I have ideas for at least three bits in struct page that
> > could be used for this). Then use a pointer in that data structure to
> > point to the real page. Or do NUMA distribution at the inode level.
> > Have a way to get from (inode, node) to an address_space which contains
> > just regular pages.
>
> How do you find all the copies ? KSM maintains a list for a reasons.
> Same would be needed here because if you want to break the write prot
> you need to find all the copy first. If you intend to walk page table
> then how do you synchronize to avoid more copy to spawn while you
> walk reverse mapping, we could lock the struct page i guess. Also how
> do you walk device page table which are completely hidden from core mm.

You have the inode and you iterate over each mapping, looking up the page
that's in each mapping. Or you use the i_mmap tree to find the pages.

> > Using main memory to cache DAX could be done today without any data
> > structure changes. It just needs the DAX entries pulled up into the
> > main pagecache. See earlier item.
> >
> > Exclusive write access ... you could put a magic value in the pagecache
> > for pages which are exclusively for someone else's use and handle those
> > specially. I don't entirely understand this use case.
>
> For this use case you need a callback to break the protection and it
> needs to handle all cases ie not only write by CPU through file mapping
> but also file write syscall and other syscall that can write to page
> (pipe, ...).

If the page can't be found in the page cache, then by definition you
won't be able to write to them.

> > I don't have time to work on all of these. If there's one that
> > particularly interests you, let's dive deep into it and figure out how
>
> I care about KSM, duplicate NUMA copy (not only for CPU but also
> device) and write protection or exclusive write access. In each case
> you need a list of all the copy (for KSM of the deduplicated page)
> Having a special entry in the page cache does not sound like a good
> option in many code path you would need to re-look the page cache to
> find out if the page is in special state. If you use a bit flag in
> struct page how do you get to the callback or to the copy/alias,
> walk all the page tables ?

Like I said, something that _looks_ like a struct page. At least looks
enough like a struct page that you can pull a pointer out of the page cache and check the bit. But since it's not actually a struct page, you can
use the rest of the data structure for pointers to things you want to
track. Like the real struct page.

> I do not see how i am doing violence to struct page :) The basis of
> my approach is to pass down the mapping. We always have the mapping
> at the top of the stack (either syscall entry point on a file or
> through the vma when working on virtual address).

Yes, you explained all that in Utah. I wasn't impressed than, and I'm
not impressed now.

2020-10-07 19:27:46

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH 00/14] Small step toward KSM for file back page.

On Wed, Oct 07, 2020 at 10:48:35AM -0400, Jerome Glisse wrote:
> On Wed, Oct 07, 2020 at 04:20:13AM +0100, Matthew Wilcox wrote:
> > On Tue, Oct 06, 2020 at 09:05:49PM -0400, [email protected] wrote:
> > > The present patchset just add mapping argument to the various vfs call-
> > > backs. It does not make use of that new parameter to avoid regression.
> > > I am posting this whole things as small contain patchset as it is rather
> > > big and i would like to make progress step by step.
> >
> > Well, that's the problem. This patch set is gigantic and unreviewable.
> > And it has no benefits. The idea you present here was discussed at
> > LSFMM in Utah and I recall absolutely nobody being in favour of it.
> > You claim many wonderful features will be unlocked by this, but I think
> > they can all be achieved without doing any of this very disruptive work.
>
> You have any ideas on how to achieve them without such change ? I will
> be more than happy for a simpler solution but i fail to see how you can
> work around the need for a pointer inside struct page. Given struct
> page can not grow it means you need to be able to overload one of the
> existing field, at least i do not see any otherway.

The one I've spent the most time thinking about is sharing pages between
reflinked files. My approach is to pull DAX entries into the main page
cache and have them reference the PFN directly. It's not a struct page,
but we can find a struct page from it if we need it. The struct page
would belong to a mapping that isn't part of the file.

For other things (NUMA distribution), we can point to something which
isn't a struct page and can be distiguished from a real struct page by a
bit somewhere (I have ideas for at least three bits in struct page that
could be used for this). Then use a pointer in that data structure to
point to the real page. Or do NUMA distribution at the inode level.
Have a way to get from (inode, node) to an address_space which contains
just regular pages.

Using main memory to cache DAX could be done today without any data
structure changes. It just needs the DAX entries pulled up into the
main pagecache. See earlier item.

Exclusive write access ... you could put a magic value in the pagecache
for pages which are exclusively for someone else's use and handle those
specially. I don't entirely understand this use case.

I don't have time to work on all of these. If there's one that
particularly interests you, let's dive deep into it and figure out how
you can do it without committing this kind of violence to struct page.

2020-10-07 21:47:52

by Jerome Glisse

[permalink] [raw]

Subject: Re: [PATCH 00/14] Small step toward KSM for file back page.

On Wed, Oct 07, 2020 at 07:33:16PM +0100, Matthew Wilcox wrote:
> On Wed, Oct 07, 2020 at 01:54:19PM -0400, Jerome Glisse wrote:
> > On Wed, Oct 07, 2020 at 06:05:58PM +0100, Matthew Wilcox wrote:
> > > On Wed, Oct 07, 2020 at 10:48:35AM -0400, Jerome Glisse wrote:
> > > > On Wed, Oct 07, 2020 at 04:20:13AM +0100, Matthew Wilcox wrote:
> > > > > On Tue, Oct 06, 2020 at 09:05:49PM -0400, [email protected] wrote:
> > > For other things (NUMA distribution), we can point to something which

[...]

> > > isn't a struct page and can be distiguished from a real struct page by a
> > > bit somewhere (I have ideas for at least three bits in struct page that
> > > could be used for this). Then use a pointer in that data structure to
> > > point to the real page. Or do NUMA distribution at the inode level.
> > > Have a way to get from (inode, node) to an address_space which contains
> > > just regular pages.
> >
> > How do you find all the copies ? KSM maintains a list for a reasons.
> > Same would be needed here because if you want to break the write prot
> > you need to find all the copy first. If you intend to walk page table
> > then how do you synchronize to avoid more copy to spawn while you
> > walk reverse mapping, we could lock the struct page i guess. Also how
> > do you walk device page table which are completely hidden from core mm.
>
> You have the inode and you iterate over each mapping, looking up the page
> that's in each mapping. Or you use the i_mmap tree to find the pages.

This would slow down for everyone as we would have to walk all mapping
each time we try to write to page. Also we a have mechanism for page
write back to avoid race between thread trying to write and write back.
We would also need something similar. Without mediating this through
struct page i do not see how to keep this reasonable from performance
point of view.

> > > I don't have time to work on all of these. If there's one that
> > > particularly interests you, let's dive deep into it and figure out how
> >
> > I care about KSM, duplicate NUMA copy (not only for CPU but also
> > device) and write protection or exclusive write access. In each case
> > you need a list of all the copy (for KSM of the deduplicated page)
> > Having a special entry in the page cache does not sound like a good
> > option in many code path you would need to re-look the page cache to
> > find out if the page is in special state. If you use a bit flag in
> > struct page how do you get to the callback or to the copy/alias,
> > walk all the page tables ?
>
> Like I said, something that _looks_ like a struct page. At least looks
> enough like a struct page that you can pull a pointer out of the page
> cache and check the bit. But since it's not actually a struct page,
> you can use the rest of the data structure for pointers to things you
> want to track. Like the real struct page.

What i fear is the added cost because it means we need to do this look-
up everytime to check and we also need proper locking to avoid races.
Adding an ancilliary struct and trying to keep everything synchronize
seems harder to me.

>
> > I do not see how i am doing violence to struct page :) The basis of
> > my approach is to pass down the mapping. We always have the mapping
> > at the top of the stack (either syscall entry point on a file or
> > through the vma when working on virtual address).
>
> Yes, you explained all that in Utah. I wasn't impressed than, and I'm
> not impressed now.

Is this more of a taste thing or is there something specific you do not
like ?

Cheers,
J?r?me

2020-10-07 22:12:57

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH 00/14] Small step toward KSM for file back page.

On Wed, Oct 07, 2020 at 01:54:19PM -0400, Jerome Glisse wrote:
> > For other things (NUMA distribution), we can point to something which
> > isn't a struct page and can be distiguished from a real struct page by a
> > bit somewhere (I have ideas for at least three bits in struct page that
> > could be used for this). Then use a pointer in that data structure to
> > point to the real page. Or do NUMA distribution at the inode level.
> > Have a way to get from (inode, node) to an address_space which contains
> > just regular pages.
>
> How do you find all the copies ? KSM maintains a list for a reasons.
> Same would be needed here because if you want to break the write prot
> you need to find all the copy first. If you intend to walk page table
> then how do you synchronize to avoid more copy to spawn while you
> walk reverse mapping, we could lock the struct page i guess. Also how
> do you walk device page table which are completely hidden from core mm.

So ... why don't you put a PageKsm page in the page cache? That way you
can share code with the current KSM implementation. You'd need
something like this:

+++ b/mm/filemap.c
@@ -1622,6 +1622,9 @@ struct page *find_lock_entry(struct address_space *mapping
, pgoff_t index)
lock_page(page);
/* Has the page been truncated? */
if (unlikely(page->mapping != mapping)) {
+ if (PageKsm(page)) {
+ ...
+ }
unlock_page(page);
put_page(page);
goto repeat;
@@ -1655,6 +1658,7 @@ struct page *find_lock_entry(struct address_space *mapping, pgoff_t index)
* * %FGP_WRITE - The page will be written
* * %FGP_NOFS - __GFP_FS will get cleared in gfp mask
* * %FGP_NOWAIT - Don't get blocked by page lock
+ * * %FGP_KSM - Return KSM pages
*
* If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even
* if the %GFP flags specified for %FGP_CREAT are atomic.
@@ -1687,6 +1691,11 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,

/* Has the page been truncated? */
if (unlikely(page->mapping != mapping)) {
+ if (PageKsm(page) {
+ if (fgp_flags & FGP_KSM)
+ return page;
+ ...
+ }
unlock_page(page);
put_page(page);
goto repeat;

I don't know what you want to do when you find a KSM page, so I just left
an ellipsis.

2020-10-08 15:56:45

by Jerome Glisse

[permalink] [raw]

Subject: Re: [PATCH 00/14] Small step toward KSM for file back page.

On Wed, Oct 07, 2020 at 11:09:16PM +0100, Matthew Wilcox wrote:
> On Wed, Oct 07, 2020 at 01:54:19PM -0400, Jerome Glisse wrote:
> > > For other things (NUMA distribution), we can point to something which
> > > isn't a struct page and can be distiguished from a real struct page by a
> > > bit somewhere (I have ideas for at least three bits in struct page that
> > > could be used for this). Then use a pointer in that data structure to
> > > point to the real page. Or do NUMA distribution at the inode level.
> > > Have a way to get from (inode, node) to an address_space which contains
> > > just regular pages.
> >
> > How do you find all the copies ? KSM maintains a list for a reasons.
> > Same would be needed here because if you want to break the write prot
> > you need to find all the copy first. If you intend to walk page table
> > then how do you synchronize to avoid more copy to spawn while you
> > walk reverse mapping, we could lock the struct page i guess. Also how
> > do you walk device page table which are completely hidden from core mm.
>
> So ... why don't you put a PageKsm page in the page cache? That way you
> can share code with the current KSM implementation. You'd need
> something like this:

I do just that but there is no need to change anything in page cache.
So below code is not necessary. What you need is a way to find all
the copies so if you have a write fault (or any write access) then
from that fault you get the mapping and offset and you use that to
lookup the fs specific informations and de-duplicate the page with
new page and the fs specific informations. Hence the filesystem code
do not need to know anything it all happens in generic common code.

So flow is:

Same as before:
1 - write fault (address, vma)
2 - regular write fault handler -> find page in page cache

New to common page fault code:
3 - ksm check in write fault common code (same as ksm today
for anonymous page fault code path).
4 - break ksm (address, vma) -> (file offset, mapping)
4.a - use mapping and file offset to lookup the proper
fs specific information that were save when the
page was made ksm.
4.b - allocate new page and initialize it with that
information (and page content), update page cache
and mappings ie all the pte who where pointing to
the ksm for that mapping at that offset to now use
the new page (like KSM for anonymous page today).

Resume regular code path:
mkwrite /|| set pte ...

Roughly the same for write ioctl (other cases goes through GUP
which itself goes through page fault code path). There is no
need to change page cache in anyway. Just common code path that
enable write to file back page.

The fs specific information is page->private, some of the flags
(page->flags) and page->indexi (file offset). Everytime a page
is deduplicated a copy of that information is save in an alias
struct which you can get to from the the share KSM page (page->
mapping is a pointer to ksm root struct which has a pointer to
list of all aliases).

>
> +++ b/mm/filemap.c
> @@ -1622,6 +1622,9 @@ struct page *find_lock_entry(struct address_space *mapping
> , pgoff_t index)
> lock_page(page);
> /* Has the page been truncated? */
> if (unlikely(page->mapping != mapping)) {
> + if (PageKsm(page)) {
> + ...
> + }
> unlock_page(page);
> put_page(page);
> goto repeat;
> @@ -1655,6 +1658,7 @@ struct page *find_lock_entry(struct address_space *mapping, pgoff_t index)
> * * %FGP_WRITE - The page will be written
> * * %FGP_NOFS - __GFP_FS will get cleared in gfp mask
> * * %FGP_NOWAIT - Don't get blocked by page lock
> + * * %FGP_KSM - Return KSM pages
> *
> * If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even
> * if the %GFP flags specified for %FGP_CREAT are atomic.
> @@ -1687,6 +1691,11 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
>
> /* Has the page been truncated? */
> if (unlikely(page->mapping != mapping)) {
> + if (PageKsm(page) {
> + if (fgp_flags & FGP_KSM)
> + return page;
> + ...
> + }
> unlock_page(page);
> put_page(page);
> goto repeat;
>
> I don't know what you want to do when you find a KSM page, so I just left
> an ellipsis.
>

2020-10-08 15:58:40

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH 00/14] Small step toward KSM for file back page.

On Thu, Oct 08, 2020 at 11:30:28AM -0400, Jerome Glisse wrote:
> On Wed, Oct 07, 2020 at 11:09:16PM +0100, Matthew Wilcox wrote:
> > So ... why don't you put a PageKsm page in the page cache? That way you
> > can share code with the current KSM implementation. You'd need
> > something like this:
>
> I do just that but there is no need to change anything in page cache.

That's clearly untrue. If you just put a PageKsm page in the page
cache today, here's what will happen on a truncate:

void truncate_inode_pages_range(struct address_space *mapping,
loff_t lstart, loff_t lend)
{
...
struct page *page = find_lock_page(mapping, start - 1);

find_lock_page() does this:
return pagecache_get_page(mapping, offset, FGP_LOCK, 0);

pagecache_get_page():

repeat:
page = find_get_entry(mapping, index);
...
if (fgp_flags & FGP_LOCK) {
...
if (unlikely(compound_head(page)->mapping != mapping)) {
unlock_page(page);
put_page(page);
goto repeat;

so it's just going to spin. There are plenty of other codepaths that
would need to be checked. If you haven't found them, that shows you
don't understand the problem deeply enough yet.

I believe we should solve this problem, but I don't think you're going
about it the right way.

> So flow is:
>
> Same as before:
> 1 - write fault (address, vma)
> 2 - regular write fault handler -> find page in page cache
>
> New to common page fault code:
> 3 - ksm check in write fault common code (same as ksm today
> for anonymous page fault code path).
> 4 - break ksm (address, vma) -> (file offset, mapping)
> 4.a - use mapping and file offset to lookup the proper
> fs specific information that were save when the
> page was made ksm.
> 4.b - allocate new page and initialize it with that
> information (and page content), update page cache
> and mappings ie all the pte who where pointing to
> the ksm for that mapping at that offset to now use
> the new page (like KSM for anonymous page today).

But by putting that logic in the page fault path, you've missed
the truncate path. And maybe other places. Putting the logic
down in pagecache_get_page() means you _don't_ need to find
all the places that call pagecache_get_page().

2020-10-08 22:20:16

by Jerome Glisse

[permalink] [raw]

Subject: Re: [PATCH 00/14] Small step toward KSM for file back page.

On Thu, Oct 08, 2020 at 04:43:41PM +0100, Matthew Wilcox wrote:
> On Thu, Oct 08, 2020 at 11:30:28AM -0400, Jerome Glisse wrote:
> > On Wed, Oct 07, 2020 at 11:09:16PM +0100, Matthew Wilcox wrote:
> > > So ... why don't you put a PageKsm page in the page cache? That way you
> > > can share code with the current KSM implementation. You'd need
> > > something like this:
> >
> > I do just that but there is no need to change anything in page cache.
>
> That's clearly untrue. If you just put a PageKsm page in the page
> cache today, here's what will happen on a truncate:
>
> void truncate_inode_pages_range(struct address_space *mapping,
> loff_t lstart, loff_t lend)
> {
> ...
> struct page *page = find_lock_page(mapping, start - 1);
>
> find_lock_page() does this:
> return pagecache_get_page(mapping, offset, FGP_LOCK, 0);
>
> pagecache_get_page():
>
> repeat:
> page = find_get_entry(mapping, index);
> ...
> if (fgp_flags & FGP_LOCK) {
> ...
> if (unlikely(compound_head(page)->mapping != mapping)) {
> unlock_page(page);
> put_page(page);
> goto repeat;
>
> so it's just going to spin. There are plenty of other codepaths that
> would need to be checked. If you haven't found them, that shows you
> don't understand the problem deeply enough yet.

I also change truncate, splice and few other special cases that do
not goes through GUP/page fault/mkwrite (memory debug too but that's
a different beast).

> I believe we should solve this problem, but I don't think you're going
> about it the right way.

I have done much more than what i posted but there is bug that i
need to hammer down before posting everything and i wanted to get
the discussion started. I guess i will finish tracking that one
down and post the whole thing.

> > So flow is:
> >
> > Same as before:
> > 1 - write fault (address, vma)
> > 2 - regular write fault handler -> find page in page cache
> >
> > New to common page fault code:
> > 3 - ksm check in write fault common code (same as ksm today
> > for anonymous page fault code path).
> > 4 - break ksm (address, vma) -> (file offset, mapping)
> > 4.a - use mapping and file offset to lookup the proper
> > fs specific information that were save when the
> > page was made ksm.
> > 4.b - allocate new page and initialize it with that
> > information (and page content), update page cache
> > and mappings ie all the pte who where pointing to
> > the ksm for that mapping at that offset to now use
> > the new page (like KSM for anonymous page today).
>
> But by putting that logic in the page fault path, you've missed
> the truncate path. And maybe other places. Putting the logic
> down in pagecache_get_page() means you _don't_ need to find
> all the places that call pagecache_get_page().

They are cases where pagecache is not even in the loop ie you
already have the page and you do not need to look it up (page
fault, some fs common code, anything that goes through GUP,
memory reclaim, ...). Making all those places having to go
through page cache all the times will slow them down and many
are hot code path that i do not believe we want to slow even
if a feature is not use.

Cheers,
J?r?me