2014-02-17 18:39:10

by Kirill A. Shutemov

[permalink] [raw]
Subject: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache

I've rewrote it from scratch since v1. Locking multiple pages at once was
not that great idea. lock_page() is mess. It seems we don't have good
locking ordering rules for lock_page() vs. lock_page() and it triggers
deadlocks.

Now we have ->fault_nonblock() to ask filesystem for a page, if it's
reachable without blocking. We request one page a time. It's not terribly
efficient and I will probably re-think the interface once again to expose
iterator or something...

Synthetic tests shows pretty impressive numbers: time to complete is
reduced by 80% on sequential multi-threaded access to a file.

Unfortunately, scripting use-case doesn't show any change. I've tried git
test-suite, kernel build and kernel rebuild: no significant difference.

I will try reduce FAULT_AROUND_PAGES to see if it makes any difference.

One important use-case which is not covered is shmem/tmpfs: we need to
implement separate ->fault_nonblock() for it.

Yet another feature to implement is MAP_POPULATE|MAP_NONBLOCK on top of
->fault_nonblock(). It should be useful for dynamic linker for example.

Benchmark data is below.

Any comments?

=========================================================================

Git test-suite make -j60 test:

Base:
1,529,278,935,718 cycles ( +- 0.23% ) [100.00%]
799,283,349,594 instructions # 0.52 insns per cycle
# 3.14 stalled cycles per insn ( +- 0.12% ) [100.00%]
2,509,889,276,415 stalled-cycles-frontend # 164.12% frontend cycles idle ( +- 0.18% )
115,493,976 minor-faults ( +- 0.00% ) [100.00%]
1 major-faults

59.686645587 seconds time elapsed ( +- 0.30% )
Patched:
1,613,998,434,527 cycles ( +- 0.24% ) [100.00%]
865,776,308,589 instructions # 0.54 insns per cycle
# 3.07 stalled cycles per insn ( +- 0.18% ) [100.00%]
2,661,812,494,310 stalled-cycles-frontend # 164.92% frontend cycles idle ( +- 0.17% )
47,428,068 minor-faults ( +- 0.00% ) [100.00%]
1 major-faults ( +- 50.00% )

60.241766430 seconds time elapsed ( +- 0.85% )

Run make -j60 on clean allmodconfig kernel tree:

Base:
19,511,808,321,876 cycles [100.00%]
17,551,075,994,147 instructions # 0.90 insns per cycle
# 1.55 stalled cycles per insn [100.00%]
27,227,785,020,522 stalled-cycles-frontend # 139.55% frontend cycles idle
268,039,365 minor-faults [100.00%]
0 major-faults

132.830612471 seconds time elapsed
Patched:
19,421,069,663,037 cycles [100.00%]
17,587,643,648,161 instructions # 0.91 insns per cycle
# 1.54 stalled cycles per insn [100.00%]
27,063,152,011,598 stalled-cycles-frontend # 139.35% frontend cycles idle
193,550,437 minor-faults [100.00%]
0 major-faults

132.851823758 seconds time elapsed

Run make -j60 on already built allmodconfig kernel tree:

Base:
286,539,116,594 cycles ( +- 0.05% ) [100.00%]
386,447,067,412 instructions # 1.35 insns per cycle
# 0.96 stalled cycles per insn ( +- 0.00% ) [100.00%]
372,549,722,846 stalled-cycles-frontend # 130.02% frontend cycles idle ( +- 0.04% )
4,967,540 minor-faults ( +- 0.06% ) [100.00%]
0 major-faults

27.215434226 seconds time elapsed ( +- 0.18% )
Patched:
286,435,472,207 cycles ( +- 0.07% ) [100.00%]
388,217,937,147 instructions # 1.36 insns per cycle
# 0.96 stalled cycles per insn ( +- 0.01% ) [100.00%]
373,681,951,018 stalled-cycles-frontend # 130.46% frontend cycles idle ( +- 0.10% )
2,285,563 minor-faults ( +- 0.26% ) [100.00%]
0 major-faults

27.292854546 seconds time elapsed ( +- 0.29% )

Access 16g file sequential:

Threads Base, seconds Patched, seconds Diff, %
1 8.284232583 5.424205564 -34,52
8 18.338648302 6.330400940 -65.48
32 54.004043136 10.774241948 -80.04
60 88.928810991 16.571960471 -81.36
120 144.729413453 28.875873029 -80.04

Access 1g file random:

Threads Base, seconds Patched, seconds Diff, %
1 18.247095738 15.003616508 -17.77
8 15.226716773 14.239478275 -6.48
32 22.080582106 16.797777162 -23.92
60 25.759900660 22.768611526 -11.61
120 44.443953873 38.197095497 -14.05

Below is `perf stat' output for the tests above.

=========================================================================

# usemem -t <threads> -f testfile -o 16g #

## 1 thread ##
### base ###
28,284,111,426 cycles ( +- 1.53% ) [100.00%]
44,120,803,329 instructions # 1.56 insns per cycle
# 1.02 stalled cycles per insn ( +- 0.00% ) [100.00%]
45,148,666,958 stalled-cycles-frontend # 159.63% frontend cycles idle ( +- 1.89% )
4,194,709 minor-faults ( +- 0.00% ) [100.00%]
0 major-faults

8.284232583 seconds time elapsed ( +- 1.67% )
### patched ###
18,615,784,628 cycles ( +- 8.46% ) [100.00%]
41,590,187,000 instructions # 2.23 insns per cycle
# 0.66 stalled cycles per insn ( +- 0.02% ) [100.00%]
27,374,671,499 stalled-cycles-frontend # 147.05% frontend cycles idle ( +- 11.48% )
131,260 minor-faults ( +- 0.03% ) [100.00%]
0 major-faults

5.424205564 seconds time elapsed ( +- 8.54% )

## 8 threads ##
### base ###
447,114,910,645 cycles ( +- 3.22% ) [100.00%]
350,993,124,470 instructions # 0.79 insns per cycle
# 2.29 stalled cycles per insn ( +- 0.01% ) [100.00%]
803,040,468,490 stalled-cycles-frontend # 179.60% frontend cycles idle ( +- 3.59% )
33,556,008 minor-faults ( +- 0.00% ) [100.00%]
0 major-faults

18.338648302 seconds time elapsed ( +- 3.20% )
### patched ###
136,437,995,805 cycles ( +- 0.87% ) [100.00%]
328,096,731,149 instructions # 2.40 insns per cycle
# 0.60 stalled cycles per insn ( +- 0.02% ) [100.00%]
197,237,317,698 stalled-cycles-frontend # 144.56% frontend cycles idle ( +- 1.19% )
1,338,884 minor-faults ( +- 3.11% ) [100.00%]
0 major-faults

6.330400940 seconds time elapsed ( +- 1.50% )

## 32 threads ##
### base ###
4,844,333,104,028 cycles ( +- 2.17% ) [100.00%]
1,401,233,609,645 instructions # 0.29 insns per cycle
# 6.61 stalled cycles per insn ( +- 0.02% ) [100.00%]
9,265,429,354,789 stalled-cycles-frontend # 191.26% frontend cycles idle ( +- 1.81% )
134,228,040 minor-faults ( +- 0.00% ) [100.00%]
0 major-faults

54.004043136 seconds time elapsed ( +- 2.03% )
### patched ###
512,226,985,443 cycles ( +- 2.21% ) [100.00%]
1,307,350,894,141 instructions # 2.55 insns per cycle
# 0.52 stalled cycles per insn ( +- 0.02% ) [100.00%]
677,259,628,341 stalled-cycles-frontend # 132.22% frontend cycles idle ( +- 1.56% )
4,820,979 minor-faults ( +- 3.88% ) [100.00%]
0 major-faults

10.774241948 seconds time elapsed ( +- 2.40% )
## 60 threads ##
### base ###
14,878,749,780,310 cycles ( +- 1.24% ) [100.00%]
2,633,166,794,060 instructions # 0.18 insns per cycle
# 10.75 stalled cycles per insn ( +- 0.01% ) [100.00%]
28,301,405,498,867 stalled-cycles-frontend # 190.21% frontend cycles idle ( +- 0.88% )
251,687,643 minor-faults ( +- 0.00% ) [100.00%]
0 major-faults

88.928810991 seconds time elapsed ( +- 1.44% )
### patched ###
1,156,172,977,262 cycles ( +- 3.24% ) [100.00%]
2,450,915,465,588 instructions # 2.12 insns per cycle
# 0.63 stalled cycles per insn ( +- 0.01% ) [100.00%]
1,534,996,614,819 stalled-cycles-frontend # 132.77% frontend cycles idle ( +- 4.37% )
9,239,706 minor-faults ( +- 1.66% ) [100.00%]
0 major-faults

16.571960471 seconds time elapsed ( +- 2.64% )
## 120 threads ##
### base ###
47,350,885,336,341 cycles ( +- 2.28% ) [100.00%]
5,282,921,722,672 instructions # 0.11 insns per cycle
# 8.96 stalled cycles per insn ( +- 0.04% ) [100.00%]
47,332,933,971,860 stalled-cycles-frontend # 99.96% frontend cycles idle ( +- 2.40% )
503,357,749 minor-faults ( +- 0.00% ) [100.00%]
0 major-faults

144.729413453 seconds time elapsed ( +- 1.75% )
### patched ###
3,791,855,572,557 cycles ( +- 2.15% ) [100.00%]
4,901,681,561,338 instructions # 1.29 insns per cycle
# 0.59 stalled cycles per insn ( +- 0.01% ) [100.00%]
2,880,236,854,539 stalled-cycles-frontend # 75.96% frontend cycles idle ( +- 3.69% )
18,080,707 minor-faults ( +- 1.22% ) [100.00%]
0 major-faults

28.875873029 seconds time elapsed ( +- 1.19% )

# usemem -t <threads> -R -f testfile -o 1g #

## 1 thread ##
### base ###
62,357,670,556 cycles ( +- 0.58% ) [100.00%]
11,612,293,315 instructions # 0.19 insns per cycle
# 10.31 stalled cycles per insn ( +- 0.03% ) [100.00%]
119,734,125,613 stalled-cycles-frontend # 192.01% frontend cycles idle ( +- 0.58% )
262,723 minor-faults ( +- 0.01% ) [100.00%]
0 major-faults

18.247095738 seconds time elapsed ( +- 0.58% )
### patched ###
51,421,735,331 cycles ( +- 11.53% ) [100.00%]
11,446,081,379 instructions # 0.22 insns per cycle
# 8.55 stalled cycles per insn ( +- 0.19% ) [100.00%]
97,900,718,100 stalled-cycles-frontend # 190.39% frontend cycles idle ( +- 12.09% )
8,629 minor-faults ( +- 1.20% ) [100.00%]
0 major-faults ( +-100.00% )

15.003616508 seconds time elapsed ( +- 11.56% )

## 8 threads ##
### base ###
395,785,250,840 cycles ( +- 0.60% ) [100.00%]
93,850,278,507 instructions # 0.24 insns per cycle
# 8.02 stalled cycles per insn ( +- 0.03% ) [100.00%]
752,951,585,171 stalled-cycles-frontend # 190.24% frontend cycles idle ( +- 0.61% )
2,097,988 minor-faults ( +- 0.00% ) [100.00%]
0 major-faults

15.226716773 seconds time elapsed ( +- 0.41% )
### patched ###
Performance counter stats for 'system wide' (5 runs):

369,032,514,042 cycles ( +- 0.42% ) [100.00%]
92,357,503,760 instructions # 0.25 insns per cycle
# 7.59 stalled cycles per insn ( +- 0.01% ) [100.00%]
700,555,940,601 stalled-cycles-frontend # 189.84% frontend cycles idle ( +- 0.44% )
89,186 minor-faults ( +- 1.88% ) [100.00%]
0 major-faults

14.239478275 seconds time elapsed ( +- 0.48% )

## 32 threads ##
### base ###
1,791,516,702,606 cycles ( +- 1.57% ) [100.00%]
369,936,392,907 instructions # 0.21 insns per cycle
# 9.07 stalled cycles per insn ( +- 0.07% ) [100.00%]
3,357,017,162,189 stalled-cycles-frontend # 187.38% frontend cycles idle ( +- 1.77% )
8,390,395 minor-faults ( +- 0.00% ) [100.00%]
0 major-faults

22.080582106 seconds time elapsed ( +- 9.50% )
### patched ###
1,517,335,667,819 cycles ( +- 0.63% ) [100.00%]
363,401,817,392 instructions # 0.24 insns per cycle
# 7.45 stalled cycles per insn ( +- 0.01% ) [100.00%]
2,705,627,050,641 stalled-cycles-frontend # 178.31% frontend cycles idle ( +- 1.15% )
269,287 minor-faults ( +- 0.16% ) [100.00%]
0 major-faults

16.797777162 seconds time elapsed ( +- 7.69% )

## 60 threads ##
### base ###
3,983,862,013,747 cycles ( +- 1.42% ) [100.00%]
693,310,458,766 instructions # 0.17 insns per cycle
# 10.75 stalled cycles per insn ( +- 0.01% ) [100.00%]
7,454,138,580,163 stalled-cycles-frontend # 187.11% frontend cycles idle ( +- 1.27% )
15,735,826 minor-faults ( +- 0.02% ) [100.00%]
1 major-faults ( +-100.00% )

25.759900660 seconds time elapsed ( +- 0.64% )
### patched ###
2,871,988,372,901 cycles ( +- 1.17% ) [100.00%]
681,452,765,745 instructions # 0.24 insns per cycle
# 7.61 stalled cycles per insn ( +- 0.03% ) [100.00%]
5,186,017,674,808 stalled-cycles-frontend # 180.57% frontend cycles idle ( +- 1.24% )
498,198 minor-faults ( +- 0.05% ) [100.00%]
0 major-faults

22.768611526 seconds time elapsed ( +- 3.13% )

## 120 threads ##
### base ###
14,891,491,866,813 cycles ( +- 0.83% ) [100.00%]
1,391,880,325,881 instructions # 0.09 insns per cycle
# 10.51 stalled cycles per insn ( +- 0.02% ) [100.00%]
14,629,621,106,104 stalled-cycles-frontend # 98.24% frontend cycles idle ( +- 1.04% )
31,468,057 minor-faults ( +- 0.00% ) [100.00%]
0 major-faults

44.443953873 seconds time elapsed ( +- 0.91% )
### patched ###
12,450,645,304,818 cycles ( +- 2.94% ) [100.00%]
1,367,989,003,638 instructions # 0.11 insns per cycle
# 8.98 stalled cycles per insn ( +- 0.04% ) [100.00%]
12,282,274,614,247 stalled-cycles-frontend # 98.65% frontend cycles idle ( +- 3.34% )
996,361 minor-faults ( +- 0.02% ) [100.00%]
0 major-faults

38.197095497 seconds time elapsed ( +- 2.

Kirill A. Shutemov (2):
mm: introduce vm_ops->fault_nonblock()
mm: implement ->fault_nonblock() for page cache

Documentation/filesystems/Locking | 8 ++++++++
fs/9p/vfs_file.c | 2 ++
fs/btrfs/file.c | 1 +
fs/cifs/file.c | 1 +
fs/ext4/file.c | 1 +
fs/f2fs/file.c | 1 +
fs/fuse/file.c | 1 +
fs/gfs2/file.c | 1 +
fs/nfs/file.c | 1 +
fs/nilfs2/file.c | 1 +
fs/ubifs/file.c | 1 +
fs/xfs/xfs_file.c | 1 +
include/linux/mm.h | 3 +++
mm/filemap.c | 35 +++++++++++++++++++++++++++++++++++
mm/memory.c | 38 +++++++++++++++++++++++++++++++++++++-
15 files changed, 95 insertions(+), 1 deletion(-)

--
1.9.0.rc3


2014-02-17 18:39:13

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 2/2] mm: implement ->fault_nonblock() for page cache

filemap_fault_nonblock() is generic implementation of ->fault_nonblock()
for filesystems who uses page cache.

It should be safe to use filemap_fault_nonblock() for ->fault_nonblock()
if filesystem use filemap_fault() for ->fault().

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/9p/vfs_file.c | 2 ++
fs/btrfs/file.c | 1 +
fs/cifs/file.c | 1 +
fs/ext4/file.c | 1 +
fs/f2fs/file.c | 1 +
fs/fuse/file.c | 1 +
fs/gfs2/file.c | 1 +
fs/nfs/file.c | 1 +
fs/nilfs2/file.c | 1 +
fs/ubifs/file.c | 1 +
fs/xfs/xfs_file.c | 1 +
mm/filemap.c | 35 +++++++++++++++++++++++++++++++++++
12 files changed, 47 insertions(+)

diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index a16b0ff497ca..a7f7e41bec37 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -832,6 +832,7 @@ static void v9fs_mmap_vm_close(struct vm_area_struct *vma)

static const struct vm_operations_struct v9fs_file_vm_ops = {
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = v9fs_vm_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
@@ -839,6 +840,7 @@ static const struct vm_operations_struct v9fs_file_vm_ops = {
static const struct vm_operations_struct v9fs_mmap_file_vm_ops = {
.close = v9fs_mmap_vm_close,
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = v9fs_vm_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0165b8672f09..13523a63e1f3 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1993,6 +1993,7 @@ out:

static const struct vm_operations_struct btrfs_file_vm_ops = {
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = btrfs_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 755584684f6c..71aff75e067c 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3094,6 +3094,7 @@ cifs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)

static struct vm_operations_struct cifs_file_vm_ops = {
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = cifs_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 1a5073959f32..182ae5543a1d 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -200,6 +200,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,

static const struct vm_operations_struct ext4_file_vm_ops = {
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = ext4_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 0dfcef53a6ed..7c48fd2eb99c 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -84,6 +84,7 @@ out:

static const struct vm_operations_struct f2fs_file_vm_ops = {
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = f2fs_vm_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 77bcc303c3ae..e95e52ec7bc2 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1940,6 +1940,7 @@ static int fuse_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
static const struct vm_operations_struct fuse_file_vm_ops = {
.close = fuse_vma_close,
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = fuse_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index efc078f0ee4e..7c4b2f096ac8 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -494,6 +494,7 @@ out:

static const struct vm_operations_struct gfs2_vm_ops = {
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = gfs2_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 5bb790a69c71..8fbe80168d1f 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -617,6 +617,7 @@ out:

static const struct vm_operations_struct nfs_file_vm_ops = {
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = nfs_vm_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 08fdb77852ac..adc4aa07d7d8 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -134,6 +134,7 @@ static int nilfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)

static const struct vm_operations_struct nilfs_file_vm_ops = {
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = nilfs_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 123c79b7261e..f27c4c401a3f 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1538,6 +1538,7 @@ out_unlock:

static const struct vm_operations_struct ubifs_file_vm_ops = {
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = ubifs_vm_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 64b48eade91d..bc619150c960 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1465,6 +1465,7 @@ const struct file_operations xfs_dir_file_operations = {

static const struct vm_operations_struct xfs_file_vm_ops = {
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = xfs_vm_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/mm/filemap.c b/mm/filemap.c
index 7a13f6ac5421..7b7c9c600544 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1726,6 +1726,40 @@ page_not_uptodate:
}
EXPORT_SYMBOL(filemap_fault);

+void filemap_fault_nonblock(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct file *file = vma->vm_file;
+ struct address_space *mapping = file->f_mapping;
+ pgoff_t size;
+ struct page *page;
+
+ page = find_get_page(mapping, vmf->pgoff);
+ if (!page)
+ return;
+ if (PageReadahead(page) || PageHWPoison(page))
+ goto put;
+ if (!trylock_page(page))
+ goto put;
+ /* Truncated? */
+ if (unlikely(page->mapping != mapping))
+ goto unlock;
+ if (unlikely(!PageUptodate(page)))
+ goto unlock;
+ size = (i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1)
+ >> PAGE_CACHE_SHIFT;
+ if (unlikely(page->index >= size))
+ goto unlock;
+ if (file->f_ra.mmap_miss > 0)
+ file->f_ra.mmap_miss--;
+ vmf->page = page;
+ return;
+unlock:
+ unlock_page(page);
+put:
+ put_page(page);
+}
+EXPORT_SYMBOL(filemap_fault_nonblock);
+
int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
{
struct page *page = vmf->page;
@@ -1755,6 +1789,7 @@ EXPORT_SYMBOL(filemap_page_mkwrite);

const struct vm_operations_struct generic_file_vm_ops = {
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = filemap_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
--
1.9.0.rc3

2014-02-17 18:39:09

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 1/2] mm: introduce vm_ops->fault_nonblock()

The patch introduces new vm_ops callback ->fault_nonblock() and uses it
for mapping easy accessible pages around fault address.

On read page fault, if filesystem provides ->fault_nonblock(), we try to
map up to FAULT_AROUND_PAGES (32 at the moment) pages around page fault
address in hope to reduce number of minor page faults.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/filesystems/Locking | 8 ++++++++
include/linux/mm.h | 3 +++
mm/memory.c | 38 +++++++++++++++++++++++++++++++++++++-
3 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 5b0c083d7c0e..11506b97e3b7 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -525,6 +525,7 @@ locking rules:
open: yes
close: yes
fault: yes can return with page locked
+fault_nonblock yes must return with page locked
page_mkwrite: yes can return with page locked
access: yes

@@ -536,6 +537,13 @@ the page, then ensure it is not already truncated (the page lock will block
subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
locked. The VM will unlock the page.

+ ->fault_nonblock() is called when VM tries to map easy accessible
+pages. Filesystem must find and return the page associated with the passed
+in "pgoff" in the vm_fault structure. If it's not possible to return a
+page without blocking, NULL should be returned. The page must be locked
+and filesystem must ensure page is not truncated. The VM will unlock the
+page. ->fault_nonblock() is called with page table locked.
+
->page_mkwrite() is called when a previously read-only pte is
about to become writeable. The filesystem again must ensure that there are
no truncate/invalidate races, and then return with the page locked. If
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f28f46eade6a..b9a688dbd62a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -221,6 +221,8 @@ struct vm_operations_struct {
void (*open)(struct vm_area_struct * area);
void (*close)(struct vm_area_struct * area);
int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
+ void (*fault_nonblock)(struct vm_area_struct *vma,
+ struct vm_fault *vmf);

/* notification that a previously read-only page is about to become
* writable, if an error is returned it will cause a SIGBUS */
@@ -1810,6 +1812,7 @@ extern void truncate_inode_pages_range(struct address_space *,

/* generic vm_area_ops exported for stackable file systems */
extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
+extern void filemap_fault_nonblock(struct vm_area_struct *, struct vm_fault *);
extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);

/* mm/page-writeback.c */
diff --git a/mm/memory.c b/mm/memory.c
index 7f52c46ef1e1..f4990fb66770 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3342,6 +3342,39 @@ static void do_set_pte(struct vm_area_struct *vma, unsigned long address,
update_mmu_cache(vma, address, pte);
}

+#define FAULT_AROUND_ORDER 5
+#define FAULT_AROUND_PAGES (1UL << FAULT_AROUND_ORDER)
+#define FAULT_AROUND_MASK ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1)
+
+static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
+ pte_t *pte, pgoff_t pgoff, unsigned int flags)
+{
+ struct vm_fault vmf;
+ unsigned long start_addr = address & FAULT_AROUND_MASK;
+ int off = (address - start_addr) >> PAGE_SHIFT;
+ int i;
+
+ for (i = 0; i < FAULT_AROUND_PAGES; i++) {
+ unsigned long addr = start_addr + i * PAGE_SIZE;
+ pte_t *_pte = pte - off +i;
+
+ if (!pte_none(*_pte))
+ continue;
+ if (addr < vma->vm_start || addr >= vma->vm_end)
+ continue;
+
+ vmf.virtual_address = (void __user *) addr;
+ vmf.pgoff = pgoff - off + i;
+ vmf.flags = flags;
+ vmf.page = NULL;
+ vma->vm_ops->fault_nonblock(vma, &vmf);
+ if (!vmf.page)
+ continue;
+ do_set_pte(vma, addr, vmf.page, _pte, false, false);
+ unlock_page(vmf.page);
+ }
+}
+
static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
@@ -3363,8 +3396,11 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return ret;
}
do_set_pte(vma, address, fault_page, pte, false, false);
- pte_unmap_unlock(pte, ptl);
unlock_page(fault_page);
+
+ if (vma->vm_ops->fault_nonblock)
+ do_fault_around(vma, address, pte, pgoff, flags);
+ pte_unmap_unlock(pte, ptl);
return ret;
}

--
1.9.0.rc3

2014-02-17 19:02:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache

On Mon, Feb 17, 2014 at 10:38 AM, Kirill A. Shutemov
<[email protected]> wrote:
>
> Now we have ->fault_nonblock() to ask filesystem for a page, if it's
> reachable without blocking. We request one page a time. It's not terribly
> efficient and I will probably re-think the interface once again to expose
> iterator or something...

Hmm. Yeah, clearly this isn't working, since the real workloads all
end up looking like

> 115,493,976 minor-faults ( +- 0.00% ) [100.00%]
> 59.686645587 seconds time elapsed ( +- 0.30% )
becomes
> 47,428,068 minor-faults ( +- 0.00% ) [100.00%]
> 60.241766430 seconds time elapsed ( +- 0.85% )

and

> 268,039,365 minor-faults [100.00%]
> 132.830612471 seconds time elapsed
becomes
> 193,550,437 minor-faults [100.00%]
> 132.851823758 seconds time elapsed

and

> 4,967,540 minor-faults ( +- 0.06% ) [100.00%]
> 27.215434226 seconds time elapsed ( +- 0.18% )
becomes
> 2,285,563 minor-faults ( +- 0.26% ) [100.00%]
> 27.292854546 seconds time elapsed ( +- 0.29% )

ie it shows a clear reduction in faults, but the added costs clearly
eat up any wins and it all becomes (just _slightly_) slower.

Sad.

I do wonder if we really need to lock the pages we fault in. We lock
them in order to test for being up-to-date and still mapped. The
up-to-date check we don't really need to worry about: that we can test
without locking by just reading "page->flags" atomically and verifying
that it's uptodate and not locked.

The other reason to lock the page is:

- for anonymous pages we need the lock for rmap, so the VM generally
always locks the page. But that's not an issue for file-backed pages:
the "rmap" for a filebacked page is just the page mapcount and the
cgroup statistics, and those don't need the page lock.

- the whole truncation/unmapping thing

So the complex part is racing with truncate/unmapping the page. But
since we hold the page table lock, I *think* what we should be able to
do is:

- increment the page _mapcount (iow, do "page_add_file_rmap()"
early). This guarantees that any *subsequent* unmap activity on this
page will walk the file mapping lists, and become serialized by the
page table lock we hold.

- mb_after_atomic_inc() (this is generally free)

- test that the page is still unlocked and uptodate, and the page
mapping still points to our page.

- if that is true, we're all good, we can use the page, otherwise we
decrement the mapcount (page_remove_rmap()) and skip the page.

Hmm? Doing something like this means that we would never lock the
pages we prefault, and you can go back to your gang lookup rather than
that "one page at a time". And the race case is basically never going
to trigger.

Comments?

Linus

2014-02-17 19:50:28

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache

On Mon, Feb 17, 2014 at 11:01:58AM -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 10:38 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> >
> > Now we have ->fault_nonblock() to ask filesystem for a page, if it's
> > reachable without blocking. We request one page a time. It's not terribly
> > efficient and I will probably re-think the interface once again to expose
> > iterator or something...
>
> Hmm. Yeah, clearly this isn't working, since the real workloads all
> end up looking like
>
> > 115,493,976 minor-faults ( +- 0.00% ) [100.00%]
> > 59.686645587 seconds time elapsed ( +- 0.30% )
> becomes
> > 47,428,068 minor-faults ( +- 0.00% ) [100.00%]
> > 60.241766430 seconds time elapsed ( +- 0.85% )
>
> and
>
> > 268,039,365 minor-faults [100.00%]
> > 132.830612471 seconds time elapsed
> becomes
> > 193,550,437 minor-faults [100.00%]
> > 132.851823758 seconds time elapsed
>
> and
>
> > 4,967,540 minor-faults ( +- 0.06% ) [100.00%]
> > 27.215434226 seconds time elapsed ( +- 0.18% )
> becomes
> > 2,285,563 minor-faults ( +- 0.26% ) [100.00%]
> > 27.292854546 seconds time elapsed ( +- 0.29% )
>
> ie it shows a clear reduction in faults, but the added costs clearly
> eat up any wins and it all becomes (just _slightly_) slower.
>
> Sad.
>
> I do wonder if we really need to lock the pages we fault in. We lock
> them in order to test for being up-to-date and still mapped. The
> up-to-date check we don't really need to worry about: that we can test
> without locking by just reading "page->flags" atomically and verifying
> that it's uptodate and not locked.
>
> The other reason to lock the page is:
>
> - for anonymous pages we need the lock for rmap, so the VM generally
> always locks the page. But that's not an issue for file-backed pages:
> the "rmap" for a filebacked page is just the page mapcount and the
> cgroup statistics, and those don't need the page lock.
>
> - the whole truncation/unmapping thing
>
> So the complex part is racing with truncate/unmapping the page. But
> since we hold the page table lock, I *think* what we should be able to
> do is:
>
> - increment the page _mapcount (iow, do "page_add_file_rmap()"
> early). This guarantees that any *subsequent* unmap activity on this
> page will walk the file mapping lists, and become serialized by the
> page table lock we hold.
>
> - mb_after_atomic_inc() (this is generally free)
>
> - test that the page is still unlocked and uptodate, and the page
> mapping still points to our page.
>
> - if that is true, we're all good, we can use the page, otherwise we
> decrement the mapcount (page_remove_rmap()) and skip the page.
>
> Hmm? Doing something like this means that we would never lock the
> pages we prefault, and you can go back to your gang lookup rather than
> that "one page at a time". And the race case is basically never going
> to trigger.
>
> Comments?

Sounds reasonable to me. I'll take a closer look tomorrow.

But it could be safer to keep locking in place and reduce lookup cost by
exposing something like ->fault_iter_init() and ->fault_iter_next(). It
will still return one page a time, but it will keep radix-tree context
around for cheaper next-page lookup.

--
Kirill A. Shutemov

2014-02-17 20:24:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache

On Mon, Feb 17, 2014 at 11:49 AM, Kirill A. Shutemov
<[email protected]> wrote:
>
> But it could be safer to keep locking in place and reduce lookup cost by
> exposing something like ->fault_iter_init() and ->fault_iter_next(). It
> will still return one page a time, but it will keep radix-tree context
> around for cheaper next-page lookup.

I really would prefer for the loop to be much smaller than that, and
not contain indirect calls to helpers that pretty much guarantee that
you can't generate nice code.

Plus I'd rather not have the mm layer know too much about the radix
tree iterations anyway, and try to use the existing page array
functions we already have (ie "find_get_pages()").

So I'd really prefer if we can do this with tight loops over explicit
pages, rather than some loop over an iterator.

Linus

2014-02-18 13:28:33

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache

On 02/17/2014 02:01 PM, Linus Torvalds wrote:

> - increment the page _mapcount (iow, do "page_add_file_rmap()"
> early). This guarantees that any *subsequent* unmap activity on this
> page will walk the file mapping lists, and become serialized by the
> page table lock we hold.
>
> - mb_after_atomic_inc() (this is generally free)
>
> - test that the page is still unlocked and uptodate, and the page
> mapping still points to our page.
>
> - if that is true, we're all good, we can use the page, otherwise we
> decrement the mapcount (page_remove_rmap()) and skip the page.
>
> Hmm? Doing something like this means that we would never lock the
> pages we prefault, and you can go back to your gang lookup rather than
> that "one page at a time". And the race case is basically never going
> to trigger.
>
> Comments?

What would the direct io code do when it runs into a page with
elevated mapcount, but for which a mapping cannot be found yet?

Looking at the code, it looks like the above scheme could cause
some trouble with invalidate_inode_pages2_range(), which has
the following sequence:

if (page_mapped(page)) {
... unmap page
}
BUG_ON(page_mapped(page));

In other words, it looks like incrementing _mapcount first could
lead to an oops in the truncate and direct IO code.

The page lock is used to prevent such races.

*sigh*

--
All rights reversed

2014-02-18 14:16:35

by Matthew Wilcox

[permalink] [raw]
Subject: RE: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache

We don't really need to lock all the pages being returned to protect against truncate. We only need to lock the one at the highest index, and check i_size while that lock is held since truncate_inode_pages_range() will block on any page that is locked.

We're still vulnerable to holepunches, but there's no locking currently between holepunches and truncate, so we're no worse off now.
________________________________________
From: Rik van Riel [[email protected]]
Sent: February 18, 2014 5:28 AM
To: Linus Torvalds; Kirill A. Shutemov
Cc: Andrew Morton; Mel Gorman; Andi Kleen; Wilcox, Matthew R; Dave Hansen; Alexander Viro; Dave Chinner; linux-mm; linux-fsdevel; Linux Kernel Mailing List
Subject: Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache

On 02/17/2014 02:01 PM, Linus Torvalds wrote:

> - increment the page _mapcount (iow, do "page_add_file_rmap()"
> early). This guarantees that any *subsequent* unmap activity on this
> page will walk the file mapping lists, and become serialized by the
> page table lock we hold.
>
> - mb_after_atomic_inc() (this is generally free)
>
> - test that the page is still unlocked and uptodate, and the page
> mapping still points to our page.
>
> - if that is true, we're all good, we can use the page, otherwise we
> decrement the mapcount (page_remove_rmap()) and skip the page.
>
> Hmm? Doing something like this means that we would never lock the
> pages we prefault, and you can go back to your gang lookup rather than
> that "one page at a time". And the race case is basically never going
> to trigger.
>
> Comments?

What would the direct io code do when it runs into a page with
elevated mapcount, but for which a mapping cannot be found yet?

Looking at the code, it looks like the above scheme could cause
some trouble with invalidate_inode_pages2_range(), which has
the following sequence:

if (page_mapped(page)) {
... unmap page
}
BUG_ON(page_mapped(page));

In other words, it looks like incrementing _mapcount first could
lead to an oops in the truncate and direct IO code.

The page lock is used to prevent such races.

*sigh*

--
All rights reversed

2014-02-18 14:23:27

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache

On Tue, Feb 18, 2014 at 08:28:02AM -0500, Rik van Riel wrote:
> On 02/17/2014 02:01 PM, Linus Torvalds wrote:
>
> > - increment the page _mapcount (iow, do "page_add_file_rmap()"
> > early). This guarantees that any *subsequent* unmap activity on this
> > page will walk the file mapping lists, and become serialized by the
> > page table lock we hold.
> >
> > - mb_after_atomic_inc() (this is generally free)
> >
> > - test that the page is still unlocked and uptodate, and the page
> > mapping still points to our page.
> >
> > - if that is true, we're all good, we can use the page, otherwise we
> > decrement the mapcount (page_remove_rmap()) and skip the page.
> >
> > Hmm? Doing something like this means that we would never lock the
> > pages we prefault, and you can go back to your gang lookup rather than
> > that "one page at a time". And the race case is basically never going
> > to trigger.
> >
> > Comments?
>
> What would the direct io code do when it runs into a page with
> elevated mapcount, but for which a mapping cannot be found yet?
>
> Looking at the code, it looks like the above scheme could cause
> some trouble with invalidate_inode_pages2_range(), which has
> the following sequence:
>
> if (page_mapped(page)) {
> ... unmap page
> }
> BUG_ON(page_mapped(page));
>
> In other words, it looks like incrementing _mapcount first could
> lead to an oops in the truncate and direct IO code.
>
> The page lock is used to prevent such races.
>
> *sigh*

What if we will retry unmap once again, before triggering BUG().
The second unmap will be serialized by page table lock, right?

--
Kirill A. Shutemov

2014-02-18 17:51:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache

On Tue, Feb 18, 2014 at 5:28 AM, Rik van Riel <[email protected]> wrote:
>
> What would the direct io code do when it runs into a page with
> elevated mapcount, but for which a mapping cannot be found yet?

Actually, you cannot get into that situation, since the definition of
"found" is that you have to follow the page tables (remember: this is
a *file* mapping, not an anonymous one, so you don't actually have an
rmap list, you have the inode mapping list).

And since we hold the page table lock, you cannot actually get to the
point where you see that it's not mapped yet. See?

That said:

> Looking at the code, it looks like the above scheme could cause
> some trouble with invalidate_inode_pages2_range(), which has
> the following sequence:
>
> if (page_mapped(page)) {
> ... unmap page
> }
> BUG_ON(page_mapped(page));

The BUG_ON() itself could trigger, because it could race with us
optimistically trying to increment the page mapping code. And yes, we
might have to remove that.

But the actual "unmap page" logic should not be able to ever see any
difference.

See my argument?

Linus

2014-02-18 17:59:13

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache

Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 10:38 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> >
> > Now we have ->fault_nonblock() to ask filesystem for a page, if it's
> > reachable without blocking. We request one page a time. It's not terribly
> > efficient and I will probably re-think the interface once again to expose
> > iterator or something...
>
> Hmm. Yeah, clearly this isn't working, since the real workloads all
> end up looking like
>
> > 115,493,976 minor-faults ( +- 0.00% ) [100.00%]
> > 59.686645587 seconds time elapsed ( +- 0.30% )
> becomes
> > 47,428,068 minor-faults ( +- 0.00% ) [100.00%]
> > 60.241766430 seconds time elapsed ( +- 0.85% )
>
> and
>
> > 268,039,365 minor-faults [100.00%]
> > 132.830612471 seconds time elapsed
> becomes
> > 193,550,437 minor-faults [100.00%]
> > 132.851823758 seconds time elapsed
>
> and
>
> > 4,967,540 minor-faults ( +- 0.06% ) [100.00%]
> > 27.215434226 seconds time elapsed ( +- 0.18% )
> becomes
> > 2,285,563 minor-faults ( +- 0.26% ) [100.00%]
> > 27.292854546 seconds time elapsed ( +- 0.29% )
>
> ie it shows a clear reduction in faults, but the added costs clearly
> eat up any wins and it all becomes (just _slightly_) slower.

I did an experement with setup pte directly in filemap_fault_nonblock() to
see how much we can get from it. And it helps:

git: -1.21s
clean build: -2.22s
rebuild: -0.63s

Is it a layering violation to setup pte directly in ->fault_nonblock()?

perf stat and patch below.

Git test-suite make -j60 test:
1,591,184,058,944 cycles ( +- 0.05% ) [100.00%]
811,200,260,823 instructions # 0.51 insns per cycle
# 3.24 stalled cycles per insn ( +- 0.19% ) [100.00%]
2,631,511,271,429 stalled-cycles-frontend # 165.38% frontend cycles idle ( +- 0.08% )
47,305,697 minor-faults ( +- 0.00% ) [100.00%]
1 major-faults

59.028360009 seconds time elapsed ( +- 0.58% )

Run make -j60 on clean allmodconfig kernel tree:
19,163,958,689,310 cycles [100.00%]
17,446,888,861,177 instructions # 0.91 insns per cycle
# 1.53 stalled cycles per insn [100.00%]
26,777,884,033,091 stalled-cycles-frontend # 139.73% frontend cycles idle
193,118,569 minor-faults [100.00%]
0 major-faults

130.631767214 seconds time elapsed

Run make -j60 on already built allmodconfig kernel tree:
282,398,537,719 cycles ( +- 0.03% ) [100.00%]
385,807,937,931 instructions # 1.37 insns per cycle
# 0.95 stalled cycles per insn ( +- 0.01% ) [100.00%]
365,940,576,310 stalled-cycles-frontend # 129.58% frontend cycles idle ( +- 0.07% )
2,254,887 minor-faults ( +- 0.02% ) [100.00%]
0 major-faults

26.660708754 seconds time elapsed ( +- 0.29% )

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b9a688dbd62a..e671dd5abe27 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -221,8 +221,8 @@ struct vm_operations_struct {
void (*open)(struct vm_area_struct * area);
void (*close)(struct vm_area_struct * area);
int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
- void (*fault_nonblock)(struct vm_area_struct *vma,
- struct vm_fault *vmf);
+ int (*fault_nonblock)(struct vm_area_struct *vma, struct vm_fault *vmf,
+ pgoff_t max_pgoff, int nr_pages, pte_t *pte);

/* notification that a previously read-only page is about to become
* writable, if an error is returned it will cause a SIGBUS */
@@ -1812,7 +1812,8 @@ extern void truncate_inode_pages_range(struct address_space *,

/* generic vm_area_ops exported for stackable file systems */
extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
-extern void filemap_fault_nonblock(struct vm_area_struct *, struct vm_fault *);
+int filemap_fault_nonblock(struct vm_area_struct *vma, struct vm_fault *vmf,
+ pgoff_t max_pgoff, int nr_pages, pte_t *pte);
extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);

/* mm/page-writeback.c */
diff --git a/mm/filemap.c b/mm/filemap.c
index 7b7c9c600544..0a8884efbcd8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
#include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
#include <linux/memcontrol.h>
#include <linux/cleancache.h>
+#include <linux/rmap.h>
#include "internal.h"

#define CREATE_TRACE_POINTS
@@ -134,6 +135,9 @@ void __delete_from_page_cache(struct page *page)
__dec_zone_page_state(page, NR_FILE_PAGES);
if (PageSwapBacked(page))
__dec_zone_page_state(page, NR_SHMEM);
+ if (page_mapped(page)) {
+ dump_page(page, "");
+ }
BUG_ON(page_mapped(page));

/*
@@ -1726,37 +1730,90 @@ page_not_uptodate:
}
EXPORT_SYMBOL(filemap_fault);

-void filemap_fault_nonblock(struct vm_area_struct *vma, struct vm_fault *vmf)
+void do_set_pte(struct vm_area_struct *vma, unsigned long address,
+ struct page *page, pte_t *pte, bool write, bool anon);
+int filemap_fault_nonblock(struct vm_area_struct *vma, struct vm_fault *vmf,
+ pgoff_t max_pgoff, int nr_pages, pte_t *pte)
{
+ struct radix_tree_iter iter;
+ void **slot;
struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
pgoff_t size;
struct page *page;
+ unsigned long address = (unsigned long) vmf->virtual_address;
+ unsigned long addr;
+ pte_t *_pte;
+ int ret = 0;

- page = find_get_page(mapping, vmf->pgoff);
- if (!page)
- return;
- if (PageReadahead(page) || PageHWPoison(page))
- goto put;
- if (!trylock_page(page))
- goto put;
- /* Truncated? */
- if (unlikely(page->mapping != mapping))
- goto unlock;
- if (unlikely(!PageUptodate(page)))
- goto unlock;
- size = (i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1)
- >> PAGE_CACHE_SHIFT;
- if (unlikely(page->index >= size))
- goto unlock;
- if (file->f_ra.mmap_miss > 0)
- file->f_ra.mmap_miss--;
- vmf->page = page;
- return;
+ rcu_read_lock();
+restart:
+ radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
+repeat:
+ page = radix_tree_deref_slot(slot);
+
+ if (radix_tree_exception(page)) {
+ if (radix_tree_deref_retry(page)) {
+ /*
+ * Transient condition which can only trigger
+ * when entry at index 0 moves out of or back
+ * to root: none yet gotten, safe to restart.
+ */
+ WARN_ON(iter.index);
+ goto restart;
+ }
+ /*
+ * Otherwise, shmem/tmpfs must be storing a swap entry
+ * here as an exceptional entry: so skip over it -
+ * we only reach this from invalidate_mapping_pages().
+ */
+ continue;
+ }
+
+ if (!page_cache_get_speculative(page))
+ goto repeat;
+
+ /* Has the page moved? */
+ if (unlikely(page != *slot)) {
+ page_cache_release(page);
+ goto repeat;
+ }
+
+ if (page->index > max_pgoff) {
+ page_cache_release(page);
+ break;
+ }
+
+ if (PageReadahead(page) || PageHWPoison(page) ||
+ !PageUptodate(page))
+ goto skip;
+ if (!trylock_page(page))
+ goto skip;
+ if (page->mapping != mapping || !PageUptodate(page))
+ goto unlock;
+ size = (i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1)
+ >> PAGE_CACHE_SHIFT;
+ if (page->index >= size)
+ goto unlock;
+ if (file->f_ra.mmap_miss > 0)
+ file->f_ra.mmap_miss--;
+ addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
+ _pte = pte + page->index - vmf->pgoff;
+ if (!pte_none(*_pte))
+ goto unlock;
+ do_set_pte(vma, addr, page, _pte, false, false);
+
+ unlock_page(page);
+ if (++ret == nr_pages || page->index == max_pgoff)
+ break;
+ continue;
unlock:
- unlock_page(page);
-put:
- put_page(page);
+ unlock_page(page);
+skip:
+ page_cache_release(page);
+ }
+ rcu_read_unlock();
+ return ret;
}
EXPORT_SYMBOL(filemap_fault_nonblock);

diff --git a/mm/memory.c b/mm/memory.c
index f4990fb66770..1af0e3a3f0f1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3318,7 +3318,8 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
return ret;
}

-static void do_set_pte(struct vm_area_struct *vma, unsigned long address,
+
+void do_set_pte(struct vm_area_struct *vma, unsigned long address,
struct page *page, pte_t *pte, bool write, bool anon)
{
pte_t entry;
@@ -3342,37 +3343,47 @@ static void do_set_pte(struct vm_area_struct *vma, unsigned long address,
update_mmu_cache(vma, address, pte);
}

-#define FAULT_AROUND_ORDER 5
+#define FAULT_AROUND_ORDER 3
#define FAULT_AROUND_PAGES (1UL << FAULT_AROUND_ORDER)
#define FAULT_AROUND_MASK ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1)

static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
pte_t *pte, pgoff_t pgoff, unsigned int flags)
{
+ unsigned long start_addr;
+ pgoff_t max_pgoff;
struct vm_fault vmf;
- unsigned long start_addr = address & FAULT_AROUND_MASK;
- int off = (address - start_addr) >> PAGE_SHIFT;
- int i;
-
- for (i = 0; i < FAULT_AROUND_PAGES; i++) {
- unsigned long addr = start_addr + i * PAGE_SIZE;
- pte_t *_pte = pte - off +i;
+ int off, ret;

- if (!pte_none(*_pte))
- continue;
- if (addr < vma->vm_start || addr >= vma->vm_end)
- continue;
+ /* Do not cross vma or page table border */
+ max_pgoff = min(pgoff - pte_index(address) + PTRS_PER_PTE - 1,
+ vma_pages(vma) + vma->vm_pgoff - 1);

- vmf.virtual_address = (void __user *) addr;
- vmf.pgoff = pgoff - off + i;
- vmf.flags = flags;
- vmf.page = NULL;
- vma->vm_ops->fault_nonblock(vma, &vmf);
- if (!vmf.page)
- continue;
- do_set_pte(vma, addr, vmf.page, _pte, false, false);
- unlock_page(vmf.page);
+ start_addr = max(address & FAULT_AROUND_MASK, vma->vm_start);
+ if ((start_addr & PMD_MASK) != (address & PMD_MASK))
+ BUG();
+ off = pte_index(start_addr) - pte_index(address);
+ pte += off;
+ pgoff += off;
+
+ /* Check if it makes any sense to call ->fault_nonblock */
+ while (!pte_none(*pte)) {
+ pte++;
+ pgoff++;
+ start_addr += PAGE_SIZE;
+ /* Do not cross vma or page table border */
+ if (!pte_index(start_addr) || start_addr >= vma->vm_end)
+ return;
+ if ((start_addr & PMD_MASK) != (address & PMD_MASK))
+ BUG();
}
+
+
+ vmf.virtual_address = (void __user *) start_addr;
+ vmf.pgoff = pgoff;
+ vmf.flags = flags;
+ ret = vma->vm_ops->fault_nonblock(vma, &vmf,
+ max_pgoff, FAULT_AROUND_PAGES, pte);
}

static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
--
Kirill A. Shutemov

2014-02-18 18:02:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache

On Tue, Feb 18, 2014 at 6:15 AM, Wilcox, Matthew R
<[email protected]> wrote:
> We don't really need to lock all the pages being returned to protect against truncate. We only need to lock the one at the highest index, and check i_size while that lock is held since truncate_inode_pages_range() will block on any page that is locked.
>
> We're still vulnerable to holepunches, but there's no locking currently between holepunches and truncate, so we're no worse off now.

It's not "holepunches and truncate", it's "holepunches and page
mapping", and I do think we currently serialize the two - the whole
"check page->mapping still being non-NULL" before mapping it while
having the page locked does that.

Besides, that per-page locking should serialize against truncate too.
No, there is no "global" serialization, but there *is* exactly that
page-level serialization where both truncation and hole punching end
up making sure that the page no longer exists in the page cache and
isn't mapped.

I'm just claiming that *because* of the way rmap works for file
mappings (walking the i_mapped list and page tables), we should
actually be ok. The anonymous rmap list is protected by the page
lock, but the file-backed rmap is protected by the pte lock (well, and
the "i_mmap_mutex" that in turn protects the i_mmap list etc).

Linus

2014-02-18 18:08:19

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache

Kirill A. Shutemov wrote:
> Linus Torvalds wrote:
> > On Mon, Feb 17, 2014 at 10:38 AM, Kirill A. Shutemov
> > <[email protected]> wrote:
> > >
> > > Now we have ->fault_nonblock() to ask filesystem for a page, if it's
> > > reachable without blocking. We request one page a time. It's not terribly
> > > efficient and I will probably re-think the interface once again to expose
> > > iterator or something...
> >
> > Hmm. Yeah, clearly this isn't working, since the real workloads all
> > end up looking like
> >
> > > 115,493,976 minor-faults ( +- 0.00% ) [100.00%]
> > > 59.686645587 seconds time elapsed ( +- 0.30% )
> > becomes
> > > 47,428,068 minor-faults ( +- 0.00% ) [100.00%]
> > > 60.241766430 seconds time elapsed ( +- 0.85% )
> >
> > and
> >
> > > 268,039,365 minor-faults [100.00%]
> > > 132.830612471 seconds time elapsed
> > becomes
> > > 193,550,437 minor-faults [100.00%]
> > > 132.851823758 seconds time elapsed
> >
> > and
> >
> > > 4,967,540 minor-faults ( +- 0.06% ) [100.00%]
> > > 27.215434226 seconds time elapsed ( +- 0.18% )
> > becomes
> > > 2,285,563 minor-faults ( +- 0.26% ) [100.00%]
> > > 27.292854546 seconds time elapsed ( +- 0.29% )
> >
> > ie it shows a clear reduction in faults, but the added costs clearly
> > eat up any wins and it all becomes (just _slightly_) slower.
>
> I did an experement with setup pte directly in filemap_fault_nonblock() to
> see how much we can get from it. And it helps:
>
> git: -1.21s
> clean build: -2.22s
> rebuild: -0.63s
>
> Is it a layering violation to setup pte directly in ->fault_nonblock()?
>
> perf stat and patch below.
>
> Git test-suite make -j60 test:
> 1,591,184,058,944 cycles ( +- 0.05% ) [100.00%]
> 811,200,260,823 instructions # 0.51 insns per cycle
> # 3.24 stalled cycles per insn ( +- 0.19% ) [100.00%]
> 2,631,511,271,429 stalled-cycles-frontend # 165.38% frontend cycles idle ( +- 0.08% )
> 47,305,697 minor-faults ( +- 0.00% ) [100.00%]
> 1 major-faults
>
> 59.028360009 seconds time elapsed ( +- 0.58% )
>
> Run make -j60 on clean allmodconfig kernel tree:
> 19,163,958,689,310 cycles [100.00%]
> 17,446,888,861,177 instructions # 0.91 insns per cycle
> # 1.53 stalled cycles per insn [100.00%]
> 26,777,884,033,091 stalled-cycles-frontend # 139.73% frontend cycles idle
> 193,118,569 minor-faults [100.00%]
> 0 major-faults
>
> 130.631767214 seconds time elapsed
>
> Run make -j60 on already built allmodconfig kernel tree:
> 282,398,537,719 cycles ( +- 0.03% ) [100.00%]
> 385,807,937,931 instructions # 1.37 insns per cycle
> # 0.95 stalled cycles per insn ( +- 0.01% ) [100.00%]
> 365,940,576,310 stalled-cycles-frontend # 129.58% frontend cycles idle ( +- 0.07% )
> 2,254,887 minor-faults ( +- 0.02% ) [100.00%]
> 0 major-faults
>
> 26.660708754 seconds time elapsed ( +- 0.29% )

Patch is wrong. Correct one is below.

diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index a16b0ff497ca..a7f7e41bec37 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -832,6 +832,7 @@ static void v9fs_mmap_vm_close(struct vm_area_struct *vma)

static const struct vm_operations_struct v9fs_file_vm_ops = {
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = v9fs_vm_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
@@ -839,6 +840,7 @@ static const struct vm_operations_struct v9fs_file_vm_ops = {
static const struct vm_operations_struct v9fs_mmap_file_vm_ops = {
.close = v9fs_mmap_vm_close,
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = v9fs_vm_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0165b8672f09..13523a63e1f3 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1993,6 +1993,7 @@ out:

static const struct vm_operations_struct btrfs_file_vm_ops = {
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = btrfs_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 755584684f6c..71aff75e067c 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3094,6 +3094,7 @@ cifs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)

static struct vm_operations_struct cifs_file_vm_ops = {
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = cifs_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 1a5073959f32..182ae5543a1d 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -200,6 +200,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,

static const struct vm_operations_struct ext4_file_vm_ops = {
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = ext4_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 0dfcef53a6ed..7c48fd2eb99c 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -84,6 +84,7 @@ out:

static const struct vm_operations_struct f2fs_file_vm_ops = {
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = f2fs_vm_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 77bcc303c3ae..e95e52ec7bc2 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1940,6 +1940,7 @@ static int fuse_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
static const struct vm_operations_struct fuse_file_vm_ops = {
.close = fuse_vma_close,
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = fuse_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index efc078f0ee4e..7c4b2f096ac8 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -494,6 +494,7 @@ out:

static const struct vm_operations_struct gfs2_vm_ops = {
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = gfs2_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 5bb790a69c71..8fbe80168d1f 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -617,6 +617,7 @@ out:

static const struct vm_operations_struct nfs_file_vm_ops = {
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = nfs_vm_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 08fdb77852ac..adc4aa07d7d8 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -134,6 +134,7 @@ static int nilfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)

static const struct vm_operations_struct nilfs_file_vm_ops = {
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = nilfs_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 123c79b7261e..f27c4c401a3f 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1538,6 +1538,7 @@ out_unlock:

static const struct vm_operations_struct ubifs_file_vm_ops = {
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = ubifs_vm_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 64b48eade91d..bc619150c960 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1465,6 +1465,7 @@ const struct file_operations xfs_dir_file_operations = {

static const struct vm_operations_struct xfs_file_vm_ops = {
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = xfs_vm_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f28f46eade6a..e671dd5abe27 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -221,6 +221,8 @@ struct vm_operations_struct {
void (*open)(struct vm_area_struct * area);
void (*close)(struct vm_area_struct * area);
int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
+ int (*fault_nonblock)(struct vm_area_struct *vma, struct vm_fault *vmf,
+ pgoff_t max_pgoff, int nr_pages, pte_t *pte);

/* notification that a previously read-only page is about to become
* writable, if an error is returned it will cause a SIGBUS */
@@ -1810,6 +1812,8 @@ extern void truncate_inode_pages_range(struct address_space *,

/* generic vm_area_ops exported for stackable file systems */
extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
+int filemap_fault_nonblock(struct vm_area_struct *vma, struct vm_fault *vmf,
+ pgoff_t max_pgoff, int nr_pages, pte_t *pte);
extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);

/* mm/page-writeback.c */
diff --git a/mm/filemap.c b/mm/filemap.c
index 7a13f6ac5421..0a8884efbcd8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
#include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
#include <linux/memcontrol.h>
#include <linux/cleancache.h>
+#include <linux/rmap.h>
#include "internal.h"

#define CREATE_TRACE_POINTS
@@ -1726,6 +1730,93 @@ page_not_uptodate:
}
EXPORT_SYMBOL(filemap_fault);

+void do_set_pte(struct vm_area_struct *vma, unsigned long address,
+ struct page *page, pte_t *pte, bool write, bool anon);
+int filemap_fault_nonblock(struct vm_area_struct *vma, struct vm_fault *vmf,
+ pgoff_t max_pgoff, int nr_pages, pte_t *pte)
+{
+ struct radix_tree_iter iter;
+ void **slot;
+ struct file *file = vma->vm_file;
+ struct address_space *mapping = file->f_mapping;
+ pgoff_t size;
+ struct page *page;
+ unsigned long address = (unsigned long) vmf->virtual_address;
+ unsigned long addr;
+ pte_t *_pte;
+ int ret = 0;
+
+ rcu_read_lock();
+restart:
+ radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
+repeat:
+ page = radix_tree_deref_slot(slot);
+
+ if (radix_tree_exception(page)) {
+ if (radix_tree_deref_retry(page)) {
+ /*
+ * Transient condition which can only trigger
+ * when entry at index 0 moves out of or back
+ * to root: none yet gotten, safe to restart.
+ */
+ WARN_ON(iter.index);
+ goto restart;
+ }
+ /*
+ * Otherwise, shmem/tmpfs must be storing a swap entry
+ * here as an exceptional entry: so skip over it -
+ * we only reach this from invalidate_mapping_pages().
+ */
+ continue;
+ }
+
+ if (!page_cache_get_speculative(page))
+ goto repeat;
+
+ /* Has the page moved? */
+ if (unlikely(page != *slot)) {
+ page_cache_release(page);
+ goto repeat;
+ }
+
+ if (page->index > max_pgoff) {
+ page_cache_release(page);
+ break;
+ }
+
+ if (PageReadahead(page) || PageHWPoison(page) ||
+ !PageUptodate(page))
+ goto skip;
+ if (!trylock_page(page))
+ goto skip;
+ if (page->mapping != mapping || !PageUptodate(page))
+ goto unlock;
+ size = (i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1)
+ >> PAGE_CACHE_SHIFT;
+ if (page->index >= size)
+ goto unlock;
+ if (file->f_ra.mmap_miss > 0)
+ file->f_ra.mmap_miss--;
+ addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
+ _pte = pte + page->index - vmf->pgoff;
+ if (!pte_none(*_pte))
+ goto unlock;
+ do_set_pte(vma, addr, page, _pte, false, false);
+
+ unlock_page(page);
+ if (++ret == nr_pages || page->index == max_pgoff)
+ break;
+ continue;
+unlock:
+ unlock_page(page);
+skip:
+ page_cache_release(page);
+ }
+ rcu_read_unlock();
+ return ret;
+}
+EXPORT_SYMBOL(filemap_fault_nonblock);
+
int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
{
struct page *page = vmf->page;
@@ -1755,6 +1846,7 @@ EXPORT_SYMBOL(filemap_page_mkwrite);

const struct vm_operations_struct generic_file_vm_ops = {
.fault = filemap_fault,
+ .fault_nonblock = filemap_fault_nonblock,
.page_mkwrite = filemap_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/mm/memory.c b/mm/memory.c
index 7f52c46ef1e1..e79c8d6f6f47 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3318,7 +3318,8 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
return ret;
}

-static void do_set_pte(struct vm_area_struct *vma, unsigned long address,
+
+void do_set_pte(struct vm_area_struct *vma, unsigned long address,
struct page *page, pte_t *pte, bool write, bool anon)
{
pte_t entry;
@@ -3342,6 +3343,49 @@ static void do_set_pte(struct vm_area_struct *vma, unsigned long address,
update_mmu_cache(vma, address, pte);
}

+#define FAULT_AROUND_ORDER 5
+#define FAULT_AROUND_PAGES (1UL << FAULT_AROUND_ORDER)
+#define FAULT_AROUND_MASK ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1)
+
+static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
+ pte_t *pte, pgoff_t pgoff, unsigned int flags)
+{
+ unsigned long start_addr;
+ pgoff_t max_pgoff;
+ struct vm_fault vmf;
+ int off, ret;
+
+ /* Do not cross vma or page table border */
+ max_pgoff = min(pgoff - pte_index(address) + PTRS_PER_PTE - 1,
+ vma_pages(vma) + vma->vm_pgoff - 1);
+
+ start_addr = max(address & FAULT_AROUND_MASK, vma->vm_start);
+ if ((start_addr & PMD_MASK) != (address & PMD_MASK))
+ BUG();
+ off = pte_index(start_addr) - pte_index(address);
+ pte += off;
+ pgoff += off;
+
+ /* Check if it makes any sense to call ->fault_nonblock */
+ while (!pte_none(*pte)) {
+ pte++;
+ pgoff++;
+ start_addr += PAGE_SIZE;
+ /* Do not cross vma or page table border */
+ if (!pte_index(start_addr) || start_addr >= vma->vm_end)
+ return;
+ if ((start_addr & PMD_MASK) != (address & PMD_MASK))
+ BUG();
+ }
+
+
+ vmf.virtual_address = (void __user *) start_addr;
+ vmf.pgoff = pgoff;
+ vmf.flags = flags;
+ ret = vma->vm_ops->fault_nonblock(vma, &vmf,
+ max_pgoff, FAULT_AROUND_PAGES, pte);
+}
+
static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
@@ -3363,8 +3407,11 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return ret;
}
do_set_pte(vma, address, fault_page, pte, false, false);
- pte_unmap_unlock(pte, ptl);
unlock_page(fault_page);
+
+ if (vma->vm_ops->fault_nonblock)
+ do_fault_around(vma, address, pte, pgoff, flags);
+ pte_unmap_unlock(pte, ptl);
return ret;
}

--
Kirill A. Shutemov

2014-02-18 18:28:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache

On Tue, Feb 18, 2014 at 10:07 AM, Kirill A. Shutemov
<[email protected]> wrote:
>
> Patch is wrong. Correct one is below.

Hmm. I don't hate this. Looking through it, it's fairly simple
conceptually, and the code isn't that complex either. I can live with
this.

I think it's a bit odd how you pass both "max_pgoff" and "nr_pages" to
the fault-around function, though. In fact, I'd consider that a bug.
Passing in "FAULT_AROUND_PAGES" is just wrong, since the code cannot -
and in fact *must* not - actually fault in that many pages, since the
starting/ending address can be limited by other things.

So I think that part of the code is bogus. You need to remove
nr_pages, because any use of it is just incorrect. I don't think it
can actually matter, since the max_pgoff checks are more restrictive,
but if you think it can matter please explain how and why it wouldn't
be a major bug?

Apart from that, I'd really like to see numbers for different ranges
of FAULT_AROUND_ORDER, because I think 5 is pretty high, but on the
whole I don't find this horrible, and you still lock the page so it
doesn't involve any new rules. I'm not hugely happy with another raw
radix-tree user, but it's not horrible.

Btw, is the "radix_tree_deref_retry(page) -> goto restart" really
necessary? I'd be almost more inclined to just make it just do a
"break;" to break out of the loop and stop doing anything clever at
all.

IOW, from a quick look there's a couple of small details I don't like
that look odd, but ..

Linus

2014-02-18 18:53:28

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache

On Tue, Feb 18, 2014 at 10:02:26AM -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2014 at 6:15 AM, Wilcox, Matthew R
> <[email protected]> wrote:
> > We don't really need to lock all the pages being returned to protect
> > against truncate. We only need to lock the one at the highest index,
> > and check i_size while that lock is held since truncate_inode_pages_range()
> > will block on any page that is locked.
> >
> > We're still vulnerable to holepunches, but there's no locking currently
> > between holepunches and truncate, so we're no worse off now.
>
> It's not "holepunches and truncate", it's "holepunches and page
> mapping", and I do think we currently serialize the two - the whole
> "check page->mapping still being non-NULL" before mapping it while
> having the page locked does that.

Yes, I did mean "holepunches and page faults". But here's the race I see:

Process A Process B
ext4_fallocate()
ext4_punch_hole()
filemap_write_and_wait_range()
mutex_lock(&inode->i_mutex);
truncate_pagecache_range()
unmap_mapping_range()
__do_fault()
filemap_fault()
lock_page_or_retry()
(page->mapping == mapping at this point)
set_pte_at()
unlock_page()
truncate_inode_pages_range()
(now the pte is pointing at a page that
is no longer attached to this file)
mutex_unlock(&inode->i_mutex);

Would we solve the problem by putting in a second call to
unmap_mapping_range() after calling truncate_inode_pages_range() in
truncate_pagecache_range(), like truncate_pagecache() does?

> Besides, that per-page locking should serialize against truncate too.
> No, there is no "global" serialization, but there *is* exactly that
> page-level serialization where both truncation and hole punching end
> up making sure that the page no longer exists in the page cache and
> isn't mapped.

What I'm suggesting is going back to Kirill's earlier patch, but only
locking the page with the highest index instead of all of the pages.
truncate() will block on that page and then we'll notice that some or
all of the other pages are also now past i_size and give up.

2014-02-18 19:07:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache

On Tue, Feb 18, 2014 at 10:53 AM, Matthew Wilcox <[email protected]> wrote:
>
> Yes, I did mean "holepunches and page faults". But here's the race I see:

Hmm. With truncate, we should be protected by i_size being changed
first (iirc - I didn't actually check), but I think you're right that
hole punching might race with a page being mapped at the same time.

> What I'm suggesting is going back to Kirill's earlier patch, but only
> locking the page with the highest index instead of all of the pages.
> truncate() will block on that page and then we'll notice that some or
> all of the other pages are also now past i_size and give up.

Actually, Kirill's latest patch seems to solve the problem with
locking - by simply never locking more than one page at a time. So I
guess it's all moot at least wrt the page preload..

Linus

2014-02-18 23:57:35

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache

On Tue, Feb 18, 2014 at 10:28:11AM -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2014 at 10:07 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> >
> > Patch is wrong. Correct one is below.
>
> Hmm. I don't hate this. Looking through it, it's fairly simple
> conceptually, and the code isn't that complex either. I can live with
> this.
>
> I think it's a bit odd how you pass both "max_pgoff" and "nr_pages" to
> the fault-around function, though. In fact, I'd consider that a bug.
> Passing in "FAULT_AROUND_PAGES" is just wrong, since the code cannot -
> and in fact *must* not - actually fault in that many pages, since the
> starting/ending address can be limited by other things.
>
> So I think that part of the code is bogus. You need to remove
> nr_pages, because any use of it is just incorrect. I don't think it
> can actually matter, since the max_pgoff checks are more restrictive,
> but if you think it can matter please explain how and why it wouldn't
> be a major bug?

I don't like this too...

Current max_pgoff is end of page table (or end of vma, if it ends before).

If we drop nr_pages but keep current max_pgoff, we will potentially setup
PTRS_PER_PTE pages a time: i.e. page fault to first page of page table and
all pages are ready. nr_pages limits the number.

It's not necessary bad idea to populate whole page table at once. I need
to measure how much latency we will add by doing that.

The only problem I see is that we take ptl for a bit too long. But with
split ptl it will affect only page table we populate.

Other approach is too limit ourself to FAULT_AROUND_PAGES from start_addr.
In this case sometimes we will do useless radix-tree lookup even if we had
chance to populated pages further in the page table.

> Apart from that, I'd really like to see numbers for different ranges
> of FAULT_AROUND_ORDER, because I think 5 is pretty high, but on the
> whole I don't find this horrible, and you still lock the page so it
> doesn't involve any new rules. I'm not hugely happy with another raw
> radix-tree user, but it's not horrible.
>
> Btw, is the "radix_tree_deref_retry(page) -> goto restart" really
> necessary? I'd be almost more inclined to just make it just do a
> "break;" to break out of the loop and stop doing anything clever at
> all.

The code has not ready yet. I'll rework it. It just what I had by the end
of the day. I wanted to know if setup pte directly from ->fault_nonblock()
is okayish approach or considered layering violation.

--
Kirill A. Shutemov

2014-02-19 00:29:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache

On Tue, Feb 18, 2014 at 3:57 PM, Kirill A. Shutemov
<[email protected]> wrote:
>
> Current max_pgoff is end of page table (or end of vma, if it ends before).

Yeah, but that should be trivial to do, and limit it to FAULT_AROUND_ORDER.

> Other approach is too limit ourself to FAULT_AROUND_PAGES from start_addr.
> In this case sometimes we will do useless radix-tree lookup even if we had
> chance to populated pages further in the page table.

So the reason I'd prefer to limit the whole thing to that is to not
generate too many extra cache misses. It would be lovely if we stayed
withing one or two cachelines of the page table entry that we have to
modify anyway.

But it would be really interesting to see the numbers for different
FAULT_AROUND_ORDER and perhaps different variations of this.

>> Btw, is the "radix_tree_deref_retry(page) -> goto restart" really
>> necessary? I'd be almost more inclined to just make it just do a
>> "break;" to break out of the loop and stop doing anything clever at
>> all.
>
> The code has not ready yet. I'll rework it. It just what I had by the end
> of the day. I wanted to know if setup pte directly from ->fault_nonblock()
> is okayish approach or considered layering violation.

Ok. Maybe somebody else screams bloody murder, but considering that
you got 1%+ performance improvements (if I read your numbers right), I
think it looks quite promising, and not overly horrid.

Having some complexity and layering violation that is strictly all in
mm/filemap.c I don't see as horrid.

I would probably *not* like random drivers start to use that new
'fault_nonblock' thing, though.

Linus