2022-06-19 15:14:12

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 1/3] filemap: Correct the conditions for marking a folio as accessed

We had an off-by-one error which meant that we never marked the first page
in a read as accessed. This was visible as a slowdown when re-reading
a file as pages were being evicted from cache too soon. In reviewing
this code, we noticed a second bug where a multi-page folio would be
marked as accessed multiple times when doing reads that were less than
the size of the folio.

Abstract the comparison of whether two file positions are in the same
folio into a new function, fixing both of these bugs.

Reported-by: Yu Kuai <[email protected]>
Reviewed-by: Kent Overstreet <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/filemap.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index ac3775c1ce4c..577068868449 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2629,6 +2629,13 @@ static int filemap_get_pages(struct kiocb *iocb, struct iov_iter *iter,
return err;
}

+static inline bool pos_same_folio(loff_t pos1, loff_t pos2, struct folio *folio)
+{
+ unsigned int shift = folio_shift(folio);
+
+ return (pos1 >> shift == pos2 >> shift);
+}
+
/**
* filemap_read - Read data from the page cache.
* @iocb: The iocb to read.
@@ -2700,11 +2707,11 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
writably_mapped = mapping_writably_mapped(mapping);

/*
- * When a sequential read accesses a page several times, only
+ * When a read accesses the same folio several times, only
* mark it as accessed the first time.
*/
- if (iocb->ki_pos >> PAGE_SHIFT !=
- ra->prev_pos >> PAGE_SHIFT)
+ if (!pos_same_folio(iocb->ki_pos, ra->prev_pos - 1,
+ fbatch.folios[0]))
folio_mark_accessed(fbatch.folios[0]);

for (i = 0; i < folio_batch_count(&fbatch); i++) {
--
2.35.1


2022-06-20 06:38:38

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/3] filemap: Correct the conditions for marking a folio as accessed

On Sun, Jun 19, 2022 at 04:11:41PM +0100, Matthew Wilcox (Oracle) wrote:
> +static inline bool pos_same_folio(loff_t pos1, loff_t pos2, struct folio *folio)
>
> +{
> + unsigned int shift = folio_shift(folio);
> +
> + return (pos1 >> shift == pos2 >> shift);
> +}

The calling convention with the folio as the last argument seems a bit
odd to me. But then again so does passing just one folio for checking
if something is in the same folio. But as I can't come up with
something better I'll just deposit these mumblings here insted of
actually complaining.

2022-06-27 09:17:31

by Oliver Sang

[permalink] [raw]
Subject: [filemap] 91fd26c76b: phoronix-test-suite.fio.SequentialRead.LinuxAIO.Yes.No.4KB.DefaultTestDirectory.mb_s -8.9% regression



Greeting,

FYI, we noticed a -8.9% regression of phoronix-test-suite.fio.SequentialRead.LinuxAIO.Yes.No.4KB.DefaultTestDirectory.mb_s due to commit:


commit: 91fd26c76b86231276b6d7ac71d02d6fde78e297 ("[PATCH 1/3] filemap: Correct the conditions for marking a folio as accessed")
url: https://github.com/intel-lab-lkp/linux/commits/Matthew-Wilcox-Oracle/Fixes-for-5-19b/20220619-231336
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 4b35035bcf80ddb47c0112c4fbd84a63a2836a18
patch link: https://lore.kernel.org/linux-mm/[email protected]

in testcase: phoronix-test-suite
on test machine: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 512G memory
with following parameters:

test: fio-1.14.1
option_a: Sequential Read
option_b: Linux AIO
option_c: Yes
option_d: No
option_e: 4KB
option_f: Default Test Directory
cpufreq_governor: performance
ucode: 0x500320a

test-description: The Phoronix Test Suite is the most comprehensive testing and benchmarking platform available that provides an extensible framework for which new tests can be easily added.
test-url: http://www.phoronix-test-suite.com/



If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
sudo bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
sudo bin/lkp run generated-yaml-file

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.

=========================================================================================
compiler/cpufreq_governor/kconfig/option_a/option_b/option_c/option_d/option_e/option_f/rootfs/tbox_group/test/testcase/ucode:
gcc-11/performance/x86_64-rhel-8.3/Sequential Read/Linux AIO/Yes/No/4KB/Default Test Directory/debian-x86_64-phoronix/lkp-csl-2sp7/fio-1.14.1/phoronix-test-suite/0x500320a

commit:
4b35035bcf ("Merge tag 'nfs-for-5.19-2' of git://git.linux-nfs.org/projects/anna/linux-nfs")
91fd26c76b ("filemap: Correct the conditions for marking a folio as accessed")

4b35035bcf80ddb4 91fd26c76b86231276b6d7ac71d
---------------- ---------------------------
%stddev %change %stddev
\ | \
477611 -8.9% 435055 phoronix-test-suite.fio.SequentialRead.LinuxAIO.Yes.No.4KB.DefaultTestDirectory.iops
1865 -8.9% 1699 phoronix-test-suite.fio.SequentialRead.LinuxAIO.Yes.No.4KB.DefaultTestDirectory.mb_s
2.871e+08 -8.8% 2.619e+08 phoronix-test-suite.time.file_system_inputs
925561 +45.0% 1342367 ? 4% numa-meminfo.node0.Active
89584 ? 15% +466.8% 507739 ? 11% numa-meminfo.node0.Active(file)
22395 ? 15% +464.5% 126419 ? 11% numa-vmstat.node0.nr_active_file
22395 ? 15% +464.5% 126422 ? 11% numa-vmstat.node0.nr_zone_active_file
1168858 -8.9% 1064851 vmstat.io.bi
20386 -7.7% 18809 vmstat.system.cs
950958 +43.6% 1365179 ? 3% meminfo.Active
112156 +369.8% 526961 ? 7% meminfo.Active(file)
2764971 -14.5% 2364011 meminfo.Inactive
1502049 -26.2% 1108866 meminfo.Inactive(file)
31273678 -7.0% 29071736 perf-stat.i.cache-misses
20797 -7.8% 19169 perf-stat.i.context-switches
6002418 -8.5% 5491342 perf-stat.i.node-loads
6357381 -7.6% 5874070 perf-stat.i.node-stores
239.46 ? 9% +14.4% 273.86 ? 5% perf-stat.overall.cycles-between-cache-misses
30990966 -7.1% 28805139 perf-stat.ps.cache-misses
20608 -7.8% 18993 perf-stat.ps.context-switches
5947870 -8.5% 5440625 perf-stat.ps.node-loads
6299557 -7.6% 5819731 perf-stat.ps.node-stores
28038 +370.1% 131814 ? 7% proc-vmstat.nr_active_file
98818 -1.6% 97280 proc-vmstat.nr_anon_pages
375391 -26.2% 277223 proc-vmstat.nr_inactive_file
28038 +370.1% 131814 ? 7% proc-vmstat.nr_zone_active_file
375391 -26.2% 277223 proc-vmstat.nr_zone_inactive_file
511.83 ? 14% +79.3% 917.83 ? 63% proc-vmstat.numa_hint_faults_local
40090769 -7.8% 36977817 proc-vmstat.numa_hit
39810116 -7.7% 36751505 proc-vmstat.numa_local
316086 +10350.9% 33034013 proc-vmstat.pgactivate
39795174 -7.9% 36645119 proc-vmstat.pgalloc_normal
39417989 -8.0% 36280723 proc-vmstat.pgfree
1.436e+08 -8.8% 1.31e+08 proc-vmstat.pgpgin
18.38 ? 10% -2.5 15.86 ? 2% perf-profile.calltrace.cycles-pp.ret_from_fork
18.38 ? 10% -2.5 15.86 ? 2% perf-profile.calltrace.cycles-pp.kthread.ret_from_fork
17.11 ? 10% -2.5 14.62 ? 2% perf-profile.calltrace.cycles-pp.worker_thread.kthread.ret_from_fork
15.09 ? 10% -2.4 12.67 ? 2% perf-profile.calltrace.cycles-pp.loop_process_work.process_one_work.worker_thread.kthread.ret_from_fork
15.28 ? 10% -2.4 12.88 ? 2% perf-profile.calltrace.cycles-pp.process_one_work.worker_thread.kthread.ret_from_fork
14.03 ? 9% -2.3 11.71 ? 3% perf-profile.calltrace.cycles-pp.do_iter_read.loop_process_work.process_one_work.worker_thread.kthread
13.60 ? 9% -2.2 11.36 ? 2% perf-profile.calltrace.cycles-pp.do_iter_readv_writev.do_iter_read.loop_process_work.process_one_work.worker_thread
13.51 ? 9% -2.2 11.28 ? 2% perf-profile.calltrace.cycles-pp.shmem_file_read_iter.do_iter_readv_writev.do_iter_read.loop_process_work.process_one_work
11.84 ? 9% -1.9 9.90 ? 3% perf-profile.calltrace.cycles-pp.copy_page_to_iter.shmem_file_read_iter.do_iter_readv_writev.do_iter_read.loop_process_work
11.76 ? 9% -1.9 9.83 ? 3% perf-profile.calltrace.cycles-pp._copy_to_iter.copy_page_to_iter.shmem_file_read_iter.do_iter_readv_writev.do_iter_read
0.00 +0.6 0.62 ? 4% perf-profile.calltrace.cycles-pp.__folio_activate.pagevec_lru_move_fn.folio_mark_accessed.filemap_read.aio_read
0.00 +0.7 0.71 ? 6% perf-profile.calltrace.cycles-pp.workingset_activation.folio_mark_accessed.filemap_read.aio_read.io_submit_one
0.00 +1.1 1.14 ? 5% perf-profile.calltrace.cycles-pp.pagevec_lru_move_fn.folio_mark_accessed.filemap_read.aio_read.io_submit_one
11.35 ? 8% +1.5 12.80 ? 2% perf-profile.calltrace.cycles-pp.filemap_read.aio_read.io_submit_one.__x64_sys_io_submit.do_syscall_64
0.26 ?100% +2.3 2.54 ? 6% perf-profile.calltrace.cycles-pp.folio_mark_accessed.filemap_read.aio_read.io_submit_one.__x64_sys_io_submit
51.18 ? 2% +2.3 53.50 perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry
56.89 ? 2% +2.4 59.30 perf-profile.calltrace.cycles-pp.secondary_startup_64_no_verify
56.38 ? 2% +2.4 58.79 perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
56.40 ? 2% +2.4 58.82 perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
56.40 ? 2% +2.4 58.82 perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64_no_verify
55.10 ? 3% +2.5 57.56 perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
18.39 ? 10% -2.5 15.86 ? 2% perf-profile.children.cycles-pp.ret_from_fork
18.38 ? 10% -2.5 15.86 ? 2% perf-profile.children.cycles-pp.kthread
17.11 ? 10% -2.5 14.62 ? 2% perf-profile.children.cycles-pp.worker_thread
15.11 ? 10% -2.4 12.69 ? 2% perf-profile.children.cycles-pp.loop_process_work
15.28 ? 10% -2.4 12.88 ? 2% perf-profile.children.cycles-pp.process_one_work
14.04 ? 9% -2.3 11.72 ? 2% perf-profile.children.cycles-pp.do_iter_read
13.61 ? 9% -2.2 11.36 ? 2% perf-profile.children.cycles-pp.do_iter_readv_writev
13.52 ? 9% -2.2 11.28 ? 2% perf-profile.children.cycles-pp.shmem_file_read_iter
15.62 ? 5% -2.2 13.40 ? 2% perf-profile.children.cycles-pp.copy_page_to_iter
11.78 ? 9% -1.9 9.84 ? 3% perf-profile.children.cycles-pp._copy_to_iter
1.07 ? 10% -0.2 0.92 ? 4% perf-profile.children.cycles-pp.shmem_getpage_gfp
0.53 ? 5% -0.1 0.39 ? 11% perf-profile.children.cycles-pp.security_file_permission
0.46 ? 5% -0.1 0.32 ? 13% perf-profile.children.cycles-pp.apparmor_file_permission
0.32 ? 6% -0.1 0.25 ? 9% perf-profile.children.cycles-pp.touch_atime
0.26 ? 6% -0.1 0.20 ? 11% perf-profile.children.cycles-pp.atime_needs_update
0.50 -0.1 0.45 ? 5% perf-profile.children.cycles-pp.free_pcppages_bulk
0.38 ? 6% -0.0 0.33 ? 2% perf-profile.children.cycles-pp.ksys_read
0.42 ? 5% -0.0 0.38 ? 5% perf-profile.children.cycles-pp.read_tsc
0.37 ? 6% -0.0 0.33 perf-profile.children.cycles-pp.vfs_read
0.24 ? 9% -0.0 0.21 ? 2% perf-profile.children.cycles-pp.seq_read_iter
0.13 ? 6% -0.0 0.10 ? 14% perf-profile.children.cycles-pp.current_time
0.12 ? 5% -0.0 0.10 ? 8% perf-profile.children.cycles-pp.__libc_read
0.02 ?141% +0.1 0.08 ? 22% perf-profile.children.cycles-pp.mem_cgroup_update_lru_size
0.02 ?142% +0.1 0.08 ? 9% perf-profile.children.cycles-pp.folio_lruvec_lock_irqsave
0.00 +0.1 0.07 ? 18% perf-profile.children.cycles-pp.__count_memcg_events
0.17 ? 12% +0.1 0.26 ? 7% perf-profile.children.cycles-pp.__mod_node_page_state
0.25 ? 13% +0.1 0.37 ? 8% perf-profile.children.cycles-pp.__mod_lruvec_state
2.61 ? 5% +0.3 2.86 ? 4% perf-profile.children.cycles-pp.tick_sched_timer
2.04 ? 6% +0.3 2.32 ? 7% perf-profile.children.cycles-pp.update_process_times
0.00 +0.5 0.46 ? 11% perf-profile.children.cycles-pp.workingset_age_nonresident
0.00 +0.6 0.64 ? 4% perf-profile.children.cycles-pp.__folio_activate
0.00 +0.7 0.72 ? 7% perf-profile.children.cycles-pp.workingset_activation
0.00 +1.1 1.14 ? 5% perf-profile.children.cycles-pp.pagevec_lru_move_fn
0.51 ? 9% +2.1 2.59 ? 6% perf-profile.children.cycles-pp.folio_mark_accessed
56.89 ? 2% +2.4 59.29 perf-profile.children.cycles-pp.do_idle
56.89 ? 2% +2.4 59.30 perf-profile.children.cycles-pp.secondary_startup_64_no_verify
56.89 ? 2% +2.4 59.30 perf-profile.children.cycles-pp.cpu_startup_entry
56.40 ? 2% +2.4 58.82 perf-profile.children.cycles-pp.start_secondary
55.60 ? 2% +2.5 58.06 perf-profile.children.cycles-pp.cpuidle_idle_call
11.69 ? 9% -1.9 9.78 ? 3% perf-profile.self.cycles-pp._copy_to_iter
0.37 ? 6% -0.1 0.24 ? 15% perf-profile.self.cycles-pp.apparmor_file_permission
0.13 ? 12% -0.0 0.10 ? 14% perf-profile.self.cycles-pp.atime_needs_update
0.08 ? 17% +0.0 0.11 ? 6% perf-profile.self.cycles-pp.__mod_lruvec_state
0.18 ? 12% +0.1 0.26 ? 12% perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
0.16 ? 14% +0.1 0.25 ? 8% perf-profile.self.cycles-pp.__mod_node_page_state
0.49 ? 9% +0.2 0.71 ? 9% perf-profile.self.cycles-pp.folio_mark_accessed
0.00 +0.2 0.25 ? 5% perf-profile.self.cycles-pp.workingset_activation
0.00 +0.3 0.30 ? 9% perf-profile.self.cycles-pp.pagevec_lru_move_fn
0.00 +0.3 0.34 ? 9% perf-profile.self.cycles-pp.__folio_activate
0.00 +0.5 0.46 ? 11% perf-profile.self.cycles-pp.workingset_age_nonresident




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


--
0-DAY CI Kernel Test Service
https://01.org/lkp



Attachments:
(No filename) (13.16 kB)
config-5.19.0-rc2-00242-g91fd26c76b86 (166.42 kB)
job-script (8.08 kB)
job.yaml (5.25 kB)
reproduce (306.00 B)
Download all attachments

2022-06-30 07:53:03

by Dominique Martinet

[permalink] [raw]
Subject: Major btrfs fiemap slowdown on file with many extents once in cache (RCU stalls?) (Was: [PATCH 1/3] filemap: Correct the conditions for marking a folio as accessed)

Hi Willy, linux-btrfs@vger,

Matthew Wilcox (Oracle) wrote on Sun, Jun 19, 2022 at 04:11:41PM +0100:
> We had an off-by-one error which meant that we never marked the first page
> in a read as accessed. This was visible as a slowdown when re-reading
> a file as pages were being evicted from cache too soon. In reviewing
> this code, we noticed a second bug where a multi-page folio would be
> marked as accessed multiple times when doing reads that were less than
> the size of the folio.

when debugging an unrelated issue (short reads on btrfs with io_uring
and O_DIRECT[1]), I noticed that my horrible big file copy speeds fell
down from ~2GB/s (there's compression and lots of zeroes) to ~100MB/s
the second time I was copying it with cp.

I've taken a moment to bisect this and came down to this patch.

[1] https://lore.kernel.org/all/[email protected]/T/#u



Dropping caches (echo 3 > /proc/sys/vm/drop_caches) restore the speed,
so there appears to be some bad effect to having the file in cache for
fiemap?
To be fair that file is pretty horrible:
---
# compsize bigfile
Processed 1 file, 194955 regular extents (199583 refs), 0 inline.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 15% 3.7G 23G 23G
none 100% 477M 477M 514M
zstd 14% 3.2G 23G 23G
---

Here's what perf has to say about it on top of this patch when running
`cp bigfile /dev/null` the first time:

98.97% 0.00% cp [kernel.kallsyms] [k]
entry_SYSCALL_64_after_hwframe
entry_SYSCALL_64_after_hwframe
do_syscall_64
- 93.40% ksys_read
- 93.36% vfs_read
- 93.25% new_sync_read
- 93.20% filemap_read
- 83.38% filemap_get_pages
- 82.76% page_cache_ra_unbounded
+ 59.72% folio_alloc
+ 13.43% read_pages
+ 8.75% filemap_add_folio
0.64% xa_load
0.52% filemap_get_read_batch
+ 8.75% copy_page_to_iter
- 4.73% __x64_sys_ioctl
- 4.72% do_vfs_ioctl
- btrfs_fiemap
- 4.70% extent_fiemap
+ 3.95% btrfs_check_shared
+ 0.70% get_extent_skip_holes

and second time:
99.90% 0.00% cp [kernel.kallsyms] [k]
entry_SYSCALL_64_after_hwfram
entry_SYSCALL_64_after_hwframe
do_syscall_64
- 94.62% __x64_sys_ioctl
do_vfs_ioctl
btrfs_fiemap
- extent_fiemap
- 50.01% get_extent_skip_holes
- 50.00% btrfs_get_extent_fiemap
- 49.97% count_range_bits
rb_next
+ 28.72% lock_extent_bits
+ 15.55% __clear_extent_bit
- 5.21% ksys_read
+ 5.21% vfs_read

(if this isn't readable, 95% of the time is spent on fiemap the second
time around)




I've also been observing RCU stalls on my laptop with the same workload
(cp to /dev/null), but unfortunately I could not reproduce in qemu so I
could not take traces to confirm they are caused by the same commit but
given the workload I'd say that is it?
I can rebuild a kernel for my laptop and confirm if you think it should
be something else.


I didn't look at the patch itself (yet) so have no suggestion at this
point - it's plausible the patch fixed something and just exposed slow
code that had been there all along so it might be better to look at the
btrfs side first, I don't know.
If you don't manage to reproduce I'll be happy to test anything thrown
at me at the very least.


Thanks,
--
Dominique Martinet | Asmadeus

2022-06-30 15:02:20

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Major btrfs fiemap slowdown on file with many extents once in cache (RCU stalls?) (Was: [PATCH 1/3] filemap: Correct the conditions for marking a folio as accessed)

On Thu, Jun 30, 2022 at 04:29:05PM +0900, Dominique MARTINET wrote:
> Hi Willy, linux-btrfs@vger,
>
> Matthew Wilcox (Oracle) wrote on Sun, Jun 19, 2022 at 04:11:41PM +0100:
> > We had an off-by-one error which meant that we never marked the first page
> > in a read as accessed. This was visible as a slowdown when re-reading
> > a file as pages were being evicted from cache too soon. In reviewing
> > this code, we noticed a second bug where a multi-page folio would be
> > marked as accessed multiple times when doing reads that were less than
> > the size of the folio.
>
> when debugging an unrelated issue (short reads on btrfs with io_uring
> and O_DIRECT[1]), I noticed that my horrible big file copy speeds fell
> down from ~2GB/s (there's compression and lots of zeroes) to ~100MB/s
> the second time I was copying it with cp.
>
> I've taken a moment to bisect this and came down to this patch.

I think you may have forgotten to include the commit-id that was the
results of your bisect.... ?

- Ted

2022-06-30 22:49:37

by Dominique Martinet

[permalink] [raw]
Subject: Re: Major btrfs fiemap slowdown on file with many extents once in cache (RCU stalls?) (Was: [PATCH 1/3] filemap: Correct the conditions for marking a folio as accessed)

Theodore Ts'o wrote on Thu, Jun 30, 2022 at 10:37:58AM -0400:
> On Thu, Jun 30, 2022 at 04:29:05PM +0900, Dominique MARTINET wrote:
> > Hi Willy, linux-btrfs@vger,
> >
> > Matthew Wilcox (Oracle) wrote on Sun, Jun 19, 2022 at 04:11:41PM +0100:
> > > We had an off-by-one error which meant that we never marked the first page
> > > in a read as accessed. This was visible as a slowdown when re-reading
> > > a file as pages were being evicted from cache too soon. In reviewing
> > > this code, we noticed a second bug where a multi-page folio would be
> > > marked as accessed multiple times when doing reads that were less than
> > > the size of the folio.
> >
> > when debugging an unrelated issue (short reads on btrfs with io_uring
> > and O_DIRECT[1]), I noticed that my horrible big file copy speeds fell
> > down from ~2GB/s (there's compression and lots of zeroes) to ~100MB/s
> > the second time I was copying it with cp.
> >
> > I've taken a moment to bisect this and came down to this patch.
>
> I think you may have forgotten to include the commit-id that was the
> results of your bisect.... ?

Sorry, this is the patch I replied to and it was recent enough that I
assumed it'd still be in mailboxes, but you're right it's better with a
commit id. This is was merged as 5ccc944dce3d ("filemap: Correct the
conditions for marking a folio as accessed")


Thanks,
--
Dominique

2022-07-07 08:47:54

by Dominique Martinet

[permalink] [raw]
Subject: Re: Major btrfs fiemap slowdown on file with many extents once in cache (RCU stalls?) (Was: [PATCH 1/3] filemap: Correct the conditions for marking a folio as accessed)

(added btrfs maintainers in direct cc)

Yu Kuai wrote on Fri, Jul 01, 2022 at 09:55:31AM +0800:
> With this patch ctive_page() will be called the second time that page is
> mark accessed, which has some extra overhead, however, 2GB/s -> 100MB/s
> is insane, I'm not sure how this is possible, but it seems like it has
> something to do with this change.(Noted that it's problematic that page
> will not mark accessed before this patch).

I honestly don't understand why folio being marked as accessed affects
how fiemap is processed...
My guess would be that this indeed "just fixes" that pages didn't get
marked as accessed -> were dropped from cache -> it kept the inode
io_tree small -> fiemap was fast ; and it really just a problem that the
fiemap algorithm doesn't scale, but I haven't really checked if I'm
right here.


So I don't think we should focus so much on the regression part as to
figure out what's actually different the second time around and make
that faster.


checking with 'perf script' btrfs_get_extent_fiemap() spends most of its
time on this:
delalloc_len = count_range_bits(&inode->io_tree, &delalloc_start,
end, len, EXTENT_DELALLOC, 1);

I have no idea what delalloc is supposed to be, but I can guess there is
just way too many nodes in the io_tree: why is that and why wasn't there
so many the first time around? I would assumed that as the file gets
read it is put into cache, so the end of the first read should slow down
as well but it didn't, so I'm sure I misunderstood something and I'm
wasting everyone's time. Feel free to ignore me and find the issue
instead :)


> BTW, during my test, the speed of buffer read in ext4 only fell down a
> little.

For "normal" files that don't have ~200k extents full of holes and
compression changes and whatever else this has gone through, I can
confirm the slowdown is not as bad -- almost unnoticeable when few
extents.
but I still have my laptop cashing when I'm copying this file twice
(well, I -could- just turn off panic_on_stall...) so it can go from a
little to infinity...


Thanks,

(Leaving rest of the message for anyone catching up now; if there's
anything you'd like me to do feel free to ask.)

> > I've taken a moment to bisect this and came down to this patch.
> > (5ccc944dce3d ("filemap: Correct the conditions for marking a folio
> > as accessed"))
> >
> > [1] https://lore.kernel.org/all/[email protected]/T/#u
> >
> >
> >
> > Dropping caches (echo 3 > /proc/sys/vm/drop_caches) restore the speed,
> > so there appears to be some bad effect to having the file in cache for
> > fiemap?
> > To be fair that file is pretty horrible:
> > ---
> > # compsize bigfile
> > Processed 1 file, 194955 regular extents (199583 refs), 0 inline.
> > Type Perc Disk Usage Uncompressed Referenced
> > TOTAL 15% 3.7G 23G 23G
> > none 100% 477M 477M 514M
> > zstd 14% 3.2G 23G 23G
> > ---
> >
> > Here's what perf has to say about it on top of this patch when running
> > `cp bigfile /dev/null` the first time:
> >
> > 98.97% 0.00% cp [kernel.kallsyms] [k]
> > entry_SYSCALL_64_after_hwframe
> > entry_SYSCALL_64_after_hwframe
> > do_syscall_64
> > - 93.40% ksys_read
> > - 93.36% vfs_read
> > - 93.25% new_sync_read
> > - 93.20% filemap_read
> > - 83.38% filemap_get_pages
> > - 82.76% page_cache_ra_unbounded
> > + 59.72% folio_alloc
> > + 13.43% read_pages
> > + 8.75% filemap_add_folio
> > 0.64% xa_load
> > 0.52% filemap_get_read_batch
> > + 8.75% copy_page_to_iter
> > - 4.73% __x64_sys_ioctl
> > - 4.72% do_vfs_ioctl
> > - btrfs_fiemap
> > - 4.70% extent_fiemap
> > + 3.95% btrfs_check_shared
> > + 0.70% get_extent_skip_holes
> >
> > and second time:
> > 99.90% 0.00% cp [kernel.kallsyms] [k]
> > entry_SYSCALL_64_after_hwfram
> > entry_SYSCALL_64_after_hwframe
> > do_syscall_64
> > - 94.62% __x64_sys_ioctl
> > do_vfs_ioctl
> > btrfs_fiemap
> > - extent_fiemap
> > - 50.01% get_extent_skip_holes
> > - 50.00% btrfs_get_extent_fiemap
> > - 49.97% count_range_bits
> > rb_next
> > + 28.72% lock_extent_bits
> > + 15.55% __clear_extent_bit
> > - 5.21% ksys_read
> > + 5.21% vfs_read
> >
> > (if this isn't readable, 95% of the time is spent on fiemap the second
> > time around)
> >
> >
> >
> >
> > I've also been observing RCU stalls on my laptop with the same workload
> > (cp to /dev/null), but unfortunately I could not reproduce in qemu so I
> > could not take traces to confirm they are caused by the same commit but
> > given the workload I'd say that is it?
> > I can rebuild a kernel for my laptop and confirm if you think it should
> > be something else.
> >
> >
> > I didn't look at the patch itself (yet) so have no suggestion at this
> > point - it's plausible the patch fixed something and just exposed slow
> > code that had been there all along so it might be better to look at the
> > btrfs side first, I don't know.
> > If you don't manage to reproduce I'll be happy to test anything thrown
> > at me at the very least.

--
Dominique