Our applications, built on Elasticsearch[0], frequently create and delete
files. These applications operate within containers, some with a memory
limit exceeding 100GB. Over prolonged periods, the accumulation of negative
dentries within these containers can amount to tens of gigabytes.
Upon container exit, directories are deleted. However, due to the numerous
associated dentries, this process can be time-consuming. Our users have
expressed frustration with this prolonged exit duration, which constitutes
our first issue.
Simultaneously, other processes may attempt to access the parent directory
of the Elasticsearch directories. Since the task responsible for deleting
the dentries holds the inode lock, processes attempting directory lookup
experience significant delays. This issue, our second problem, is easily
demonstrated:
- Task 1 generates negative dentries:
$ pwd
~/test
$ mkdir es && cd es/ && ./create_and_delete_files.sh
[ After generating tens of GB dentries ]
$ cd ~/test && rm -rf es
[ It will take a long duration to finish ]
- Task 2 attempts to lookup the 'test/' directory
$ pwd
~/test
$ ls
The 'ls' command in Task 2 experiences prolonged execution as Task 1
is deleting the dentries.
We've devised a solution to address both issues by deleting associated
dentry when removing a file. Interestingly, we've noted that a similar
patch was proposed years ago[1], although it was rejected citing the
absence of tangible issues caused by negative dentries. Given our current
challenges, we're resubmitting the proposal. All relevant stakeholders from
previous discussions have been included for reference.
Some alternative solutions are also under discussion[2][3], such as
shrinking child dentries outside of the parent inode lock or even
asynchronously shrinking child dentries. However, given the straightforward
nature of the current solution, I believe this approach is still necessary.
[0]. https://github.com/elastic/elasticsearch
[1]. https://patchwork.kernel.org/project/linux-fsdevel/patch/[email protected]
[2]. https://lore.kernel.org/linux-fsdevel/[email protected]/
[3]. https://lore.kernel.org/linux-fsdevel/CAHk-=wjEMf8Du4UFzxuToGDnF3yLaMcrYeyNAaH1NJWa6fwcNQ@mail.gmail.com/
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Yafang Shao <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Wangkai <[email protected]>
Cc: Colin Walters <[email protected]>
---
fs/dcache.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 71a8e943a0fa..2ffdb98e9166 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2360,19 +2360,17 @@ EXPORT_SYMBOL(d_hash_and_lookup);
* - unhash this dentry and free it.
*
* Usually, we want to just turn this into
- * a negative dentry, but if anybody else is
- * currently using the dentry or the inode
- * we can't do that and we fall back on removing
- * it from the hash queues and waiting for
- * it to be deleted later when it has no users
+ * a negative dentry, but certain workloads can
+ * generate a large number of negative dentries.
+ * Therefore, it would be better to simply
+ * unhash it.
*/
/**
* d_delete - delete a dentry
* @dentry: The dentry to delete
*
- * Turn the dentry into a negative dentry if possible, otherwise
- * remove it from the hash queues so it can be deleted later
+ * Remove the dentry from the hash queues so it can be deleted later.
*/
void d_delete(struct dentry * dentry)
@@ -2381,14 +2379,14 @@ void d_delete(struct dentry * dentry)
spin_lock(&inode->i_lock);
spin_lock(&dentry->d_lock);
+ __d_drop(dentry);
+
/*
* Are we the only user?
*/
if (dentry->d_lockref.count == 1) {
- dentry->d_flags &= ~DCACHE_CANT_MOUNT;
dentry_unlink_inode(dentry);
} else {
- __d_drop(dentry);
spin_unlock(&dentry->d_lock);
spin_unlock(&inode->i_lock);
}
--
2.30.1 (Apple Git-130)
Oliver,
is there any chance you could run this through the test robot
performance suite? The original full patch at
https://lore.kernel.org/all/[email protected]/
and it would be interesting if the test robot could see if the patch
makes any difference on any other loads?
Thanks,
Linus
On Wed, 15 May 2024 at 02:17, Yafang Shao <[email protected]> wrote:
>
> Our applications, built on Elasticsearch[0], frequently create and delete
> files. These applications operate within containers, some with a memory
> limit exceeding 100GB. Over prolonged periods, the accumulation of negative
> dentries within these containers can amount to tens of gigabytes.
>
> Upon container exit, directories are deleted. However, due to the numerous
> associated dentries, this process can be time-consuming. Our users have
> expressed frustration with this prolonged exit duration, which constitutes
> our first issue.
hi, Linus,
On Wed, May 15, 2024 at 09:05:24AM -0700, Linus Torvalds wrote:
> Oliver,
> is there any chance you could run this through the test robot
> performance suite? The original full patch at
>
> https://lore.kernel.org/all/[email protected]/
>
> and it would be interesting if the test robot could see if the patch
> makes any difference on any other loads?
sure. it's our great pleasure. thanks a lot to involve us.
the tests are started, but due to resource contraint, it may cost 2-3 days
to run enough tests.
will keep you guys updated, Thanks!
>
> Thanks,
> Linus
>
> On Wed, 15 May 2024 at 02:17, Yafang Shao <[email protected]> wrote:
> >
> > Our applications, built on Elasticsearch[0], frequently create and delete
> > files. These applications operate within containers, some with a memory
> > limit exceeding 100GB. Over prolonged periods, the accumulation of negative
> > dentries within these containers can amount to tens of gigabytes.
> >
> > Upon container exit, directories are deleted. However, due to the numerous
> > associated dentries, this process can be time-consuming. Our users have
> > expressed frustration with this prolonged exit duration, which constitutes
> > our first issue.
Hello,
kernel test robot noticed a 6.7% improvement of stress-ng.touch.ops_per_sec on:
commit: 3681ce364442ce2ec7c7fbe90ad0aca651db475b ("[PATCH] vfs: Delete the associated dentry when deleting a file")
url: https://github.com/intel-lab-lkp/linux/commits/Yafang-Shao/vfs-Delete-the-associated-dentry-when-deleting-a-file/20240516-144340
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 3c999d1ae3c75991902a1a7dad0cb62c2a3008b4
patch link: https://lore.kernel.org/all/[email protected]/
patch subject: [PATCH] vfs: Delete the associated dentry when deleting a file
testcase: stress-ng
test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
parameters:
nr_threads: 100%
disk: 1HDD
testtime: 60s
fs: ext4
test: touch
cpufreq_governor: performance
Details are as below:
-------------------------------------------------------------------------------------------------->
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240522/[email protected]
=========================================================================================
compiler/cpufreq_governor/disk/fs/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
gcc-13/performance/1HDD/ext4/x86_64-rhel-8.3/100%/debian-12-x86_64-20240206.cgz/lkp-icl-2sp8/touch/stress-ng/60s
commit:
3c999d1ae3 ("Merge tag 'wq-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq")
3681ce3644 ("vfs: Delete the associated dentry when deleting a file")
3c999d1ae3c75991 3681ce364442ce2ec7c7fbe90ad
---------------- ---------------------------
%stddev %change %stddev
\ | \
9.884e+08 -49.1% 5.027e+08 ? 4% cpuidle..time
23.46 ? 2% -38.3% 14.47 ? 5% iostat.cpu.idle
75.71 +11.7% 84.60 iostat.cpu.system
620100 ? 5% -24.0% 471157 ? 6% numa-numastat.node0.local_node
650333 ? 3% -22.3% 505122 ? 5% numa-numastat.node0.numa_hit
736690 ? 3% -21.7% 577182 ? 4% numa-numastat.node1.local_node
772759 ? 3% -21.1% 609537 ? 3% numa-numastat.node1.numa_hit
23.42 -38.1% 14.50 ? 5% vmstat.cpu.id
54.39 ? 3% +13.4% 61.67 ? 2% vmstat.procs.r
75507 ? 6% +24.6% 94113 ? 5% vmstat.system.cs
177863 +14.2% 203176 vmstat.system.in
21.30 ? 2% -9.5 11.76 ? 7% mpstat.cpu.all.idle%
0.37 ? 2% +0.1 0.42 ? 2% mpstat.cpu.all.irq%
0.14 -0.0 0.11 ? 2% mpstat.cpu.all.soft%
77.37 +9.4 86.79 mpstat.cpu.all.sys%
0.81 +0.1 0.92 ? 2% mpstat.cpu.all.usr%
68.24 -11.5% 60.40 stress-ng.time.elapsed_time
68.24 -11.5% 60.40 stress-ng.time.elapsed_time.max
175478 +5.3% 184765 stress-ng.time.involuntary_context_switches
5038 +12.4% 5663 stress-ng.time.percent_of_cpu_this_job_got
259551 ? 2% +6.7% 277010 stress-ng.touch.ops_per_sec
454143 ? 2% -9.2% 412504 ? 2% meminfo.Active
445748 ? 2% -9.3% 404154 ? 2% meminfo.Active(anon)
1714332 ? 2% -92.8% 123761 ? 2% meminfo.KReclaimable
7111721 -23.7% 5423125 meminfo.Memused
1714332 ? 2% -92.8% 123761 ? 2% meminfo.SReclaimable
291963 -35.6% 188122 meminfo.SUnreclaim
2006296 -84.5% 311883 meminfo.Slab
7289014 -24.0% 5539852 meminfo.max_used_kB
875347 ? 3% -91.3% 75880 ? 30% numa-meminfo.node0.KReclaimable
875347 ? 3% -91.3% 75880 ? 30% numa-meminfo.node0.SReclaimable
161663 ? 7% -31.3% 111087 ? 12% numa-meminfo.node0.SUnreclaim
1037010 ? 3% -82.0% 186968 ? 16% numa-meminfo.node0.Slab
838678 ? 3% -94.3% 48016 ? 50% numa-meminfo.node1.KReclaimable
838678 ? 3% -94.3% 48016 ? 50% numa-meminfo.node1.SReclaimable
130259 ? 8% -40.8% 77055 ? 18% numa-meminfo.node1.SUnreclaim
968937 ? 3% -87.1% 125072 ? 25% numa-meminfo.node1.Slab
218872 ? 3% -91.3% 18961 ? 31% numa-vmstat.node0.nr_slab_reclaimable
40425 ? 7% -31.3% 27772 ? 12% numa-vmstat.node0.nr_slab_unreclaimable
649973 ? 3% -22.3% 504715 ? 5% numa-vmstat.node0.numa_hit
619740 ? 5% -24.0% 470750 ? 6% numa-vmstat.node0.numa_local
209697 ? 3% -94.3% 12019 ? 50% numa-vmstat.node1.nr_slab_reclaimable
32567 ? 8% -40.8% 19264 ? 18% numa-vmstat.node1.nr_slab_unreclaimable
771696 ? 3% -21.1% 608506 ? 3% numa-vmstat.node1.numa_hit
735626 ? 3% -21.7% 576154 ? 4% numa-vmstat.node1.numa_local
111469 ? 2% -9.3% 101103 ? 2% proc-vmstat.nr_active_anon
446.10 -36.6% 283.05 proc-vmstat.nr_dirtied
18812 +2.5% 19287 proc-vmstat.nr_kernel_stack
428165 ? 2% -92.8% 30970 ? 2% proc-vmstat.nr_slab_reclaimable
72954 -35.5% 47032 proc-vmstat.nr_slab_unreclaimable
111469 ? 2% -9.3% 101103 ? 2% proc-vmstat.nr_zone_active_anon
1424982 -21.7% 1116423 proc-vmstat.numa_hit
1358679 -22.7% 1050103 ? 2% proc-vmstat.numa_local
4261989 ? 3% -14.9% 3625822 ? 3% proc-vmstat.pgalloc_normal
399826 -4.6% 381396 proc-vmstat.pgfault
3996977 ? 3% -15.9% 3359964 ? 3% proc-vmstat.pgfree
13500 -5.9% 12708 proc-vmstat.pgpgout
0.13 ? 3% -0.0 0.08 ? 4% perf-profile.children.cycles-pp.d_lookup
0.12 ? 4% -0.0 0.08 perf-profile.children.cycles-pp.__d_lookup
0.11 ? 2% +0.0 0.13 ? 3% perf-profile.children.cycles-pp.irq_exit_rcu
0.06 ? 7% +0.0 0.09 ? 2% perf-profile.children.cycles-pp.__slab_free
0.02 ?100% +0.0 0.06 ? 8% perf-profile.children.cycles-pp.__memcg_slab_free_hook
0.07 ? 7% +0.0 0.12 ? 4% perf-profile.children.cycles-pp.run_ksoftirqd
0.09 ? 5% +0.0 0.14 ? 4% perf-profile.children.cycles-pp.smpboot_thread_fn
0.18 ? 3% +0.1 0.24 ? 3% perf-profile.children.cycles-pp.kmem_cache_free
0.10 ? 7% +0.1 0.16 ? 7% perf-profile.children.cycles-pp.kthread
0.10 ? 7% +0.1 0.16 ? 7% perf-profile.children.cycles-pp.ret_from_fork
0.10 ? 7% +0.1 0.16 ? 7% perf-profile.children.cycles-pp.ret_from_fork_asm
0.14 ? 4% +0.1 0.20 ? 3% perf-profile.children.cycles-pp.rcu_do_batch
0.15 ? 3% +0.1 0.21 ? 2% perf-profile.children.cycles-pp.rcu_core
0.18 ? 3% +0.1 0.24 ? 2% perf-profile.children.cycles-pp.handle_softirqs
0.00 +0.1 0.06 ? 7% perf-profile.children.cycles-pp.list_lru_del
0.82 ? 2% +0.1 0.89 ? 3% perf-profile.children.cycles-pp._raw_spin_lock
0.01 ?238% +0.1 0.08 ? 4% perf-profile.children.cycles-pp.__call_rcu_common
0.00 +0.1 0.08 ? 3% perf-profile.children.cycles-pp.d_lru_del
0.00 +0.2 0.17 ? 3% perf-profile.children.cycles-pp.__dentry_kill
0.33 ? 11% +0.2 0.50 ? 6% perf-profile.children.cycles-pp.dput
0.09 ? 4% -0.0 0.05 ? 22% perf-profile.self.cycles-pp.__d_lookup
0.06 ? 5% +0.0 0.09 ? 5% perf-profile.self.cycles-pp.__slab_free
0.74 ? 2% +0.1 0.79 perf-profile.self.cycles-pp._raw_spin_lock
2.55 ? 4% -40.6% 1.51 ? 5% perf-stat.i.MPKI
1.219e+10 +12.8% 1.376e+10 perf-stat.i.branch-instructions
0.27 ? 2% -0.1 0.21 ? 2% perf-stat.i.branch-miss-rate%
22693772 +14.3% 25941855 perf-stat.i.branch-misses
39.43 -4.6 34.83 perf-stat.i.cache-miss-rate%
95834044 ? 3% +10.9% 1.063e+08 ? 4% perf-stat.i.cache-misses
2.658e+08 ? 2% +14.8% 3.051e+08 ? 4% perf-stat.i.cache-references
77622 ? 6% +25.5% 97395 ? 5% perf-stat.i.context-switches
2.82 +3.3% 2.91 perf-stat.i.cpi
1.821e+11 +12.5% 2.049e+11 perf-stat.i.cpu-cycles
9159 ? 3% +22.9% 11257 ? 3% perf-stat.i.cpu-migrations
6.223e+10 +12.8% 7.019e+10 perf-stat.i.instructions
0.36 -4.8% 0.34 perf-stat.i.ipc
1.19 ? 9% +27.6% 1.51 ? 5% perf-stat.i.metric.K/sec
4716 +6.3% 5011 perf-stat.i.minor-faults
4716 +6.3% 5011 perf-stat.i.page-faults
35.93 -1.2 34.77 perf-stat.overall.cache-miss-rate%
1.214e+10 +11.5% 1.353e+10 perf-stat.ps.branch-instructions
22276184 +13.3% 25240906 perf-stat.ps.branch-misses
95133798 ? 3% +9.8% 1.045e+08 ? 4% perf-stat.ps.cache-misses
2.648e+08 ? 2% +13.5% 3.004e+08 ? 4% perf-stat.ps.cache-references
77612 ? 6% +23.8% 96101 ? 5% perf-stat.ps.context-switches
1.813e+11 +11.1% 2.015e+11 perf-stat.ps.cpu-cycles
9093 ? 3% +21.5% 11047 ? 2% perf-stat.ps.cpu-migrations
6.192e+10 +11.5% 6.901e+10 perf-stat.ps.instructions
4639 +5.6% 4899 perf-stat.ps.minor-faults
4640 +5.6% 4899 perf-stat.ps.page-faults
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
hi, Linus, hi, Yafang Shao,
On Wed, May 15, 2024 at 09:05:24AM -0700, Linus Torvalds wrote:
> Oliver,
> is there any chance you could run this through the test robot
> performance suite? The original full patch at
>
> https://lore.kernel.org/all/[email protected]/
>
> and it would be interesting if the test robot could see if the patch
> makes any difference on any other loads?
>
we just reported a stress-ng performance improvement by this patch [1]
test robot applied this patch upon
3c999d1ae3 ("Merge tag 'wq-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq")
filesystem is not our team's major domain, so we just made some limited review
of the results, and decided to send out the report FYI.
at first stage, we decided to check below catagories of tests as priority:
stress-ng filesystem
filebench mailserver
reaim fileserver
we also pick sysbench-fileio, blogbench into coverage.
here is a summary.
for stress-ng, besided [1] which was reported, we got below data that are
about this patch comparing to 3c999d1ae3.
either there is no significant performance change, or the change is smaller
than the noise which will make test robot's bisect fail, so these information
is just FYI. and if you have any doubt about any subtests, could you let us know
then we could check further?
(also included some net test results)
12.87 ? 6% -0.6% 12.79 stress-ng.xattr.ops_per_sec
6721 ? 5% +7.5% 7224 ? 27% stress-ng.rawdev.ops_per_sec
9002 ? 7% -8.7% 8217 stress-ng.dirmany.ops_per_sec
8594743 ? 4% -3.0% 8337417 stress-ng.rawsock.ops_per_sec
2056 ? 3% +2.9% 2116 stress-ng.dirdeep.ops_per_sec
4307 ? 21% -6.9% 4009 stress-ng.dir.ops_per_sec
137946 ? 18% +5.8% 145942 stress-ng.fiemap.ops_per_sec
22413006 ? 2% +2.5% 22982512 ? 2% stress-ng.sockdiag.ops_per_sec
286714 ? 2% -3.8% 275876 ? 5% stress-ng.udp-flood.ops_per_sec
82904 ? 46% -31.6% 56716 stress-ng.sctp.ops_per_sec
9853408 -0.3% 9826387 stress-ng.ping-sock.ops_per_sec
84667 ? 12% -26.7% 62050 ? 17% stress-ng.dccp.ops_per_sec
61750 ? 25% -24.2% 46821 ? 38% stress-ng.open.ops_per_sec
583443 ? 3% -3.4% 563822 stress-ng.file-ioctl.ops_per_sec
11919 ? 28% -34.3% 7833 stress-ng.dentry.ops_per_sec
18.59 ? 12% -23.9% 14.15 ? 27% stress-ng.swap.ops_per_sec
246.37 ? 2% +15.9% 285.58 ? 12% stress-ng.aiol.ops_per_sec
7.45 -4.8% 7.10 ? 7% stress-ng.fallocate.ops_per_sec
207.97 ? 7% +5.2% 218.70 stress-ng.copy-file.ops_per_sec
69.87 ? 7% +5.8% 73.93 ? 5% stress-ng.fpunch.ops_per_sec
0.25 ? 21% +24.0% 0.31 stress-ng.inode-flags.ops_per_sec
849.35 ? 6% +1.4% 861.51 stress-ng.mknod.ops_per_sec
926144 ? 4% -5.2% 877558 stress-ng.lease.ops_per_sec
82924 -2.1% 81220 stress-ng.fcntl.ops_per_sec
6.19 ?124% -50.7% 3.05 stress-ng.chattr.ops_per_sec
676.90 ? 4% -1.9% 663.94 ? 5% stress-ng.iomix.ops_per_sec
0.93 ? 6% +5.6% 0.98 ? 7% stress-ng.symlink.ops_per_sec
1703608 -3.8% 1639057 ? 3% stress-ng.eventfd.ops_per_sec
1735861 -0.6% 1726072 stress-ng.sockpair.ops_per_sec
85440 -2.0% 83705 stress-ng.rawudp.ops_per_sec
6198 +0.6% 6236 stress-ng.sockabuse.ops_per_sec
39226 +0.0% 39234 stress-ng.sock.ops_per_sec
1358 +0.3% 1363 stress-ng.tun.ops_per_sec
9794021 -1.7% 9623340 stress-ng.icmp-flood.ops_per_sec
1324728 +0.3% 1328244 stress-ng.epoll.ops_per_sec
146150 -2.0% 143231 stress-ng.rawpkt.ops_per_sec
6381112 -0.4% 6352696 stress-ng.udp.ops_per_sec
1234258 +0.2% 1236738 stress-ng.sockfd.ops_per_sec
23954 -0.1% 23932 stress-ng.sockmany.ops_per_sec
257030 -0.1% 256860 stress-ng.netdev.ops_per_sec
6337097 +0.1% 6341130 stress-ng.flock.ops_per_sec
173212 -0.3% 172728 stress-ng.rename.ops_per_sec
199.69 +0.6% 200.82 stress-ng.sync-file.ops_per_sec
606.57 +0.8% 611.53 stress-ng.chown.ops_per_sec
183549 -0.9% 181975 stress-ng.handle.ops_per_sec
1299 +0.0% 1299 stress-ng.hdd.ops_per_sec
98371066 +0.2% 98571113 stress-ng.lockofd.ops_per_sec
25.49 -4.3% 24.39 stress-ng.ioprio.ops_per_sec
96745191 -1.5% 95333632 stress-ng.locka.ops_per_sec
582.35 +0.1% 582.86 stress-ng.chmod.ops_per_sec
2075897 -2.2% 2029552 stress-ng.getdent.ops_per_sec
60.47 -1.9% 59.34 stress-ng.metamix.ops_per_sec
14161 -0.3% 14123 stress-ng.io.ops_per_sec
23.98 -1.5% 23.61 stress-ng.link.ops_per_sec
27514 +0.0% 27528 stress-ng.filename.ops_per_sec
44955 +1.6% 45678 stress-ng.dnotify.ops_per_sec
160.94 +0.4% 161.51 stress-ng.inotify.ops_per_sec
2452224 +4.0% 2549607 stress-ng.lockf.ops_per_sec
6761 +0.3% 6779 stress-ng.fsize.ops_per_sec
775083 -1.5% 763487 stress-ng.fanotify.ops_per_sec
309124 -4.2% 296285 stress-ng.utime.ops_per_sec
25567 -0.1% 25530 stress-ng.dup.ops_per_sec
1858 +0.9% 1876 stress-ng.procfs.ops_per_sec
105804 -3.9% 101658 stress-ng.access.ops_per_sec
1.04 -1.9% 1.02 stress-ng.chdir.ops_per_sec
82753 -0.3% 82480 stress-ng.fstat.ops_per_sec
681128 +3.7% 706375 stress-ng.acl.ops_per_sec
11892 -0.1% 11875 stress-ng.bind-mount.ops_per_sec
for filebench, similar results, but data is less unstable than stress-ng, which
means for most of them, we regarded them as that they should not be impacted by
this patch.
for reaim/sysbench-fileio/blogbench, the data are quite stable, and we didn't
notice any significant performance changes. we even doubt whether they go
through the code path changed by this patch.
so for these, we don't list full results here.
BTW, besides filesystem tests, this patch is also piped into other performance
test categories such like net, scheduler, mm and others, and since it also goes
into our so-called hourly kernels, it could run by full other performance test
suites which test robot supports. so in following 2-3 weeks, it's still possible
for us to report other results including regression.
thanks
[1] https://lore.kernel.org/all/[email protected]/
> Thanks,
> Linus
>
> On Wed, 15 May 2024 at 02:17, Yafang Shao <[email protected]> wrote:
> >
> > Our applications, built on Elasticsearch[0], frequently create and delete
> > files. These applications operate within containers, some with a memory
> > limit exceeding 100GB. Over prolonged periods, the accumulation of negative
> > dentries within these containers can amount to tens of gigabytes.
> >
> > Upon container exit, directories are deleted. However, due to the numerous
> > associated dentries, this process can be time-consuming. Our users have
> > expressed frustration with this prolonged exit duration, which constitutes
> > our first issue.
On Wed, 22 May 2024 at 01:12, kernel test robot <[email protected]> wrote:
>
> kernel test robot noticed a 6.7% improvement of stress-ng.touch.ops_per_sec on:
>
> commit: 3681ce364442ce2ec7c7fbe90ad0aca651db475b ("[PATCH] vfs: Delete the associated dentry when deleting a file")
Ok, since everything else is at least tentatively in the noise, and
the only hard numbers we have are the ones from Yafang's Elasticsearch
load and this - both of which say that this is a good patch - I
decided to just apply this ASAP just to get more testing.
I just wanted to note very explicitly that this is very much
tentative: this will be reverted very aggressively if somebody reports
some other real-world load performance issues, and we'll have to look
at other solutions. But I just don't think we'll get much more actual
testing of this without just saying "let's try it".
Also, I ended up removing the part of the patch that stopped clearing
the DCACHE_CANT_MOUNT bit. I think it's right, but it's really
unrelated to the actual problem at hand, and there are other cleanups
- like the unnecessary dget/dput pair - in this path that could also
be looked at.
Anyway, let's see if somebody notices any issues with this. And I
think we should look at the "shrink dentries" case anyway for other
reasons, since it's easy to create a ton of negative dentries with
just lots of lookups (rather than lots of unlinking of existing
files).
Of course, if you do billions of lookups of different files that do
not exist in the same directory, I suspect you just have yourself to
blame, so the "lots of negative lookups" load doesn't sound
particularly realistic.
TLDR; I applied it for testing because we're in the merge window and
there's no reason not to try.
Linus
On Wed, 22 May 2024 at 10:14, Matthew Wilcox <[email protected]> wrote:
>
> On Wed, May 22, 2024 at 09:00:03AM -0700, Linus Torvalds wrote:
> > Of course, if you do billions of lookups of different files that do
> > not exist in the same directory, I suspect you just have yourself to
> > blame, so the "lots of negative lookups" load doesn't sound
> > particularly realistic.
>
> Oh no. We have real customers that this hits and it's not even stupid.
Oh, negative dentries exist, and yes, they are a major feature on some
loads. Exactly because of the kinds of situations you describe.
In fact, that's the _point_. Things like PATH lookup require negative
dentries for good performance, because otherwise all the missed cases
would force a lookup all the way out to the filesystem.
So having thousands of negative dentries is normal and expected. And
it will grow for bigger machines and loads, of course.
That said, I don't think we've had people on real loads complain about
them being in the hundreds of millions like Yafang's case.
> plan9 handles this so much better because it has that union-mount stuff
> instead of search paths. So it creates one dentry that tells it which of
> those directories it actually exists in. But we're stuck with unix-style
> search paths, so life is pain.
I suspect you are just not as aware of the downsides of the plan9 models.
People tend to think plan9 was great. It had huge and fundamental
design mistakes.
Union mounts à la plan9 aren't hugely wonderful, and there's a reason
overlayfs does things differently (not that that is hugely wonderful
either).
Linus
On Wed, May 22, 2024 at 09:00:03AM -0700, Linus Torvalds wrote:
> Of course, if you do billions of lookups of different files that do
> not exist in the same directory, I suspect you just have yourself to
> blame, so the "lots of negative lookups" load doesn't sound
> particularly realistic.
Oh no. We have real customers that this hits and it's not even stupid.
Every time an "event" happens, they look up something like a hash in three
different directories (like $PATH or multiple -I flags to the compiler).
Order is important, so they can't just look it up in the directory that
it's most likely to exist in. It usually fails to exist in directory A
and B, so we create dentries that say it doesn't. And those dentries are
literally never referenced again. Then directory C has the file they're
looking for (or it doesn't and it gets created because the process has
write access to C and not A or B).
plan9 handles this so much better because it has that union-mount stuff
instead of search paths. So it creates one dentry that tells it which of
those directories it actually exists in. But we're stuck with unix-style
search paths, so life is pain.
On Wed, May 22, 2024 at 4:51 PM Oliver Sang <[email protected]> wrote:
>
>
> hi, Linus, hi, Yafang Shao,
>
>
> On Wed, May 15, 2024 at 09:05:24AM -0700, Linus Torvalds wrote:
> > Oliver,
> > is there any chance you could run this through the test robot
> > performance suite? The original full patch at
> >
> > https://lore.kernel.org/all/20240515091727.22034-1-laoar.shao@gmailcom/
> >
> > and it would be interesting if the test robot could see if the patch
> > makes any difference on any other loads?
> >
>
> we just reported a stress-ng performance improvement by this patch [1]
Awesome!
>
> test robot applied this patch upon
> 3c999d1ae3 ("Merge tag 'wq-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq")
>
> filesystem is not our team's major domain, so we just made some limited review
> of the results, and decided to send out the report FYI.
>
> at first stage, we decided to check below catagories of tests as priority:
>
> stress-ng filesystem
> filebench mailserver
> reaim fileserver
>
> we also pick sysbench-fileio, blogbench into coverage.
>
> here is a summary.
>
> for stress-ng, besided [1] which was reported, we got below data that are
> about this patch comparing to 3c999d1ae3.
>
> either there is no significant performance change, or the change is smaller
> than the noise which will make test robot's bisect fail, so these information
> is just FYI. and if you have any doubt about any subtests, could you let us know
> then we could check further?
>
> (also included some net test results)
>
> 12.87 ą 6% -0.6% 12.79 stress-ng.xattr.ops_per_sec
> 6721 ą 5% +7.5% 7224 ą 27% stress-ng.rawdevops_per_sec
> 9002 ą 7% -8.7% 8217 stress-ng.dirmany.ops_per_sec
> 8594743 ą 4% -3.0% 8337417 stress-ng.rawsock.ops_per_sec
> 2056 ą 3% +2.9% 2116 stress-ng.dirdeep.ops_per_sec
> 4307 ą 21% -6.9% 4009 stress-ng.dir.ops_per_sec
> 137946 ą 18% +5.8% 145942 stress-ng.fiemap.ops_per_sec
> 22413006 ą 2% +2.5% 22982512 ą 2% stress-ng.sockdiag.ops_per_sec
> 286714 ą 2% -3.8% 275876 ą 5% stress-ng.udp-flood.ops_per_sec
> 82904 ą 46% -31.6% 56716 stress-ng.sctp.ops_per_sec
> 9853408 -0.3% 9826387 stress-ng.ping-sock.ops_per_sec
> 84667 ą 12% -26.7% 62050 ą 17% stress-ng.dccp.ops_per_sec
> 61750 ą 25% -24.2% 46821 ą 38% stress-ng.open.ops_per_sec
> 583443 ą 3% -3.4% 563822 stress-ng.file-ioctl.ops_per_sec
> 11919 ą 28% -34.3% 7833 stress-ng.dentry.ops_per_sec
> 18.59 ą 12% -23.9% 14.15 ą 27% stress-ng.swap.ops_per_sec
> 246.37 ą 2% +15.9% 285.58 ą 12% stress-ng.aiol.ops_per_sec
> 7.45 -4.8% 7.10 ą 7% stress-ng.fallocate.ops_per_sec
> 207.97 ą 7% +5.2% 218.70 stress-ng.copy-file.ops_per_sec
> 69.87 ą 7% +5.8% 73.93 ą 5% stress-ng.fpunchops_per_sec
> 0.25 ą 21% +24.0% 0.31 stress-ng.inode-flagsops_per_sec
> 849.35 ą 6% +1.4% 861.51 stress-ng.mknod.ops_per_sec
> 926144 ą 4% -5.2% 877558 stress-ng.lease.ops_per_sec
> 82924 -2.1% 81220 stress-ng.fcntl.ops_per_sec
> 6.19 ą124% -50.7% 3.05 stress-ng.chattr.ops_per_sec
> 676.90 ą 4% -1.9% 663.94 ą 5% stress-ng.iomix.ops_per_sec
> 0.93 ą 6% +5.6% 0.98 ą 7% stress-ng.symlink.ops_per_sec
> 1703608 -3.8% 1639057 ą 3% stress-ng.eventfd.ops_per_sec
> 1735861 -0.6% 1726072 stress-ng.sockpair.ops_per_sec
> 85440 -2.0% 83705 stress-ng.rawudp.ops_per_sec
> 6198 +0.6% 6236 stress-ng.sockabuse.ops_per_sec
> 39226 +0.0% 39234 stress-ng.sock.ops_per_sec
> 1358 +0.3% 1363 stress-ng.tun.ops_per_sec
> 9794021 -1.7% 9623340 stress-ng.icmp-flood.ops_per_sec
> 1324728 +0.3% 1328244 stress-ng.epoll.ops_per_sec
> 146150 -2.0% 143231 stress-ng.rawpkt.ops_per_sec
> 6381112 -0.4% 6352696 stress-ng.udp.ops_per_sec
> 1234258 +0.2% 1236738 stress-ng.sockfd.ops_per_sec
> 23954 -0.1% 23932 stress-ng.sockmany.ops_per_sec
> 257030 -0.1% 256860 stress-ng.netdev.ops_per_sec
> 6337097 +0.1% 6341130 stress-ng.flock.ops_per_sec
> 173212 -0.3% 172728 stress-ng.rename.ops_per_sec
> 199.69 +0.6% 200.82 stress-ng.sync-file.ops_per_sec
> 606.57 +0.8% 611.53 stress-ng.chown.ops_per_sec
> 183549 -0.9% 181975 stress-ng.handle.ops_per_sec
> 1299 +0.0% 1299 stress-ng.hdd.ops_per_sec
> 98371066 +0.2% 98571113 stress-ng.lockofd.ops_per_sec
> 25.49 -4.3% 24.39 stress-ng.ioprio.ops_per_sec
> 96745191 -1.5% 95333632 stress-ng.locka.ops_per_sec
> 582.35 +0.1% 582.86 stress-ng.chmod.ops_per_sec
> 2075897 -2.2% 2029552 stress-ng.getdent.ops_per_sec
> 60.47 -1.9% 59.34 stress-ng.metamix.ops_per_sec
> 14161 -0.3% 14123 stress-ng.io.ops_per_sec
> 23.98 -1.5% 23.61 stress-ng.link.ops_per_sec
> 27514 +0.0% 27528 stress-ng.filename.ops_per_sec
> 44955 +1.6% 45678 stress-ng.dnotify.ops_per_sec
> 160.94 +0.4% 161.51 stress-ng.inotify.ops_per_sec
> 2452224 +4.0% 2549607 stress-ng.lockf.ops_per_sec
> 6761 +0.3% 6779 stress-ng.fsize.ops_per_sec
> 775083 -1.5% 763487 stress-ng.fanotify.ops_per_sec
> 309124 -4.2% 296285 stress-ng.utime.ops_per_sec
> 25567 -0.1% 25530 stress-ng.dup.ops_per_sec
> 1858 +0.9% 1876 stress-ng.procfs.ops_per_sec
> 105804 -3.9% 101658 stress-ng.access.ops_per_sec
> 1.04 -1.9% 1.02 stress-ng.chdir.ops_per_sec
> 82753 -0.3% 82480 stress-ng.fstat.ops_per_sec
> 681128 +3.7% 706375 stress-ng.acl.ops_per_sec
> 11892 -0.1% 11875 stress-ng.bind-mount.ops_per_sec
>
>
> for filebench, similar results, but data is less unstable than stress-ng, which
> means for most of them, we regarded them as that they should not be impacted by
> this patch.
>
> for reaim/sysbench-fileio/blogbench, the data are quite stable, and we didn't
> notice any significant performance changes. we even doubt whether they go
> through the code path changed by this patch.
>
> so for these, we don't list full results here.
>
> BTW, besides filesystem tests, this patch is also piped into other performance
> test categories such like net, scheduler, mm and others, and since it also goes
> into our so-called hourly kernels, it could run by full other performance test
> suites which test robot supports. so in following 2-3 weeks, it's still possible
> for us to report other results including regression.
>
That's great. Many thanks for your help.
--
Regards
Yafang