2020-09-18 05:07:37

by Ritesh Harjani

[permalink] [raw]
Subject: [PATCHv3 1/1] ext4: Optimize file overwrites

In case if the file already has underlying blocks/extents allocated
then we don't need to start a journal txn and can directly return
the underlying mapping. Currently ext4_iomap_begin() is used by
both DAX & DIO path. We can check if the write request is an
overwrite & then directly return the mapping information.

This could give a significant perf boost for multi-threaded writes
specially random overwrites.
On PPC64 VM with simulated pmem(DAX) device, ~10x perf improvement
could be seen in random writes (overwrite). Also bcoz this optimizes
away the spinlock contention during jbd2 slab cache allocation
(jbd2_journal_handle). On x86 VM, ~2x perf improvement was observed.

Reported-by: Dan Williams <[email protected]>
Suggested-by: Jan Kara <[email protected]>
Signed-off-by: Ritesh Harjani <[email protected]>
---
fs/ext4/inode.c | 18 +++++++++++++++---
1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 10dd470876b3..6eae17758ece 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3437,14 +3437,26 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;

- if (flags & IOMAP_WRITE)
+ if (flags & IOMAP_WRITE) {
+ /*
+ * We check here if the blocks are already allocated, then we
+ * don't need to start a journal txn and we can directly return
+ * the mapping information. This could boost performance
+ * especially in multi-threaded overwrite requests.
+ */
+ if (offset + length <= i_size_read(inode)) {
+ ret = ext4_map_blocks(NULL, inode, &map, 0);
+ if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED))
+ goto out;
+ }
ret = ext4_iomap_alloc(inode, &map, flags);
- else
+ } else {
ret = ext4_map_blocks(NULL, inode, &map, 0);
+ }

if (ret < 0)
return ret;
-
+out:
ext4_set_iomap(inode, iomap, &map, offset, length);

return 0;
--
2.26.2


2020-09-18 07:53:48

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCHv3 1/1] ext4: Optimize file overwrites

On Fri, Sep 18, 2020 at 7:09 AM Ritesh Harjani <[email protected]> wrote:
>
> In case if the file already has underlying blocks/extents allocated
> then we don't need to start a journal txn and can directly return
> the underlying mapping. Currently ext4_iomap_begin() is used by
> both DAX & DIO path. We can check if the write request is an
> overwrite & then directly return the mapping information.
>
> This could give a significant perf boost for multi-threaded writes
> specially random overwrites.
> On PPC64 VM with simulated pmem(DAX) device, ~10x perf improvement
> could be seen in random writes (overwrite). Also bcoz this optimizes
> away the spinlock contention during jbd2 slab cache allocation
> (jbd2_journal_handle). On x86 VM, ~2x perf improvement was observed.
>
> Reported-by: Dan Williams <[email protected]>
> Suggested-by: Jan Kara <[email protected]>
> Signed-off-by: Ritesh Harjani <[email protected]>

I have applied your patch on top of recent Linus Git and boot-tested on x86-64.

Here I have LTP installed.
If you have a LTP filesystem test-/use-case you know for testing,
please let me know.

Yes, I have seen the FIO config in the cover-letter.
Maybe you have a different FIO config - 16G AFAIK is too big here.

Feel free to add...

Tested-by: Sedat Dilek <[email protected]> # Compile and boot on
x86-64 Debian/unstable

Thanks.

- Sedat -

> ---
> fs/ext4/inode.c | 18 +++++++++++++++---
> 1 file changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 10dd470876b3..6eae17758ece 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3437,14 +3437,26 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
> EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
>
> - if (flags & IOMAP_WRITE)
> + if (flags & IOMAP_WRITE) {
> + /*
> + * We check here if the blocks are already allocated, then we
> + * don't need to start a journal txn and we can directly return
> + * the mapping information. This could boost performance
> + * especially in multi-threaded overwrite requests.
> + */
> + if (offset + length <= i_size_read(inode)) {
> + ret = ext4_map_blocks(NULL, inode, &map, 0);
> + if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED))
> + goto out;
> + }
> ret = ext4_iomap_alloc(inode, &map, flags);
> - else
> + } else {
> ret = ext4_map_blocks(NULL, inode, &map, 0);
> + }
>
> if (ret < 0)
> return ret;
> -
> +out:
> ext4_set_iomap(inode, iomap, &map, offset, length);
>
> return 0;
> --
> 2.26.2
>

2020-09-25 07:16:51

by Chen, Rong A

[permalink] [raw]
Subject: [ext4] 4e8fc10115: fio.write_iops 330.6% improvement

Greeting,

FYI, we noticed a 330.6% improvement of fio.write_iops due to commit:


commit: 4e8fc10115a6978060fe8a90f6a3a05463fa0660 ("[PATCHv3 1/1] ext4: Optimize file overwrites")
url: https://github.com/0day-ci/linux/commits/Ritesh-Harjani/Optimize-ext4-file-overwrites-perf-improvement/20200918-131139
base: https://git.kernel.org/cgit/linux/kernel/git/tytso/ext4.git dev

in testcase: fio-basic
on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 256G memory
with following parameters:

disk: 2pmem
fs: ext4
mount_option: dax
runtime: 200s
nr_task: 50%
time_based: tb
rw: write
bs: 4k
ioengine: sync
test_size: 200G
cpufreq_governor: performance
ucode: 0x5002f01

test-description: Fio is a tool that will spawn a number of threads or processes doing a particular type of I/O action as specified by the user.
test-url: https://github.com/axboe/fio





Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml

=========================================================================================
bs/compiler/cpufreq_governor/disk/fs/ioengine/kconfig/mount_option/nr_task/rootfs/runtime/rw/tbox_group/test_size/testcase/time_based/ucode:
4k/gcc-9/performance/2pmem/ext4/sync/x86_64-rhel-8.3/dax/50%/debian-10.4-x86_64-20200603.cgz/200s/write/lkp-csl-2sp6/200G/fio-basic/tb/0x5002f01

commit:
27bc446e2d ("ext4: limit the length of per-inode prealloc list")
4e8fc10115 ("ext4: Optimize file overwrites")

27bc446e2def38db 4e8fc10115a6978060fe8a90f6a
---------------- ---------------------------
%stddev %change %stddev
\ | \
0.12 ±106% -0.1 0.01 fio.latency_100us%
51.38 ± 23% -48.5 2.85 ± 20% fio.latency_20us%
0.01 +16.6 16.64 ± 28% fio.latency_2us%
0.24 ±135% +54.7 54.89 ± 3% fio.latency_4us%
32.62 ± 18% -31.7 0.91 ± 15% fio.latency_50us%
14780 ± 3% -9.4% 13390 fio.time.involuntary_context_switches
9299 -7.0% 8647 fio.time.system_time
228.71 ± 4% +281.9% 873.42 ± 6% fio.time.user_time
23448 -6.5% 21915 fio.time.voluntary_context_switches
5.426e+08 ± 5% +330.6% 2.337e+09 ± 6% fio.workload
10597 ± 5% +330.6% 45638 ± 6% fio.write_bw_MBps
26944 ± 8% -76.8% 6240 ± 9% fio.write_clat_90%_us
30368 ± 8% -72.0% 8512 ± 11% fio.write_clat_95%_us
38016 ± 9% -49.0% 19392 ± 4% fio.write_clat_99%_us
17448 ± 5% -77.9% 3855 ± 7% fio.write_clat_mean_us
11052 ± 32% -68.3% 3502 ± 10% fio.write_clat_stddev
2713004 ± 5% +330.6% 11683335 ± 6% fio.write_iops
13639680 ± 7% +26.6% 17267712 ± 5% meminfo.DirectMap2M
2704 ± 97% +131.9% 6269 ± 26% numa-meminfo.node0.PageTables
676.50 ± 96% +131.1% 1563 ± 26% numa-vmstat.node0.nr_page_table_pages
48.36 -6.8% 45.09 iostat.cpu.system
1.21 ± 4% +271.5% 4.51 ± 6% iostat.cpu.user
0.74 ± 2% +0.1 0.81 ± 5% mpstat.cpu.all.irq%
1.22 ± 4% +3.3 4.55 ± 6% mpstat.cpu.all.usr%
541348 +1.4% 548949 proc-vmstat.nr_file_pages
245833 +2.9% 252840 proc-vmstat.nr_unevictable
245833 +2.9% 252840 proc-vmstat.nr_zone_unevictable
695285 ± 20% -12.6% 607417 ± 17% proc-vmstat.pgfree
601976 ± 2% +22.0% 734594 ± 2% sched_debug.cpu.avg_idle.avg
1001923 +9.0% 1092207 ± 5% sched_debug.cpu.avg_idle.max
372963 -25.8% 276657 ± 6% sched_debug.cpu.avg_idle.stddev
22130 ± 17% +36.2% 30133 ± 14% sched_debug.cpu.nr_switches.max
3374 ± 18% +28.5% 4336 ± 10% sched_debug.cpu.nr_switches.stddev
-47.00 -45.7% -25.50 sched_debug.cpu.nr_uninterruptible.min
2816 ± 21% +36.5% 3844 ± 13% sched_debug.cpu.sched_count.stddev
26.69 ± 13% -44.0% 14.94 ± 17% sched_debug.cpu.sched_goidle.min
1424 ± 21% +36.2% 1941 ± 13% sched_debug.cpu.sched_goidle.stddev
1411 ± 18% +31.9% 1861 ± 12% sched_debug.cpu.ttwu_count.stddev
15.42 ± 3% -82.8% 2.66 ± 8% perf-stat.i.MPKI
3.417e+09 ± 4% +239.7% 1.161e+10 ± 6% perf-stat.i.branch-instructions
0.72 -0.1 0.64 perf-stat.i.branch-miss-rate%
24883051 ± 3% +181.5% 70036819 ± 4% perf-stat.i.branch-misses
97563341 ± 12% -58.3% 40638724 ± 14% perf-stat.i.cache-misses
2.96e+08 ± 2% -48.4% 1.529e+08 ± 11% perf-stat.i.cache-references
7.06 ± 4% -70.7% 2.06 ± 5% perf-stat.i.cpi
1461 ± 14% +170.2% 3948 ± 19% perf-stat.i.cycles-between-cache-misses
6.17e+09 ± 4% +243.3% 2.119e+10 ± 6% perf-stat.i.dTLB-loads
0.00 ± 11% -0.0 0.00 ± 3% perf-stat.i.dTLB-store-miss-rate%
3.978e+09 ± 4% +257.1% 1.421e+10 ± 6% perf-stat.i.dTLB-stores
83.61 +7.2 90.82 perf-stat.i.iTLB-load-miss-rate%
25688726 ± 3% +126.2% 58108368 ± 5% perf-stat.i.iTLB-load-misses
4852201 +17.7% 5709608 ± 2% perf-stat.i.iTLB-loads
1.962e+10 ± 4% +243.4% 6.738e+10 ± 6% perf-stat.i.instructions
774.43 ± 2% +50.4% 1165 perf-stat.i.instructions-per-iTLB-miss
0.15 ± 4% +235.9% 0.51 ± 6% perf-stat.i.ipc
0.25 ± 2% +51.6% 0.37 ± 3% perf-stat.i.metric.K/sec
144.73 ± 4% +239.5% 491.37 ± 6% perf-stat.i.metric.M/sec
89.29 +2.6 91.93 perf-stat.i.node-load-miss-rate%
12691022 ± 8% -56.3% 5550053 ± 12% perf-stat.i.node-load-misses
1504953 ± 13% -64.4% 535348 ± 15% perf-stat.i.node-loads
9964107 ± 8% -58.8% 4108905 ± 17% perf-stat.i.node-store-misses
15.10 ± 3% -84.9% 2.28 ± 11% perf-stat.overall.MPKI
0.73 -0.1 0.60 perf-stat.overall.branch-miss-rate%
6.86 ± 4% -71.0% 1.99 ± 6% perf-stat.overall.cpi
1401 ± 13% +139.9% 3361 ± 14% perf-stat.overall.cycles-between-cache-misses
0.00 ± 30% -0.0 0.00 ± 45% perf-stat.overall.dTLB-load-miss-rate%
0.00 ± 22% -0.0 0.00 ± 4% perf-stat.overall.dTLB-store-miss-rate%
84.11 +6.9 91.02 perf-stat.overall.iTLB-load-miss-rate%
763.81 ± 2% +51.8% 1159 perf-stat.overall.instructions-per-iTLB-miss
0.15 ± 4% +245.0% 0.50 ± 6% perf-stat.overall.ipc
89.44 +1.8 91.23 perf-stat.overall.node-load-miss-rate%
7276 -20.3% 5801 perf-stat.overall.path-length
3.401e+09 ± 4% +239.6% 1.155e+10 ± 6% perf-stat.ps.branch-instructions
24776511 ± 3% +181.3% 69696643 ± 4% perf-stat.ps.branch-misses
97040508 ± 12% -58.3% 40436979 ± 14% perf-stat.ps.cache-misses
2.945e+08 ± 2% -48.3% 1.522e+08 ± 11% perf-stat.ps.cache-references
6.141e+09 ± 4% +243.2% 2.108e+10 ± 6% perf-stat.ps.dTLB-loads
3.959e+09 ± 4% +257.0% 1.414e+10 ± 6% perf-stat.ps.dTLB-stores
25562318 ± 3% +126.2% 57814503 ± 5% perf-stat.ps.iTLB-load-misses
4826722 +17.7% 5679789 ± 2% perf-stat.ps.iTLB-loads
1.953e+10 ± 4% +243.3% 6.704e+10 ± 6% perf-stat.ps.instructions
12624818 ± 8% -56.3% 5522769 ± 12% perf-stat.ps.node-load-misses
1497174 ± 13% -64.4% 532776 ± 15% perf-stat.ps.node-loads
9912289 ± 8% -58.8% 4087930 ± 17% perf-stat.ps.node-store-misses
3.947e+12 ± 4% +243.4% 1.355e+13 ± 6% perf-stat.total.instructions
290.75 ± 51% -78.1% 63.75 ±128% interrupts.CPU17.RES:Rescheduling_interrupts
6339 ± 25% -35.3% 4101 ± 52% interrupts.CPU19.NMI:Non-maskable_interrupts
6339 ± 25% -35.3% 4101 ± 52% interrupts.CPU19.PMI:Performance_monitoring_interrupts
166.00 ± 46% -91.6% 14.00 ± 72% interrupts.CPU2.RES:Rescheduling_interrupts
429.75 ± 2% +14.0% 490.00 ± 12% interrupts.CPU20.CAL:Function_call_interrupts
6339 ± 25% -35.3% 4100 ± 52% interrupts.CPU20.NMI:Non-maskable_interrupts
6339 ± 25% -35.3% 4100 ± 52% interrupts.CPU20.PMI:Performance_monitoring_interrupts
6338 ± 25% -31.1% 4364 ± 46% interrupts.CPU21.NMI:Non-maskable_interrupts
6338 ± 25% -31.1% 4364 ± 46% interrupts.CPU21.PMI:Performance_monitoring_interrupts
6339 ± 25% -50.8% 3121 ± 14% interrupts.CPU23.NMI:Non-maskable_interrupts
6339 ± 25% -50.8% 3121 ± 14% interrupts.CPU23.PMI:Performance_monitoring_interrupts
68.50 ± 54% +202.2% 207.00 interrupts.CPU24.RES:Rescheduling_interrupts
3328 ± 45% +76.5% 5876 ± 33% interrupts.CPU25.NMI:Non-maskable_interrupts
3328 ± 45% +76.5% 5876 ± 33% interrupts.CPU25.PMI:Performance_monitoring_interrupts
39.75 ± 79% +423.9% 208.25 ± 2% interrupts.CPU25.RES:Rescheduling_interrupts
1766 ±112% -75.2% 438.25 ± 4% interrupts.CPU27.CAL:Function_call_interrupts
82.75 ± 49% -64.0% 29.75 ±122% interrupts.CPU27.TLB:TLB_shootdowns
439.50 ± 2% +74.2% 765.50 ± 38% interrupts.CPU3.CAL:Function_call_interrupts
494.25 ± 5% -10.5% 442.25 ± 5% interrupts.CPU30.CAL:Function_call_interrupts
61.00 ±127% +230.7% 201.75 interrupts.CPU30.RES:Rescheduling_interrupts
56.50 ±140% +255.3% 200.75 interrupts.CPU31.RES:Rescheduling_interrupts
1633 ±123% -73.3% 435.50 ± 3% interrupts.CPU32.CAL:Function_call_interrupts
56.75 ±141% +252.4% 200.00 interrupts.CPU33.RES:Rescheduling_interrupts
56.75 ±139% +227.3% 185.75 ± 12% interrupts.CPU34.RES:Rescheduling_interrupts
56.50 ±142% +185.8% 161.50 ± 39% interrupts.CPU35.RES:Rescheduling_interrupts
79.75 ± 36% -56.4% 34.75 ± 91% interrupts.CPU36.TLB:TLB_shootdowns
65.25 ±117% +176.6% 180.50 ± 30% interrupts.CPU39.RES:Rescheduling_interrupts
78.50 ± 44% -54.1% 36.00 ± 83% interrupts.CPU39.TLB:TLB_shootdowns
62.25 ±120% +151.8% 156.75 ± 45% interrupts.CPU43.RES:Rescheduling_interrupts
86.00 ± 45% -54.4% 39.25 ± 97% interrupts.CPU43.TLB:TLB_shootdowns
487.50 ± 10% -10.8% 434.75 ± 3% interrupts.CPU44.CAL:Function_call_interrupts
93.00 ± 46% -64.5% 33.00 ±119% interrupts.CPU46.TLB:TLB_shootdowns
7330 ± 12% -41.4% 4293 ± 33% interrupts.CPU5.NMI:Non-maskable_interrupts
7330 ± 12% -41.4% 4293 ± 33% interrupts.CPU5.PMI:Performance_monitoring_interrupts
169.25 ± 36% -90.8% 15.50 ± 71% interrupts.CPU5.RES:Rescheduling_interrupts
3285 ± 45% +92.3% 6318 ± 25% interrupts.CPU57.NMI:Non-maskable_interrupts
3285 ± 45% +92.3% 6318 ± 25% interrupts.CPU57.PMI:Performance_monitoring_interrupts
7323 ± 12% -51.2% 3572 ± 34% interrupts.CPU6.NMI:Non-maskable_interrupts
7323 ± 12% -51.2% 3572 ± 34% interrupts.CPU6.PMI:Performance_monitoring_interrupts
32.50 ± 78% +580.0% 221.00 ±125% interrupts.CPU63.TLB:TLB_shootdowns
7323 ± 12% -41.5% 4286 ± 33% interrupts.CPU7.NMI:Non-maskable_interrupts
7323 ± 12% -41.5% 4286 ± 33% interrupts.CPU7.PMI:Performance_monitoring_interrupts
175.50 ± 27% -80.3% 34.50 ± 37% interrupts.CPU72.RES:Rescheduling_interrupts
93.25 ± 45% -57.1% 40.00 ±115% interrupts.CPU72.TLB:TLB_shootdowns
7868 -45.2% 4311 ± 32% interrupts.CPU73.NMI:Non-maskable_interrupts
7868 -45.2% 4311 ± 32% interrupts.CPU73.PMI:Performance_monitoring_interrupts
7330 ± 12% -41.4% 4297 ± 33% interrupts.CPU75.NMI:Non-maskable_interrupts
7330 ± 12% -41.4% 4297 ± 33% interrupts.CPU75.PMI:Performance_monitoring_interrupts
163.50 ± 41% -84.9% 24.75 ±127% interrupts.CPU77.RES:Rescheduling_interrupts
7324 ± 12% -41.4% 4294 ± 33% interrupts.CPU78.NMI:Non-maskable_interrupts
7324 ± 12% -41.4% 4294 ± 33% interrupts.CPU78.PMI:Performance_monitoring_interrupts
161.25 ± 45% -91.5% 13.75 ±109% interrupts.CPU80.RES:Rescheduling_interrupts
7325 ± 12% -41.5% 4287 ± 33% interrupts.CPU81.NMI:Non-maskable_interrupts
7325 ± 12% -41.5% 4287 ± 33% interrupts.CPU81.PMI:Performance_monitoring_interrupts
95.00 ± 50% -59.7% 38.25 ±117% interrupts.CPU92.TLB:TLB_shootdowns
8991 ±108% +161.3% 23491 ± 19% softirqs.CPU2.SCHED
67870 ± 5% +8.4% 73546 ± 2% softirqs.CPU2.TIMER
23244 ± 25% -88.7% 2626 softirqs.CPU24.SCHED
83405 ± 17% -23.4% 63886 ± 2% softirqs.CPU24.TIMER
23963 ± 12% -88.4% 2784 ± 2% softirqs.CPU25.SCHED
83623 ± 19% -23.5% 63968 ± 2% softirqs.CPU25.TIMER
4276 ± 5% +97.6% 8448 ± 13% softirqs.CPU26.RCU
14129 ± 74% -81.4% 2631 ± 4% softirqs.CPU26.SCHED
17203 ± 53% -70.0% 5163 ± 89% softirqs.CPU27.SCHED
70966 ± 5% -10.4% 63583 ± 5% softirqs.CPU27.TIMER
19121 ± 47% -74.6% 4863 ± 88% softirqs.CPU28.SCHED
72354 ± 6% -10.4% 64858 ± 2% softirqs.CPU29.TIMER
9275 ±101% +151.3% 23309 ± 19% softirqs.CPU3.SCHED
19928 ± 46% -84.7% 3042 ± 7% softirqs.CPU30.SCHED
72106 ± 7% -11.8% 63632 ± 2% softirqs.CPU30.TIMER
19845 ± 45% -84.7% 3030 ± 6% softirqs.CPU31.SCHED
72345 ± 6% -10.8% 64523 softirqs.CPU31.TIMER
19559 ± 47% -84.2% 3094 ± 8% softirqs.CPU32.SCHED
19689 ± 47% -83.0% 3352 ± 2% softirqs.CPU33.SCHED
71873 ± 7% -9.4% 65131 softirqs.CPU33.TIMER
16286 ± 48% -63.6% 5928 ± 76% softirqs.CPU34.SCHED
11784 ± 76% +118.7% 25776 softirqs.CPU4.SCHED
70606 ± 5% -9.8% 63713 softirqs.CPU48.TIMER
71122 ± 4% -10.2% 63890 ± 5% softirqs.CPU49.TIMER
8863 ±108% +190.0% 25702 softirqs.CPU5.SCHED
20026 ± 49% -87.1% 2587 ± 5% softirqs.CPU50.SCHED
70832 ± 4% -10.7% 63286 softirqs.CPU50.TIMER
18874 ± 50% -86.1% 2631 ± 4% softirqs.CPU51.SCHED
71694 ± 5% -13.7% 61847 ± 3% softirqs.CPU51.TIMER
17403 ± 56% -85.3% 2560 softirqs.CPU52.SCHED
71831 ± 8% -11.0% 63942 ± 3% softirqs.CPU52.TIMER
20860 ± 49% -87.1% 2689 ± 2% softirqs.CPU53.SCHED
81014 ± 19% -23.0% 62345 ± 2% softirqs.CPU53.TIMER
20180 ± 50% -87.7% 2480 ± 9% softirqs.CPU54.SCHED
71917 ± 5% -12.3% 63071 softirqs.CPU54.TIMER
74057 ± 12% -16.4% 61946 ± 2% softirqs.CPU55.TIMER
20135 ± 50% -86.8% 2667 ± 4% softirqs.CPU56.SCHED
73377 ± 7% -13.4% 63523 ± 3% softirqs.CPU56.TIMER
23019 ± 19% -64.3% 8226 ±118% softirqs.CPU57.SCHED
75540 ± 5% -14.6% 64485 ± 4% softirqs.CPU57.TIMER
20267 ± 49% -59.4% 8236 ±118% softirqs.CPU58.SCHED
72755 ± 7% -11.1% 64699 ± 3% softirqs.CPU58.TIMER
72871 ± 7% -10.9% 64896 ± 4% softirqs.CPU59.TIMER
8781 ±108% +192.7% 25703 softirqs.CPU6.SCHED
72683 ± 7% -10.9% 64778 ± 4% softirqs.CPU60.TIMER
72665 ± 8% -11.1% 64612 ± 4% softirqs.CPU61.TIMER
72308 ± 5% -10.1% 64991 ± 6% softirqs.CPU65.TIMER
20301 ± 49% -58.5% 8419 ±118% softirqs.CPU66.SCHED
11380 ± 79% +123.7% 25453 softirqs.CPU7.SCHED
4027 ± 5% +111.8% 8530 ± 32% softirqs.CPU71.RCU
5823 ± 96% +357.6% 26649 softirqs.CPU72.SCHED
2461 ± 12% +952.7% 25914 softirqs.CPU73.SCHED
8475 ±117% +176.7% 23452 ± 20% softirqs.CPU75.SCHED
8462 ±116% +178.9% 23601 ± 19% softirqs.CPU76.SCHED
8459 ±117% +211.7% 26366 ± 2% softirqs.CPU77.SCHED
8511 ±117% +205.5% 26002 ± 2% softirqs.CPU79.SCHED
8854 ±105% +186.2% 25341 ± 2% softirqs.CPU8.SCHED
8450 ±116% +215.1% 26629 ± 2% softirqs.CPU80.SCHED
8496 ±117% +206.5% 26038 softirqs.CPU81.SCHED
4144 ± 6% +83.5% 7603 ± 21% softirqs.CPU82.RCU
8429 ±117% +179.7% 23575 ± 18% softirqs.CPU82.SCHED
8393 ±117% +138.6% 20028 ± 30% softirqs.CPU84.SCHED
8422 ±116% +140.8% 20281 ± 28% softirqs.CPU92.SCHED
4021 ± 7% +93.4% 7778 ± 29% softirqs.CPU95.RCU
415214 +63.4% 678631 ± 6% softirqs.RCU
38.06 ± 7% -38.1 0.00 perf-profile.calltrace.cycles-pp.__ext4_journal_start_sb.ext4_iomap_begin.iomap_apply.dax_iomap_rw.ext4_file_write_iter
36.28 ± 7% -36.3 0.00 perf-profile.calltrace.cycles-pp.jbd2__journal_start.__ext4_journal_start_sb.ext4_iomap_begin.iomap_apply.dax_iomap_rw
36.07 ± 7% -36.1 0.00 perf-profile.calltrace.cycles-pp.start_this_handle.jbd2__journal_start.__ext4_journal_start_sb.ext4_iomap_begin.iomap_apply
63.15 ± 7% -31.9 31.29 ± 12% perf-profile.calltrace.cycles-pp.ext4_iomap_begin.iomap_apply.dax_iomap_rw.ext4_file_write_iter.new_sync_write
11.15 ± 9% -11.1 0.00 perf-profile.calltrace.cycles-pp.__ext4_journal_stop.ext4_iomap_begin.iomap_apply.dax_iomap_rw.ext4_file_write_iter
10.95 ± 9% -11.0 0.00 perf-profile.calltrace.cycles-pp.jbd2_journal_stop.__ext4_journal_stop.ext4_iomap_begin.iomap_apply.dax_iomap_rw
8.81 ± 7% -8.8 0.00 perf-profile.calltrace.cycles-pp.stop_this_handle.jbd2_journal_stop.__ext4_journal_stop.ext4_iomap_begin.iomap_apply
8.49 ± 6% -8.5 0.00 perf-profile.calltrace.cycles-pp.add_transaction_credits.start_this_handle.jbd2__journal_start.__ext4_journal_start_sb.ext4_iomap_begin
5.93 ± 6% -5.9 0.00 perf-profile.calltrace.cycles-pp._raw_read_lock.start_this_handle.jbd2__journal_start.__ext4_journal_start_sb.ext4_iomap_begin
0.99 ± 9% +0.4 1.44 ± 19% perf-profile.calltrace.cycles-pp.ext4_write_checks.ext4_file_write_iter.new_sync_write.vfs_write.ksys_write
0.00 +1.0 0.96 ± 17% perf-profile.calltrace.cycles-pp.ext4_es_lookup_extent.ext4_map_blocks.ext4_iomap_begin.iomap_apply.dax_iomap_rw
0.00 +1.1 1.10 ± 20% perf-profile.calltrace.cycles-pp.__check_block_validity.ext4_map_blocks.ext4_iomap_begin.iomap_apply.dax_iomap_rw
0.00 +2.2 2.19 ± 17% perf-profile.calltrace.cycles-pp.ext4_map_blocks.ext4_iomap_begin.iomap_apply.dax_iomap_rw.ext4_file_write_iter
1.94 ± 16% +6.6 8.49 ± 13% perf-profile.calltrace.cycles-pp.__copy_user_nocache.__copy_user_flushcache._copy_from_iter_flushcache.dax_iomap_actor.iomap_apply
1.95 ± 16% +6.6 8.54 ± 13% perf-profile.calltrace.cycles-pp.__copy_user_flushcache._copy_from_iter_flushcache.dax_iomap_actor.iomap_apply.dax_iomap_rw
1.99 ± 16% +6.7 8.70 ± 13% perf-profile.calltrace.cycles-pp._copy_from_iter_flushcache.dax_iomap_actor.iomap_apply.dax_iomap_rw.ext4_file_write_iter
7.86 ± 11% +12.8 20.70 ± 13% perf-profile.calltrace.cycles-pp._raw_read_lock.jbd2_transaction_committed.ext4_set_iomap.ext4_iomap_begin.iomap_apply
1.73 ± 15% +13.7 15.42 ± 27% perf-profile.calltrace.cycles-pp.__srcu_read_unlock.dax_iomap_actor.iomap_apply.dax_iomap_rw.ext4_file_write_iter
12.86 ± 7% +14.8 27.69 ± 13% perf-profile.calltrace.cycles-pp.jbd2_transaction_committed.ext4_set_iomap.ext4_iomap_begin.iomap_apply.dax_iomap_rw
13.14 ± 7% +15.7 28.81 ± 13% perf-profile.calltrace.cycles-pp.ext4_set_iomap.ext4_iomap_begin.iomap_apply.dax_iomap_rw.ext4_file_write_iter
3.87 ± 14% +20.9 24.76 ± 20% perf-profile.calltrace.cycles-pp.dax_iomap_actor.iomap_apply.dax_iomap_rw.ext4_file_write_iter.new_sync_write
38.74 ± 7% -38.1 0.65 ± 8% perf-profile.children.cycles-pp.__ext4_journal_start_sb
36.93 ± 7% -36.3 0.61 ± 7% perf-profile.children.cycles-pp.jbd2__journal_start
36.73 ± 7% -36.1 0.60 ± 7% perf-profile.children.cycles-pp.start_this_handle
63.15 ± 7% -31.9 31.30 ± 12% perf-profile.children.cycles-pp.ext4_iomap_begin
11.21 ± 9% -11.2 0.01 ±173% perf-profile.children.cycles-pp.__ext4_journal_stop
11.01 ± 9% -11.0 0.01 ±173% perf-profile.children.cycles-pp.jbd2_journal_stop
8.83 ± 7% -8.8 0.00 perf-profile.children.cycles-pp.stop_this_handle
8.64 ± 7% -8.5 0.14 ± 8% perf-profile.children.cycles-pp.add_transaction_credits
0.00 +0.1 0.05 ± 8% perf-profile.children.cycles-pp.timestamp_truncate
0.00 +0.1 0.06 ± 15% perf-profile.children.cycles-pp.pmem_dax_direct_access
0.00 +0.1 0.06 ± 14% perf-profile.children.cycles-pp.fsnotify_parent
0.00 +0.1 0.06 ± 11% perf-profile.children.cycles-pp.file_modified
0.00 +0.1 0.07 ± 12% perf-profile.children.cycles-pp.aa_file_perm
0.00 +0.1 0.07 ± 12% perf-profile.children.cycles-pp.apparmor_file_permission
0.00 +0.1 0.07 ± 15% perf-profile.children.cycles-pp.ktime_get_coarse_real_ts64
0.00 +0.1 0.08 ± 10% perf-profile.children.cycles-pp.__pmem_direct_access
0.00 +0.1 0.09 ± 9% perf-profile.children.cycles-pp.__x86_indirect_thunk_rax
0.00 +0.1 0.09 ± 7% perf-profile.children.cycles-pp.__might_sleep
0.00 +0.1 0.09 ± 13% perf-profile.children.cycles-pp._cond_resched
0.00 +0.1 0.10 ± 12% perf-profile.children.cycles-pp.___might_sleep
0.00 +0.1 0.12 ± 12% perf-profile.children.cycles-pp.fsnotify
0.04 ± 57% +0.1 0.18 ± 7% perf-profile.children.cycles-pp.__fdget_pos
0.00 +0.1 0.14 ± 7% perf-profile.children.cycles-pp.__fget_light
0.00 +0.2 0.15 ± 10% perf-profile.children.cycles-pp.up_write
0.01 ±173% +0.2 0.17 ± 6% perf-profile.children.cycles-pp.current_time
0.00 +0.2 0.16 ± 11% perf-profile.children.cycles-pp.dax_direct_access
0.06 ± 7% +0.2 0.23 ± 11% perf-profile.children.cycles-pp.__sb_start_write
0.00 +0.2 0.18 ± 72% perf-profile.children.cycles-pp.generic_write_checks
0.04 ± 57% +0.2 0.22 ± 8% perf-profile.children.cycles-pp.__srcu_read_lock
0.06 ± 7% +0.2 0.26 ± 11% perf-profile.children.cycles-pp.entry_SYSCALL_64
0.06 +0.2 0.26 ± 14% perf-profile.children.cycles-pp.common_file_perm
0.05 ± 9% +0.2 0.28 ± 11% perf-profile.children.cycles-pp.down_write
0.00 +0.2 0.23 ± 60% perf-profile.children.cycles-pp.ext4_generic_write_checks
0.09 ± 5% +0.3 0.34 ± 13% perf-profile.children.cycles-pp.syscall_return_via_sysret
0.09 ± 5% +0.3 0.37 ± 14% perf-profile.children.cycles-pp.security_file_permission
0.10 ± 8% +0.4 0.54 ± 25% perf-profile.children.cycles-pp.ext4_inode_block_valid
0.99 ± 9% +0.4 1.44 ± 19% perf-profile.children.cycles-pp.ext4_write_checks
0.04 ± 57% +0.5 0.51 ± 31% perf-profile.children.cycles-pp.percpu_counter_add_batch
0.12 ±173% +0.5 0.65 ± 42% perf-profile.children.cycles-pp.start_kernel
0.17 ± 11% +0.8 0.96 ± 17% perf-profile.children.cycles-pp.ext4_es_lookup_extent
0.19 ± 14% +0.9 1.11 ± 20% perf-profile.children.cycles-pp.__check_block_validity
0.39 ± 12% +1.8 2.20 ± 17% perf-profile.children.cycles-pp.ext4_map_blocks
1.94 ± 16% +6.6 8.50 ± 13% perf-profile.children.cycles-pp.__copy_user_nocache
1.95 ± 16% +6.6 8.54 ± 13% perf-profile.children.cycles-pp.__copy_user_flushcache
1.99 ± 16% +6.7 8.70 ± 13% perf-profile.children.cycles-pp._copy_from_iter_flushcache
13.96 ± 9% +7.1 21.04 ± 13% perf-profile.children.cycles-pp._raw_read_lock
1.73 ± 15% +13.7 15.43 ± 27% perf-profile.children.cycles-pp.__srcu_read_unlock
12.87 ± 7% +14.8 27.70 ± 13% perf-profile.children.cycles-pp.jbd2_transaction_committed
13.15 ± 7% +15.7 28.82 ± 13% perf-profile.children.cycles-pp.ext4_set_iomap
3.88 ± 14% +20.9 24.78 ± 20% perf-profile.children.cycles-pp.dax_iomap_actor
21.95 ± 7% -21.6 0.35 ± 8% perf-profile.self.cycles-pp.start_this_handle
8.79 ± 7% -8.8 0.00 perf-profile.self.cycles-pp.stop_this_handle
8.60 ± 7% -8.5 0.14 ± 8% perf-profile.self.cycles-pp.add_transaction_credits
0.00 +0.1 0.05 ± 8% perf-profile.self.cycles-pp.__x86_indirect_thunk_rax
0.00 +0.1 0.06 ± 9% perf-profile.self.cycles-pp.current_time
0.00 +0.1 0.06 ± 11% perf-profile.self.cycles-pp.aa_file_perm
0.00 +0.1 0.06 ± 20% perf-profile.self.cycles-pp.apparmor_file_permission
0.00 +0.1 0.07 ± 20% perf-profile.self.cycles-pp.generic_write_checks
0.00 +0.1 0.07 ± 15% perf-profile.self.cycles-pp.ktime_get_coarse_real_ts64
0.00 +0.1 0.08 ± 6% perf-profile.self.cycles-pp.__might_sleep
0.00 +0.1 0.08 ± 10% perf-profile.self.cycles-pp.__pmem_direct_access
0.00 +0.1 0.08 ± 13% perf-profile.self.cycles-pp.__sb_start_write
0.00 +0.1 0.09 ± 13% perf-profile.self.cycles-pp.ksys_write
0.00 +0.1 0.10 ± 12% perf-profile.self.cycles-pp.___might_sleep
0.00 +0.1 0.11 ± 16% perf-profile.self.cycles-pp.dax_iomap_rw
0.00 +0.1 0.11 ± 11% perf-profile.self.cycles-pp.fsnotify
0.00 +0.1 0.12 ± 67% perf-profile.self.cycles-pp.file_update_time
0.00 +0.1 0.13 ± 8% perf-profile.self.cycles-pp.__fget_light
0.00 +0.1 0.13 ± 9% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
0.00 +0.1 0.14 ± 15% perf-profile.self.cycles-pp.ext4_map_blocks
0.00 +0.2 0.15 ± 12% perf-profile.self.cycles-pp._copy_from_iter_flushcache
0.04 ± 57% +0.2 0.19 ± 15% perf-profile.self.cycles-pp.common_file_perm
0.00 +0.2 0.15 ± 10% perf-profile.self.cycles-pp.up_write
0.00 +0.2 0.17 ± 10% perf-profile.self.cycles-pp.down_write
0.04 ± 57% +0.2 0.21 ± 10% perf-profile.self.cycles-pp.dax_iomap_actor
0.01 ±173% +0.2 0.20 ± 11% perf-profile.self.cycles-pp.vfs_write
0.00 +0.2 0.18 ± 15% perf-profile.self.cycles-pp.do_syscall_64
0.08 ± 5% +0.2 0.28 ± 8% perf-profile.self.cycles-pp.ext4_iomap_begin
0.06 ± 15% +0.2 0.25 ± 11% perf-profile.self.cycles-pp.ext4_es_lookup_extent
0.06 ± 7% +0.2 0.26 ± 11% perf-profile.self.cycles-pp.entry_SYSCALL_64
0.01 ±173% +0.2 0.22 ± 10% perf-profile.self.cycles-pp.__srcu_read_lock
0.09 ± 5% +0.3 0.34 ± 13% perf-profile.self.cycles-pp.syscall_return_via_sysret
0.00 +0.3 0.31 ± 80% perf-profile.self.cycles-pp.new_sync_write
0.11 ± 7% +0.3 0.45 ± 9% perf-profile.self.cycles-pp.iomap_apply
0.04 ± 57% +0.4 0.47 ± 32% perf-profile.self.cycles-pp.percpu_counter_add_batch
0.10 ± 8% +0.4 0.53 ± 25% perf-profile.self.cycles-pp.ext4_inode_block_valid
0.25 ± 12% +0.5 0.70 ± 25% perf-profile.self.cycles-pp.ext4_file_write_iter
0.09 ± 27% +0.5 0.56 ± 21% perf-profile.self.cycles-pp.__check_block_validity
0.27 ± 18% +0.8 1.11 ± 28% perf-profile.self.cycles-pp.ext4_set_iomap
4.99 ± 6% +2.0 6.95 ± 14% perf-profile.self.cycles-pp.jbd2_transaction_committed
1.93 ± 16% +6.5 8.46 ± 13% perf-profile.self.cycles-pp.__copy_user_nocache
13.90 ± 9% +7.0 20.92 ± 13% perf-profile.self.cycles-pp._raw_read_lock
1.73 ± 15% +13.6 15.35 ± 27% perf-profile.self.cycles-pp.__srcu_read_unlock



fio.write_bw_MBps

60000 +-------------------------------------------------------------------+
55000 |-+ O |
| O O O |
50000 |-+ O O O O O O |
45000 |-+ O O O O O O O |
40000 |-O O O O O |
35000 |-+ |
| |
30000 |-+ |
25000 |-+ |
20000 |-+ |
15000 |-+ |
|.+..+.+.+.+..+.+.+.+..+.+.+. .+. .+..+.+.+.+..+.+.+. .+. .+.|
10000 |-+ +. + +..+ +.+. |
5000 +-------------------------------------------------------------------+


fio.write_iops

1.6e+07 +-----------------------------------------------------------------+
| O |
1.4e+07 |-+ |
| O O O O O |
1.2e+07 |-+ O O O O O |
| O O O O O O O O O O |
1e+07 |-+ O |
| |
8e+06 |-+ |
| |
6e+06 |-+ |
| |
4e+06 |-+ |
|.+.+..+.+.+.+.+..+.+.+.+.+..+.+.+.+.+..+.+.+.+.+..+.+.+.+.+..+.+.|
2e+06 +-----------------------------------------------------------------+


fio.write_clat_mean_us

20000 +-------------------------------------------------------------------+
| +.+.. |
18000 |-+ .+..+.+.+.. .+..+. + |
16000 |.+..+. .+.+..+. .+.+.. .+.+ +.+.+.+..+.+.+ + +.|
| + + + |
14000 |-+ |
12000 |-+ |
| |
10000 |-+ |
8000 |-+ |
| |
6000 |-+ |
4000 |-O O O O O O O O O |
| O O O O O O O O O O O O O |
2000 +-------------------------------------------------------------------+


fio.write_clat_90__us

35000 +-------------------------------------------------------------------+
| |
30000 |-+ + .+. .+.. .+ + |
|.+.. +. .+ : + .+. .+. + .+. +.+.+.+. : : + .+.|
25000 |-+ +. + +. + : +..+ + + +. .. : : +. |
| + + + + |
20000 |-+ |
| |
15000 |-+ |
| |
10000 |-+ |
| O O O O O O O O O O |
5000 |-+ O O O O O O O O O O O O |
| |
0 +-------------------------------------------------------------------+


fio.write_clat_95__us

40000 +-------------------------------------------------------------------+
| |
35000 |-+ .+. .+.. + |
| +. +. + +. + +. .+ :+ |
30000 |.+.. : +..+ : +.. + + + +.+.+ + +.+.+. : : +..+.|
| +. : + : + + + + : : |
25000 |-+ + + + + |
| |
20000 |-+ |
| |
15000 |-+ |
| |
10000 |-+ O O O O O O O O |
| O O O O O O O O O O O O O |
5000 +-------------------------------------------------------------------+


fio.latency_4us_

70 +----------------------------------------------------------------------+
| O |
60 |-+ O |
| O O O O O O O O O |
50 |-+ O O O O O O |
| O O O O |
40 |-+ O |
| |
30 |-+ |
| |
20 |-+ |
| |
10 |-+ |
| |
0 +----------------------------------------------------------------------+


fio.latency_50us_

45 +----------------------------------------------------------------------+
| + |
40 |-+ .+ :: |
35 |-+ + + +.+..+.+ .+ : : : .+ |
| + :+ + :: +. .. : +. .+. : : +. :|
30 |+++ : + + + : : : + : : + : : :|
25 |-+ + : + + : +.. : +..+.+ : : : |
| +.: + + + : : |
20 |-+ + + + |
15 |-+ |
| |
10 |-+ |
5 |-+ |
| |
0 +----------------------------------------------------------------------+


fio.workload

3e+09 +-----------------------------------------------------------------+
| |
| O O O |
2.5e+09 |-+ O O O O O |
| O O O O O O |
| O O O O O O O |
2e+09 |-+ |
| |
1.5e+09 |-+ |
| |
| |
1e+09 |-+ |
| |
|. .+..+.+.+. .+..+. .+. .+.. |
5e+08 +-----------------------------------------------------------------+


fio.time.user_time

1100 +--------------------------------------------------------------------+
| O |
1000 |-+ O O O O |
900 |-+ O O O O O |
| O O O O O O |
800 |-O O O O O |
700 |-+ |
| |
600 |-+ |
500 |-+ |
| |
400 |-+ + |
300 |-+.. + |
|.+ +.+..+.+.+.+..+.+.+..+. .+.+..+.+.+..+.+. .+.+. .+.|
200 +--------------------------------------------------------------------+


fio.time.system_time

9400 +--------------------------------------------------------------------+
9300 |-+ .+.+..+.+. .+.. .+.+.. |
|.+.. +.+..+.+.+.+..+.+.+..+ +.+..+.+.+..+.+ +.+ +.|
9200 |-+ + |
9100 |-+ + |
| |
9000 |-+ |
8900 |-+ |
8800 |-+ |
| O |
8700 |-O O O O O O O O O O |
8600 |-+ O O O O O |
| O O O O |
8500 |-+ O |
8400 +--------------------------------------------------------------------+


fio.time.voluntary_context_switches

24500 +-------------------------------------------------------------------+
| + + |
24000 |-+ + : : + |
|: + + +. + : : + + |
|: + + + .. +.+. .+. .+ + +. .+. + + |
23500 |-+ + +.+ +..+ + +.+.+. +.+.+..+ +.+..+.|
| |
23000 |-+ |
| |
22500 |-+ |
| O O |
| O O |
22000 |-+ O O O O O O O O |
| O O O O O O O O O O |
21500 +-------------------------------------------------------------------+


[*] bisect-good sample
[O] bisect-bad sample



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


Thanks,
Rong Chen


Attachments:
(No filename) (47.85 kB)
config-5.8.0-rc4-00047-g4e8fc10115a69 (172.08 kB)
job-script (8.46 kB)
job.yaml (5.86 kB)
reproduce (971.00 B)
Download all attachments

2020-10-03 04:50:53

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCHv3 1/1] ext4: Optimize file overwrites

On Fri, Sep 18, 2020 at 10:36:35AM +0530, Ritesh Harjani wrote:
> In case if the file already has underlying blocks/extents allocated
> then we don't need to start a journal txn and can directly return
> the underlying mapping. Currently ext4_iomap_begin() is used by
> both DAX & DIO path. We can check if the write request is an
> overwrite & then directly return the mapping information.
>
> This could give a significant perf boost for multi-threaded writes
> specially random overwrites.
> On PPC64 VM with simulated pmem(DAX) device, ~10x perf improvement
> could be seen in random writes (overwrite). Also bcoz this optimizes
> away the spinlock contention during jbd2 slab cache allocation
> (jbd2_journal_handle). On x86 VM, ~2x perf improvement was observed.
>
> Reported-by: Dan Williams <[email protected]>
> Suggested-by: Jan Kara <[email protected]>
> Signed-off-by: Ritesh Harjani <[email protected]>

Thanks, applied.

- Ted