2020-12-18 20:44:21

by Matteo Croce

[permalink] [raw]
Subject: discard and data=writeback

Hi,

I noticed a big slowdown on file removal, so I tried to remove the
discard option, and it helped
a lot.
Obviously discarding blocks will have an overhead, but the strange
thing is that it only
does when using data=writeback:

Ordered:

$ dmesg |grep EXT4
[ 0.243372] EXT4-fs (vda1): mounted filesystem with ordered data
mode. Opts: (null)

$ grep -w / /proc/mounts
/dev/root / ext4 rw,noatime 0 0
$ time rm -rf linux-5.10

real 0m0.454s
user 0m0.029s
sys 0m0.409s

$ grep -w / /proc/mounts
/dev/root / ext4 rw,noatime,discard 0 0
$ time rm -rf linux-5.10

real 0m0.554s
user 0m0.051s
sys 0m0.403s

Writeback:

$ dmesg |grep EXT4
[ 0.243909] EXT4-fs (vda1): mounted filesystem with writeback data
mode. Opts: (null)

$ grep -w / /proc/mounts
/dev/root / ext4 rw,noatime 0 0
$ time rm -rf linux-5.10

real 0m0.440s
user 0m0.030s
sys 0m0.407s

$ grep -w / /proc/mounts
/dev/root / ext4 rw,noatime,discard 0 0
$ time rm -rf linux-5.10

real 0m3.763s
user 0m0.030s
sys 0m0.876s

It seems that ext4_issue_discard() is called ~300 times with data=ordered
and ~50k times with data=writeback.
I'm using vanilla 5.10.1 kernel.

Any thoughts?

Regards,
--
per aspera ad upstream


2020-12-21 03:06:05

by Theodore Ts'o

[permalink] [raw]
Subject: Re: discard and data=writeback

On Fri, Dec 18, 2020 at 07:40:09PM +0100, Matteo Croce wrote:
>
> I noticed a big slowdown on file removal, so I tried to remove the
> discard option, and it helped
> a lot.
> Obviously discarding blocks will have an overhead, but the strange
> thing is that it only
> does when using data=writeback:

If data=ordered mount option is enabled, when you have allocating
buffered writes pending, the data block writes are forced out *before*
we write out the journal blocks, followed by a cache flush, followed
by the commit block (which is either written with the Forced Unit
Attention bit set if the storage device supports this, or the commit
block is followed by another cache flush). After the journal commit
block is written out, then if the discard mount option is enabled,
then all blocks that were released during the last joutnal transaction
are then discarded.

If data=writeback is enabled, then we do *not* flush out any dirty
pages in the page cache that were allocated during the previous
transaction. This means that if you crash, it is possible that
freshly inodes that contain freshly allocated blocks may have stale
data in those new allocated blocks. This blocks might include some
other users' e-mails, medical records, cryptographic keys, or other
PII. Which is why data=ordered is the default.

So if data=ordered and data=writeback makes any difference, the first
question I'd have to ask is whether any dirty pages in the page cache,
or any background writes happening in parallel with the rm -rf
command.

> It seems that ext4_issue_discard() is called ~300 times with data=ordered
> and ~50k times with data=writeback.

ext4_issue_discard() gets called for each contiguous set of blocks
that were released in a particular jbd2 transaction. So if you are
deleting 100 files, and all of those files are unlinked in a single
transaction, and all of those blocks belonging to those files belong
to a single contiguous block region, then ext4_issue_discard() will be
called only once. If you delete a single file, but all of its blocks
are heavily fragmented, then ext4_issue_discard() be called a thousand
times.

If you delete 100 files, all of which are contiguous, but each file is
in a different part of the disk, then ext4_issue_discard() might be
called 100 times.

So that implies that your experiment may not be repeatable; did you
make sure the file system was freshly reformatted before you wrote out
the files in the directory you are deleting? And was the directory
written out in exactly the same way? And did you make sure all of the
writes were flushed out to disk before you tried timing the "rm -rf"
command? And did you make sure that there weren't any other processes
running that might be issuing other file system operations (either
data or metadata heavy) that might be interfering with the "rm -rf"
operation? What kind of storage device were you using? (An SSD; a
USB thumb drive; some kind of Cloud emulated block device?)

Note that benchmarking the file system operations is *hard*. When I
worked with a graduate student working on a paper describing a
prototype of a file system enhancement to ext4 to optimize ext4 for
drive-managed SMR drives[1], the graduate student spent *way* more
time getting reliable, repeatable benchmarks than making changes to
ext4 for the prototype. (It turns out the SMR GC operations caused
variations in write speeds, which meant the writeback throughput
measurements would fluctuate wildly, which then influenced the
writeback cache ratio, which in turn massively influenced the how
aggressively the writeback threads would behave, which in turn
massively influenced the filebench and postmark numbers.)

[1] https://www.usenix.org/conference/fast17/technical-sessions/presentation/aghayev

So there can be variability caused by how blocks are allocated at the
file system; how the SSD is assigning blocks to flash erase blocks;
how the SSD's GC operation influences its write speed, which can in
turn influence the kernel's measured writeback throughput; different
SSD's or Cloud block devices can have very different discard
performance that can vary based on past write history, yadda, yadda,
yadda.

Cheers,

- Ted

2020-12-22 15:43:46

by Matteo Croce

[permalink] [raw]
Subject: Re: discard and data=writeback

On Mon, Dec 21, 2020 at 4:04 AM Theodore Y. Ts'o <[email protected]> wrote:
>
> So that implies that your experiment may not be repeatable; did you
> make sure the file system was freshly reformatted before you wrote out
> the files in the directory you are deleting? And was the directory
> written out in exactly the same way? And did you make sure all of the
> writes were flushed out to disk before you tried timing the "rm -rf"
> command? And did you make sure that there weren't any other processes
> running that might be issuing other file system operations (either
> data or metadata heavy) that might be interfering with the "rm -rf"
> operation? What kind of storage device were you using? (An SSD; a
> USB thumb drive; some kind of Cloud emulated block device?)
>

I got another machine with a faster NVME disk. I discarded the whole
drive before partitioning it, this drive is very fast in discarding
blocks:
# time blkdiscard -f /dev/nvme0n1p1

real 0m1.356s
user 0m0.003s
sys 0m0.000s

Also, the drive is pretty big compared to the dataset size, so it's
unlikely to be fragmented:

# lsblk /dev/nvme0n1
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme0n1 259:0 0 1.7T 0 disk
└─nvme0n1p1 259:1 0 1.7T 0 part /media
# df -h /media
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p1 1.8T 1.2G 1.7T 1% /media
# du -sh /media/linux-5.10/
1.1G /media/linux-5.10/

I'm issuing sync + sleep(10) after the extraction, so the writes
should all be flushed.
Also, I repeated the test three times, with very similar results:

# dmesg |grep EXT4-fs
[12807.847559] EXT4-fs (nvme0n1p1): mounted filesystem with ordered
data mode. Opts: data=ordered,discard

# tar xf ~/linux-5.10.tar ; sync ; sleep 10
# time rm -rf linux-5.10/

real 0m1.607s
user 0m0.048s
sys 0m1.559s
# tar xf ~/linux-5.10.tar ; sync ; sleep 10
# time rm -rf linux-5.10/

real 0m1.634s
user 0m0.080s
sys 0m1.553s
# tar xf ~/linux-5.10.tar ; sync ; sleep 10
# time rm -rf linux-5.10/

real 0m1.604s
user 0m0.052s
sys 0m1.552s


# dmesg |grep EXT4-fs
[13133.953978] EXT4-fs (nvme0n1p1): mounted filesystem with writeback
data mode. Opts: data=writeback,discard

# tar xf ~/linux-5.10.tar ; sync ; sleep 10
# time rm -rf linux-5.10/

real 1m29.443s
user 0m0.073s
sys 0m2.520s
# tar xf ~/linux-5.10.tar ; sync ; sleep 10
# time rm -rf linux-5.10/

real 1m29.409s
user 0m0.081s
sys 0m2.518s
# tar xf ~/linux-5.10.tar ; sync ; sleep 10
# time rm -rf linux-5.10/

real 1m19.283s
user 0m0.068s
sys 0m2.505s

> Note that benchmarking the file system operations is *hard*. When I
> worked with a graduate student working on a paper describing a
> prototype of a file system enhancement to ext4 to optimize ext4 for
> drive-managed SMR drives[1], the graduate student spent *way* more
> time getting reliable, repeatable benchmarks than making changes to
> ext4 for the prototype. (It turns out the SMR GC operations caused
> variations in write speeds, which meant the writeback throughput
> measurements would fluctuate wildly, which then influenced the
> writeback cache ratio, which in turn massively influenced the how
> aggressively the writeback threads would behave, which in turn
> massively influenced the filebench and postmark numbers.)
>
> [1] https://www.usenix.org/conference/fast17/technical-sessions/presentation/aghayev
>

Interesting!

Cheers,
--
per aspera ad upstream

2020-12-22 16:35:39

by Theodore Ts'o

[permalink] [raw]
Subject: Re: discard and data=writeback

On Tue, Dec 22, 2020 at 03:59:29PM +0100, Matteo Croce wrote:
>
> I'm issuing sync + sleep(10) after the extraction, so the writes
> should all be flushed.
> Also, I repeated the test three times, with very similar results:

So that means the problem is not due to page cache writeback
interfering with the discards. So it's most likely that the problem
is due to how the blocks are allocated and laid out when using
data=ordered vs data=writeback.

Some experiments to try next. After extracting the files with
data=ordered and data=writeback on a freshly formatted file system,
use "e2freefrag" to see how the free space is fragmented. This will
tell us how the file system is doing from a holistic perspective, in
terms of blocks allocated to the extracted files. (E2freefrag is
showing you the blocks *not* allocated, of course, but that's a mirror
image dual of the blocks that *are* allocated, especially if you start
from an identical known state; hence the use of a freshly formatted
file system.)

Next, we can see how individual files look like with respect to
fragmentation. This can be done via using filefrag on all of the
files, e.g:

find . -type f -print0 | xargs -0 filefrag

Another way to get similar (although not identical) information is via
running "e2fsck -E fragcheck" on a file system. How they differ is
especially more of a big deal on ext3 file systems without extents and
flex_bg, since filefrag tries to take into account metadata blocks
such as indirect blocks and extent tree blocks, and e2fsck -E
fragcheck does not; but it's good enough for getting a good gestalt
for the files' overall fragmentation --- and note that as long as the
average fragment size is at least a megabyte or two, some
fragmentation really isn't that much of a problem from a real-world
performance perspective. People can get way too invested in trying to
get to perfection with 100% fragmentation-free files. The problem
with doing this at the expense of all else is that you can end up
making the overall free space fragmentation worse as the file system
ages, at which point the file system performance really dives through
the floor as the file system approaches 100%, or even 80-90% full,
especially on HDD's. For SSD's fragmentation doesn't matter quite so
much, unless the average fragment size is *really* small, and when you
are discarded freed blocks.

Even if the files are showing no substantial difference in
fragmentation, and the free space is equally A-OK with respect to
fragmentation, the other possibility is the *layout* of the blocks are
such that the order in which they are deleted using rm -rf ends up
being less friendly from a discard perspective. This can happen if
the directory hierarchy is big enough, and/or the journal size is
small enough, that the rm -rf requires multiple journal transactions
to complete. That's because with mount -o discard, we do the discards
after each transaction commit, and it might be that even though the
used blocks are perfectly contiguous, because of the order in which
the files end up getting deleted, we end up needing to discard them in
smaller chunks.

For example, one could imagine a case where you have a million 4k
files, and they are allocated contiguously, but if you get
super-unlucky, such that in the first transaction you delete all of
the odd-numbered files, and in second transaction you delete all of
the even-numbered files, you might need to do a million 4k discards
--- but if all of the deletes could fit into a single transaction, you
would only need to do a single million block discard operation.

Finally, you may want to consider whether or not mount -o discard
really makes sense or not. For most SSD's, especially high-end SSD's,
it probably doesn't make that much difference. That's because when
you overwrite a sector, the SSD knows (or should know; this might not
be some really cheap, crappy low-end flash devices; but on those
devices, discard might not be making uch of a difference anyway), that
the old contents of the sector is no longer needed. Hence an
overwrite effectively is an "implied discard". So long as there is a
sufficient number of free erase blocks, the SSD might be able to keep
up doing the GC for those "implied discards", and so accelerating the
process by sending explicit discards after every journal transaction
might not be necessary. Or, maybe it's sufficient to run "fstrim"
every week at Sunday 3am local time; or maybe even fstrim once a night
or fstrim once a month --- your mileage may vary.

It's going to vary from SSD to SSD and from workload to workload, but
you might find that mount -o discard isn't buying you all that much
--- if you run a random write workload, and you don't notice any
performance degradation, and you don't notice an increase in the SSD's
write amplification numbers (if they are provided by your SSD), then
you might very well find that it's not worth it to use mount -o
discard.

I personally don't bother using mount -o discard, and instead
periodically run fstrim, on my personal machines. Part of that is
because I'm mostly just reading and replying to emails, building
kernels and editing text files, and that is not nearly as stressful on
the FTL as a full-blown random write workload (for example, if you
were running a database supporting a transaction processing workload).

Cheers,

- Ted

2020-12-22 22:55:01

by Andreas Dilger

[permalink] [raw]
Subject: Re: discard and data=writeback

On Dec 22, 2020, at 9:34 AM, Theodore Y. Ts'o <[email protected]> wrote:
>
> On Tue, Dec 22, 2020 at 03:59:29PM +0100, Matteo Croce wrote:
>>
>> I'm issuing sync + sleep(10) after the extraction, so the writes
>> should all be flushed.
>> Also, I repeated the test three times, with very similar results:
>
> So that means the problem is not due to page cache writeback
> interfering with the discards. So it's most likely that the problem
> is due to how the blocks are allocated and laid out when using
> data=ordered vs data=writeback.
>
> Some experiments to try next. After extracting the files with
> data=ordered and data=writeback on a freshly formatted file system,
> use "e2freefrag" to see how the free space is fragmented. This will
> tell us how the file system is doing from a holistic perspective, in
> terms of blocks allocated to the extracted files. (E2freefrag is
> showing you the blocks *not* allocated, of course, but that's a mirror
> image dual of the blocks that *are* allocated, especially if you start
> from an identical known state; hence the use of a freshly formatted
> file system.)
>
> Next, we can see how individual files look like with respect to
> fragmentation. This can be done via using filefrag on all of the
> files, e.g:
>
> find . -type f -print0 | xargs -0 filefrag
>
> Another way to get similar (although not identical) information is via
> running "e2fsck -E fragcheck" on a file system. How they differ is
> especially more of a big deal on ext3 file systems without extents and
> flex_bg, since filefrag tries to take into account metadata blocks
> such as indirect blocks and extent tree blocks, and e2fsck -E
> fragcheck does not; but it's good enough for getting a good gestalt
> for the files' overall fragmentation --- and note that as long as the
> average fragment size is at least a megabyte or two, some
> fragmentation really isn't that much of a problem from a real-world
> performance perspective. People can get way too invested in trying to
> get to perfection with 100% fragmentation-free files. The problem
> with doing this at the expense of all else is that you can end up
> making the overall free space fragmentation worse as the file system
> ages, at which point the file system performance really dives through
> the floor as the file system approaches 100%, or even 80-90% full,
> especially on HDD's. For SSD's fragmentation doesn't matter quite so
> much, unless the average fragment size is *really* small, and when you
> are discarded freed blocks.
>
> Even if the files are showing no substantial difference in
> fragmentation, and the free space is equally A-OK with respect to
> fragmentation, the other possibility is the *layout* of the blocks are
> such that the order in which they are deleted using rm -rf ends up
> being less friendly from a discard perspective. This can happen if
> the directory hierarchy is big enough, and/or the journal size is
> small enough, that the rm -rf requires multiple journal transactions
> to complete. That's because with mount -o discard, we do the discards
> after each transaction commit, and it might be that even though the
> used blocks are perfectly contiguous, because of the order in which
> the files end up getting deleted, we end up needing to discard them in
> smaller chunks.
>
> For example, one could imagine a case where you have a million 4k
> files, and they are allocated contiguously, but if you get
> super-unlucky, such that in the first transaction you delete all of
> the odd-numbered files, and in second transaction you delete all of
> the even-numbered files, you might need to do a million 4k discards
> --- but if all of the deletes could fit into a single transaction, you
> would only need to do a single million block discard operation.
>
> Finally, you may want to consider whether or not mount -o discard
> really makes sense or not. For most SSD's, especially high-end SSD's,
> it probably doesn't make that much difference. That's because when
> you overwrite a sector, the SSD knows (or should know; this might not
> be some really cheap, crappy low-end flash devices; but on those
> devices, discard might not be making uch of a difference anyway), that
> the old contents of the sector is no longer needed. Hence an
> overwrite effectively is an "implied discard". So long as there is a
> sufficient number of free erase blocks, the SSD might be able to keep
> up doing the GC for those "implied discards", and so accelerating the
> process by sending explicit discards after every journal transaction
> might not be necessary. Or, maybe it's sufficient to run "fstrim"
> every week at Sunday 3am local time; or maybe even fstrim once a night
> or fstrim once a month --- your mileage may vary.
>
> It's going to vary from SSD to SSD and from workload to workload, but
> you might find that mount -o discard isn't buying you all that much
> --- if you run a random write workload, and you don't notice any
> performance degradation, and you don't notice an increase in the SSD's
> write amplification numbers (if they are provided by your SSD), then
> you might very well find that it's not worth it to use mount -o
> discard.
>
> I personally don't bother using mount -o discard, and instead
> periodically run fstrim, on my personal machines. Part of that is
> because I'm mostly just reading and replying to emails, building
> kernels and editing text files, and that is not nearly as stressful on
> the FTL as a full-blown random write workload (for example, if you
> were running a database supporting a transaction processing workload).

The problem (IMHO) with "-o discard" is that if it is only trimming
*blocks* that were deleted, these may be too small to effectively be
processed by the underlying device (e.g. the "super-unlucky" example
above where interleaved 4KB file deletes result in 1M separate 4KB
trim requests to the device, even when the *space* that is freed by
the unlinks could be handled with far fewer large trim requests.

There was a discussion previously ("introduce EXT4_BG_WAS_TRIMMED ...")

https://patchwork.ozlabs.org/project/linux-ext4/patch/[email protected]/

about leveraging the persistent EXT4_BG_WAS_TRIMMED flag in the group
descriptors, and having "-o discard" only track trim on a per-group
basis rather than its current mode of doing trim on a per-block basis,
and then use the same code internally as fstrim to do a trim of free
blocks in that block group.

Using EXT4_BG_WAS_TRIMMED and tracking *groups* to be trimmed would be
a bit more lazy than the current "-o discard" implementation, but would
be more memory efficient, and also more efficient for the device (fewer,
larger trim requests submitted). It would only need to track groups
that have at least a reasonable amount of free space to be trimmed. If
the group doesn't have enough free blocks to trim now, it will be checked
again in the future when more blocks are freed.

Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2020-12-23 00:49:47

by Matteo Croce

[permalink] [raw]
Subject: Re: discard and data=writeback

On Tue, Dec 22, 2020 at 5:34 PM Theodore Y. Ts'o <[email protected]> wrote:
>
> On Tue, Dec 22, 2020 at 03:59:29PM +0100, Matteo Croce wrote:
> >
> > I'm issuing sync + sleep(10) after the extraction, so the writes
> > should all be flushed.
> > Also, I repeated the test three times, with very similar results:
>
> So that means the problem is not due to page cache writeback
> interfering with the discards. So it's most likely that the problem
> is due to how the blocks are allocated and laid out when using
> data=ordered vs data=writeback.
>
> Some experiments to try next. After extracting the files with
> data=ordered and data=writeback on a freshly formatted file system,
> use "e2freefrag" to see how the free space is fragmented. This will
> tell us how the file system is doing from a holistic perspective, in
> terms of blocks allocated to the extracted files. (E2freefrag is
> showing you the blocks *not* allocated, of course, but that's a mirror
> image dual of the blocks that *are* allocated, especially if you start
> from an identical known state; hence the use of a freshly formatted
> file system.)
>

This is with data=ordered:

# e2freefrag /dev/nvme0n1p1
Device: /dev/nvme0n1p1
Blocksize: 4096 bytes
Total blocks: 468843350
Free blocks: 460922366 (98.3%)

Min. free extent: 4 KB
Max. free extent: 2064256 KB
Avg. free extent: 1976084 KB
Num. free extent: 933

# e2freefrag /dev/nvme0n1p1
Device: /dev/nvme0n1p1
Blocksize: 4096 bytes
Total blocks: 468843350
Free blocks: 460922365 (98.3%)

Min. free extent: 4 KB
Max. free extent: 2064256 KB
Avg. free extent: 1976084 KB
Num. free extent: 933

HISTOGRAM OF FREE EXTENT SIZES:
Extent Size Range : Free extents Free Blocks Percent
4K... 8K- : 1 1 0.00%
8K... 16K- : 2 5 0.00%
16K... 32K- : 1 7 0.00%
2M... 4M- : 3 2400 0.00%
32M... 64M- : 2 16384 0.00%
64M... 128M- : 11 267085 0.06%
128M... 256M- : 11 650037 0.14%
256M... 512M- : 3 314957 0.07%
512M... 1024M- : 7 1387580 0.30%
1G... 2G- : 892 458283909 99.43%

and this data=writeback:

# e2freefrag /dev/nvme0n1p1
Device: /dev/nvme0n1p1
Blocksize: 4096 bytes
Total blocks: 468843350
Free blocks: 460922366 (98.3%)

Min. free extent: 4 KB
Max. free extent: 2064256 KB
Avg. free extent: 1976084 KB
Num. free extent: 933

HISTOGRAM OF FREE EXTENT SIZES:
Extent Size Range : Free extents Free Blocks Percent
4K... 8K- : 1 1 0.00%
8K... 16K- : 2 5 0.00%
16K... 32K- : 1 7 0.00%
2M... 4M- : 3 2400 0.00%
32M... 64M- : 2 16384 0.00%
64M... 128M- : 11 267085 0.06%
128M... 256M- : 11 650038 0.14%
256M... 512M- : 3 314957 0.07%
512M... 1024M- : 7 1387580 0.30%
1G... 2G- : 892 458283909 99.43%

> Next, we can see how individual files look like with respect to
> fragmentation. This can be done via using filefrag on all of the
> files, e.g:
>
> find . -type f -print0 | xargs -0 filefrag
>

data=ordered:

# find /media -type f -print0 | xargs -0 filefrag |awk -F: '{print$2}'
|sort |uniq -c
32 0 extents found
70570 1 extent found

data=writeback:

# find /media -type f -print0 | xargs -0 filefrag |awk -F: '{print$2}'
|sort |uniq -c
32 0 extents found
70570 1 extent found

> Another way to get similar (although not identical) information is via
> running "e2fsck -E fragcheck" on a file system. How they differ is
> especially more of a big deal on ext3 file systems without extents and
> flex_bg, since filefrag tries to take into account metadata blocks
> such as indirect blocks and extent tree blocks, and e2fsck -E
> fragcheck does not; but it's good enough for getting a good gestalt
> for the files' overall fragmentation
>

data=ordered:

# e2fsck -fE fragcheck /dev/nvme0n1p1
e2fsck 1.45.6 (20-Mar-2020)
Pass 1: Checking inodes, blocks, and sizes
69341844(d): expecting 277356746 actual extent phys 277356748 log 1 len 2
69342337(d): expecting 277356766 actual extent phys 277356768 log 1 len 2
69346374(d): expecting 277357037 actual extent phys 277357094 log 1 len 2
69469890(d): expecting 277880969 actual extent phys 277880975 log 1 len 2
69473971(d): expecting 277881215 actual extent phys 277881219 log 1 len 2
69606373(d): expecting 278405580 actual extent phys 278405581 log 1 len 2
69732356(d): expecting 278929541 actual extent phys 278929543 log 1 len 2
69868308(d): expecting 279454129 actual extent phys 279454245 log 1 len 2
69999150(d): expecting 279978430 actual extent phys 279978439 log 1 len 2
69999150(d): expecting 279978441 actual extent phys 279978457 log 3 len 1
69999150(d): expecting 279978458 actual extent phys 279978459 log 4 len 1
69999150(d): expecting 279978460 actual extent phys 279978502 log 5 len 1
69999150(d): expecting 279978503 actual extent phys 279978511 log 6 len 2
69999150(d): expecting 279978513 actual extent phys 279978517 log 8 len 1
70000685(d): expecting 279978520 actual extent phys 279978523 log 1 len 2
70124788(d): expecting 280502371 actual extent phys 280502381 log 1 len 2
70124788(d): expecting 280502383 actual extent phys 280502394 log 3 len 1
70124788(d): expecting 280502395 actual extent phys 280502399 log 4 len 1
70126301(d): expecting 280502445 actual extent phys 280502459 log 1 len 2
70127963(d): expecting 280502526 actual extent phys 280502528 log 1 len 2
70256678(d): expecting 281026905 actual extent phys 281026913 log 1 len 2
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/nvme0n1p1: 75365/117211136 files (0.0% non-contiguous),
7920985/468843350 blocks

data=writeback:

# e2fsck -fE fragcheck /dev/nvme0n1p1
e2fsck 1.45.6 (20-Mar-2020)
Pass 1: Checking inodes, blocks, and sizes
91755156(d): expecting 367009992 actual extent phys 367009994 log 1 len 2
91755649(d): expecting 367010012 actual extent phys 367010014 log 1 len 2
91759686(d): expecting 367010283 actual extent phys 367010340 log 1 len 2
91883202(d): expecting 367534217 actual extent phys 367534223 log 1 len 2
91887283(d): expecting 367534463 actual extent phys 367534467 log 1 len 2
92019685(d): expecting 368058828 actual extent phys 368058829 log 1 len 2
92145668(d): expecting 368582789 actual extent phys 368582791 log 1 len 2
92281620(d): expecting 369107377 actual extent phys 369107493 log 1 len 2
92412462(d): expecting 369631678 actual extent phys 369631687 log 1 len 2
92412462(d): expecting 369631689 actual extent phys 369631705 log 3 len 1
92412462(d): expecting 369631706 actual extent phys 369631707 log 4 len 1
92412462(d): expecting 369631708 actual extent phys 369631757 log 5 len 1
92412462(d): expecting 369631758 actual extent phys 369631759 log 6 len 2
92412462(d): expecting 369631761 actual extent phys 369631766 log 8 len 1
92413997(d): expecting 369631768 actual extent phys 369631771 log 1 len 2
92538100(d): expecting 370155619 actual extent phys 370155629 log 1 len 2
92538100(d): expecting 370155631 actual extent phys 370155642 log 3 len 1
92538100(d): expecting 370155643 actual extent phys 370155647 log 4 len 1
92539613(d): expecting 370155693 actual extent phys 370155707 log 1 len 2
92541275(d): expecting 370155774 actual extent phys 370155776 log 1 len 2
92669990(d): expecting 370680153 actual extent phys 370680161 log 1 len 2
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/nvme0n1p1: 75365/117211136 files (0.0% non-contiguous),
7920984/468843350 blocks

As an extra test I extracted the archive with data=ordered, remounted
with data=writeback and timed the rm -rf and viceversa.
The mount option is the one that counts, the one using during
extraction doesn't matter.

As extra extra test I also tried data=journal, which is as fast as ordered.

> Even if the files are showing no substantial difference in
> fragmentation, and the free space is equally A-OK with respect to
> fragmentation, the other possibility is the *layout* of the blocks are
> such that the order in which they are deleted using rm -rf ends up
> being less friendly from a discard perspective. This can happen if
> the directory hierarchy is big enough, and/or the journal size is
> small enough, that the rm -rf requires multiple journal transactions
> to complete. That's because with mount -o discard, we do the discards
> after each transaction commit, and it might be that even though the
> used blocks are perfectly contiguous, because of the order in which
> the files end up getting deleted, we end up needing to discard them in
> smaller chunks.
>
> For example, one could imagine a case where you have a million 4k
> files, and they are allocated contiguously, but if you get
> super-unlucky, such that in the first transaction you delete all of
> the odd-numbered files, and in second transaction you delete all of
> the even-numbered files, you might need to do a million 4k discards
> --- but if all of the deletes could fit into a single transaction, you
> would only need to do a single million block discard operation.
>
> Finally, you may want to consider whether or not mount -o discard
> really makes sense or not. For most SSD's, especially high-end SSD's,
> it probably doesn't make that much difference. That's because when
> you overwrite a sector, the SSD knows (or should know; this might not
> be some really cheap, crappy low-end flash devices; but on those
> devices, discard might not be making uch of a difference anyway), that
> the old contents of the sector is no longer needed. Hence an
> overwrite effectively is an "implied discard". So long as there is a
> sufficient number of free erase blocks, the SSD might be able to keep
> up doing the GC for those "implied discards", and so accelerating the
> process by sending explicit discards after every journal transaction
> might not be necessary. Or, maybe it's sufficient to run "fstrim"
> every week at Sunday 3am local time; or maybe even fstrim once a night
> or fstrim once a month --- your mileage may vary.
>
> It's going to vary from SSD to SSD and from workload to workload, but
> you might find that mount -o discard isn't buying you all that much
> --- if you run a random write workload, and you don't notice any
> performance degradation, and you don't notice an increase in the SSD's
> write amplification numbers (if they are provided by your SSD), then
> you might very well find that it's not worth it to use mount -o
> discard.
>
> I personally don't bother using mount -o discard, and instead
> periodically run fstrim, on my personal machines. Part of that is
> because I'm mostly just reading and replying to emails, building
> kernels and editing text files, and that is not nearly as stressful on
> the FTL as a full-blown random write workload (for example, if you
> were running a database supporting a transaction processing workload).
>

That's what I'm doing locally, I issue a fstrim from time to time.
But I found discard useful in QEMU guests because latest virtio-blk
will punch holes in the host and save space.

Cheers,
--
per aspera ad upstream

2020-12-23 01:27:41

by Matteo Croce

[permalink] [raw]
Subject: Re: discard and data=writeback

On Tue, Dec 22, 2020 at 11:53 PM Andreas Dilger <[email protected]> wrote:
>
> On Dec 22, 2020, at 9:34 AM, Theodore Y. Ts'o <[email protected]> wrote:
> >
> > On Tue, Dec 22, 2020 at 03:59:29PM +0100, Matteo Croce wrote:
> >>
> >> I'm issuing sync + sleep(10) after the extraction, so the writes
> >> should all be flushed.
> >> Also, I repeated the test three times, with very similar results:
> >
> > So that means the problem is not due to page cache writeback
> > interfering with the discards. So it's most likely that the problem
> > is due to how the blocks are allocated and laid out when using
> > data=ordered vs data=writeback.
> >
> > Some experiments to try next. After extracting the files with
> > data=ordered and data=writeback on a freshly formatted file system,
> > use "e2freefrag" to see how the free space is fragmented. This will
> > tell us how the file system is doing from a holistic perspective, in
> > terms of blocks allocated to the extracted files. (E2freefrag is
> > showing you the blocks *not* allocated, of course, but that's a mirror
> > image dual of the blocks that *are* allocated, especially if you start
> > from an identical known state; hence the use of a freshly formatted
> > file system.)
> >
> > Next, we can see how individual files look like with respect to
> > fragmentation. This can be done via using filefrag on all of the
> > files, e.g:
> >
> > find . -type f -print0 | xargs -0 filefrag
> >
> > Another way to get similar (although not identical) information is via
> > running "e2fsck -E fragcheck" on a file system. How they differ is
> > especially more of a big deal on ext3 file systems without extents and
> > flex_bg, since filefrag tries to take into account metadata blocks
> > such as indirect blocks and extent tree blocks, and e2fsck -E
> > fragcheck does not; but it's good enough for getting a good gestalt
> > for the files' overall fragmentation --- and note that as long as the
> > average fragment size is at least a megabyte or two, some
> > fragmentation really isn't that much of a problem from a real-world
> > performance perspective. People can get way too invested in trying to
> > get to perfection with 100% fragmentation-free files. The problem
> > with doing this at the expense of all else is that you can end up
> > making the overall free space fragmentation worse as the file system
> > ages, at which point the file system performance really dives through
> > the floor as the file system approaches 100%, or even 80-90% full,
> > especially on HDD's. For SSD's fragmentation doesn't matter quite so
> > much, unless the average fragment size is *really* small, and when you
> > are discarded freed blocks.
> >
> > Even if the files are showing no substantial difference in
> > fragmentation, and the free space is equally A-OK with respect to
> > fragmentation, the other possibility is the *layout* of the blocks are
> > such that the order in which they are deleted using rm -rf ends up
> > being less friendly from a discard perspective. This can happen if
> > the directory hierarchy is big enough, and/or the journal size is
> > small enough, that the rm -rf requires multiple journal transactions
> > to complete. That's because with mount -o discard, we do the discards
> > after each transaction commit, and it might be that even though the
> > used blocks are perfectly contiguous, because of the order in which
> > the files end up getting deleted, we end up needing to discard them in
> > smaller chunks.
> >
> > For example, one could imagine a case where you have a million 4k
> > files, and they are allocated contiguously, but if you get
> > super-unlucky, such that in the first transaction you delete all of
> > the odd-numbered files, and in second transaction you delete all of
> > the even-numbered files, you might need to do a million 4k discards
> > --- but if all of the deletes could fit into a single transaction, you
> > would only need to do a single million block discard operation.
> >
> > Finally, you may want to consider whether or not mount -o discard
> > really makes sense or not. For most SSD's, especially high-end SSD's,
> > it probably doesn't make that much difference. That's because when
> > you overwrite a sector, the SSD knows (or should know; this might not
> > be some really cheap, crappy low-end flash devices; but on those
> > devices, discard might not be making uch of a difference anyway), that
> > the old contents of the sector is no longer needed. Hence an
> > overwrite effectively is an "implied discard". So long as there is a
> > sufficient number of free erase blocks, the SSD might be able to keep
> > up doing the GC for those "implied discards", and so accelerating the
> > process by sending explicit discards after every journal transaction
> > might not be necessary. Or, maybe it's sufficient to run "fstrim"
> > every week at Sunday 3am local time; or maybe even fstrim once a night
> > or fstrim once a month --- your mileage may vary.
> >
> > It's going to vary from SSD to SSD and from workload to workload, but
> > you might find that mount -o discard isn't buying you all that much
> > --- if you run a random write workload, and you don't notice any
> > performance degradation, and you don't notice an increase in the SSD's
> > write amplification numbers (if they are provided by your SSD), then
> > you might very well find that it's not worth it to use mount -o
> > discard.
> >
> > I personally don't bother using mount -o discard, and instead
> > periodically run fstrim, on my personal machines. Part of that is
> > because I'm mostly just reading and replying to emails, building
> > kernels and editing text files, and that is not nearly as stressful on
> > the FTL as a full-blown random write workload (for example, if you
> > were running a database supporting a transaction processing workload).
>
> The problem (IMHO) with "-o discard" is that if it is only trimming
> *blocks* that were deleted, these may be too small to effectively be
> processed by the underlying device (e.g. the "super-unlucky" example
> above where interleaved 4KB file deletes result in 1M separate 4KB
> trim requests to the device, even when the *space* that is freed by
> the unlinks could be handled with far fewer large trim requests.
>
> There was a discussion previously ("introduce EXT4_BG_WAS_TRIMMED ...")
>
> https://patchwork.ozlabs.org/project/linux-ext4/patch/[email protected]/
>
> about leveraging the persistent EXT4_BG_WAS_TRIMMED flag in the group
> descriptors, and having "-o discard" only track trim on a per-group
> basis rather than its current mode of doing trim on a per-block basis,
> and then use the same code internally as fstrim to do a trim of free
> blocks in that block group.
>
> Using EXT4_BG_WAS_TRIMMED and tracking *groups* to be trimmed would be
> a bit more lazy than the current "-o discard" implementation, but would
> be more memory efficient, and also more efficient for the device (fewer,
> larger trim requests submitted). It would only need to track groups
> that have at least a reasonable amount of free space to be trimmed. If
> the group doesn't have enough free blocks to trim now, it will be checked
> again in the future when more blocks are freed.
>

Hi,

I gave it a quick run refreshing it for 5.10, but it doesn't seem to help.
Are there actions needed other than the patch itself?

Regards,
--
per aspera ad upstream

2020-12-23 18:13:51

by Theodore Ts'o

[permalink] [raw]
Subject: Re: discard and data=writeback

On Wed, Dec 23, 2020 at 01:47:33AM +0100, Matteo Croce wrote:
> As an extra test I extracted the archive with data=ordered, remounted
> with data=writeback and timed the rm -rf and viceversa.
> The mount option is the one that counts, the one using during
> extraction doesn't matter.

Hmm... that's really surprising. At this point, the only thing I can
suggest is to try using blktrace to see what's going on at the block
layer when the I/O's and discard requests are being submitted. If
there are no dirty blocks in the page cache, I don't see how
data=ordered vs data=writeback would make a difference to how mount -o
discard processing would take place.

Cheers,

- Ted

2020-12-23 19:01:55

by Matteo Croce

[permalink] [raw]
Subject: Re: discard and data=writeback

On Wed, Dec 23, 2020 at 7:12 PM Theodore Y. Ts'o <[email protected]> wrote:
>
> On Wed, Dec 23, 2020 at 01:47:33AM +0100, Matteo Croce wrote:
> > As an extra test I extracted the archive with data=ordered, remounted
> > with data=writeback and timed the rm -rf and viceversa.
> > The mount option is the one that counts, the one using during
> > extraction doesn't matter.
>
> Hmm... that's really surprising. At this point, the only thing I can
> suggest is to try using blktrace to see what's going on at the block
> layer when the I/O's and discard requests are being submitted. If
> there are no dirty blocks in the page cache, I don't see how
> data=ordered vs data=writeback would make a difference to how mount -o
> discard processing would take place.
>

Hi,

these are the blktrace outputs for both journaling modes:

# dmesg |grep EXT4-fs |tail -1
[ 1594.829833] EXT4-fs (nvme0n1p1): mounted filesystem with ordered
data mode. Opts: data=ordered,discard
# blktrace /dev/nvme0n1 & sleep 1 ; time rm -rf /media/linux-5.10/ ; kill $!
[1] 3032

real 0m1.328s
user 0m0.063s
sys 0m1.231s
# === nvme0n1 ===
CPU 0: 0 events, 0 KiB data
CPU 1: 0 events, 0 KiB data
CPU 2: 0 events, 0 KiB data
CPU 3: 1461 events, 69 KiB data
CPU 4: 1 events, 1 KiB data
CPU 5: 0 events, 0 KiB data
CPU 6: 0 events, 0 KiB data
CPU 7: 0 events, 0 KiB data
Total: 1462 events (dropped 0), 69 KiB data


# dmesg |grep EXT4-fs |tail -1
[ 1734.837651] EXT4-fs (nvme0n1p1): mounted filesystem with writeback
data mode. Opts: data=writeback,discard
# blktrace /dev/nvme0n1 & sleep 1 ; time rm -rf /media/linux-5.10/ ; kill $!
[1] 3069

real 1m30.273s
user 0m0.139s
sys 0m3.084s
# === nvme0n1 ===
CPU 0: 133830 events, 6274 KiB data
CPU 1: 21878 events, 1026 KiB data
CPU 2: 46365 events, 2174 KiB data
CPU 3: 98116 events, 4600 KiB data
CPU 4: 290902 events, 13637 KiB data
CPU 5: 10926 events, 513 KiB data
CPU 6: 76861 events, 3603 KiB data
CPU 7: 17855 events, 837 KiB data
Total: 696733 events (dropped 0), 32660 KiB data

Cheers,
--
per aspera ad upstream

2020-12-24 03:18:03

by Theodore Ts'o

[permalink] [raw]
Subject: Re: discard and data=writeback

On Wed, Dec 23, 2020 at 07:59:13PM +0100, Matteo Croce wrote:
>
> Hi,
>
> these are the blktrace outputs for both journaling modes:

Can you send me full trace files (or the outputs of blkparse) so we
can see what's going on at a somewhat more granular detail?

They'll be huge, so you may need to make them available for download
from a web server; certainly the vger.kernel.org list server isn't
going to let an attachment that large through.

Thanks,

- Ted

2020-12-24 10:57:14

by Matteo Croce

[permalink] [raw]
Subject: Re: discard and data=writeback

On Thu, Dec 24, 2020 at 4:16 AM Theodore Y. Ts'o <[email protected]> wrote:
>
> On Wed, Dec 23, 2020 at 07:59:13PM +0100, Matteo Croce wrote:
> >
> > Hi,
> >
> > these are the blktrace outputs for both journaling modes:
>
> Can you send me full trace files (or the outputs of blkparse) so we
> can see what's going on at a somewhat more granular detail?
>
> They'll be huge, so you may need to make them available for download
> from a web server; certainly the vger.kernel.org list server isn't
> going to let an attachment that large through.
>

Hi,

I've created a GDrive link, it should work for everyone:

https://drive.google.com/file/d/1b35hzgUMSnNBZeMNhooFk4rACpNvCZuQ/view?usp=sharing

Cheers,
--
per aspera ad upstream

2020-12-29 05:44:45

by Daejun Park

[permalink] [raw]
Subject: Re: discard and data=writeback

Hi,

> # dmesg |grep EXT4-fs |tail -1
> [ 1594.829833] EXT4-fs (nvme0n1p1): mounted filesystem with ordered
> data mode. Opts: data=ordered,discard
> # blktrace /dev/nvme0n1 & sleep 1 ; time rm -rf /media/linux-5.10/ ; kill $!
> [1] 3032
>
> real 0m1.328s
> user 0m0.063s
> sys 0m1.231s
> # === nvme0n1 ===
> CPU 0: 0 events, 0 KiB data
> CPU 1: 0 events, 0 KiB data
> CPU 2: 0 events, 0 KiB data
> CPU 3: 1461 events, 69 KiB data
> CPU 4: 1 events, 1 KiB data
> CPU 5: 0 events, 0 KiB data
> CPU 6: 0 events, 0 KiB data
> CPU 7: 0 events, 0 KiB data
> Total: 1462 events (dropped 0), 69 KiB data
>
>
> # dmesg |grep EXT4-fs |tail -1
> [ 1734.837651] EXT4-fs (nvme0n1p1): mounted filesystem with writeback
> data mode. Opts: data=writeback,discard
> # blktrace /dev/nvme0n1 & sleep 1 ; time rm -rf /media/linux-5.10/ ; kill $!
> [1] 3069
>
> real 1m30.273s
> user 0m0.139s
> sys 0m3.084s
> # === nvme0n1 ===
> CPU 0: 133830 events, 6274 KiB data
> CPU 1: 21878 events, 1026 KiB data
> CPU 2: 46365 events, 2174 KiB data
> CPU 3: 98116 events, 4600 KiB data
> CPU 4: 290902 events, 13637 KiB data
> CPU 5: 10926 events, 513 KiB data
> CPU 6: 76861 events, 3603 KiB data
> CPU 7: 17855 events, 837 KiB data
> Total: 696733 events (dropped 0), 32660 KiB data
>

In this result, there is few IO in ordered mode.

As I understand (please correct this if I am wrong), with writeback +
discard, ext4_issue_discard is called immediately at each rm command.
However, with ordered mode, ext4_issue_discard is called when end of
committing a transaction to pace with the corresponding transaction.
It means, they are not discarded yet.

Even with ordered mode, if sync is called after rm command,
ext4_issue_discard can be called due to transaction commit.
So, I think you will get similar results form writeback mode with sync
command.

Thanks,
Daejun

2020-12-29 13:44:31

by Matteo Croce

[permalink] [raw]
Subject: Re: discard and data=writeback

On Tue, Dec 29, 2020 at 6:41 AM Daejun Park <[email protected]> wrote:
>
> Hi,
>
> > # dmesg |grep EXT4-fs |tail -1
> > [ 1594.829833] EXT4-fs (nvme0n1p1): mounted filesystem with ordered
> > data mode. Opts: data=ordered,discard
> > # blktrace /dev/nvme0n1 & sleep 1 ; time rm -rf /media/linux-5.10/ ; kill $!
> > [1] 3032
> >
> > real 0m1.328s
> > user 0m0.063s
> > sys 0m1.231s
> > # === nvme0n1 ===
> > CPU 0: 0 events, 0 KiB data
> > CPU 1: 0 events, 0 KiB data
> > CPU 2: 0 events, 0 KiB data
> > CPU 3: 1461 events, 69 KiB data
> > CPU 4: 1 events, 1 KiB data
> > CPU 5: 0 events, 0 KiB data
> > CPU 6: 0 events, 0 KiB data
> > CPU 7: 0 events, 0 KiB data
> > Total: 1462 events (dropped 0), 69 KiB data
> >
> >
> > # dmesg |grep EXT4-fs |tail -1
> > [ 1734.837651] EXT4-fs (nvme0n1p1): mounted filesystem with writeback
> > data mode. Opts: data=writeback,discard
> > # blktrace /dev/nvme0n1 & sleep 1 ; time rm -rf /media/linux-5.10/ ; kill $!
> > [1] 3069
> >
> > real 1m30.273s
> > user 0m0.139s
> > sys 0m3.084s
> > # === nvme0n1 ===
> > CPU 0: 133830 events, 6274 KiB data
> > CPU 1: 21878 events, 1026 KiB data
> > CPU 2: 46365 events, 2174 KiB data
> > CPU 3: 98116 events, 4600 KiB data
> > CPU 4: 290902 events, 13637 KiB data
> > CPU 5: 10926 events, 513 KiB data
> > CPU 6: 76861 events, 3603 KiB data
> > CPU 7: 17855 events, 837 KiB data
> > Total: 696733 events (dropped 0), 32660 KiB data
> >
>
> In this result, there is few IO in ordered mode.
>
> As I understand (please correct this if I am wrong), with writeback +
> discard, ext4_issue_discard is called immediately at each rm command.
> However, with ordered mode, ext4_issue_discard is called when end of
> committing a transaction to pace with the corresponding transaction.
> It means, they are not discarded yet.
>
> Even with ordered mode, if sync is called after rm command,
> ext4_issue_discard can be called due to transaction commit.
> So, I think you will get similar results form writeback mode with sync
> command.
>

Hi,

that's what I get with data=ordered if I issue a sync after the removal:

# time rm -rf /media/linux-5.10/ ; sync ; kill $!

real 0m1.569s
user 0m0.044s
sys 0m1.508s
#
=== nvme0n1 ===
CPU 0: 10980 events, 515 KiB data
CPU 1: 0 events, 0 KiB data
CPU 2: 0 events, 0 KiB data
CPU 3: 26 events, 2 KiB data
CPU 4: 3601 events, 169 KiB data
CPU 5: 0 events, 0 KiB data
CPU 6: 21786 events, 1022 KiB data
CPU 7: 0 events, 0 KiB data
Total: 36393 events (dropped 0), 1706 KiB data

Still way less transactions than writeback.

Cheers,
--
per aspera ad upstream

2020-12-30 05:31:40

by Daejun Park

[permalink] [raw]
Subject: RE: Re: discard and data=writeback

> Hi,
> >
> > > # dmesg |grep EXT4-fs |tail -1
> > > [ 1594.829833] EXT4-fs (nvme0n1p1): mounted filesystem with ordered
> > > data mode. Opts: data=ordered,discard
> > > # blktrace /dev/nvme0n1 & sleep 1 ; time rm -rf /media/linux-5.10/ ; kill $!
> > > [1] 3032
> > >
> > > real 0m1.328s
> > > user 0m0.063s
> > > sys 0m1.231s
> > > # === nvme0n1 ===
> > > CPU 0: 0 events, 0 KiB data
> > > CPU 1: 0 events, 0 KiB data
> > > CPU 2: 0 events, 0 KiB data
> > > CPU 3: 1461 events, 69 KiB data
> > > CPU 4: 1 events, 1 KiB data
> > > CPU 5: 0 events, 0 KiB data
> > > CPU 6: 0 events, 0 KiB data
> > > CPU 7: 0 events, 0 KiB data
> > > Total: 1462 events (dropped 0), 69 KiB data
> > >
> > >
> > > # dmesg |grep EXT4-fs |tail -1
> > > [ 1734.837651] EXT4-fs (nvme0n1p1): mounted filesystem with writeback
> > > data mode. Opts: data=writeback,discard
> > > # blktrace /dev/nvme0n1 & sleep 1 ; time rm -rf /media/linux-5.10/ ; kill $!
> > > [1] 3069
> > >
> > > real 1m30.273s
> > > user 0m0.139s
> > > sys 0m3.084s
> > > # === nvme0n1 ===
> > > CPU 0: 133830 events, 6274 KiB data
> > > CPU 1: 21878 events, 1026 KiB data
> > > CPU 2: 46365 events, 2174 KiB data
> > > CPU 3: 98116 events, 4600 KiB data
> > > CPU 4: 290902 events, 13637 KiB data
> > > CPU 5: 10926 events, 513 KiB data
> > > CPU 6: 76861 events, 3603 KiB data
> > > CPU 7: 17855 events, 837 KiB data
> > > Total: 696733 events (dropped 0), 32660 KiB data
> > >
> >
> > In this result, there is few IO in ordered mode.
> >
> > As I understand (please correct this if I am wrong), with writeback +
> > discard, ext4_issue_discard is called immediately at each rm command.
> > However, with ordered mode, ext4_issue_discard is called when end of
> > committing a transaction to pace with the corresponding transaction.
> > It means, they are not discarded yet.
> >
> > Even with ordered mode, if sync is called after rm command,
> > ext4_issue_discard can be called due to transaction commit.
> > So, I think you will get similar results form writeback mode with sync
> > command.
> >
>
> Hi,
>
> that's what I get with data=ordered if I issue a sync after the removal:
>
> # time rm -rf /media/linux-5.10/ ; sync ; kill $!
>
> real 0m1.569s
> user 0m0.044s
> sys 0m1.508s
> #
> === nvme0n1 ===
> CPU 0: 10980 events, 515 KiB data
> CPU 1: 0 events, 0 KiB data
> CPU 2: 0 events, 0 KiB data
> CPU 3: 26 events, 2 KiB data
> CPU 4: 3601 events, 169 KiB data
> CPU 5: 0 events, 0 KiB data
> CPU 6: 21786 events, 1022 KiB data
> CPU 7: 0 events, 0 KiB data
> Total: 36393 events (dropped 0), 1706 KiB data
>
> Still way less transactions than writeback.
>
The full trace you shared on this thread seems contains only on writeback
mode. In the trace, discards are issued by each deletion file by rm.

If you share the full trace on ordered mode, it will help we analyze the
results. It is expected that number of discards will lower than writeback
mode, because discards can be merged on ordered mode.

Thanks,
Daejun

2020-12-30 15:19:35

by Matteo Croce

[permalink] [raw]
Subject: Re: Re: discard and data=writeback

On Wed, Dec 30, 2020 at 6:21 AM Daejun Park <[email protected]> wrote:
>
> > Hi,
> > >
> > > > # dmesg |grep EXT4-fs |tail -1
> > > > [ 1594.829833] EXT4-fs (nvme0n1p1): mounted filesystem with ordered
> > > > data mode. Opts: data=ordered,discard
> > > > # blktrace /dev/nvme0n1 & sleep 1 ; time rm -rf /media/linux-5.10/ ; kill $!
> > > > [1] 3032
> > > >
> > > > real 0m1.328s
> > > > user 0m0.063s
> > > > sys 0m1.231s
> > > > # === nvme0n1 ===
> > > > CPU 0: 0 events, 0 KiB data
> > > > CPU 1: 0 events, 0 KiB data
> > > > CPU 2: 0 events, 0 KiB data
> > > > CPU 3: 1461 events, 69 KiB data
> > > > CPU 4: 1 events, 1 KiB data
> > > > CPU 5: 0 events, 0 KiB data
> > > > CPU 6: 0 events, 0 KiB data
> > > > CPU 7: 0 events, 0 KiB data
> > > > Total: 1462 events (dropped 0), 69 KiB data
> > > >
> > > >
> > > > # dmesg |grep EXT4-fs |tail -1
> > > > [ 1734.837651] EXT4-fs (nvme0n1p1): mounted filesystem with writeback
> > > > data mode. Opts: data=writeback,discard
> > > > # blktrace /dev/nvme0n1 & sleep 1 ; time rm -rf /media/linux-5.10/ ; kill $!
> > > > [1] 3069
> > > >
> > > > real 1m30.273s
> > > > user 0m0.139s
> > > > sys 0m3.084s
> > > > # === nvme0n1 ===
> > > > CPU 0: 133830 events, 6274 KiB data
> > > > CPU 1: 21878 events, 1026 KiB data
> > > > CPU 2: 46365 events, 2174 KiB data
> > > > CPU 3: 98116 events, 4600 KiB data
> > > > CPU 4: 290902 events, 13637 KiB data
> > > > CPU 5: 10926 events, 513 KiB data
> > > > CPU 6: 76861 events, 3603 KiB data
> > > > CPU 7: 17855 events, 837 KiB data
> > > > Total: 696733 events (dropped 0), 32660 KiB data
> > > >
> > >
> > > In this result, there is few IO in ordered mode.
> > >
> > > As I understand (please correct this if I am wrong), with writeback +
> > > discard, ext4_issue_discard is called immediately at each rm command.
> > > However, with ordered mode, ext4_issue_discard is called when end of
> > > committing a transaction to pace with the corresponding transaction.
> > > It means, they are not discarded yet.
> > >
> > > Even with ordered mode, if sync is called after rm command,
> > > ext4_issue_discard can be called due to transaction commit.
> > > So, I think you will get similar results form writeback mode with sync
> > > command.
> > >
> >
> > Hi,
> >
> > that's what I get with data=ordered if I issue a sync after the removal:
> >
> > # time rm -rf /media/linux-5.10/ ; sync ; kill $!
> >
> > real 0m1.569s
> > user 0m0.044s
> > sys 0m1.508s
> > #
> > === nvme0n1 ===
> > CPU 0: 10980 events, 515 KiB data
> > CPU 1: 0 events, 0 KiB data
> > CPU 2: 0 events, 0 KiB data
> > CPU 3: 26 events, 2 KiB data
> > CPU 4: 3601 events, 169 KiB data
> > CPU 5: 0 events, 0 KiB data
> > CPU 6: 21786 events, 1022 KiB data
> > CPU 7: 0 events, 0 KiB data
> > Total: 36393 events (dropped 0), 1706 KiB data
> >
> > Still way less transactions than writeback.
> >
> The full trace you shared on this thread seems contains only on writeback
> mode. In the trace, discards are issued by each deletion file by rm.
>
> If you share the full trace on ordered mode, it will help we analyze the
> results. It is expected that number of discards will lower than writeback
> mode, because discards can be merged on ordered mode.
>

Hi,

I did the same blktrace with data=ordered,discard
Find it here:

https://drive.google.com/file/d/1gqffP9WPCME3_81xlXAQCiDlTK-Gqv4_/view?usp=sharing

Thanks,
--
per aspera ad upstream

2020-12-31 01:34:10

by Daejun Park

[permalink] [raw]
Subject: RE: Re: Re: discard and data=writeback

Hi,

In the trace files, the amount of discard is almost the same in both modes.
(ordered: 1096MB / writeback: 1078MB) However, there is a big difference in
the average discard size per request. (ordered: 15.6KB / writeback: 34.2MB)
In ext4, when data is deleted, discard is immediately issued in writeback
mode. Therefore, the average size of discard commands is small and the
number of discard commands is large.
However, if it is not in the writeback mode, discard commands are issued by
JBD after merging them. Therefore, the average size of discard is large and
the number of discard commands is small.

In conclusion, since discard commands are not merged in the writeback mode,
many fragmented discard commands occur, so it affects the elapsed time of
many file deletion. And it is not abnormal behavior of ext4 file system.

Thanks,
Daejun