2019-03-20 22:44:31

by Mikhail Morfikov

[permalink] [raw]
Subject: Question about ext4 extents and file fragmentation

When we have a big file on an ext4 partition, and filefrag shows
the following:

filefrag -ve /bigfile
Filesystem type is: ef53
File size of /bigfile is 1439201280 (351368 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 32767: 34816.. 67583: 32768:
1: 32768.. 63487: 67584.. 98303: 30720:
2: 63488.. 96255: 100352.. 133119: 32768: 98304:
3: 96256.. 126975: 133120.. 163839: 30720:
4: 126976.. 159743: 165888.. 198655: 32768: 163840:
5: 159744.. 190463: 198656.. 229375: 30720:
6: 190464.. 223231: 231424.. 264191: 32768: 229376:
7: 223232.. 253951: 264192.. 294911: 30720:
8: 253952.. 286719: 296960.. 329727: 32768: 294912:
9: 286720.. 319487: 329728.. 362495: 32768:
10: 319488.. 351367: 362496.. 394375: 31880: last,eof
/bigfile: 5 extents found

1. How many fragments does this file really have? 11 or 5?
2. Should the extents 0 and 1 be treated as one fragment or two
separate ones? I know they could be one from the human
perspective, but is it really one for ext4 filesystem?
3. What does actually happen during the read in the case of
some HDD and its magnetic heads? If the head finishes reading
the whole extent (ext 0), will it be able to read the data of
the next extent (ext 1) without any delays like in the case of
raw read (for instance dd if=/dev/sda ...), or will it be
delayed because of the filesystem layer, and the head will
have to spend some time to be positioned again in order to
read the next extent?


Attachments:
signature.asc (833.00 B)
OpenPGP digital signature

2019-03-21 03:18:39

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Question about ext4 extents and file fragmentation

On Wed, Mar 20, 2019 at 11:44:19PM +0100, Mikhail Morfikov wrote:
> When we have a big file on an ext4 partition, and filefrag shows
> the following:
>
> filefrag -ve /bigfile
> Filesystem type is: ef53
> File size of /bigfile is 1439201280 (351368 blocks of 4096 bytes)
> ext: logical_offset: physical_offset: length: expected: flags:
> 0: 0.. 32767: 34816.. 67583: 32768:
> 1: 32768.. 63487: 67584.. 98303: 30720:
> 2: 63488.. 96255: 100352.. 133119: 32768: 98304:
> 3: 96256.. 126975: 133120.. 163839: 30720:
> 4: 126976.. 159743: 165888.. 198655: 32768: 163840:
> 5: 159744.. 190463: 198656.. 229375: 30720:
> 6: 190464.. 223231: 231424.. 264191: 32768: 229376:
> 7: 223232.. 253951: 264192.. 294911: 30720:
> 8: 253952.. 286719: 296960.. 329727: 32768: 294912:
> 9: 286720.. 319487: 329728.. 362495: 32768:
> 10: 319488.. 351367: 362496.. 394375: 31880: last,eof
> /bigfile: 5 extents found
>
> 1. How many fragments does this file really have? 11 or 5?
> 2. Should the extents 0 and 1 be treated as one fragment or two
> separate ones? I know they could be one from the human
> perspective, but is it really one for ext4 filesystem?

They are encoded as two separate physical extents on disk. Logically,
extents 0, 1, and 2 are contiguous regions on idks.

> 3. What does actually happen during the read in the case of
> some HDD and its magnetic heads? If the head finishes reading
> the whole extent (ext 0), will it be able to read the data of
> the next extent (ext 1) without any delays like in the case of
> raw read (for instance dd if=/dev/sda ...), or will it be
> delayed because of the filesystem layer, and the head will
> have to spend some time to be positioned again in order to
> read the next extent?

The delay won't be because of the file system layer, as the
information about these first three extents will all be stored on the
same block on disk. In addition, ext4 has an in-memory "extent cache"
which stores the logical->physical block mapping, and in memory, it
will be stored as a single entry in the extent cache.

It takes *time* to read 128 megabytes (32768 4k blocks), and from a
hard drive perspective, you are doing a streaming sequential read, how
the file system metadata is stored is not going to be the limiting
factor. In fact, it's likely that they won't be issued to the hard
drive as a single I/O request anyway. But that doesn't matter; the
hard drive has an I/O request queue, and so the right thing will
happen.

- Ted

2019-03-21 09:29:38

by Mikhail Morfikov

[permalink] [raw]
Subject: Re: Question about ext4 extents and file fragmentation

On 21/03/2019 04:18, Theodore Ts'o wrote:
> On Wed, Mar 20, 2019 at 11:44:19PM +0100, Mikhail Morfikov wrote:
>> When we have a big file on an ext4 partition, and filefrag shows
>> the following:
>>
>> filefrag -ve /bigfile
>> Filesystem type is: ef53
>> File size of /bigfile is 1439201280 (351368 blocks of 4096 bytes)
>> ext: logical_offset: physical_offset: length: expected: flags:
>> 0: 0.. 32767: 34816.. 67583: 32768:
>> 1: 32768.. 63487: 67584.. 98303: 30720:
>> 2: 63488.. 96255: 100352.. 133119: 32768: 98304:
>> 3: 96256.. 126975: 133120.. 163839: 30720:
>> 4: 126976.. 159743: 165888.. 198655: 32768: 163840:
>> 5: 159744.. 190463: 198656.. 229375: 30720:
>> 6: 190464.. 223231: 231424.. 264191: 32768: 229376:
>> 7: 223232.. 253951: 264192.. 294911: 30720:
>> 8: 253952.. 286719: 296960.. 329727: 32768: 294912:
>> 9: 286720.. 319487: 329728.. 362495: 32768:
>> 10: 319488.. 351367: 362496.. 394375: 31880: last,eof
>> /bigfile: 5 extents found
>>
>> 1. How many fragments does this file really have? 11 or 5?
>> 2. Should the extents 0 and 1 be treated as one fragment or two
>> separate ones? I know they could be one from the human
>> perspective, but is it really one for ext4 filesystem?
>
> They are encoded as two separate physical extents on disk. Logically,
> extents 0, 1, and 2 are contiguous regions on idks.So 5 fragments then?

>> 3. What does actually happen during the read in the case of
>> some HDD and its magnetic heads? If the head finishes reading
>> the whole extent (ext 0), will it be able to read the data of
>> the next extent (ext 1) without any delays like in the case of
>> raw read (for instance dd if=/dev/sda ...), or will it be
>> delayed because of the filesystem layer, and the head will
>> have to spend some time to be positioned again in order to
>> read the next extent?
>
> The delay won't be because of the file system layer, as the
> information about these first three extents will all be stored on the
> same block on disk. In addition, ext4 has an in-memory "extent cache"
> which stores the logical->physical block mapping, and in memory, it
> will be stored as a single entry in the extent cache.
>
> It takes *time* to read 128 megabytes (32768 4k blocks), and from a
> hard drive perspective, you are doing a streaming sequential read, how
> the file system metadata is stored is not going to be the limiting
> factor. In fact, it's likely that they won't be issued to the hard
> drive as a single I/O request anyway. But that doesn't matter; the
> hard drive has an I/O request queue, and so the right thing will
> happen.
>

Yes, I know that many things can happen during the 128M read. But wecan assume that we have some simplified environment, where we have
only one disk, one file we want to read at the moment, and we have
time to do it without any external interferences.

If I understood correctly, as long as the extents reside on a contiguous
region, they will be read sequentially without any delays, right? So if
the file in question was one big contiguous region, would it be read
sequentially from the beginning of the file to its end?

Also I have a question concerning the following sentence[1]:
"When there are more than four extents to a file, the rest of the
extents are indexed in a tree."
Does this mean that only four extents can be read sequentially in a
file that have only contiguous blocks of data, or because of the
extent cache, the whole file can be read sequentially anyway?

[1] https://en.wikipedia.org/wiki/Ext4#Features



Attachments:
signature.asc (833.00 B)
OpenPGP digital signature

2019-03-21 15:05:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Question about ext4 extents and file fragmentation

On Thu, Mar 21, 2019 at 10:29:23AM +0100, Mikhail Morfikov wrote:
>
> Yes, I know that many things can happen during the 128M read. But wecan assume that we have some simplified environment, where we have
> only one disk, one file we want to read at the moment, and we have
> time to do it without any external interferences.
>
> If I understood correctly, as long as the extents reside on a contiguous
> region, they will be read sequentially without any delays, right? So if
> the file in question was one big contiguous region, would it be read
> sequentially from the beginning of the file to its end?

It *could* be read sequentially from the beginning of the file to the
end. There are many things that might cause that not to happen, that
have nothing to do with how we store the logical to physicla map. For
example, some other process might be requested disk reads that might
be interleaved with the reads for that file. If you try to read too
quickly, and the system stalls due to lack of space in the page cache,
that might force some writeback that will interrupt the contiguous
read. The possibilities are endless.

I hesitate to make a categorical statement, because I don't understand
why you are being monomaniacal about this.

> Also I have a question concerning the following sentence[1]:
> "When there are more than four extents to a file, the rest of the
> extents are indexed in a tree."
> Does this mean that only four extents can be read sequentially in a
> file that have only contiguous blocks of data, or because of the
> extent cache, the whole file can be read sequentially anyway?

If you really care about this, it's possible to use the ioctl
EXT4_IOC_PRECACHE_EXTENTS which will read the extent tree and cache it
in the extent status cache. The main use for this has been people who
want to make a really big file --- for example, it's possible to
create a single 10 TB file which is contiguous, and while the on-disk
extent tree might require a number of 4k blocks, it can be cached in a
single 12 byte extent status cache entry.

The primary use case for this ioctl is for a *random* read workload if
there is a requirement for tail latencies. For certain workloads,
such as a distributed query of hundreds of disks to satisfy a single
search query, if a single read is slow, it will slow down the ability
to satisfy the entire search query. To avoid that, people will worry
about the 99th or even 99.9th percentile random read latency. And so
precaching the extent tree makes sense:

3. Fast is better than slow.
We know your time is valuable, so when you’re seeking an answer on
the web you want it right away–and we aim to please. We may be the
only people in the world who can say our goal is to have people
leave our website as quickly as possible. By shaving excess bits
and bytes from our pages and increasing the efficiency of our
serving environment, we’ve broken our own speed records many times
over, so that the average response time on a search result is a
fraction of a second....
- https://www.google.com/about/philosophy.html

But for a sequential read workload --- it really makes no sense to be
worried about this. For example, if you are doing a streaming video
read, the need to seek to to read from the extent status tree is not
going to be noticed at all. A HD video stream is roughly 100MB /
minute. So once the system realizes that you are doing a sequential
read, read-ahead will automatically start pulling in new blocks ahead
of the video stream, and the need to seek to read the extent status
tree will be invisible. And if you are copying the file, the
percentage increase for periodically seeking to read in the extent
status tree is going to be so small it might not even be measurable.

Which is why I'm really puzzled why you care.

- Ted


2019-03-21 15:59:47

by Mikhail Morfikov

[permalink] [raw]
Subject: Re: Question about ext4 extents and file fragmentation

On 21/03/2019 16:05, Theodore Ts'o wrote:
> It *could* be read sequentially from the beginning of the file to the
> end. There are many things that might cause that not to happen, that
> have nothing to do with how we store the logical to physicla map.

And this is what I wanted to know, because some people tell that if you
store a file in a filesystem, it can't be read sequentially as a whole
because of the filesystem layer (compared to "dd if=/dev/sda ..."). So
the filesystem layer doesn't really matter and doesn't really add any
additional delays compared to the raw read of a device when we deal with
data that is stored in contiguous blocks. I know that many things can
prevent the sequential read from happening, but I just wanted it to be
clarified.

Thank you for the answer, I really appreciate it.



Attachments:
signature.asc (833.00 B)
OpenPGP digital signature