LinuxLists.cc - Re: After unlinking a large file on ext4, the process stalls for a long time

2014-07-17 03:37:59

Subject: Re: After unlinking a large file on ext4, the process stalls for a long time

On Jul 16, 2014, at 11:16 AM, Mason <[email protected]> wrote:
> (I hope you'll forgive me for reformatting the quote characters
> to my taste.)

Thank you.

> On 16/07/2014 17:16, John Stoffel wrote:
>> Mason wrote:
>>> I'm using Linux (3.1.10 at the moment) on a embedded system
>>> similar in spec to a desktop PC from 15 years ago (256 MB RAM,
>>> 800-MHz CPU, USB).
>>
>> Sounds like a Raspberry Pi... And have you investigated using
>> something like XFS as your filesystem instead?
>
> The system is a set-top box (DVB-S2 receiver). The system CPU is
> MIPS 74K, not ARM (not that it matters, in this case).
>
> No, I have not investigated other file systems (yet).
>
>>> I need to be able to create large files (50-1000 GB) "as fast
>>> as possible". These files are created on an external hard disk
>>> drive, connected over Hi-Speed USB (typical throughput 30 MB/s).
>>
>> Really... so you just need to create allocations of space as quickly
>> as possible,
>
> I may not have been clear. The creation needs to be fast (in UX terms,
> so less than 5-10 seconds), but it only occurs a few times during the
> lifetime of the system.
>
>> which will then be filled in later with actual data?
>
> Yes. In fact, I use the loopback device to format the file as an
> ext4 partition.
>
> The use case is
> - allocate a large file
> - stick a file system on it
> - store stuff (typically video files) inside this "private" FS
> - when the user decides he doesn't need it anymore, unmount and unlink
> (I also have a resize operation in there, but I wanted to get the
> basics before taking the hard stuff head on.)
>
> So, in the limit, we don't store anything at all: just create and
> immediately delete. This was my test.

I would agree that LVM is the real solution that you want to use.
It is specifically designed for this, and has much less overhead than
a filesystem on a loopback device on a file on another filesystem.
The amount of space overhead is tuneable, but typically the volumes
are allocated in multiples of 4MB chunks.

That said, I think you've found some kind of strange performance problem,
and it is worthwhile to figure this out.

>>> /tmp # time ./foo /mnt/hdd/xxx 5
>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [68 ms]
>>> unlink(filename): 0 [0 ms]
>>> 0.00user 1.86system 0:01.92elapsed 97%CPU (0avgtext+0avgdata 528maxresident)k
>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
>>>
>>> /tmp # time ./foo /mnt/hdd/xxx 10
>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [141 ms]
>>> unlink(filename): 0 [0 ms]
>>> 0.00user 3.71system 0:03.83elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
>>>
>>> /tmp # time ./foo /mnt/hdd/xxx 100
>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [1882 ms]
>>> unlink(filename): 0 [0 ms]
>>> 0.00user 37.12system 0:38.93elapsed 95%CPU (0avgtext+0avgdata 528maxresident)k
>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
>>>
>>> /tmp # time ./foo /mnt/hdd/xxx 300
>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [3883 ms]
>>> unlink(filename): 0 [0 ms]
>>> 0.00user 111.38system 1:55.04elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps

Firstly, have you tried using "fallocate()" directly, instead of
posix_fallocate()? It may be (depending on your userspace) that
posix_fallocate() is writing zeroes to the file instead of using
the fallocate() syscall, and the kernel is busy cleaning up all
of the dirty pages when the file is unlinked. You could try using
strace to see what system calls are actually being used.

Secondly, where is the process actually stuck? From your output
above, the unlink() call takes no measurable time before returning,
so I don't see where it is actually stuck. Again, running your
test with "strace -tt -T ./foo /mnt/hdd/xxx 300" will show which
syscall is actually taking so much time to complete. I don't
think it is unlink().

Cheers, Andreas

Attachments:

signature.asc (833.00 B)
Message signed with OpenPGP using GPGMail

2014-07-17 10:30:34

by Mason

[permalink] [raw]

Subject: Re: After unlinking a large file on ext4, the process stalls for a long time

Hello,

Andreas Dilger wrote:

> Mason wrote:
>
>> The use case is
>> - allocate a large file
>> - stick a file system on it
>> - store stuff (typically video files) inside this "private" FS
>> - when the user decides he doesn't need it anymore, unmount and unlink
>> (I also have a resize operation in there, but I wanted to get the
>> basics before taking the hard stuff head on.)
>>
>> So, in the limit, we don't store anything at all: just create and
>> immediately delete. This was my test.
>
> I would agree that LVM is the real solution that you want to use.
> It is specifically designed for this, and has much less overhead than
> a filesystem on a loopback device on a file on another filesystem.
> The amount of space overhead is tuneable, but typically the volumes
> are allocated in multiples of 4MB chunks.

I'll take a look at LVM. (But, at this point, it's too late to change
the architecture of the system.)

> That said, I think you've found some kind of strange performance problem,
> and it is worthwhile to figure this out.
>
>>>> /tmp # time ./foo /mnt/hdd/xxx 5
>>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [68 ms]
>>>> unlink(filename): 0 [0 ms]
>>>> 0.00user 1.86system 0:01.92elapsed 97%CPU (0avgtext+0avgdata 528maxresident)k
>>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
>>>>
>>>> /tmp # time ./foo /mnt/hdd/xxx 10
>>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [141 ms]
>>>> unlink(filename): 0 [0 ms]
>>>> 0.00user 3.71system 0:03.83elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
>>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
>>>>
>>>> /tmp # time ./foo /mnt/hdd/xxx 100
>>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [1882 ms]
>>>> unlink(filename): 0 [0 ms]
>>>> 0.00user 37.12system 0:38.93elapsed 95%CPU (0avgtext+0avgdata 528maxresident)k
>>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
>>>>
>>>> /tmp # time ./foo /mnt/hdd/xxx 300
>>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [3883 ms]
>>>> unlink(filename): 0 [0 ms]
>>>> 0.00user 111.38system 1:55.04elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
>>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps

Preliminary info:

The partition was created/mounted with
$ mkfs.ext4 -m 0 -i 1024000 -L ZOZO -O ^has_journal,^huge_file /dev/sda1
$ mount -t ext4 /dev/sda1 /mnt/hdd -o noexec,noatime
(mount is busybox, in case it matters)

mke2fs 1.42.10 (18-May-2014)
/dev/sda1 contains a ext4 file system labelled 'ZOZO'
last mounted on /mnt/hdd on Wed Jul 16 15:40:40 2014
Proceed anyway? (y,n) y
Creating filesystem with 104857600 4k blocks and 460800 inodes
Filesystem UUID: 8c12c8fe-6ab8-4888-b9a3-6f28c86020eb
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000

Allocating group tables: done
Writing inode tables: done
Writing superblocks and filesystem accounting information: done

/dev/sda1 on /mnt/hdd type ext4 (rw,noexec,noatime,barrier=1)
/* No support for xattr in this kernel */

# dumpe2fs -h /dev/sda1
dumpe2fs 1.42.10 (18-May-2014)
Filesystem volume name: ZOZO
Last mounted on: <not available>
Filesystem UUID: 8c12c8fe-6ab8-4888-b9a3-6f28c86020eb
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: not clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 460800
Block count: 104857600
Reserved block count: 0
Free blocks: 104803944
Free inodes: 460789
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 999
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 144
Inode blocks per group: 9
Flex block group size: 16
Filesystem created: Thu Jul 17 11:14:27 2014
Last mount time: Thu Jul 17 11:14:29 2014
Last write time: Thu Jul 17 11:14:29 2014
Mount count: 1
Maximum mount count: -1
Last checked: Thu Jul 17 11:14:27 2014
Check interval: 0 (<none>)
Lifetime writes: 4883 kB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group unknown)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Default directory hash: half_md4
Directory Hash Seed: 157f2107-76fc-417b-9a07-491951c873b7

> Firstly, have you tried using "fallocate()" directly, instead of
> posix_fallocate()? It may be (depending on your userspace) that
> posix_fallocate() is writing zeroes to the file instead of using
> the fallocate() syscall, and the kernel is busy cleaning up all
> of the dirty pages when the file is unlinked. You could try using
> strace to see what system calls are actually being used.

Unfortunately, I'm using a prehistoric version of glibc (2.8)
that doesn't support the fallocate wrapper (imported in 2.10).

I'm 70% sure that posix_fallocate() is not actually writing zeros
to the file, because when I tested it on ext2, creating a 300-GB
file took hours, literally (approx. 3 hours). The same operation
on ext4 takes a few seconds. (Although, now that I think of it,
it could be working asynchronously, or defer some operation, that
I eventually have to pay for on deletion.)

# time strace -tt -T ./foo /mnt/hdd/xxx 300 2> strace.out
posix_fallocate(fd, 0, size_in_GiB << 30): 0 [414 ms]
unlink(filename): 0 [1 ms]

12:23:27.218838 open("/mnt/hdd/xxx", O_WRONLY|O_CREAT|O_EXCL|O_LARGEFILE, 0600) = 3 <0.000486>
12:23:27.220121 clock_gettime(CLOCK_MONOTONIC, {79879, 926227018}) = 0 <0.000105>
12:23:27.221029 SYS_4320() = 0 <0.412013>
12:23:27.633673 clock_gettime(CLOCK_MONOTONIC, {79880, 339646593}) = 0 <0.000104>
12:23:27.634657 fstat64(1, {st_mode=S_IFCHR|0755, st_rdev=makedev(4, 64), ...}) = 0 <0.000116>
12:23:27.636187 ioctl(1, TIOCNXCL, {B115200 opost isig icanon echo ...}) = 0 <0.000146>
12:23:27.637509 old_mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x77248000 <0.000143>
12:23:27.638306 write(1, "posix_fallocate(fd, 0, size_in_G"..., 54) = 54 <0.000237>
12:23:27.639496 clock_gettime(CLOCK_MONOTONIC, {79880, 345448452}) = 0 <0.000102>
12:23:27.640168 unlink("/mnt/hdd/xxx") = 0 <0.000231>
12:23:27.641174 clock_gettime(CLOCK_MONOTONIC, {79880, 347202581}) = 0 <0.000100>
12:23:27.641984 write(1, "unlink(filename): 0 [1 ms]\n", 27) = 27 <0.000157>
12:23:27.643056 exit_group(0) = ?
0.02user 111.51system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 864maxresident)k
0inputs+0outputs (0major+459minor)pagefaults 0swaps

AFAICT, SYS_4320() is fallocate.

/*
* Linux o32 style syscalls are in the range from 4000 to 4999.
*/
#define __NR_Linux 4000
#define __NR_fallocate (__NR_Linux + 320)

Where is the process stalling? That is a mystery. Seems it's stuck
in exit_group(), waiting for the kernel to clean up on its behalf?
Maybe I need ftrace, or something to profile the kernel?

> Secondly, where is the process actually stuck? From your output
> above, the unlink() call takes no measurable time before returning,
> so I don't see where it is actually stuck. Again, running your
> test with "strace -tt -T ./foo /mnt/hdd/xxx 300" will show which
> syscall is actually taking so much time to complete. I don't
> think it is unlink().

See above, the process is stalled, but I don't know where!

--
Regards.

2014-07-17 10:40:24

by Lukas Czerner

[permalink] [raw]

Subject: Re: After unlinking a large file on ext4, the process stalls for a long time

On Thu, 17 Jul 2014, Mason wrote:

> Date: Thu, 17 Jul 2014 12:30:34 +0200
> From: Mason <[email protected]>
> To: Andreas Dilger <[email protected]>
> Cc: Ext4 Developers List <[email protected]>,
> linux-fsdevel <[email protected]>
> Subject: Re: After unlinking a large file on ext4,
> the process stalls for a long time
>
> Hello,
>
> Andreas Dilger wrote:
>
> > Mason wrote:
> >
> >> The use case is
> >> - allocate a large file
> >> - stick a file system on it
> >> - store stuff (typically video files) inside this "private" FS
> >> - when the user decides he doesn't need it anymore, unmount and unlink
> >> (I also have a resize operation in there, but I wanted to get the
> >> basics before taking the hard stuff head on.)
> >>
> >> So, in the limit, we don't store anything at all: just create and
> >> immediately delete. This was my test.
> >
> > I would agree that LVM is the real solution that you want to use.
> > It is specifically designed for this, and has much less overhead than
> > a filesystem on a loopback device on a file on another filesystem.
> > The amount of space overhead is tuneable, but typically the volumes
> > are allocated in multiples of 4MB chunks.
>
> I'll take a look at LVM. (But, at this point, it's too late to change
> the architecture of the system.)
>
> > That said, I think you've found some kind of strange performance problem,
> > and it is worthwhile to figure this out.
> >
> >>>> /tmp # time ./foo /mnt/hdd/xxx 5
> >>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [68 ms]
> >>>> unlink(filename): 0 [0 ms]
> >>>> 0.00user 1.86system 0:01.92elapsed 97%CPU (0avgtext+0avgdata 528maxresident)k
> >>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
> >>>>
> >>>> /tmp # time ./foo /mnt/hdd/xxx 10
> >>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [141 ms]
> >>>> unlink(filename): 0 [0 ms]
> >>>> 0.00user 3.71system 0:03.83elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
> >>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
> >>>>
> >>>> /tmp # time ./foo /mnt/hdd/xxx 100
> >>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [1882 ms]
> >>>> unlink(filename): 0 [0 ms]
> >>>> 0.00user 37.12system 0:38.93elapsed 95%CPU (0avgtext+0avgdata 528maxresident)k
> >>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
> >>>>
> >>>> /tmp # time ./foo /mnt/hdd/xxx 300
> >>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [3883 ms]
> >>>> unlink(filename): 0 [0 ms]
> >>>> 0.00user 111.38system 1:55.04elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
> >>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
>
> Preliminary info:
>
> The partition was created/mounted with
> $ mkfs.ext4 -m 0 -i 1024000 -L ZOZO -O ^has_journal,^huge_file /dev/sda1
> $ mount -t ext4 /dev/sda1 /mnt/hdd -o noexec,noatime
> (mount is busybox, in case it matters)
>
> mke2fs 1.42.10 (18-May-2014)
> /dev/sda1 contains a ext4 file system labelled 'ZOZO'
> last mounted on /mnt/hdd on Wed Jul 16 15:40:40 2014
> Proceed anyway? (y,n) y
> Creating filesystem with 104857600 4k blocks and 460800 inodes
> Filesystem UUID: 8c12c8fe-6ab8-4888-b9a3-6f28c86020eb
> Superblock backups stored on blocks:
> 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> 102400000
>
> Allocating group tables: done
> Writing inode tables: done
> Writing superblocks and filesystem accounting information: done
>
> /dev/sda1 on /mnt/hdd type ext4 (rw,noexec,noatime,barrier=1)
> /* No support for xattr in this kernel */
>
> # dumpe2fs -h /dev/sda1
> dumpe2fs 1.42.10 (18-May-2014)
> Filesystem volume name: ZOZO
> Last mounted on: <not available>
> Filesystem UUID: 8c12c8fe-6ab8-4888-b9a3-6f28c86020eb
> Filesystem magic number: 0xEF53
> Filesystem revision #: 1 (dynamic)
> Filesystem features: ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file uninit_bg dir_nlink extra_isize
> Filesystem flags: signed_directory_hash
> Default mount options: user_xattr acl
> Filesystem state: not clean
> Errors behavior: Continue
> Filesystem OS type: Linux
> Inode count: 460800
> Block count: 104857600
> Reserved block count: 0
> Free blocks: 104803944
> Free inodes: 460789
> First block: 0
> Block size: 4096
> Fragment size: 4096
> Reserved GDT blocks: 999
> Blocks per group: 32768
> Fragments per group: 32768
> Inodes per group: 144
> Inode blocks per group: 9
> Flex block group size: 16
> Filesystem created: Thu Jul 17 11:14:27 2014
> Last mount time: Thu Jul 17 11:14:29 2014
> Last write time: Thu Jul 17 11:14:29 2014
> Mount count: 1
> Maximum mount count: -1
> Last checked: Thu Jul 17 11:14:27 2014
> Check interval: 0 (<none>)
> Lifetime writes: 4883 kB
> Reserved blocks uid: 0 (user root)
> Reserved blocks gid: 0 (group unknown)
> First inode: 11
> Inode size: 256
> Required extra isize: 28
> Desired extra isize: 28
> Default directory hash: half_md4
> Directory Hash Seed: 157f2107-76fc-417b-9a07-491951c873b7
>
> > Firstly, have you tried using "fallocate()" directly, instead of
> > posix_fallocate()? It may be (depending on your userspace) that
> > posix_fallocate() is writing zeroes to the file instead of using
> > the fallocate() syscall, and the kernel is busy cleaning up all
> > of the dirty pages when the file is unlinked. You could try using
> > strace to see what system calls are actually being used.
>
> Unfortunately, I'm using a prehistoric version of glibc (2.8)
> that doesn't support the fallocate wrapper (imported in 2.10).
>
> I'm 70% sure that posix_fallocate() is not actually writing zeros
> to the file, because when I tested it on ext2, creating a 300-GB
> file took hours, literally (approx. 3 hours). The same operation
> on ext4 takes a few seconds. (Although, now that I think of it,
> it could be working asynchronously, or defer some operation, that
> I eventually have to pay for on deletion.)
>
> # time strace -tt -T ./foo /mnt/hdd/xxx 300 2> strace.out
> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [414 ms]
> unlink(filename): 0 [1 ms]
>
>
> 12:23:27.218838 open("/mnt/hdd/xxx", O_WRONLY|O_CREAT|O_EXCL|O_LARGEFILE, 0600) = 3 <0.000486>
> 12:23:27.220121 clock_gettime(CLOCK_MONOTONIC, {79879, 926227018}) = 0 <0.000105>
> 12:23:27.221029 SYS_4320() = 0 <0.412013>
> 12:23:27.633673 clock_gettime(CLOCK_MONOTONIC, {79880, 339646593}) = 0 <0.000104>
> 12:23:27.634657 fstat64(1, {st_mode=S_IFCHR|0755, st_rdev=makedev(4, 64), ...}) = 0 <0.000116>
> 12:23:27.636187 ioctl(1, TIOCNXCL, {B115200 opost isig icanon echo ...}) = 0 <0.000146>
> 12:23:27.637509 old_mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x77248000 <0.000143>
> 12:23:27.638306 write(1, "posix_fallocate(fd, 0, size_in_G"..., 54) = 54 <0.000237>
> 12:23:27.639496 clock_gettime(CLOCK_MONOTONIC, {79880, 345448452}) = 0 <0.000102>
> 12:23:27.640168 unlink("/mnt/hdd/xxx") = 0 <0.000231>
> 12:23:27.641174 clock_gettime(CLOCK_MONOTONIC, {79880, 347202581}) = 0 <0.000100>
> 12:23:27.641984 write(1, "unlink(filename): 0 [1 ms]\n", 27) = 27 <0.000157>
> 12:23:27.643056 exit_group(0) = ?
> 0.02user 111.51system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 864maxresident)k
> 0inputs+0outputs (0major+459minor)pagefaults 0swaps

So it really does not seem to be stalling in fallocate, nor unlink.
Can you add close() before unlink, just to be sure what's happening
there ?

Thanks!
-Lukas

>
>
> AFAICT, SYS_4320() is fallocate.
>
> /*
> * Linux o32 style syscalls are in the range from 4000 to 4999.
> */
> #define __NR_Linux 4000
> #define __NR_fallocate (__NR_Linux + 320)
>
>
> Where is the process stalling? That is a mystery. Seems it's stuck
> in exit_group(), waiting for the kernel to clean up on its behalf?
> Maybe I need ftrace, or something to profile the kernel?
>
> > Secondly, where is the process actually stuck? From your output
> > above, the unlink() call takes no measurable time before returning,
> > so I don't see where it is actually stuck. Again, running your
> > test with "strace -tt -T ./foo /mnt/hdd/xxx 300" will show which
> > syscall is actually taking so much time to complete. I don't
> > think it is unlink().
>
> See above, the process is stalled, but I don't know where!
>
>

2014-07-17 11:16:25

by Mason

[permalink] [raw]

Subject: Re: After unlinking a large file on ext4, the process stalls for a long time

Luk?? Czerner wrote:

> So it really does not seem to be stalling in fallocate, nor unlink.
> Can you add close() before unlink, just to be sure what's happening
> there ?

Doh! Good catch! Unlinking was fast because the ref count didn't drop
to 0 on unlink, it did so on the implicit close done on exit, which
would explain why the process stalled "at the end".

If I unlink a closed file, it is indeed unlink that stalls.

[BTW, some of the e2fsprogs devs may be reading this. I suppose you
already know, but the cross-compile build was broken in 1.4.10.
I wrote a trivial patch to fix it (cf. the end of this message)
although I'm not sure I did it the canonical way.]

# time strace -T ./foo /mnt/hdd/xxx 300 2> strace.out
posix_fallocate(fd, 0, size_in_GiB << 30): 0 [412 ms]
close(fd): 0 [0 ms]
unlink(filename): 0 [111481 ms]

open("/mnt/hdd/xxx", O_WRONLY|O_CREAT|O_EXCL|O_LARGEFILE, 0600) = 3 <0.000456>
clock_gettime(CLOCK_MONOTONIC, {82152, 251657385}) = 0 <0.000085>
SYS_4320() = 0 <0.411628>
clock_gettime(CLOCK_MONOTONIC, {82152, 664179762}) = 0 <0.000089>
fstat64(1, {st_mode=S_IFCHR|0755, st_rdev=makedev(4, 64), ...}) = 0 <0.000094>
ioctl(1, TIOCNXCL, {B115200 opost isig icanon echo ...}) = 0 <0.000128>
old_mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x773e4000 <0.000195>
write(1, "posix_fallocate(fd, 0, size_in_G"..., 54) = 54 <0.000281>
clock_gettime(CLOCK_MONOTONIC, {82152, 668413115}) = 0 <0.000077>
close(3) = 0 <0.000119>
clock_gettime(CLOCK_MONOTONIC, {82152, 669249479}) = 0 <0.000129>
write(1, "close(fd): 0 [0 ms]\n", 20) = 20 <0.000145>
clock_gettime(CLOCK_MONOTONIC, {82152, 670361133}) = 0 <0.000078>
unlink("/mnt/hdd/xxx") = 0 <111.479283>
clock_gettime(CLOCK_MONOTONIC, {82264, 150551496}) = 0 <0.000080>
write(1, "unlink(filename): 0 [111481 ms]\n", 32) = 32 <0.000225>
exit_group(0) = ?

0.01user 111.48system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 772maxresident)k
0inputs+0outputs (0major+434minor)pagefaults 0swaps

For reference, here's my minimal test case:

#define _FILE_OFFSET_BITS 64
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <fcntl.h>
#include <time.h>

#define BENCH(op) do { \
struct timespec t0; clock_gettime(CLOCK_MONOTONIC, &t0); \
int err = op; \
struct timespec t1; clock_gettime(CLOCK_MONOTONIC, &t1); \
int ms = (t1.tv_sec-t0.tv_sec)*1000 + (t1.tv_nsec-t0.tv_nsec)/1000000; \
printf("%s: %d [%d ms]\n", #op, err, ms); } while(0)

int main(int argc, char **argv)
{
if (argc != 3) { puts("Usage: prog filename size"); return 42; }

char *filename = argv[1];
int fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600);
if (fd < 0) { perror("open"); return 1; }

long long size_in_GiB = atoi(argv[2]);
BENCH(posix_fallocate(fd, 0, size_in_GiB << 30));
BENCH(close(fd));
BENCH(unlink(filename));
return 0;
}

$ cat e2fsprogs-1.42.10.patch
diff -ur a/util/Makefile.in b/util/Makefile.in
--- a/util/Makefile.in 2014-05-15 19:04:08.000000000 +0200
+++ b/util/Makefile.in 2014-07-10 15:31:04.819352596 +0200
@@ -15,7 +15,7 @@

.c.o:
$(E) " CC $<"
- $(Q) $(BUILD_CC) -c $(BUILD_CFLAGS) $< -o $@
+ $(Q) $(BUILD_CC) $(CPPFLAGS) -c $(BUILD_CFLAGS) $< -o $@
$(Q) $(CHECK_CMD) $(ALL_CFLAGS) $<

PROGS= subst symlinks

--
Regards.

2014-07-17 13:37:55

by Theodore Ts'o

[permalink] [raw]

Subject: Re: After unlinking a large file on ext4, the process stalls for a long time

On Thu, Jul 17, 2014 at 01:17:11PM +0200, Mason wrote:
> unlink("/mnt/hdd/xxx") = 0 <111.479283>
>
> 0.01user 111.48system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 772maxresident)k
> 0inputs+0outputs (0major+434minor)pagefaults 0swaps

... and we're CPU bound inside the kernel.

Can you run perf so we can see exactly where we're spending the CPU?
You're not using a journal, so I'm pretty sure what you will find is
that we're spending all of our time in mb_free_blocks(), when it is
updating the internal mballoc buddy bitmaps.

With a journal, this work done by mb_free_blocks() is hidden in the
kjournal thread, and happens after the commit is completed, so it
won't block other file system operations (other than burning some
extra CPU on one of the multiple cores available on a typical x86
CPU).

Also, I suspect the CPU overhead is *much* less on an x86 CPU, which
has native bit test/set/clear instructions, whereas the MIPS
architecture was designed by Prof. Hennessy at Stanford, who was a
doctrinaire RISC fanatic, so there would be no bitop instructions.

Even though I'm pretty sure what we'll find, knowing exactly *where*
in mb_free_blocks() or the function it calls would be helpful in
knowing what we need to optimize. So if you could try using perf
(assuming that the perf is supported MIPS; not sure if it does) that
would be really helpful.

Thanks,

- Ted

2014-07-17 16:07:30

by Mason

[permalink] [raw]

Subject: Re: After unlinking a large file on ext4, the process stalls for a long time

Theodore Ts'o wrote:

> Mason wrote:
>
>> unlink("/mnt/hdd/xxx") = 0 <111.479283>
>>
>> 0.01user 111.48system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 772maxresident)k
>> 0inputs+0outputs (0major+434minor)pagefaults 0swaps
>
> ... and we're CPU bound inside the kernel.
>
> Can you run perf so we can see exactly where we're spending the CPU?
> You're not using a journal, so I'm pretty sure what you will find is
> that we're spending all of our time in mb_free_blocks(), when it is
> updating the internal mballoc buddy bitmaps.
>
> With a journal, this work done by mb_free_blocks() is hidden in the
> kjournal thread, and happens after the commit is completed, so it
> won't block other file system operations (other than burning some
> extra CPU on one of the multiple cores available on a typical x86
> CPU).
>
> Also, I suspect the CPU overhead is *much* less on an x86 CPU, which
> has native bit test/set/clear instructions, whereas the MIPS
> architecture was designed by Prof. Hennessy at Stanford, who was a
> doctrinaire RISC fanatic, so there would be no bitop instructions.
>
> Even though I'm pretty sure what we'll find, knowing exactly *where*
> in mb_free_blocks() or the function it calls would be helpful in
> knowing what we need to optimize. So if you could try using perf
> (assuming that the perf is supported MIPS; not sure if it does) that
> would be really helpful.

Is perf "better" than oprofile? (For some metric)

I have enabled:

CONFIG_PERF_EVENTS=y
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
CONFIG_OPROFILE=y
CONFIG_HAVE_OPROFILE=y
CONFIG_KPROBES=y
CONFIG_KRETPROBES=y

What command-line do you suggest I run to get the output you expect?
(I'll try to get it done, but I might have to wait two weeks before
I can run these tests.)

--
Regards.

2014-07-17 16:31:54

by Mason

[permalink] [raw]

Subject: Re: After unlinking a large file on ext4, the process stalls for a long time

On 17/07/2014 18:07, Mason wrote:

> Theodore Ts'o wrote:
>
>> Mason wrote:
>>
>>> unlink("/mnt/hdd/xxx") = 0 <111.479283>
>>>
>>> 0.01user 111.48system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 772maxresident)k
>>> 0inputs+0outputs (0major+434minor)pagefaults 0swaps
>>
>> ... and we're CPU bound inside the kernel.
>>
>> Can you run perf so we can see exactly where we're spending the CPU?
>> You're not using a journal, so I'm pretty sure what you will find is
>> that we're spending all of our time in mb_free_blocks(), when it is
>> updating the internal mballoc buddy bitmaps.
>>
>> With a journal, this work done by mb_free_blocks() is hidden in the
>> kjournal thread, and happens after the commit is completed, so it
>> won't block other file system operations (other than burning some
>> extra CPU on one of the multiple cores available on a typical x86
>> CPU).
>>
>> Also, I suspect the CPU overhead is *much* less on an x86 CPU, which
>> has native bit test/set/clear instructions, whereas the MIPS
>> architecture was designed by Prof. Hennessy at Stanford, who was a
>> doctrinaire RISC fanatic, so there would be no bitop instructions.
>>
>> Even though I'm pretty sure what we'll find, knowing exactly *where*
>> in mb_free_blocks() or the function it calls would be helpful in
>> knowing what we need to optimize. So if you could try using perf
>> (assuming that the perf is supported MIPS; not sure if it does) that
>> would be really helpful.
>
> Is perf "better" than oprofile? (For some metric)
>
> I have enabled:
>
> CONFIG_PERF_EVENTS=y
> CONFIG_PROFILING=y
> CONFIG_TRACEPOINTS=y
> CONFIG_OPROFILE=y
> CONFIG_HAVE_OPROFILE=y
> CONFIG_KPROBES=y
> CONFIG_KRETPROBES=y
>
> What command-line do you suggest I run to get the output you expect?
> (I'll try to get it done, but I might have to wait two weeks before
> I can run these tests.)

So much for oprofile...

CC arch/mips/oprofile/../../../drivers/oprofile/oprof.o
arch/mips/oprofile/../../../drivers/oprofile/oprof.c: In function 'oprofile_init':
arch/mips/oprofile/../../../drivers/oprofile/oprof.c:316: error: 'timer' undeclared (first use in this function)
arch/mips/oprofile/../../../drivers/oprofile/oprof.c:316: error: (Each undeclared identifier is reported only once
arch/mips/oprofile/../../../drivers/oprofile/oprof.c:316: error: for each function it appears in.)
arch/mips/oprofile/../../../drivers/oprofile/oprof.c: In function '__check_timer':
arch/mips/oprofile/../../../drivers/oprofile/oprof.c:373: error: 'timer' undeclared (first use in this function)
arch/mips/oprofile/../../../drivers/oprofile/oprof.c: At top level:
arch/mips/oprofile/../../../drivers/oprofile/oprof.c:373: error: 'timer' undeclared here (not in a function)
cc1: warnings being treated as errors
arch/mips/oprofile/../../../drivers/oprofile/oprof.c:373: error: type defaults to 'int' in declaration of 'type name'
make[1]: *** [arch/mips/oprofile/../../../drivers/oprofile/oprof.o] Error 1
make: *** [arch/mips/oprofile] Error 2

Dunno if this happens on vanilla kernels, or if the ODM messed
something up (again).

$ ll tools/perf/arch/
drwxrwxr-x 4 bob bob 4096 Mar 27 17:12 arm/
drwxrwxr-x 4 bob bob 4096 Mar 27 17:12 powerpc/
drwxrwxr-x 4 bob bob 4096 Mar 27 17:12 s390/
drwxrwxr-x 4 bob bob 4096 Mar 27 17:12 sh/
drwxrwxr-x 4 bob bob 4096 Mar 27 17:12 sparc/
drwxrwxr-x 4 bob bob 4096 Mar 27 17:12 x86/

I'm not sure perf supports MIPS...

Or maybe it does

$ g -rni mips .
./Makefile:45: -e s/ppc.*/powerpc/ -e s/mips.*/mips/ \
Binary file ./.Makefile.swp matches
./perf.h:76:#ifdef __mips__
./perf.h:77:#include "../../arch/mips/include/asm/unistd.h"
./perf.h:79: ".set mips2\n\t" \
./perf.h:81: ".set mips0" \

--
Regards.

2014-07-18 09:29:34

by Lukas Czerner

[permalink] [raw]

Subject: Re: After unlinking a large file on ext4, the process stalls for a long time

On Thu, 17 Jul 2014, Mason wrote:

> Date: Thu, 17 Jul 2014 18:07:30 +0200
> From: Mason <[email protected]>
> To: Theodore Ts'o <[email protected]>
> Cc: Luk?? Czerner <[email protected]>, Andreas Dilger <[email protected]>,
> Ext4 Developers List <[email protected]>,
> linux-fsdevel <[email protected]>
> Subject: Re: After unlinking a large file on ext4,
> the process stalls for a long time
>
> Theodore Ts'o wrote:
>
> > Mason wrote:
> >
> >> unlink("/mnt/hdd/xxx") = 0 <111.479283>
> >>
> >> 0.01user 111.48system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 772maxresident)k
> >> 0inputs+0outputs (0major+434minor)pagefaults 0swaps
> >
> > ... and we're CPU bound inside the kernel.
> >
> > Can you run perf so we can see exactly where we're spending the CPU?
> > You're not using a journal, so I'm pretty sure what you will find is
> > that we're spending all of our time in mb_free_blocks(), when it is
> > updating the internal mballoc buddy bitmaps.
> >
> > With a journal, this work done by mb_free_blocks() is hidden in the
> > kjournal thread, and happens after the commit is completed, so it
> > won't block other file system operations (other than burning some
> > extra CPU on one of the multiple cores available on a typical x86
> > CPU).
> >
> > Also, I suspect the CPU overhead is *much* less on an x86 CPU, which
> > has native bit test/set/clear instructions, whereas the MIPS
> > architecture was designed by Prof. Hennessy at Stanford, who was a
> > doctrinaire RISC fanatic, so there would be no bitop instructions.
> >
> > Even though I'm pretty sure what we'll find, knowing exactly *where*
> > in mb_free_blocks() or the function it calls would be helpful in
> > knowing what we need to optimize. So if you could try using perf
> > (assuming that the perf is supported MIPS; not sure if it does) that
> > would be really helpful.
>
> Is perf "better" than oprofile? (For some metric)
>
> I have enabled:
>
> CONFIG_PERF_EVENTS=y
> CONFIG_PROFILING=y
> CONFIG_TRACEPOINTS=y
> CONFIG_OPROFILE=y
> CONFIG_HAVE_OPROFILE=y
> CONFIG_KPROBES=y
> CONFIG_KRETPROBES=y
>
> What command-line do you suggest I run to get the output you expect?
> (I'll try to get it done, but I might have to wait two weeks before
> I can run these tests.)

If perf works on your system you can record data with

perf record -g ./test file <size>

and then report with

perf report --stdio

That should yield some interesting information about where we spend
the most time in kernel.

Thanks!
-Lukas