2013-09-12 04:17:26

by Cuong Tran

[permalink] [raw]
Subject: Java Stop-the-World GC stall induced by FS flush or many large file deletions

We have seen GC stalls that are NOT due to memory usage of applications.

GC log reports the CPU user and system time of GC threads, which are
almost 0, and stop-the-world time, which can be multiple seconds. This
indicates GC threads are waiting for IO but GC threads should be
CPU-bound in user mode.

We could reproduce the problems using a simple Java program that just
appends to a log file via log4j. If the test just runs by itself, it
does not incur any GC stalls. However, if we run a script that enters
a loop to create multiple large file via falloc() and then deletes
them, then GC stall of 1+ seconds can happen fairly predictably.

We can also reproduce the problem by periodically switch the log and
gzip the older log. IO device, a single disk drive, is overloaded by
FS flush when this happens.

Our guess is GC has to acquiesce its threads and if one of the threads
is stuck in the kernel (say in non-interruptible mode). Then GC has to
wait until this thread unblocks. In the mean time, it already stops
the world.

Another test that shows similar problem is doing deferred writes to
append a file. Latency of deferred writes is very fast but once a
while, it can last more than 1 second.

We would really appreciate if you could shed some light on possible
causes? (Threads blocked because of journal check point, delayed
allocation can't proceed?). We could alleviate the problem by
configuring expire_centisecs and writeback_centisecs to flush more
frequently, and thus even-out the workload to the disk drive. But we
would like to know if there is a methodology to model the rate of
flush vs. rate of changes and IO throughput of the drive (SAS, 15K
RPM).

Many thanks.


2013-09-12 05:45:10

by Cuong Tran

[permalink] [raw]
Subject: Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions

Awesome fix and thanks for very speedy response. I have some
questions. We delete files one at a time, and thus that would lock up
one core or all cores?

And in our test, we use falloc w/o writing to file. That would still
cause freeing block-by-block, correct?
--Cuong

On Wed, Sep 11, 2013 at 10:32 PM, Sidorov, Andrei
<[email protected]> wrote:
> Hi,
>
> Large file deletions are likely to lock cpu for seconds if you're
> running non-preemptible kernel < 3.10.
> Make sure you have this change:
> http://patchwork.ozlabs.org/patch/232172/ (available in 3.10 if I
> remember it right).
> Turning on preemption may be a good idea as well.
>
> Regards,
> Andrei.
>
> On 12.09.2013 00:18, Cuong Tran wrote:
>> We have seen GC stalls that are NOT due to memory usage of applications.
>>
>> GC log reports the CPU user and system time of GC threads, which are
>> almost 0, and stop-the-world time, which can be multiple seconds. This
>> indicates GC threads are waiting for IO but GC threads should be
>> CPU-bound in user mode.
>>
>> We could reproduce the problems using a simple Java program that just
>> appends to a log file via log4j. If the test just runs by itself, it
>> does not incur any GC stalls. However, if we run a script that enters
>> a loop to create multiple large file via falloc() and then deletes
>> them, then GC stall of 1+ seconds can happen fairly predictably.
>>
>> We can also reproduce the problem by periodically switch the log and
>> gzip the older log. IO device, a single disk drive, is overloaded by
>> FS flush when this happens.
>>
>> Our guess is GC has to acquiesce its threads and if one of the threads
>> is stuck in the kernel (say in non-interruptible mode). Then GC has to
>> wait until this thread unblocks. In the mean time, it already stops
>> the world.
>>
>> Another test that shows similar problem is doing deferred writes to
>> append a file. Latency of deferred writes is very fast but once a
>> while, it can last more than 1 second.
>>
>> We would really appreciate if you could shed some light on possible
>> causes? (Threads blocked because of journal check point, delayed
>> allocation can't proceed?). We could alleviate the problem by
>> configuring expire_centisecs and writeback_centisecs to flush more
>> frequently, and thus even-out the workload to the disk drive. But we
>> would like to know if there is a methodology to model the rate of
>> flush vs. rate of changes and IO throughput of the drive (SAS, 15K
>> RPM).
>>
>> Many thanks.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>

2013-09-12 05:47:46

by Sidorov, Andrei

[permalink] [raw]
Subject: Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions

Hi,

Large file deletions are likely to lock cpu for seconds if you're
running non-preemptible kernel < 3.10.
Make sure you have this change:
http://patchwork.ozlabs.org/patch/232172/ (available in 3.10 if I
remember it right).
Turning on preemption may be a good idea as well.

Regards,
Andrei.

On 12.09.2013 00:18, Cuong Tran wrote:
> We have seen GC stalls that are NOT due to memory usage of applications.
>
> GC log reports the CPU user and system time of GC threads, which are
> almost 0, and stop-the-world time, which can be multiple seconds. This
> indicates GC threads are waiting for IO but GC threads should be
> CPU-bound in user mode.
>
> We could reproduce the problems using a simple Java program that just
> appends to a log file via log4j. If the test just runs by itself, it
> does not incur any GC stalls. However, if we run a script that enters
> a loop to create multiple large file via falloc() and then deletes
> them, then GC stall of 1+ seconds can happen fairly predictably.
>
> We can also reproduce the problem by periodically switch the log and
> gzip the older log. IO device, a single disk drive, is overloaded by
> FS flush when this happens.
>
> Our guess is GC has to acquiesce its threads and if one of the threads
> is stuck in the kernel (say in non-interruptible mode). Then GC has to
> wait until this thread unblocks. In the mean time, it already stops
> the world.
>
> Another test that shows similar problem is doing deferred writes to
> append a file. Latency of deferred writes is very fast but once a
> while, it can last more than 1 second.
>
> We would really appreciate if you could shed some light on possible
> causes? (Threads blocked because of journal check point, delayed
> allocation can't proceed?). We could alleviate the problem by
> configuring expire_centisecs and writeback_centisecs to flush more
> frequently, and thus even-out the workload to the disk drive. But we
> would like to know if there is a methodology to model the rate of
> flush vs. rate of changes and IO throughput of the drive (SAS, 15K
> RPM).
>
> Many thanks.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>


2013-09-12 06:01:54

by Sidorov, Andrei

[permalink] [raw]
Subject: Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions

It would lock-up one core whichever jdb/sdaX runs on. This will usually
happen upon commit that runs every x seconds, 5 by default (see ?commit?
mount option for ext4). I.e. deleting 5 files one by one with 1 second
interval in between is basically the same as deleting all of them ?at once?.

Yes, fallocated files are the same wrt releasing blocks.

Regards,
Andrei.

On 12.09.2013 01:45, Cuong Tran wrote:
> Awesome fix and thanks for very speedy response. I have some
> questions. We delete files one at a time, and thus that would lock up
> one core or all cores?
>
> And in our test, we use falloc w/o writing to file. That would still
> cause freeing block-by-block, correct?
> --Cuong
>
> On Wed, Sep 11, 2013 at 10:32 PM, Sidorov, Andrei
> <[email protected]> wrote:
>> Hi,
>>
>> Large file deletions are likely to lock cpu for seconds if you're
>> running non-preemptible kernel < 3.10.
>> Make sure you have this change:
>> http://patchwork.ozlabs.org/patch/232172/ (available in 3.10 if I
>> remember it right).
>> Turning on preemption may be a good idea as well.
>>
>> Regards,
>> Andrei.
>>
>> On 12.09.2013 00:18, Cuong Tran wrote:
>>> We have seen GC stalls that are NOT due to memory usage of applications.
>>>
>>> GC log reports the CPU user and system time of GC threads, which are
>>> almost 0, and stop-the-world time, which can be multiple seconds. This
>>> indicates GC threads are waiting for IO but GC threads should be
>>> CPU-bound in user mode.
>>>
>>> We could reproduce the problems using a simple Java program that just
>>> appends to a log file via log4j. If the test just runs by itself, it
>>> does not incur any GC stalls. However, if we run a script that enters
>>> a loop to create multiple large file via falloc() and then deletes
>>> them, then GC stall of 1+ seconds can happen fairly predictably.
>>>
>>> We can also reproduce the problem by periodically switch the log and
>>> gzip the older log. IO device, a single disk drive, is overloaded by
>>> FS flush when this happens.
>>>
>>> Our guess is GC has to acquiesce its threads and if one of the threads
>>> is stuck in the kernel (say in non-interruptible mode). Then GC has to
>>> wait until this thread unblocks. In the mean time, it already stops
>>> the world.
>>>
>>> Another test that shows similar problem is doing deferred writes to
>>> append a file. Latency of deferred writes is very fast but once a
>>> while, it can last more than 1 second.
>>>
>>> We would really appreciate if you could shed some light on possible
>>> causes? (Threads blocked because of journal check point, delayed
>>> allocation can't proceed?). We could alleviate the problem by
>>> configuring expire_centisecs and writeback_centisecs to flush more
>>> frequently, and thus even-out the workload to the disk drive. But we
>>> would like to know if there is a methodology to model the rate of
>>> flush vs. rate of changes and IO throughput of the drive (SAS, 15K
>>> RPM).
>>>
>>> Many thanks.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>

2013-09-12 06:02:27

by Sidorov, Andrei

[permalink] [raw]
Subject: Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions

It would lock-up one core whichever jdb/sdaX runs on. This will usually
happen upon commit that runs every x seconds, 5 by default (see ?commit?
mount option for ext4). I.e. deleting 5 files one by one with 1 second
interval in between is basically the same as deleting all of them ?at once?.

Yes, fallocated files are the same wrt releasing blocks.

Regards,
Andrei.

On 12.09.2013 01:45, Cuong Tran wrote:
> Awesome fix and thanks for very speedy response. I have some
> questions. We delete files one at a time, and thus that would lock up
> one core or all cores?
>
> And in our test, we use falloc w/o writing to file. That would still
> cause freeing block-by-block, correct?
> --Cuong
>
> On Wed, Sep 11, 2013 at 10:32 PM, Sidorov, Andrei
> <[email protected]> wrote:
>> Hi,
>>
>> Large file deletions are likely to lock cpu for seconds if you're
>> running non-preemptible kernel < 3.10.
>> Make sure you have this change:
>> http://patchwork.ozlabs.org/patch/232172/ (available in 3.10 if I
>> remember it right).
>> Turning on preemption may be a good idea as well.
>>
>> Regards,
>> Andrei.
>>
>> On 12.09.2013 00:18, Cuong Tran wrote:
>>> We have seen GC stalls that are NOT due to memory usage of applications.
>>>
>>> GC log reports the CPU user and system time of GC threads, which are
>>> almost 0, and stop-the-world time, which can be multiple seconds. This
>>> indicates GC threads are waiting for IO but GC threads should be
>>> CPU-bound in user mode.
>>>
>>> We could reproduce the problems using a simple Java program that just
>>> appends to a log file via log4j. If the test just runs by itself, it
>>> does not incur any GC stalls. However, if we run a script that enters
>>> a loop to create multiple large file via falloc() and then deletes
>>> them, then GC stall of 1+ seconds can happen fairly predictably.
>>>
>>> We can also reproduce the problem by periodically switch the log and
>>> gzip the older log. IO device, a single disk drive, is overloaded by
>>> FS flush when this happens.
>>>
>>> Our guess is GC has to acquiesce its threads and if one of the threads
>>> is stuck in the kernel (say in non-interruptible mode). Then GC has to
>>> wait until this thread unblocks. In the mean time, it already stops
>>> the world.
>>>
>>> Another test that shows similar problem is doing deferred writes to
>>> append a file. Latency of deferred writes is very fast but once a
>>> while, it can last more than 1 second.
>>>
>>> We would really appreciate if you could shed some light on possible
>>> causes? (Threads blocked because of journal check point, delayed
>>> allocation can't proceed?). We could alleviate the problem by
>>> configuring expire_centisecs and writeback_centisecs to flush more
>>> frequently, and thus even-out the workload to the disk drive. But we
>>> would like to know if there is a methodology to model the rate of
>>> flush vs. rate of changes and IO throughput of the drive (SAS, 15K
>>> RPM).
>>>
>>> Many thanks.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>

2013-09-12 06:08:42

by Cuong Tran

[permalink] [raw]
Subject: Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions

My desk top has 8 cores, including hyperthreading. Thus deleting files
would lock up one core but that should not affect GC threads if core
lock-up is an issue? Would # journal records be proportional to #
blocks deleted. And thus deleting N blocks, one block at a time would
create N times more journal records than deleting all N blocks in "one
shot"?

--Cuong

On Wed, Sep 11, 2013 at 11:02 PM, Sidorov, Andrei
<[email protected]> wrote:
> It would lock-up one core whichever jdb/sdaX runs on. This will usually
> happen upon commit that runs every x seconds, 5 by default (see ?commit?
> mount option for ext4). I.e. deleting 5 files one by one with 1 second
> interval in between is basically the same as deleting all of them ?at once?.
>
> Yes, fallocated files are the same wrt releasing blocks.
>
> Regards,
> Andrei.
>
> On 12.09.2013 01:45, Cuong Tran wrote:
>> Awesome fix and thanks for very speedy response. I have some
>> questions. We delete files one at a time, and thus that would lock up
>> one core or all cores?
>>
>> And in our test, we use falloc w/o writing to file. That would still
>> cause freeing block-by-block, correct?
>> --Cuong
>>
>> On Wed, Sep 11, 2013 at 10:32 PM, Sidorov, Andrei
>> <[email protected]> wrote:
>>> Hi,
>>>
>>> Large file deletions are likely to lock cpu for seconds if you're
>>> running non-preemptible kernel < 3.10.
>>> Make sure you have this change:
>>> http://patchwork.ozlabs.org/patch/232172/ (available in 3.10 if I
>>> remember it right).
>>> Turning on preemption may be a good idea as well.
>>>
>>> Regards,
>>> Andrei.
>>>
>>> On 12.09.2013 00:18, Cuong Tran wrote:
>>>> We have seen GC stalls that are NOT due to memory usage of applications.
>>>>
>>>> GC log reports the CPU user and system time of GC threads, which are
>>>> almost 0, and stop-the-world time, which can be multiple seconds. This
>>>> indicates GC threads are waiting for IO but GC threads should be
>>>> CPU-bound in user mode.
>>>>
>>>> We could reproduce the problems using a simple Java program that just
>>>> appends to a log file via log4j. If the test just runs by itself, it
>>>> does not incur any GC stalls. However, if we run a script that enters
>>>> a loop to create multiple large file via falloc() and then deletes
>>>> them, then GC stall of 1+ seconds can happen fairly predictably.
>>>>
>>>> We can also reproduce the problem by periodically switch the log and
>>>> gzip the older log. IO device, a single disk drive, is overloaded by
>>>> FS flush when this happens.
>>>>
>>>> Our guess is GC has to acquiesce its threads and if one of the threads
>>>> is stuck in the kernel (say in non-interruptible mode). Then GC has to
>>>> wait until this thread unblocks. In the mean time, it already stops
>>>> the world.
>>>>
>>>> Another test that shows similar problem is doing deferred writes to
>>>> append a file. Latency of deferred writes is very fast but once a
>>>> while, it can last more than 1 second.
>>>>
>>>> We would really appreciate if you could shed some light on possible
>>>> causes? (Threads blocked because of journal check point, delayed
>>>> allocation can't proceed?). We could alleviate the problem by
>>>> configuring expire_centisecs and writeback_centisecs to flush more
>>>> frequently, and thus even-out the workload to the disk drive. But we
>>>> would like to know if there is a methodology to model the rate of
>>>> flush vs. rate of changes and IO throughput of the drive (SAS, 15K
>>>> RPM).
>>>>
>>>> Many thanks.
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>
>

2013-09-12 07:45:12

by Zheng Liu

[permalink] [raw]
Subject: Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions

Hello Cuong,

Could you please tell us the kernel version? Meanwhile it would be
better if you could paste the result of the following commands:
* sudo tune2fs -l ${DEV}
* cat /proc/mounts | grep ${DEV}

I want to know which feature is enabled in your file system. From your
description, I guess delayed allocation is enabled. So I suggest that
you can try to disable it. You can disable it using the following
command:
* sudo mount -t ext4 -o remount,nodelalloc ${DEV} ${MNT}

Regards,
- Zheng

On Wed, Sep 11, 2013 at 09:17:26PM -0700, Cuong Tran wrote:
> We have seen GC stalls that are NOT due to memory usage of applications.
>
> GC log reports the CPU user and system time of GC threads, which are
> almost 0, and stop-the-world time, which can be multiple seconds. This
> indicates GC threads are waiting for IO but GC threads should be
> CPU-bound in user mode.
>
> We could reproduce the problems using a simple Java program that just
> appends to a log file via log4j. If the test just runs by itself, it
> does not incur any GC stalls. However, if we run a script that enters
> a loop to create multiple large file via falloc() and then deletes
> them, then GC stall of 1+ seconds can happen fairly predictably.
>
> We can also reproduce the problem by periodically switch the log and
> gzip the older log. IO device, a single disk drive, is overloaded by
> FS flush when this happens.
>
> Our guess is GC has to acquiesce its threads and if one of the threads
> is stuck in the kernel (say in non-interruptible mode). Then GC has to
> wait until this thread unblocks. In the mean time, it already stops
> the world.
>
> Another test that shows similar problem is doing deferred writes to
> append a file. Latency of deferred writes is very fast but once a
> while, it can last more than 1 second.
>
> We would really appreciate if you could shed some light on possible
> causes? (Threads blocked because of journal check point, delayed
> allocation can't proceed?). We could alleviate the problem by
> configuring expire_centisecs and writeback_centisecs to flush more
> frequently, and thus even-out the workload to the disk drive. But we
> would like to know if there is a methodology to model the rate of
> flush vs. rate of changes and IO throughput of the drive (SAS, 15K
> RPM).
>
> Many thanks.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-09-12 11:46:59

by Cuong Tran

[permalink] [raw]
Subject: Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions

Hi Zheng, this is output of tune2fs:
tune2fs 1.41.12 (17-May-2010)
Filesystem volume name: <none>
Last mounted on: /
Filesystem UUID: 358707f1-8997-4e96-8110-d9f929073e68
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index
filetype needs_recovery extent flex_bg sparse_super large_file
huge_file uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: journal_data_writeback user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 23281664
Block count: 93110272
Reserved block count: 4655513
Free blocks: 44878839
Free inodes: 22074662
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 1001
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Flex block group size: 16
Filesystem created: Fri Oct 7 05:01:07 2011
Last mount time: Mon Sep 9 18:49:39 2013
Last write time: Wed Aug 21 10:48:03 2013
Mount count: 23
Maximum mount count: -1
Last checked: Fri Oct 7 05:01:07 2011
Check interval: 0 (<none>)
Lifetime writes: 1795 GB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
First orphan inode: 4207998
Default directory hash: half_md4
Directory Hash Seed: 4de319c2-fc2d-4771-b4de-b63acd610ba0
Journal backup: inode blocks

And this is mount options:
UUID=358707f1-8997-4e96-8110-d9f929073e68 / ext4
defaults 1 1


Yes, I forgot to mention that in the deferred write test, turning off
deferred allocation or using preallocation, the problem is largely
gone. I mentioned largely gone because it still happens but much less
frequently.

Thanks.

--Cuong

On Thu, Sep 12, 2013 at 12:46 AM, Zheng Liu <[email protected]> wrote:
> Hello Cuong,
>
> Could you please tell us the kernel version? Meanwhile it would be
> better if you could paste the result of the following commands:
> * sudo tune2fs -l ${DEV}
> * cat /proc/mounts | grep ${DEV}
>
> I want to know which feature is enabled in your file system. From your
> description, I guess delayed allocation is enabled. So I suggest that
> you can try to disable it. You can disable it using the following
> command:
> * sudo mount -t ext4 -o remount,nodelalloc ${DEV} ${MNT}
>
> Regards,
> - Zheng
>
> On Wed, Sep 11, 2013 at 09:17:26PM -0700, Cuong Tran wrote:
>> We have seen GC stalls that are NOT due to memory usage of applications.
>>
>> GC log reports the CPU user and system time of GC threads, which are
>> almost 0, and stop-the-world time, which can be multiple seconds. This
>> indicates GC threads are waiting for IO but GC threads should be
>> CPU-bound in user mode.
>>
>> We could reproduce the problems using a simple Java program that just
>> appends to a log file via log4j. If the test just runs by itself, it
>> does not incur any GC stalls. However, if we run a script that enters
>> a loop to create multiple large file via falloc() and then deletes
>> them, then GC stall of 1+ seconds can happen fairly predictably.
>>
>> We can also reproduce the problem by periodically switch the log and
>> gzip the older log. IO device, a single disk drive, is overloaded by
>> FS flush when this happens.
>>
>> Our guess is GC has to acquiesce its threads and if one of the threads
>> is stuck in the kernel (say in non-interruptible mode). Then GC has to
>> wait until this thread unblocks. In the mean time, it already stops
>> the world.
>>
>> Another test that shows similar problem is doing deferred writes to
>> append a file. Latency of deferred writes is very fast but once a
>> while, it can last more than 1 second.
>>
>> We would really appreciate if you could shed some light on possible
>> causes? (Threads blocked because of journal check point, delayed
>> allocation can't proceed?). We could alleviate the problem by
>> configuring expire_centisecs and writeback_centisecs to flush more
>> frequently, and thus even-out the workload to the disk drive. But we
>> would like to know if there is a methodology to model the rate of
>> flush vs. rate of changes and IO throughput of the drive (SAS, 15K
>> RPM).
>>
>> Many thanks.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-09-12 15:47:33

by Sidorov, Andrei

[permalink] [raw]
Subject: Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions

If there are threads bound to this core, they probably won't be able to
progress for the time of commit. I don't think scheduler would migrate
them to a different core immediately. So, if GC wants to talk to these
threads, it would block as well.

Of course deleting many small files is much slower than deleting single
file of the same size. Number of records is roughly the number of files
deleted, split at allocation group boundaries. The group size is 128M
unless you enabled bigalloc feature. That is in simplest case a release
blocks entry for 256M file contains one record with 2 groups to free.
But release blocks entry for 256 1M files will contain 256 records that
are to be precessed separately. Before that change, commit time for
removing one 256M is the same as removing 256 1M files (wrt to time
taken to release blocks). After the change, releasing blocks of 256M
file would take 2 ?iterations? as opposed to 256.

Basically, there is no thing like ?deleting N blocks?, unless you delete
a file by progressively truncating it towards zero.

Regards,
Andrei.

On 12.09.2013 02:08, Cuong Tran wrote:
> My desk top has 8 cores, including hyperthreading. Thus deleting files
> would lock up one core but that should not affect GC threads if core
> lock-up is an issue? Would # journal records be proportional to #
> blocks deleted. And thus deleting N blocks, one block at a time would
> create N times more journal records than deleting all N blocks in "one
> shot"?
>
> --Cuong
>
> On Wed, Sep 11, 2013 at 11:02 PM, Sidorov, Andrei
> <[email protected]> wrote:
>> It would lock-up one core whichever jdb/sdaX runs on. This will usually
>> happen upon commit that runs every x seconds, 5 by default (see ?commit?
>> mount option for ext4). I.e. deleting 5 files one by one with 1 second
>> interval in between is basically the same as deleting all of them ?at once?.
>>
>> Yes, fallocated files are the same wrt releasing blocks.
>>
>> Regards,
>> Andrei.
>>
>> On 12.09.2013 01:45, Cuong Tran wrote:
>>> Awesome fix and thanks for very speedy response. I have some
>>> questions. We delete files one at a time, and thus that would lock up
>>> one core or all cores?
>>>
>>> And in our test, we use falloc w/o writing to file. That would still
>>> cause freeing block-by-block, correct?
>>> --Cuong
>>>
>>> On Wed, Sep 11, 2013 at 10:32 PM, Sidorov, Andrei
>>> <[email protected]> wrote:
>>>> Hi,
>>>>
>>>> Large file deletions are likely to lock cpu for seconds if you're
>>>> running non-preemptible kernel < 3.10.
>>>> Make sure you have this change:
>>>> http://patchwork.ozlabs.org/patch/232172/ (available in 3.10 if I
>>>> remember it right).
>>>> Turning on preemption may be a good idea as well.
>>>>
>>>> Regards,
>>>> Andrei.
>>>>
>>>> On 12.09.2013 00:18, Cuong Tran wrote:
>>>>> We have seen GC stalls that are NOT due to memory usage of applications.
>>>>>
>>>>> GC log reports the CPU user and system time of GC threads, which are
>>>>> almost 0, and stop-the-world time, which can be multiple seconds. This
>>>>> indicates GC threads are waiting for IO but GC threads should be
>>>>> CPU-bound in user mode.
>>>>>
>>>>> We could reproduce the problems using a simple Java program that just
>>>>> appends to a log file via log4j. If the test just runs by itself, it
>>>>> does not incur any GC stalls. However, if we run a script that enters
>>>>> a loop to create multiple large file via falloc() and then deletes
>>>>> them, then GC stall of 1+ seconds can happen fairly predictably.
>>>>>
>>>>> We can also reproduce the problem by periodically switch the log and
>>>>> gzip the older log. IO device, a single disk drive, is overloaded by
>>>>> FS flush when this happens.
>>>>>
>>>>> Our guess is GC has to acquiesce its threads and if one of the threads
>>>>> is stuck in the kernel (say in non-interruptible mode). Then GC has to
>>>>> wait until this thread unblocks. In the mean time, it already stops
>>>>> the world.
>>>>>
>>>>> Another test that shows similar problem is doing deferred writes to
>>>>> append a file. Latency of deferred writes is very fast but once a
>>>>> while, it can last more than 1 second.
>>>>>
>>>>> We would really appreciate if you could shed some light on possible
>>>>> causes? (Threads blocked because of journal check point, delayed
>>>>> allocation can't proceed?). We could alleviate the problem by
>>>>> configuring expire_centisecs and writeback_centisecs to flush more
>>>>> frequently, and thus even-out the workload to the disk drive. But we
>>>>> would like to know if there is a methodology to model the rate of
>>>>> flush vs. rate of changes and IO throughput of the drive (SAS, 15K
>>>>> RPM).
>>>>>
>>>>> Many thanks.
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>>>> the body of a message to [email protected]
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>

2013-09-12 16:58:50

by Cuong Tran

[permalink] [raw]
Subject: Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions

Andrei, regarding core binding, our test program has only 1 thread
which appends to the log. I did not explicitly bind this test to any
core. Even without core binding, would a thread already scheduled for
the core (say due to NUMA) get stuck until the core is available or
does the kernel migrate this poor thread?

Regarding journaling, your explanation is very clear. I read that a
record takes up 1 to multiple buffers, 4 KB each. And a journal
default size is 128 MB, regardless of partition size. Thus a journal
can be filled up to 32,000 records before it has to check point. Is
this correct?

Thanks again for your help.

--Cuong

On Thu, Sep 12, 2013 at 8:47 AM, Sidorov, Andrei
<[email protected]> wrote:
> If there are threads bound to this core, they probably won't be able to
> progress for the time of commit. I don't think scheduler would migrate
> them to a different core immediately. So, if GC wants to talk to these
> threads, it would block as well.
>
> Of course deleting many small files is much slower than deleting single
> file of the same size. Number of records is roughly the number of files
> deleted, split at allocation group boundaries. The group size is 128M
> unless you enabled bigalloc feature. That is in simplest case a release
> blocks entry for 256M file contains one record with 2 groups to free.
> But release blocks entry for 256 1M files will contain 256 records that
> are to be precessed separately. Before that change, commit time for
> removing one 256M is the same as removing 256 1M files (wrt to time
> taken to release blocks). After the change, releasing blocks of 256M
> file would take 2 ?iterations? as opposed to 256.
>
> Basically, there is no thing like ?deleting N blocks?, unless you delete
> a file by progressively truncating it towards zero.
>
> Regards,
> Andrei.
>
> On 12.09.2013 02:08, Cuong Tran wrote:
>> My desk top has 8 cores, including hyperthreading. Thus deleting files
>> would lock up one core but that should not affect GC threads if core
>> lock-up is an issue? Would # journal records be proportional to #
>> blocks deleted. And thus deleting N blocks, one block at a time would
>> create N times more journal records than deleting all N blocks in "one
>> shot"?
>>
>> --Cuong
>>
>> On Wed, Sep 11, 2013 at 11:02 PM, Sidorov, Andrei
>> <[email protected]> wrote:
>>> It would lock-up one core whichever jdb/sdaX runs on. This will usually
>>> happen upon commit that runs every x seconds, 5 by default (see ?commit?
>>> mount option for ext4). I.e. deleting 5 files one by one with 1 second
>>> interval in between is basically the same as deleting all of them ?at once?.
>>>
>>> Yes, fallocated files are the same wrt releasing blocks.
>>>
>>> Regards,
>>> Andrei.
>>>
>>> On 12.09.2013 01:45, Cuong Tran wrote:
>>>> Awesome fix and thanks for very speedy response. I have some
>>>> questions. We delete files one at a time, and thus that would lock up
>>>> one core or all cores?
>>>>
>>>> And in our test, we use falloc w/o writing to file. That would still
>>>> cause freeing block-by-block, correct?
>>>> --Cuong
>>>>
>>>> On Wed, Sep 11, 2013 at 10:32 PM, Sidorov, Andrei
>>>> <[email protected]> wrote:
>>>>> Hi,
>>>>>
>>>>> Large file deletions are likely to lock cpu for seconds if you're
>>>>> running non-preemptible kernel < 3.10.
>>>>> Make sure you have this change:
>>>>> http://patchwork.ozlabs.org/patch/232172/ (available in 3.10 if I
>>>>> remember it right).
>>>>> Turning on preemption may be a good idea as well.
>>>>>
>>>>> Regards,
>>>>> Andrei.
>>>>>
>>>>> On 12.09.2013 00:18, Cuong Tran wrote:
>>>>>> We have seen GC stalls that are NOT due to memory usage of applications.
>>>>>>
>>>>>> GC log reports the CPU user and system time of GC threads, which are
>>>>>> almost 0, and stop-the-world time, which can be multiple seconds. This
>>>>>> indicates GC threads are waiting for IO but GC threads should be
>>>>>> CPU-bound in user mode.
>>>>>>
>>>>>> We could reproduce the problems using a simple Java program that just
>>>>>> appends to a log file via log4j. If the test just runs by itself, it
>>>>>> does not incur any GC stalls. However, if we run a script that enters
>>>>>> a loop to create multiple large file via falloc() and then deletes
>>>>>> them, then GC stall of 1+ seconds can happen fairly predictably.
>>>>>>
>>>>>> We can also reproduce the problem by periodically switch the log and
>>>>>> gzip the older log. IO device, a single disk drive, is overloaded by
>>>>>> FS flush when this happens.
>>>>>>
>>>>>> Our guess is GC has to acquiesce its threads and if one of the threads
>>>>>> is stuck in the kernel (say in non-interruptible mode). Then GC has to
>>>>>> wait until this thread unblocks. In the mean time, it already stops
>>>>>> the world.
>>>>>>
>>>>>> Another test that shows similar problem is doing deferred writes to
>>>>>> append a file. Latency of deferred writes is very fast but once a
>>>>>> while, it can last more than 1 second.
>>>>>>
>>>>>> We would really appreciate if you could shed some light on possible
>>>>>> causes? (Threads blocked because of journal check point, delayed
>>>>>> allocation can't proceed?). We could alleviate the problem by
>>>>>> configuring expire_centisecs and writeback_centisecs to flush more
>>>>>> frequently, and thus even-out the workload to the disk drive. But we
>>>>>> would like to know if there is a methodology to model the rate of
>>>>>> flush vs. rate of changes and IO throughput of the drive (SAS, 15K
>>>>>> RPM).
>>>>>>
>>>>>> Many thanks.
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>>>>> the body of a message to [email protected]
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>>
>>
>

2013-09-12 19:02:53

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions

Are you absolutely certain your JVM attempting to write to any files
in its GC thread? Say, to do some kind of logging? It might be worth
stracing the JVM and correlating the GC stall with any syscalls that
might have been issued from the JVM GC thread.

Especially in the case of the FS Flush, the writeback thread isn't CPU
bound. It will wait for the writeback to complete, but while it's
waiting, other processes or threads will be allowed to run on the CPU.

Now, if the GC thread tries to do some kind of fs operation which
requires writing to the file system, and the file sytstem is trying to
start a jbd transaction commit, file system operations can block until
all of the jbd handles associated with the previous commit can
complete. If you are storage devices are slow, or you are using a
block cgroup to control how much I/O bandwidth a particular cgroup
could use, this can end up causing a priority inversion where a low
priority cgroup takes a while to complete, this can stall the jbd
commit completion, and this can cause new ext4 operations can stall
waiting to start a new jbd handle.

So you could have a stall happening, if it's taking a long time for
commits to complete, but it might be completely unrelated to a GC
stall.

If you enable the jbd2_run_stats tracepoint, you can get some
interesting numbers about how long the various phases of the jbd2
commit are taking.

- Ted

2013-09-13 15:25:30

by Cuong Tran

[permalink] [raw]
Subject: Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions

Thanks Ted. That's great info.
I did strace and some stalls can be correlated to GC thread writing to
the log but others are not.

I turned on the jbd2_run_stats trace point, and this is sample trace
of creating and deleting 300 100MB files in a loop. (My desktop is
running low on disk space!). No Java test is running. I also turned on
jbd2_checkpoint_stats. Not sure if the count "running" in 5000s or
frequency of jbd2_checkpoint_stats has any relevance.

<...>-534 [003] 64044.506552: jbd2_checkpoint_stats: dev
sda1 tid 5433920 chp_time 0 forced_to_close 0 written 0 dropped 6
<...>-534 [003] 64044.506555: jbd2_run_stats: dev sda1
tid 5433921 wait 0 running 5003 locked 0 flushing 0 logging 13
handle_count 1587 blocks 6 blocks_logged 7
<...>-534 [003] 64049.505782: jbd2_checkpoint_stats: dev
sda1 tid 5433921 chp_time 0 forced_to_close 0 written 0 dropped 6
<...>-534 [003] 64049.505784: jbd2_run_stats: dev sda1
tid 5433922 wait 0 running 5003 locked 0 flushing 0 logging 15
handle_count 1585 blocks 6 blocks_logged 7
<...>-534 [003] 64149.417563: jbd2_checkpoint_stats: dev
sda1 tid 5433941 chp_time 0 forced_to_close 0 written 0 dropped 6
<...>-534 [003] 64149.417565: jbd2_run_stats: dev sda1
tid 5433942 wait 0 running 5003 locked 0 flushing 0 logging 14
handle_count 1600 blocks 6 blocks_logged 7
<...>-534 [003] 64234.335940: jbd2_checkpoint_stats: dev
sda1 tid 5433958 chp_time 0 forced_to_close 0 written 0 dropped 6
<...>-534 [003] 64234.335942: jbd2_run_stats: dev sda1
tid 5433959 wait 0 running 5000 locked 0 flushing 0 logging 13
handle_count 1574 blocks 6 blocks_logged 7
<...>-534 [004] 64658.799432: jbd2_run_stats: dev sda1
tid 5434029 wait 0 running 5004 locked 0 flushing 0 logging 66
handle_count 442 blocks 1139 blocks_logged 1143
<...>-534 [000] 64707.948801: jbd2_run_stats: dev sda1
tid 5434034 wait 0 running 5002 locked 0 flushing 0 logging 73
handle_count 8416 blocks 1475 blocks_logged 1480
<...>-534 [000] 64722.940015: jbd2_run_stats: dev sda1
tid 5434037 wait 0 running 5008 locked 0 flushing 0 logging 72
handle_count 8442 blocks 1479 blocks_logged 1484
<...>-534 [001] 64727.933030: jbd2_run_stats: dev sda1
tid 5434038 wait 0 running 5001 locked 0 flushing 0 logging 73
handle_count 19208 blocks 1389 blocks_logged 1394
<...>-534 [001] 64737.922818: jbd2_run_stats: dev sda1
tid 5434040 wait 0 running 5002 locked 0 flushing 0 logging 77
handle_count 10147 blocks 1405 blocks_logged 1410
<...>-534 [001] 64742.908500: jbd2_run_stats: dev sda1
tid 5434041 wait 0 running 5003 locked 0 flushing 0 logging 69
handle_count 8474 blocks 1219 blocks_logged 1223
<...>-534 [001] 64747.901419: jbd2_run_stats: dev sda1
tid 5434042 wait 0 running 5003 locked 0 flushing 0 logging 66
handle_count 8452 blocks 1206 blocks_logged 1210
<...>-534 [001] 64752.898315: jbd2_checkpoint_stats: dev
sda1 tid 5434041 chp_time 0 forced_to_close 0 written 0 dropped 1219

And this is sample of traces for creating and deleting one 20 GB file in a loop:

<...>-534 [001] 64511.057009: jbd2_run_stats: dev sda1
tid 5434012 wait 0 running 5460 locked 0 flushing 0 logging 47
handle_count 2569 blocks 299 blocks_logged 300
<...>-534 [001] 64517.051769: jbd2_run_stats: dev sda1
tid 5434013 wait 0 running 5592 locked 0 flushing 0 logging 52
handle_count 3847 blocks 583 blocks_logged 585
<...>-534 [000] 64522.958310: jbd2_run_stats: dev sda1
tid 5434014 wait 0 running 5905 locked 0 flushing 0 logging 54
handle_count 2570 blocks 572 blocks_logged 574
<...>-534 [001] 64528.027403: jbd2_run_stats: dev sda1
tid 5434015 wait 0 running 5085 locked 0 flushing 0 logging 47
handle_count 3845 blocks 281 blocks_logged 282
<...>-534 [000] 64534.019586: jbd2_checkpoint_stats: dev
sda1 tid 5434012 chp_time 0 forced_to_close 0 written 0 dropped 299
<...>-534 [000] 64534.019610: jbd2_run_stats: dev sda1
tid 5434016 wait 0 running 5490 locked 0 flushing 0 logging 48
handle_count 2568 blocks 298 blocks_logged 299
<...>-534 [001] 64540.255202: jbd2_checkpoint_stats: dev
sda1 tid 5434013 chp_time 0 forced_to_close 0 written 0 dropped 583
<...>-534 [001] 64540.255206: jbd2_run_stats: dev sda1
tid 5434017 wait 0 running 5581 locked 283 flushing 0 logging 54
handle_count 3849 blocks 692 blocks_logged 695
<...>-534 [001] 64547.069532: jbd2_checkpoint_stats: dev
sda1 tid 5434016 chp_time 0 forced_to_close 0 written 0 dropped 8
<...>-534 [001] 64547.114704: jbd2_checkpoint_stats: dev
sda1 tid 5434015 chp_time 0 forced_to_close 0 written 0 dropped 260
<...>-534 [001] 64547.114724: jbd2_checkpoint_stats: dev
sda1 tid 5434014 chp_time 0 forced_to_close 0 written 0 dropped 571
<...>-534 [001] 64547.114729: jbd2_checkpoint_stats: dev
sda1 tid 5434017 chp_time 0 forced_to_close 0 written 0 dropped 8
<...>-534 [001] 64547.114731: jbd2_run_stats: dev sda1
tid 5434018 wait 0 running 5603 locked 276 flushing 0 logging 45
handle_count 3851 blocks 281 blocks_logged 282
<...>-534 [001] 64553.974915: jbd2_run_stats: dev sda1
tid 5434019 wait 0 running 5593 locked 274 flushing 0 logging 48
handle_count 3849 blocks 302 blocks_logged 303
<...>-534 [001] 64559.979010: jbd2_run_stats: dev sda1
tid 5434020 wait 0 running 5005 locked 0 flushing 0 logging 56
handle_count 3847 blocks 669 blocks_logged 671
<...>-534 [001] 64565.952850: jbd2_run_stats: dev sda1
tid 5434021 wait 0 running 5400 locked 0 flushing 0 logging 52
handle_count 12933 blocks 559 blocks_logged 561

It would be awesome if you could shed some light on the counters and
the interactions.

Thanks again.

--Cuong

On Thu, Sep 12, 2013 at 12:02 PM, Theodore Ts'o <[email protected]> wrote:
> Are you absolutely certain your JVM attempting to write to any files
> in its GC thread? Say, to do some kind of logging? It might be worth
> stracing the JVM and correlating the GC stall with any syscalls that
> might have been issued from the JVM GC thread.
>
> Especially in the case of the FS Flush, the writeback thread isn't CPU
> bound. It will wait for the writeback to complete, but while it's
> waiting, other processes or threads will be allowed to run on the CPU.
>
> Now, if the GC thread tries to do some kind of fs operation which
> requires writing to the file system, and the file sytstem is trying to
> start a jbd transaction commit, file system operations can block until
> all of the jbd handles associated with the previous commit can
> complete. If you are storage devices are slow, or you are using a
> block cgroup to control how much I/O bandwidth a particular cgroup
> could use, this can end up causing a priority inversion where a low
> priority cgroup takes a while to complete, this can stall the jbd
> commit completion, and this can cause new ext4 operations can stall
> waiting to start a new jbd handle.
>
> So you could have a stall happening, if it's taking a long time for
> commits to complete, but it might be completely unrelated to a GC
> stall.
>
> If you enable the jbd2_run_stats tracepoint, you can get some
> interesting numbers about how long the various phases of the jbd2
> commit are taking.
>
> - Ted

2013-09-13 18:36:19

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions

On Fri, Sep 13, 2013 at 08:25:30AM -0700, Cuong Tran wrote:
> I did strace and some stalls can be correlated to GC thread writing to
> the log but others are not.

Something else which just came to mind --- depending on the kernel
version which you are using, you may be getting hit by stable page
writebacks. This is where a process tries to modify a mmap'ed page
while it is being written back to disk. With stable page writeback,
the an attempt to modify a page which is being written back to disk
will cause a page fault, and that process will block until the page is
written back to disk.

This is a feature which is required for certain block devices such as
iSCSI, SAS drives with DIF and RAID 5, since if the page changes out
from under the writeback, the checksum (or parity strip in the case of
RAID 5) of the page will be broken. However, it can cause random
process stalls.

Some of us (Google and Tao Bao) have hacked kernels which forcibly
disables stable page writeback, because it causes latency problems.
(Since we weren't using any of the block devices that require this
feature, it was OK.) In the most recent kernels, there are some
changes which will automatically disable stable page writebacks for
those block devices that don't need it.

> I turned on the jbd2_run_stats trace point, and this is sample trace
> of creating and deleting 300 100MB files in a loop. (My desktop is
> running low on disk space!). No Java test is running. I also turned on
> jbd2_checkpoint_stats. Not sure if the count "running" in 5000s or
> frequency of jbd2_checkpoint_stats has any relevance.
>
> <...>-534 [003] 64044.506552: jbd2_checkpoint_stats: dev
> sda1 tid 5433920 chp_time 0 forced_to_close 0 written 0 dropped 6

The check point time is zero, so we don't have any problems here.
That's fine.

> <...>-534 [003] 64044.506555: jbd2_run_stats: dev sda1
> tid 5433921 wait 0 running 5003 locked 0 flushing 0 logging 13
> handle_count 1587 blocks 6 blocks_logged 7

The "run time" is the time between the previous transaction and when
the current transaction starts committing. Five seconds (all of the
times in jbd2_run_stats are in milliseconds) is the default commit
time. This implies that the transaction was committed because of the
commit timeout, and not because either (a) an fsync() or fdatasync()
or (b) the number of changed blocks was too big, such that we forced
the commit to start.

So for this test, all of the jbd2_run_stats don't show anything
interesting. We are spending 13-15ms writing a half-dozen or so
blocks to the journal. The handle count is the number of individual
file system operations in the commit, but all of these handles are
touching only 6 metadata blocks. I'm guessing this means you are
writing those 100MB files in relatively small chunks, which explains
why there are 1500+ handles started but so few metadata blocks
(primarily bitmap allocation blocks and inode table blocks) being
modified in those 5 seconds.


> <...>-534 [000] 64707.948801: jbd2_run_stats: dev sda1
> tid 5434034 wait 0 running 5002 locked 0 flushing 0 logging 73
> handle_count 8416 blocks 1475 blocks_logged 1480

OK, this is a bit more typical; in this five second range, we modified
1475 metadata blocks across 8416 file system operations. Note that
the commit operation only took a total of 73 milliseconds, though.
(That's because the write to the journal is a sequential write, which
tends to be fairly efficient for HDD's.)

The bottom line is that it doesn't look like the journal commits is
actually taking that long. I certainly don't see anything to explain
GC stalls in the one second range.

Regards,

- Ted

2013-09-14 18:47:10

by Cuong Tran

[permalink] [raw]
Subject: Re: Java Stop-the-World GC stall induced by FS flush or many large file deletions

Sorry for belated response as I am packing for vacation.

It's very interesting about the stable page writeback. I will keep
that in mind because other apps use mmap but not this one.

Oh, I was not very clear in my earlier e-mails. The jbd2_run_stats are
for the tests that falloc() (hence you did not see much writes) and
then delete large files. Java app is not running in these traces. I
will repeat the run but this time with Java app and let you know.

I hope you don't mind the following questions:

1. We observed that ex4 if mounted with "writeback" seems to be less
likely to stall than mounted with "ordered". Is this because "ordered"
flushes the journal buffers after the data blocks of the same inode
and hence it's more likely for the threads writing to the same inode
to block (hence stall) during the journal commit? In our test, all
threads write to the same inode.

2. We also observed that the threads are more likely to stall when
write back kicks in (dirty_background_ratio exceeded). Is this due to
block allocations if the FS has to flush buffers that have delayed
allocation?

3. Once the # dirty buffers approach dirty_ratio threshold, the
threads can stall more frequently than in case 2 above, probably they
are drafted to flush the dirty buffers?

4. I am very puzzled as I can correlate some but not stalls with GC
writes to gc.log file. But I can't correlate stalls with deferred
writes by the app threads. These writes are append-only and can take
up to 700 msec to return but the strace time-stamp of such writes do
not line up with when GC stalls happen.

And it's also unclear when GC "parks" or "acquiesces" the threads.
AFAIK, it's a co-operating models that app threads check back
(probably via futex) if they need to park. What if one thread is stuck
in a kernel but in the mean time, Eden needs compaction! Other threads
can not proceed then until the "stuck" thread checks back and parks?
(I would think JVM would signal the threads to "recall" them but that
does not seem to be the case).

Thanks again and you have a good weekend.

--Cuong




On Fri, Sep 13, 2013 at 11:36 AM, Theodore Ts'o <[email protected]> wrote:
> On Fri, Sep 13, 2013 at 08:25:30AM -0700, Cuong Tran wrote:
>> I did strace and some stalls can be correlated to GC thread writing to
>> the log but others are not.
>
> Something else which just came to mind --- depending on the kernel
> version which you are using, you may be getting hit by stable page
> writebacks. This is where a process tries to modify a mmap'ed page
> while it is being written back to disk. With stable page writeback,
> the an attempt to modify a page which is being written back to disk
> will cause a page fault, and that process will block until the page is
> written back to disk.
>
> This is a feature which is required for certain block devices such as
> iSCSI, SAS drives with DIF and RAID 5, since if the page changes out
> from under the writeback, the checksum (or parity strip in the case of
> RAID 5) of the page will be broken. However, it can cause random
> process stalls.
>
> Some of us (Google and Tao Bao) have hacked kernels which forcibly
> disables stable page writeback, because it causes latency problems.
> (Since we weren't using any of the block devices that require this
> feature, it was OK.) In the most recent kernels, there are some
> changes which will automatically disable stable page writebacks for
> those block devices that don't need it.
>
>> I turned on the jbd2_run_stats trace point, and this is sample trace
>> of creating and deleting 300 100MB files in a loop. (My desktop is
>> running low on disk space!). No Java test is running. I also turned on
>> jbd2_checkpoint_stats. Not sure if the count "running" in 5000s or
>> frequency of jbd2_checkpoint_stats has any relevance.
>>
>> <...>-534 [003] 64044.506552: jbd2_checkpoint_stats: dev
>> sda1 tid 5433920 chp_time 0 forced_to_close 0 written 0 dropped 6
>
> The check point time is zero, so we don't have any problems here.
> That's fine.
>
>> <...>-534 [003] 64044.506555: jbd2_run_stats: dev sda1
>> tid 5433921 wait 0 running 5003 locked 0 flushing 0 logging 13
>> handle_count 1587 blocks 6 blocks_logged 7
>
> The "run time" is the time between the previous transaction and when
> the current transaction starts committing. Five seconds (all of the
> times in jbd2_run_stats are in milliseconds) is the default commit
> time. This implies that the transaction was committed because of the
> commit timeout, and not because either (a) an fsync() or fdatasync()
> or (b) the number of changed blocks was too big, such that we forced
> the commit to start.
>
> So for this test, all of the jbd2_run_stats don't show anything
> interesting. We are spending 13-15ms writing a half-dozen or so
> blocks to the journal. The handle count is the number of individual
> file system operations in the commit, but all of these handles are
> touching only 6 metadata blocks. I'm guessing this means you are
> writing those 100MB files in relatively small chunks, which explains
> why there are 1500+ handles started but so few metadata blocks
> (primarily bitmap allocation blocks and inode table blocks) being
> modified in those 5 seconds.
>
>
>> <...>-534 [000] 64707.948801: jbd2_run_stats: dev sda1
>> tid 5434034 wait 0 running 5002 locked 0 flushing 0 logging 73
>> handle_count 8416 blocks 1475 blocks_logged 1480
>
> OK, this is a bit more typical; in this five second range, we modified
> 1475 metadata blocks across 8416 file system operations. Note that
> the commit operation only took a total of 73 milliseconds, though.
> (That's because the write to the journal is a sequential write, which
> tends to be fairly efficient for HDD's.)
>
> The bottom line is that it doesn't look like the journal commits is
> actually taking that long. I certainly don't see anything to explain
> GC stalls in the one second range.
>
> Regards,
>
> - Ted