Hi,
we're using ext4 on satellite boxes (e.g. XTrend et9200) with mounted harddisks. The receiver uses linux with kernel version 3.8.7. Some users (like me) have problems with the first recording right after boot. Most of them have big partitions(1-2TB) and high disk usage (over 50%). The application signals in this case a buffer overflow (the buffer is 4 MB big). We found out, that one of the first writes after boot or remount is very slow. I have debugged it. The testcase was a umount then mount and then a write of 64MB data to the disk:
The problem is the initialization of the buffer cache which takes very long(ext4_mb_init_group in ext4_mb_good_group procedure). In my case it loads around 1300 groups per second (with the patch which avoids loading full groups). My disk is at the beginning quite full, so it needs to read around 8200 groups to find a "good" one. This takes over 6 seconds. Here is the output:
May 10 02:06:15 et9x00 user.info kernel: EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator start time: 4284161251798
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before for group loop cr: 0
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 0 time: 4284161318465
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads; time: 4284161355983
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait time: 4284161440835
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap time: 4284167134687
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads; time: 4284167180243
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait time: 4284167198909
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap time: 4284167212835
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 1 time: 4284167260724
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 2 time: 4284167276576
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 3 time: 4284167291205
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 4 time: 4284167305798
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 5 time: 4284167320280
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 6 time: 4284167334835
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 7 time: 4284167349317
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8 time: 4284167363909
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 9 time: 4284167378391
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 10 time: 4284167392872
...
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8240 time: 4290297430389
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads; time: 4290297464612
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait time: 4290297521464
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap time: 4290310304019
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads; time: 4290310346352
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait time: 4290310363908
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap time: 4290310377834
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8241 time: 4290310425945
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8242 time: 4290310443241
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8243 time: 4290310458352
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8244 time: 4290310473204
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8245 time: 4290310488056
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8246 time: 4290310503167
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8247 time: 4290310517982
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8248 time: 4290310533093
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8249 time: 4290310547945
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8250 time: 4290310562797
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8251 time: 4290310577871
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8252 time: 4290310592945
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8253 time: 4290310608019
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8254 time: 4290310622871
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8255 time: 4290310637723
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads; time: 4290310668278
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait time: 4290310739464
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap time: 4290310979093
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads; time: 4290311058093
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait time: 4290311077538
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap time: 4290311091167
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator after good group; group: 8255 time: 4290311137649
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator end time: 4290311168501
Don't be surprised I couldn't activate kernel tracing on the satellite box. Perhaps because of 2 closed source kernel modules which we need to use. I have added some ext4_msg statements in mballoc.c.
If I understood right, this can also happen hours after first write, because there might be some space at the beginning of the harddisk and if it is consumed the initialization of the buffer cache proceeds...
For debugging I tested with a 64MB write, which took more than 7 seconds (the subsequently writes were much faster):
root@et9x00:~# dd if=/dev/zero of=/hdd/test.6434 bs=64M count=1
1+0 records in
1+0 records out
67108864 bytes (64.0MB) copied, 7.379178 seconds, 8.7MB/s
The application writes 188KB blocks to disk. And there we also see after some quick writes several seconds taking writes after boot or mount.
So we have real time data (up to 2 MB per second per recording) which needs to be written within max 2 or 3 seconds (depending on the bitrate of the channel). We cannot have very big buffer in the application, because of limited resources, which we cannot change.
So my questions:
- What can we do to avoid this (at best with no reformating)?
- Why do you throw away buffer cache and don't store it on disk during umount? The initialization of the buffer cache is quite awful for application which need a specific write throughput.
- A workaround would be to read whole /proc/.../mb_groups file right after every mount. Correct?
- I can try to add a mount option to initialize the cache at mount time. Would you be interested in such a patch?
- I can see (see debug output) that the call of ext4_wait_block_bitmap in mballoc.c line 848 takes during buffer cache initialization the longest time (some 1/100 of a second). Can this be improved?
Regards,
Frank
Hi Frank,
Consider using bigalloc feature (requires reformat), preallocate space
with fallocate and use O_DIRECT for reads/writes. However, 188k writes
are too small for good throughput with O_DIRECT. You might also want to
adjust max_sectors_kb to something larger than 512k.
We're doing 6in+6out 20Mbps streams just fine.
Regards,
Andrei.
On 17.05.2013 09:51, [email protected] wrote:
> Hi,
>
> we're using ext4 on satellite boxes (e.g. XTrend et9200) with mounted harddisks. The receiver uses linux with kernel version 3.8.7. Some users (like me) have problems with the first recording right after boot. Most of them have big partitions(1-2TB) and high disk usage (over 50%). The application signals in this case a buffer overflow (the buffer is 4 MB big). We found out, that one of the first writes after boot or remount is very slow. I have debugged it. The testcase was a umount then mount and then a write of 64MB data to the disk:
> The problem is the initialization of the buffer cache which takes very long(ext4_mb_init_group in ext4_mb_good_group procedure). In my case it loads around 1300 groups per second (with the patch which avoids loading full groups). My disk is at the beginning quite full, so it needs to read around 8200 groups to find a "good" one. This takes over 6 seconds. Here is the output:
> May 10 02:06:15 et9x00 user.info kernel: EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator start time: 4284161251798
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before for group loop cr: 0
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 0 time: 4284161318465
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads; time: 4284161355983
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait time: 4284161440835
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap time: 4284167134687
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads; time: 4284167180243
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait time: 4284167198909
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap time: 4284167212835
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 1 time: 4284167260724
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 2 time: 4284167276576
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 3 time: 4284167291205
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 4 time: 4284167305798
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 5 time: 4284167320280
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 6 time: 4284167334835
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 7 time: 4284167349317
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8 time: 4284167363909
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 9 time: 4284167378391
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 10 time: 4284167392872
> ...
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8240 time: 4290297430389
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads; time: 4290297464612
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait time: 4290297521464
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap time: 4290310304019
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads; time: 4290310346352
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait time: 4290310363908
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap time: 4290310377834
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8241 time: 4290310425945
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8242 time: 4290310443241
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8243 time: 4290310458352
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8244 time: 4290310473204
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8245 time: 4290310488056
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8246 time: 4290310503167
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8247 time: 4290310517982
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8248 time: 4290310533093
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8249 time: 4290310547945
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8250 time: 4290310562797
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8251 time: 4290310577871
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8252 time: 4290310592945
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8253 time: 4290310608019
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8254 time: 4290310622871
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8255 time: 4290310637723
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads; time: 4290310668278
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait time: 4290310739464
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap time: 4290310979093
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads; time: 4290311058093
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait time: 4290311077538
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap time: 4290311091167
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator after good group; group: 8255 time: 4290311137649
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator end time: 4290311168501
>
> Don't be surprised I couldn't activate kernel tracing on the satellite box. Perhaps because of 2 closed source kernel modules which we need to use. I have added some ext4_msg statements in mballoc.c.
>
> If I understood right, this can also happen hours after first write, because there might be some space at the beginning of the harddisk and if it is consumed the initialization of the buffer cache proceeds...
> For debugging I tested with a 64MB write, which took more than 7 seconds (the subsequently writes were much faster):
> root@et9x00:~# dd if=/dev/zero of=/hdd/test.6434 bs=64M count=1
> 1+0 records in
> 1+0 records out
> 67108864 bytes (64.0MB) copied, 7.379178 seconds, 8.7MB/s
>
> The application writes 188KB blocks to disk. And there we also see after some quick writes several seconds taking writes after boot or mount.
>
> So we have real time data (up to 2 MB per second per recording) which needs to be written within max 2 or 3 seconds (depending on the bitrate of the channel). We cannot have very big buffer in the application, because of limited resources, which we cannot change.
>
> So my questions:
> - What can we do to avoid this (at best with no reformating)?
> - Why do you throw away buffer cache and don't store it on disk during umount? The initialization of the buffer cache is quite awful for application which need a specific write throughput.
> - A workaround would be to read whole /proc/.../mb_groups file right after every mount. Correct?
> - I can try to add a mount option to initialize the cache at mount time. Would you be interested in such a patch?
> - I can see (see debug output) that the call of ext4_wait_block_bitmap in mballoc.c line 848 takes during buffer cache initialization the longest time (some 1/100 of a second). Can this be improved?
>
> Regards,
> Frank
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
On Fri, May 17, 2013 at 06:51:23PM +0200, [email protected] wrote:
> - Why do you throw away buffer cache and don't store it on disk during umount? The initialization of the buffer cache is quite awful for application which need a specific write throughput.
> - A workaround would be to read whole /proc/.../mb_groups file right after every mount. Correct?
Simply adding "cat /proc/fs/<dev>/mb_groups > /dev/null" to one of the
/etc/init.d scripts, or to /etc/rc.local is probably the simplest fix,
yes.
> - I can try to add a mount option to initialize the cache at mount time. Would you be interested in such a patch?
Given the simple nature of the above workaround, it's not obvious to
me that trying to make file system format changes, or even adding a
new mount option, is really worth it. This is especially true given
that mount -a is sequential so if there are a large number of big file
systems, using this as a mount option would be slow down the boot
significantly. It would be better to do this parallel, which you
could do in userspace much more easily using the "cat
/proc/fs/<dev>/mb_groups" workaround.
> - I can see (see debug output) that the call of ext4_wait_block_bitmap in mballoc.c line 848 takes during buffer cache initialization the longest time (some 1/100 of a second). Can this be improved?
The delay is caused purely by I/O delay, so short of replacing the HDD
with a SSD, not really....
Regards,
- Ted
On 2013-05-19, at 8:00, Theodore Ts'o <[email protected]> wrote:
> On Fri, May 17, 2013 at 06:51:23PM +0200, [email protected] wrote:
>> - Why do you throw away buffer cache and don't store it on disk during umount? The initialization of the buffer cache is quite awful for application which need a specific write throughput.
>> - A workaround would be to read whole /proc/.../mb_groups file right after every mount. Correct?
>
> Simply adding "cat /proc/fs/<dev>/mb_groups > /dev/null" to one of the
> /etc/init.d scripts, or to /etc/rc.local is probably the simplest fix,
> yes.
>
>> - I can try to add a mount option to initialize the cache at mount time. Would you be interested in such a patch?
>
> Given the simple nature of the above workaround, it's not obvious to
> me that trying to make file system format changes, or even adding a
> new mount option, is really worth it. This is especially true given
> that mount -a is sequential so if there are a large number of big file
> systems, using this as a mount option would be slow down the boot
> significantly. It would be better to do this parallel, which you
> could do in userspace much more easily using the "cat
> /proc/fs/<dev>/mb_groups" workaround.
Since we already have a thread starting at mount time to check the
inode table zeroing, it would also be possible to co-opt this thread
for preloading the group metadata from the bitmaps.
>> - I can see (see debug output) that the call of ext4_wait_block_bitmap in mballoc.c line 848 takes during buffer cache initialization the longest time (some 1/100 of a second). Can this be improved?
>
> The delay is caused purely by I/O delay, so short of replacing the HDD
> with a SSD, not really....
Well, with a larger flex_bg factor at format time there will be more
bitmaps allocated together on disk, so fewer seeks needed to load
them after a new mount. We use a flex_bg factor of 256 for this
reason on our very large storage targets.
Cheers, Andreas
On Mon, May 20, 2013 at 12:39:50AM -0600, Andreas Dilger wrote:
>
> Since we already have a thread starting at mount time to check the
> inode table zeroing, it would also be possible to co-opt this thread
> for preloading the group metadata from the bitmaps.
True. Since I wrote my earlier post, I've also been considering the
possibility that e2fsck or the kernel should just simply issue
readahead requests for all of the bitmap blocks. The advantage of
doing it in e2fsck is that it happens earlier.
In fact, since in e2fsck the prereads can be done in parallel, I was
even thinking about a scheme where e2fsck would synchronously force
all of the allocation blocks into the buffer cache, and then in the
kernel, we could have a loop which checks to see if the bitmap blocks
were already in cache, and if they were, to initialize the buddy
bitmaps pages. That way, even if subsequent memory pressure were to
push the buddy bitmap pages and allocation bitmaps out of the cache,
it would mean that all of the ext4_group_info structures would be
initialized, and just having the bb_largest_free_order information
will very much help things.
On Sun, 19 May 2013 21:36:02 +0200 (CEST) Frank C Moeller wrote:
>From my point (end user) I would prefer a builtin solution. I'm also a
>programmer and I can therefore understand why you don't want to change
>anything.
It's not that I don't want to change anything, it's that I'm very
hesitant to add new mount options or new code paths that now need more
testing unless there's no other way of addressing a particular use
case. Another consideration is how to do it in such a way that it
doesn't degrade other users' performance.
Issuing readahead requests for the bitmap blocks might be good
compromise; since they are readahead requests, as low priority
requests they won't interfere with anything else going on, and in
practice, unless you are starting your video recording **immediately**
after the reboot, it should address your concern. (Also note that for
most people willing to hack a DVR, adding a line to /etc/rc.local is
usually considered easier than building a new kernel from sources and
then after making file system format changes, requiring a reformat of
their data disk!)
So it's not that I'm against solutions that involve kernel changes or
file system format changes. It's just that I want to make sure we
explore the entire solution space, since there are costs in terms of
testing costs, the need to do a backup-reformat-restore pass, etc,
etc., to some of the solutions that have been suggested so far.
Regards,
- Ted
On 5/20/13 1:39 AM, Andreas Dilger wrote:
> On 2013-05-19, at 8:00, Theodore Ts'o <[email protected]> wrote:
>> On Fri, May 17, 2013 at 06:51:23PM +0200, [email protected] wrote:
>>> - Why do you throw away buffer cache and don't store it on disk during umount? The initialization of the buffer cache is quite awful for application which need a specific write throughput.
>>> - A workaround would be to read whole /proc/.../mb_groups file right after every mount. Correct?
>>
>> Simply adding "cat /proc/fs/<dev>/mb_groups > /dev/null" to one of the
>> /etc/init.d scripts, or to /etc/rc.local is probably the simplest fix,
>> yes.
>>
>>> - I can try to add a mount option to initialize the cache at mount time. Would you be interested in such a patch?
>>
>> Given the simple nature of the above workaround, it's not obvious to
>> me that trying to make file system format changes, or even adding a
>> new mount option, is really worth it. This is especially true given
>> that mount -a is sequential so if there are a large number of big file
>> systems, using this as a mount option would be slow down the boot
>> significantly. It would be better to do this parallel, which you
>> could do in userspace much more easily using the "cat
>> /proc/fs/<dev>/mb_groups" workaround.
>
> Since we already have a thread starting at mount time to check the
> inode table zeroing, it would also be possible to co-opt this thread
> for preloading the group metadata from the bitmaps.
Only up to a point, I hope; if the fs is so big that you start dropping the
first ones that were read, it'd be pointless. So it'd need some nuance,
at the very least least.
How much memory are you willing to dedicate to this, and how much does
it really help long-term, given that it's not pinned in any way?
As long as we don't have efficiently-searchable on-disk freespace info
it seems like anything else is just a workaround, I'm afraid.
-Eric