2013-02-26 20:39:44

by Theodore Ts'o

[permalink] [raw]
Subject: [GIT PULL] ext4 updates for 3.9


The following changes since commit 9931faca02c604c22335f5a935a501bb2ace6e20:

Linux 3.8-rc3 (2013-01-09 18:59:55 -0800)

are available in the git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git tags/ext4_for_linus

for you to fetch changes up to 304e220f0879198b1f5309ad6f0be862b4009491:

ext4: fix free clusters calculation in bigalloc filesystem (2013-02-22 15:27:52 -0500)

----------------------------------------------------------------
The one new feature added in this patch series is the ability to use
the "punch hole" functionality for inodes that are not using extent
maps.

In the bug fix category, we fixed some races in the AIO and fstrim
code, and some potential NULL pointer dereferences and memory leaks in
error handling code paths.

In the optimization category, we fixed a performance regression in the
jbd2 layer introduced by commit d9b0193 (introduced in v3.0) which
shows up in the AIM7 benchmark. We also further optimized jbd2 by
minimize the amount of time that transaction handles are held active.

This patch series also features some additional enhancement of the
extent status tree, which is now used to cache extent information in a
more efficient/compact form than what we use on-disk.

----------------------------------------------------------------
Akria Fujita (1):
ext4: fix smatch warning in move_extent.c's mext_replace_branches()

Chen Gang (1):
ext4: fix memory leak when quota options are specified multiple times

Cong Ding (1):
ext4: reduce one "if" comparison in ext4_dirhash()

Eric Sandeen (1):
jbd2: don't wake kjournald unnecessarily

Eryu Guan (2):
ext4: check bh in ext4_read_block_bitmap()
ext4: no need to remove extent if len is 0 in ext4_es_remove_extent()

Guo Chao (4):
ext4: release buffer when checksum failed
ext4: remove unused variable in add_dirent_to_buf()
ext4: remove useless assignment in dx_probe()
ext4: remove unnecessary NULL pointer check

Jan Kara (13):
ext4: Always use ext4_bio_write_page() for writeout
ext4: use redirty_page_for_writepage() in ext4_bio_write_page()
ext4: remove __ext4_journalled_writepage() from mpage_da_submit_io()
ext4: move work from io_end to inode
ext4: simplify list handling in ext4_do_flush_completed_IO()
ext4: dirty page has always buffers attached
ext4: simplify mpage_add_bh_to_extent()
ext4: Make ext4_bio_writepage() handle unprepared buffers
ext4: fix ext4_writepage() to achieve data=ordered guarantees
ext4: fix possible use-after-free with AIO
ext4: move several mount options to standard handling loop
ext4: make mount option parsing loop more logical
ext4: print error when argument of inode_readahead_blk is invalid

Julia Lawall (1):
ext4: use WARN in ext4_alloc_blocks

Lukas Czerner (4):
ext4: remove unused variable flags
ext4: remove explicit WARN_ON when ext4_map_blocks() fails
ext4: fix xattr block allocation/release with bigalloc
ext4: fix free clusters calculation in bigalloc filesystem

Niu Yawei (1):
ext4: fix race in ext4_mb_add_n_trim()

Theodore Ts'o (24):
ext4: return ENOMEM if sb_getblk() fails
ext4: trigger the lazy inode table initialization after resize
ext4: release sysfs kobject when failing to enable quotas on mount
quota: autoload the quota_v2 module for QFMT_VFS_V1 quota format
ext4: check incompatible mount options while mounting ext2/3
ext4: optimize mballoc for large allocations
jbd2: track request delay statistics
jbd2: revert "jbd2: add COW fields to struct jbd2_journal_handle"
jbd2: add tracepoints which provide per-handle statistics
ext4: move the jbd2 wrapper functions out of super.c
ext4: pass context information to jbd2__journal_start()
ext4: grab page before starting transaction handle in write_begin()
ext4: start handle at the last possible moment in ext4_unlink()
ext4: start handle at the last possible moment in ext4_rmdir()
ext4: fix the number of credits needed for ext4_ext_migrate()
ext4: fix the number of credits needed for ext4_unlink() and ext4_rmdir()
ext4: fix the number of credits needed for acl ops with inline data
ext4: start handle at the last possible moment when creating inodes
ext4: use module parameters instead of debugfs for mballoc_debug
jbd2: use module parameters instead of debugfs for jbd_debug
ext4: use KERN_WARNING for warning messages
ext4: add debugging context for warning in ext4_da_update_reserve_space()
ext4: refactor code to read directory blocks into ext4_read_dirblock()
ext4: use ERR_PTR() abstraction for ext4_append()

Wang Shilong (1):
ext4: use unlikely to improve the efficiency of the kernel

Zheng Liu (11):
ext4: add tracepoint in punching hole
ext4: add punching hole support for non-extent-mapped files
ext4: refine extent status tree
ext4: add physical block and status member into extent status tree
ext4: rename and improbe ext4_es_find_extent()
ext4: let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag
ext4: track all extent status in extent status tree
ext4: lookup block mapping in extent status tree
ext4: remove single extent cache
ext4: adjust some functions for reclaiming extents from extent status tree
ext4: reclaim extents from extent status tree

fs/ext4/acl.c | 7 +-
fs/ext4/balloc.c | 13 +-
fs/ext4/dir.c | 1 +
fs/ext4/ext4.h | 123 ++++++--
fs/ext4/ext4_extents.h | 6 -
fs/ext4/ext4_jbd2.c | 102 +++++++
fs/ext4/ext4_jbd2.h | 51 +++-
fs/ext4/extents.c | 312 ++++++++-------------
fs/ext4/extents_status.c | 631 ++++++++++++++++++++++++++++++-----------
fs/ext4/extents_status.h | 83 +++++-
fs/ext4/file.c | 16 +-
fs/ext4/hash.c | 6 +-
fs/ext4/ialloc.c | 29 +-
fs/ext4/indirect.c | 259 ++++++++++++++++-
fs/ext4/inline.c | 12 +-
fs/ext4/inode.c | 664 ++++++++++++++++++++------------------------
fs/ext4/ioctl.c | 13 +-
fs/ext4/mballoc.c | 69 ++---
fs/ext4/mballoc.h | 4 +-
fs/ext4/migrate.c | 15 +-
fs/ext4/mmp.c | 4 +-
fs/ext4/move_extent.c | 10 +-
fs/ext4/namei.c | 497 +++++++++++++++------------------
fs/ext4/page-io.c | 85 ++----
fs/ext4/resize.c | 36 +--
fs/ext4/super.c | 485 ++++++++++++++------------------
fs/ext4/xattr.c | 23 +-
fs/ext4/xattr.h | 68 -----
fs/jbd2/commit.c | 8 +
fs/jbd2/journal.c | 66 ++---
fs/jbd2/transaction.c | 29 +-
include/linux/jbd2.h | 44 ++-
include/linux/quota.h | 1 +
include/trace/events/ext4.h | 214 +++++++++++---
include/trace/events/jbd2.h | 106 ++++++-
35 files changed, 2408 insertions(+), 1684 deletions(-)


2013-02-27 12:47:27

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: [GIT PULL] ext4 updates for 3.9

On 2013.02.26 at 15:39 -0500, Theodore Ts'o wrote:
>
> The following changes since commit 9931faca02c604c22335f5a935a501bb2ace6e20:
>
> Linux 3.8-rc3 (2013-01-09 18:59:55 -0800)
>
> are available in the git repository at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git tags/ext4_for_linus

Just booted todays Linux tree and got the following errors:

...
Feb 27 13:33:31 x4 kernel: EXT4-fs (sda): mounted filesystem with ordered data mode. Opts: (null)
...
Feb 27 13:33:32 x4 kernel: EXT4-fs error (device sda): ext4_find_dest_de:1657: inode #70647809: block 14164000: comm cupsd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2569822761, rec_len=3837, name_len=1
Feb 27 13:33:32 x4 kernel: EXT4-fs error (device sda): ext4_find_dest_de:1657: inode #70911401: block 15213579: comm pdnsd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2000846358, rec_len=36782, name_len=120

/dev/sda ext4 1.4T 655G 651G 51% /var
/dev/sda on /var type ext4 (rw,noatime,data=ordered)

Running "fsck.ext4 -f /dev/sda" shows no problems.

--
Markus

2013-02-27 15:34:35

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [GIT PULL] ext4 updates for 3.9

On Wed, Feb 27, 2013 at 01:47:27PM +0100, Markus Trippelsdorf wrote:
> Just booted todays Linux tree and got the following errors:
>
> ...
> Feb 27 13:33:31 x4 kernel: EXT4-fs (sda): mounted filesystem with ordered data mode. Opts: (null)
> ...
> Feb 27 13:33:32 x4 kernel: EXT4-fs error (device sda): ext4_find_dest_de:1657: inode #70647809: block 14164000: comm cupsd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2569822761, rec_len=3837, name_len=1
> Feb 27 13:33:32 x4 kernel: EXT4-fs error (device sda): ext4_find_dest_de:1657: inode #70911401: block 15213579: comm pdnsd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2000846358, rec_len=36782, name_len=120

Is this reproducible? This looks like the in-memory copy of the
directory got corrupted. This could be caused by a hardware error, or
a wild pointer, or a bug in the buffer cache code, etc. Since there
are so many different possible causes of this kind of complaint, we
really need some kind of reproducible test case to do anything with this.

I did do a test compile of the ext4 tree with the latest linus.git
tree merged in, and ran a full set of repgression tests before I sent
my pull request. Now, the regression tests take over 14 hours to run,
and there is a delay between when a maintainer sends the pull request
to when Linus acts on it --- so Linus almost certainly pulled in some
other trees betewen when I did my final regression testing and when I
sent the pull request and he pulled it into my tree.

I'll see if I can reproduce this on my end, on Linus's tree after the
ext4 tree was merged in, but at least in the past, this is the sort of
thing that is almost certainly caused by a hardware failure or bug
somewhere in the device driver, mm, or buffer cache, given that the
directory looks completely insane and a subsequent e2fsck -f didn't
discover any problem.

Is there anything special about your system? How much memory do you
have? What kind of device is /dev/sda? What sort of workload did you
have running on your system before the failure?

Also, can you send us the output of "debugfs -R "stat <70647809>"
/dev/sda" so I can confirm that block 14164000 really is assigned to
inode 70647809? The one potential cause of this error I can think of
that might be related to recent changes in ext4 is if the extent
status tree had the wrong logical-to-physical mapping cached for the
directory inode.

Regards,

- Ted

2013-02-27 15:44:54

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: [GIT PULL] ext4 updates for 3.9

On 2013.02.27 at 10:34 -0500, Theodore Ts'o wrote:
> On Wed, Feb 27, 2013 at 01:47:27PM +0100, Markus Trippelsdorf wrote:
> > Just booted todays Linux tree and got the following errors:
> >
> > ...
> > Feb 27 13:33:31 x4 kernel: EXT4-fs (sda): mounted filesystem with ordered data mode. Opts: (null)
> > ...
> > Feb 27 13:33:32 x4 kernel: EXT4-fs error (device sda): ext4_find_dest_de:1657: inode #70647809: block 14164000: comm cupsd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2569822761, rec_len=3837, name_len=1
> > Feb 27 13:33:32 x4 kernel: EXT4-fs error (device sda): ext4_find_dest_de:1657: inode #70911401: block 15213579: comm pdnsd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2000846358, rec_len=36782, name_len=120
>
> Is this reproducible?

Haven't checked yet, because I got scared.

> Is there anything special about your system? How much memory do you
> have? What kind of device is /dev/sda? What sort of workload did you
> have running on your system before the failure?

There is nothing special about my system. /dev/sda is a standard SATA
drive:
Model Family: Seagate Barracuda Green (AF)
Device Model: ST1500DL003-9VT16L
The error happens during boot.
I use ECC memory.

> Also, can you send us the output of "debugfs -R "stat <70647809>"
> /dev/sda" so I can confirm that block 14164000 really is assigned to
> inode 70647809? The one potential cause of this error I can think of
> that might be related to recent changes in ext4 is if the extent
> status tree had the wrong logical-to-physical mapping cached for the
> directory inode.

Inode: 70647809 Type: directory Mode: 0755 Flags: 0x80000
Generation: 4138624941 Version: 0x00000000:0000000d
User: 0 Group: 0 Size: 4096
File ACL: 0 Directory ACL: 0
Links: 13 Blockcount: 8
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x50ab80de:2c77d5bc -- Tue Nov 20 14:08:46 2012
atime: 0x50aa5519:c556886c -- Mon Nov 19 16:49:45 2012
mtime: 0x50ab80de:2c77d5bc -- Tue Nov 20 14:08:46 2012
crtime: 0x50aa4a81:06bc6fd0 -- Mon Nov 19 16:04:33 2012
Size of extra inode fields: 28
EXTENTS:
(0):282599456

--
Markus

2013-02-27 17:01:57

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: [GIT PULL] ext4 updates for 3.9

On 2013.02.27 at 16:44 +0100, Markus Trippelsdorf wrote:
> On 2013.02.27 at 10:34 -0500, Theodore Ts'o wrote:
> > On Wed, Feb 27, 2013 at 01:47:27PM +0100, Markus Trippelsdorf wrote:
> > > Just booted todays Linux tree and got the following errors:
> > >
> > > ...
> > > Feb 27 13:33:31 x4 kernel: EXT4-fs (sda): mounted filesystem with ordered data mode. Opts: (null)
> > > ...
> > > Feb 27 13:33:32 x4 kernel: EXT4-fs error (device sda): ext4_find_dest_de:1657: inode #70647809: block 14164000: comm cupsd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2569822761, rec_len=3837, name_len=1
> > > Feb 27 13:33:32 x4 kernel: EXT4-fs error (device sda): ext4_find_dest_de:1657: inode #70911401: block 15213579: comm pdnsd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2000846358, rec_len=36782, name_len=120
> >
> > Is this reproducible?

The issue is caused by:

commit d100eef2440fea13e4f09e88b1c8bcbca64beb9f
Author: Zheng Liu <[email protected]>
Date: Mon Feb 18 00:29:59 2013 -0500

ext4: lookup block mapping in extent status tree

--
Markus

2013-02-27 17:10:15

by Zheng Liu

[permalink] [raw]
Subject: Re: [GIT PULL] ext4 updates for 3.9

?? 2013-2-28??????1:01??Markus Trippelsdorf <[email protected]> д????

> On 2013.02.27 at 16:44 +0100, Markus Trippelsdorf wrote:
>> On 2013.02.27 at 10:34 -0500, Theodore Ts'o wrote:
>>> On Wed, Feb 27, 2013 at 01:47:27PM +0100, Markus Trippelsdorf wrote:
>>>> Just booted todays Linux tree and got the following errors:
>>>>
>>>> ...
>>>> Feb 27 13:33:31 x4 kernel: EXT4-fs (sda): mounted filesystem with ordered data mode. Opts: (null)
>>>> ...
>>>> Feb 27 13:33:32 x4 kernel: EXT4-fs error (device sda): ext4_find_dest_de:1657: inode #70647809: block 14164000: comm cupsd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2569822761, rec_len=3837, name_len=1
>>>> Feb 27 13:33:32 x4 kernel: EXT4-fs error (device sda): ext4_find_dest_de:1657: ignore#70911401: block 15213579: comm pdnsd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2000846358, rec_len=36782, name_len=120
>>>
>>> Is this reproducible?
>
> The issue is caused by:
>
> commit d100eef2440fea13e4f09e88b1c8bcbca64beb9f
> Author: Zheng Liu <[email protected]>
> Date: Mon Feb 18 00:29:59 2013 -0500
>
> ext4: lookup block mapping in extent status tree

Hi Markus,

Thanks for the report. I am very sorry about that. Now I am trying to fix it.

Thanks,
- Zheng-

2013-02-27 17:22:56

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: [GIT PULL] ext4 updates for 3.9

On 2013.02.28 at 01:10 +0800, gnehzuil.liu wrote:
> 在 2013-2-28,上午1:01,Markus Trippelsdorf <[email protected]> 写道:
>
> > On 2013.02.27 at 16:44 +0100, Markus Trippelsdorf wrote:
> >> On 2013.02.27 at 10:34 -0500, Theodore Ts'o wrote:
> >>> On Wed, Feb 27, 2013 at 01:47:27PM +0100, Markus Trippelsdorf wrote:
> >>>> Just booted todays Linux tree and got the following errors:
> >>>>
> >>>> ...
> >>>> Feb 27 13:33:31 x4 kernel: EXT4-fs (sda): mounted filesystem with ordered data mode. Opts: (null)
> >>>> ...
> >>>> Feb 27 13:33:32 x4 kernel: EXT4-fs error (device sda): ext4_find_dest_de:1657: inode #70647809: block 14164000: comm cupsd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2569822761, rec_len=3837, name_len=1
> >>>> Feb 27 13:33:32 x4 kernel: EXT4-fs error (device sda): ext4_find_dest_de:1657: ignore#70911401: block 15213579: comm pdnsd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2000846358, rec_len=36782, name_len=120
> >>>
> >>> Is this reproducible?
> >
> > The issue is caused by:
> >
> > commit d100eef2440fea13e4f09e88b1c8bcbca64beb9f
> > Author: Zheng Liu <[email protected]>
> > Date: Mon Feb 18 00:29:59 2013 -0500
> >
> > ext4: lookup block mapping in extent status tree
>
> Thanks for the report. I am very sorry about that. Now I am trying to fix it.
>
Thank you.

Please note that if I run a kernel build with:
"git reset --hard f7fec032aa782d3fd7e51fbdf08aa3a296c01500"
(the commit before d100eef24) I get a different but similar error:

EXT4-fs (sda): error count: 4
EXT4-fs (sda): initial error at 1361983458: ext4_find_dest_de:1657: inode 70911401: block 15213579
EXT4-fs (sda): last error at 1361983663: ext4_find_dest_de:1658: inode 70647809: block 14164000

--
Markus

2013-02-27 17:38:13

by Zheng Liu

[permalink] [raw]
Subject: Re: [GIT PULL] ext4 updates for 3.9

?? 2013-2-28??????1:22??Markus Trippelsdorf <[email protected]> д????

> On 2013.02.28 at 01:10 +0800, gnehzuil.liu wrote:
>> ?? 2013-2-28??????1:01??Markus Trippelsdorf <[email protected]> д????
>>
>>> On 2013.02.27 at 16:44 +0100, Markus Trippelsdorf wrote:
>>>> On 2013.02.27 at 10:34 -0500, Theodore Ts'o wrote:
>>>>> On Wed, Feb 27, 2013 at 01:47:27PM +0100, Markus Trippelsdorf wrote:
>>>>>> Just booted todays Linux tree and got the following errors:
>>>>>>
>>>>>> ...
>>>>>> Feb 27 13:33:31 x4 kernel: EXT4-fs (sda): mounted filesystem with ordered data mode. Opts: (null)
>>>>>> ...
>>>>>> Feb 27 13:33:32 x4 kernel: EXT4-fs error (device sda): ext4_find_dest_de:1657: inode #70647809: block 14164000: comm cupsd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2569822761, rec_len=3837, name_len=1
>>>>>> Feb 27 13:33:32 x4 kernel: EXT4-fs error (device sda): ext4_find_dest_de:1657: ignore#70911401: block 15213579: comm pdnsd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2000846358, rec_len=36782, name_len=120
>>>>>
>>>>> Is this reproducible?
>>>
>>> The issue is caused by:
>>>
>>> commit d100eef2440fea13e4f09e88b1c8bcbca64beb9f
>>> Author: Zheng Liu <[email protected]>
>>> Date: Mon Feb 18 00:29:59 2013 -0500
>>>
>>> ext4: lookup block mapping in extent status tree
>>
>> Thanks for the report. I am very sorry about that. Now I am trying to fix it.
>>
> Thank you.
>
> Please note that if I run a kernel build with:
> "git reset --hard f7fec032aa782d3fd7e51fbdf08aa3a296c01500"
> (the commit before d100eef24) I get a different but similar error:
>
> EXT4-fs (sda): error count: 4
> EXT4-fs (sda): initial error at 1361983458: ext4_find_dest_de:1657: inode 70911401: block 15213579
> EXT4-fs (sda): last error at 1361983663: ext4_find_dest_de:1658: inode 70647809: block 14164000

Yup, thanks for the note. Could you please revert to this commit a25a4e1a5d5dc0f97dddbca44e695c532d8228c1 (the commit before f7fec032)? I suspect that commit f7fec032 is the root cause. Thanks for your help.

Regards,
- Zheng-

2013-02-27 17:45:53

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: [GIT PULL] ext4 updates for 3.9

On 2013.02.28 at 01:38 +0800, gnehzuil.liu wrote:
> 在 2013-2-28,上午1:22,Markus Trippelsdorf <[email protected]> 写道:
>
> > On 2013.02.28 at 01:10 +0800, gnehzuil.liu wrote:
> >> 在 2013-2-28,上午1:01,Markus Trippelsdorf <[email protected]> 写道:
> >>
> >>> On 2013.02.27 at 16:44 +0100, Markus Trippelsdorf wrote:
> >>>> On 2013.02.27 at 10:34 -0500, Theodore Ts'o wrote:
> >>>>> On Wed, Feb 27, 2013 at 01:47:27PM +0100, Markus Trippelsdorf wrote:
> >>>>>> Just booted todays Linux tree and got the following errors:
> >>>>>>
> >>>>>> ...
> >>>>>> Feb 27 13:33:31 x4 kernel: EXT4-fs (sda): mounted filesystem with ordered data mode. Opts: (null)
> >>>>>> ...
> >>>>>> Feb 27 13:33:32 x4 kernel: EXT4-fs error (device sda): ext4_find_dest_de:1657: inode #70647809: block 14164000: comm cupsd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2569822761, rec_len=3837, name_len=1
> >>>>>> Feb 27 13:33:32 x4 kernel: EXT4-fs error (device sda): ext4_find_dest_de:1657: ignore#70911401: block 15213579: comm pdnsd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=2000846358, rec_len=36782, name_len=120
> >>>>>
> >>>>> Is this reproducible?
> >>>
> >>> The issue is caused by:
> >>>
> >>> commit d100eef2440fea13e4f09e88b1c8bcbca64beb9f
> >>> Author: Zheng Liu <[email protected]>
> >>> Date: Mon Feb 18 00:29:59 2013 -0500
> >>>
> >>> ext4: lookup block mapping in extent status tree
> >>
> >> Thanks for the report. I am very sorry about that. Now I am trying to fix it.
> >>
> > Thank you.
> >
> > Please note that if I run a kernel build with:
> > "git reset --hard f7fec032aa782d3fd7e51fbdf08aa3a296c01500"
> > (the commit before d100eef24) I get a different but similar error:
> >
> > EXT4-fs (sda): error count: 4
> > EXT4-fs (sda): initial error at 1361983458: ext4_find_dest_de:1657: inode 70911401: block 15213579
> > EXT4-fs (sda): last error at 1361983663: ext4_find_dest_de:1658: inode 70647809: block 14164000
>
> Yup, thanks for the note. Could you please revert to this commit
> a25a4e1a5d5dc0f97dddbca44e695c532d8228c1 (the commit before f7fec032)?
> I suspect that commit f7fec032 is the root cause. Thanks for your
> help.

git revert 06b0c886214a223dde7b21cbfc3008fd20a8ce16..74cd15cd02708c7188581f279f33a98b2ae8d322
fixes all issues...

--
Markus

2013-02-27 17:52:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] ext4 updates for 3.9

On Wed, Feb 27, 2013 at 9:45 AM, Markus Trippelsdorf
<[email protected]> wrote:
>
> git revert 06b0c886214a223dde7b21cbfc3008fd20a8ce16..74cd15cd02708c7188581f279f33a98b2ae8d322
> fixes all issues...

Hmm. I'm hoping this will have a quick resolution that doesn't mean
having to revert all that. But it's good to have the option.

Zheng, Ted, I'm holding up further merging (ugh, second time this
merge window) in the hope that we can get this sorted out quickly. I
hate continuing to merge stuff when there are known nasty issues
pending in my tree.

I can't hold for all that long, but it would be really nice if this
got sorted out quickly.

Linus

2013-02-27 18:49:12

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [GIT PULL] ext4 updates for 3.9

Markus, Dave, can you confirm that this fixes your problem?

Thanks!!

(Sigh, this is a real brown paper bug; I'm embarassed I missed this in
my code review.)

- Ted

>From f47f0d11096ca5d9e1965d8a9f266aa13fe2b73b Mon Sep 17 00:00:00 2001
From: Theodore Ts'o <[email protected]>
Date: Wed, 27 Feb 2013 13:47:52 -0500
Subject: [PATCH] ext4: fix extent status tree regression for file systems >
512GB

This fixes a regression introduced by commit f7fec032aa782. The
problem was that the extents status flags caused us to mask out block
numbers smaller than 2**28 blocks. Since we didn't test with file
systems smaller than 512GB, we didn't notice this during the
development cycle.

A typical failure looks like this:

EXT4-fs error (device sdb1): htree_dirblock_to_tree:919: inode #172235804: block
152052301: comm ls: bad entry in directory: rec_len is smaller than minimal -
offset=0(0), inode=0, rec_len=0, name_len=0

... where 'debugfs -R "stat <172235804>" /dev/sdb1' reports that the
inode has block number 688923213. When viewed in hex, block number
152052301 (from the syslog) is 0x910224D, while block number 688923213
is 0x2910224D. Note the missing "0x20000000" in the block number.

Reported-by: Markus Trippelsdorf <[email protected]>
Reported-by: Dave Jones <[email protected]>
Cc: Zheng Liu <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/extents_status.h | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index cf83e77..c795ff6 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -20,10 +20,10 @@
#define es_debug(fmt, ...) no_printk(fmt, ##__VA_ARGS__)
#endif

-#define EXTENT_STATUS_WRITTEN 0x80000000 /* written extent */
-#define EXTENT_STATUS_UNWRITTEN 0x40000000 /* unwritten extent */
-#define EXTENT_STATUS_DELAYED 0x20000000 /* delayed extent */
-#define EXTENT_STATUS_HOLE 0x10000000 /* hole */
+#define EXTENT_STATUS_WRITTEN (((unsigned long long) 1) << 63)
+#define EXTENT_STATUS_UNWRITTEN (((unsigned long long) 1) << 62)
+#define EXTENT_STATUS_DELAYED (((unsigned long long) 1) << 61)
+#define EXTENT_STATUS_HOLE (((unsigned long long) 1) << 60)

#define EXTENT_STATUS_FLAGS (EXTENT_STATUS_WRITTEN | \
EXTENT_STATUS_UNWRITTEN | \
@@ -58,22 +58,22 @@ extern int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk,

static inline int ext4_es_is_written(struct extent_status *es)
{
- return (es->es_pblk & EXTENT_STATUS_WRITTEN);
+ return (es->es_pblk & EXTENT_STATUS_WRITTEN) != 0;
}

static inline int ext4_es_is_unwritten(struct extent_status *es)
{
- return (es->es_pblk & EXTENT_STATUS_UNWRITTEN);
+ return (es->es_pblk & EXTENT_STATUS_UNWRITTEN) != 0;
}

static inline int ext4_es_is_delayed(struct extent_status *es)
{
- return (es->es_pblk & EXTENT_STATUS_DELAYED);
+ return (es->es_pblk & EXTENT_STATUS_DELAYED) != 0;
}

static inline int ext4_es_is_hole(struct extent_status *es)
{
- return (es->es_pblk & EXTENT_STATUS_HOLE);
+ return (es->es_pblk & EXTENT_STATUS_HOLE) != 0;
}

static inline ext4_fsblk_t ext4_es_status(struct extent_status *es)
--
1.7.12.rc0.22.gcdd159b

2013-02-27 18:56:25

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: [GIT PULL] ext4 updates for 3.9

On 2013.02.27 at 13:49 -0500, Theodore Ts'o wrote:

> Markus, Dave, can you confirm that this fixes your problem?

Yes, it fixes the issue.
Thanks Ted.

--
Markus

2013-02-27 18:57:55

by Dave Jones

[permalink] [raw]
Subject: Re: [GIT PULL] ext4 updates for 3.9

On Wed, Feb 27, 2013 at 01:49:12PM -0500, Theodore Ts'o wrote:
> Markus, Dave, can you confirm that this fixes your problem?
>
> Thanks!!
>
> (Sigh, this is a real brown paper bug; I'm embarassed I missed this in
> my code review.)

Building now. Can you confirm that nothing on-disk should be awry ?
Or will I need a new fsck to detect what happened ?

Dave

2013-02-27 18:59:00

by Zheng Liu

[permalink] [raw]
Subject: Re: [GIT PULL] ext4 updates for 3.9

Hi Ted,

On 02/28/2013 02:49 AM, Theodore Ts'o wrote:
> Markus, Dave, can you confirm that this fixes your problem?
>
> Thanks!!
>
> (Sigh, this is a real brown paper bug; I'm embarassed I missed this in
> my code review.)

Sorry, I don't have a big disk in my hand now. So I could reproduce it.
But the patch looks good. Thanks for fixing it.

Regards,
- Zheng

>
> - Ted
>
> From f47f0d11096ca5d9e1965d8a9f266aa13fe2b73b Mon Sep 17 00:00:00 2001
> From: Theodore Ts'o <[email protected]>
> Date: Wed, 27 Feb 2013 13:47:52 -0500
> Subject: [PATCH] ext4: fix extent status tree regression for file systems >
> 512GB
>
> This fixes a regression introduced by commit f7fec032aa782. The
> problem was that the extents status flags caused us to mask out block
> numbers smaller than 2**28 blocks. Since we didn't test with file
> systems smaller than 512GB, we didn't notice this during the
> development cycle.
>
> A typical failure looks like this:
>
> EXT4-fs error (device sdb1): htree_dirblock_to_tree:919: inode #172235804: block
> 152052301: comm ls: bad entry in directory: rec_len is smaller than minimal -
> offset=0(0), inode=0, rec_len=0, name_len=0
>
> ... where 'debugfs -R "stat <172235804>" /dev/sdb1' reports that the
> inode has block number 688923213. When viewed in hex, block number
> 152052301 (from the syslog) is 0x910224D, while block number 688923213
> is 0x2910224D. Note the missing "0x20000000" in the block number.
>
> Reported-by: Markus Trippelsdorf <[email protected]>
> Reported-by: Dave Jones <[email protected]>
> Cc: Zheng Liu <[email protected]>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
> fs/ext4/extents_status.h | 16 ++++++++--------
> 1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
> index cf83e77..c795ff6 100644
> --- a/fs/ext4/extents_status.h
> +++ b/fs/ext4/extents_status.h
> @@ -20,10 +20,10 @@
> #define es_debug(fmt, ...) no_printk(fmt, ##__VA_ARGS__)
> #endif
>
> -#define EXTENT_STATUS_WRITTEN 0x80000000 /* written extent */
> -#define EXTENT_STATUS_UNWRITTEN 0x40000000 /* unwritten extent */
> -#define EXTENT_STATUS_DELAYED 0x20000000 /* delayed extent */
> -#define EXTENT_STATUS_HOLE 0x10000000 /* hole */
> +#define EXTENT_STATUS_WRITTEN (((unsigned long long) 1) << 63)
> +#define EXTENT_STATUS_UNWRITTEN (((unsigned long long) 1) << 62)
> +#define EXTENT_STATUS_DELAYED (((unsigned long long) 1) << 61)
> +#define EXTENT_STATUS_HOLE (((unsigned long long) 1) << 60)
>
> #define EXTENT_STATUS_FLAGS (EXTENT_STATUS_WRITTEN | \
> EXTENT_STATUS_UNWRITTEN | \
> @@ -58,22 +58,22 @@ extern int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk,
>
> static inline int ext4_es_is_written(struct extent_status *es)
> {
> - return (es->es_pblk & EXTENT_STATUS_WRITTEN);
> + return (es->es_pblk & EXTENT_STATUS_WRITTEN) != 0;
> }
>
> static inline int ext4_es_is_unwritten(struct extent_status *es)
> {
> - return (es->es_pblk & EXTENT_STATUS_UNWRITTEN);
> + return (es->es_pblk & EXTENT_STATUS_UNWRITTEN) != 0;
> }
>
> static inline int ext4_es_is_delayed(struct extent_status *es)
> {
> - return (es->es_pblk & EXTENT_STATUS_DELAYED);
> + return (es->es_pblk & EXTENT_STATUS_DELAYED) != 0;
> }
>
> static inline int ext4_es_is_hole(struct extent_status *es)
> {
> - return (es->es_pblk & EXTENT_STATUS_HOLE);
> + return (es->es_pblk & EXTENT_STATUS_HOLE) != 0;
> }
>
> static inline ext4_fsblk_t ext4_es_status(struct extent_status *es)
>

2013-02-27 19:04:19

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [GIT PULL] ext4 updates for 3.9

On Wed, Feb 27, 2013 at 01:57:55PM -0500, Dave Jones wrote:
> Building now. Can you confirm that nothing on-disk should be awry ?
> Or will I need a new fsck to detect what happened ?
>

Well, it's possible that a read from data file which had blocks
located above 512GB might have gotten bogus information, or a block
allocated above 512GB might result in a write to the wrong place on
disk.

So if you are very cautious, running fsck just to make sure things are
OK is not a bad idea. But it's likely that the directory sanity
checks would have caught things quickly, or trying run an executable
would have caused a seg fault quickly enough. If your system crashed
very quickly during the boot process, you'll probably be OK.

- Ted

2013-02-27 19:06:23

by Borislav Petkov

[permalink] [raw]
Subject: Re: [GIT PULL] ext4 updates for 3.9

On Wed, Feb 27, 2013 at 01:49:12PM -0500, Theodore Ts'o wrote:
> -#define EXTENT_STATUS_WRITTEN 0x80000000 /* written extent */
> -#define EXTENT_STATUS_UNWRITTEN 0x40000000 /* unwritten extent */
> -#define EXTENT_STATUS_DELAYED 0x20000000 /* delayed extent */
> -#define EXTENT_STATUS_HOLE 0x10000000 /* hole */
> +#define EXTENT_STATUS_WRITTEN (((unsigned long long) 1) << 63)
> +#define EXTENT_STATUS_UNWRITTEN (((unsigned long long) 1) << 62)
> +#define EXTENT_STATUS_DELAYED (((unsigned long long) 1) << 61)
> +#define EXTENT_STATUS_HOLE (((unsigned long long) 1) << 60)

Just a nitpick:

1ULL << ...

is shorter and probably the standard way we do flags.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2013-02-27 19:11:15

by Dave Jones

[permalink] [raw]
Subject: Re: [GIT PULL] ext4 updates for 3.9

On Wed, Feb 27, 2013 at 02:04:19PM -0500, Theodore Ts'o wrote:
> On Wed, Feb 27, 2013 at 01:57:55PM -0500, Dave Jones wrote:
> > Building now. Can you confirm that nothing on-disk should be awry ?
> > Or will I need a new fsck to detect what happened ?
> >
>
> Well, it's possible that a read from data file which had blocks
> located above 512GB might have gotten bogus information, or a block
> allocated above 512GB might result in a write to the wrong place on
> disk.
>
> So if you are very cautious, running fsck just to make sure things are
> OK is not a bad idea. But it's likely that the directory sanity
> checks would have caught things quickly, or trying run an executable
> would have caused a seg fault quickly enough. If your system crashed
> very quickly during the boot process, you'll probably be OK.

It had been up a few hours before I hit those checks.

fsck never found anything. I do have another disk (XFS formatted) with
a backup from a week ago. I'll run a --dry-run rsync to see if it
picks up any changes that I don't expect.

Dave

2013-02-27 19:19:31

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [GIT PULL] ext4 updates for 3.9

On Wed, Feb 27, 2013 at 02:11:15PM -0500, Dave Jones wrote:
>
> fsck never found anything. I do have another disk (XFS formatted) with
> a backup from a week ago. I'll run a --dry-run rsync to see if it
> picks up any changes that I don't expect.

Probably a good idea.

Given that Markus reports that this fixes the problem for him, and
regression smoke test has passed, I'll get a cleaned up patch ready
for Linus to pull ASAP.

My deep apologies that we didn't catch this earlier. i'm going to
think about how we can improve our testing regime to make sure
problems like this get caught earlier.

- Ted

2013-02-27 19:19:23

by Dave Jones

[permalink] [raw]
Subject: Re: [GIT PULL] ext4 updates for 3.9

On Wed, Feb 27, 2013 at 07:56:25PM +0100, Markus Trippelsdorf wrote:
> On 2013.02.27 at 13:49 -0500, Theodore Ts'o wrote:
>
> > Markus, Dave, can you confirm that this fixes your problem?
>
> Yes, it fixes the issue.

Looks like it's fixed here too.

How did this make it through -next without anyone hitting it ?

I can't remember how many years ago I last bought a disk < 1TB,
and I can't be alone. Or is everyone all about SSDs these days?

Is anyone running xfstests or similar on linux-next regularly ?

Dave

2013-02-27 19:27:05

by Zheng Liu

[permalink] [raw]
Subject: Re: [GIT PULL] ext4 updates for 3.9

On 02/28/2013 03:19 AM, Dave Jones wrote:
> On Wed, Feb 27, 2013 at 07:56:25PM +0100, Markus Trippelsdorf wrote:
> > On 2013.02.27 at 13:49 -0500, Theodore Ts'o wrote:
> >
> > > Markus, Dave, can you confirm that this fixes your problem?
> >
> > Yes, it fixes the issue.
>
> Looks like it's fixed here too.
>
> How did this make it through -next without anyone hitting it ?

Sorry, my apology.

>
> I can't remember how many years ago I last bought a disk < 1TB,
> and I can't be alone. Or is everyone all about SSDs these days?

I admit that I run xfstests on a SSD that has 128G.

>
> Is anyone running xfstests or similar on linux-next regularly ?

Ted has a dev branch for -next, and I work on this branch.

Regards,
- Zheng

2013-02-27 19:29:07

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [GIT PULL] ext4 updates for 3.9

On Wed, Feb 27, 2013 at 02:19:23PM -0500, Dave Jones wrote:
>
> Looks like it's fixed here too.
>
> How did this make it through -next without anyone hitting it ?
>
> I can't remember how many years ago I last bought a disk < 1TB,
> and I can't be alone. Or is everyone all about SSDs these days?

I use LVM, so I have a number of volues which are smaler than 512GB,
but very few which are actually larger than 1TB. And none on my test
boxes. I was running the bleeding edge ext4 code on my laptop as for
dogfooding purposes, but I have an 80GB mSATA SSD and a 500GB HDD on
my X230 laptop (it requires a thin laptop drive, and 7mm drives don't
come any bigger, alas).

> Is anyone running xfstests or similar on linux-next regularly ?

I run xfstests on the ext4 tree, and I ran it on ext4 plus Linus's tip
before I submitted a pull request. The problem is that XFSTESTS is
S-L-O-W if you use large partitions, so typically I use a 5GB
partition sizes for my test runs. Normally we're worried about race
condition bugs, not something as bone-headed as a bitmasking problem,
so it makes sense to use a smaller disk for most of your testing.
(Some folks do their xfstests run on SSD's or tmpfs image files, again
for speed reasons, and it's unlikely they would be big enough.)

So what we probably need to do is to have a separate set of tests
using a loopback mount, and perhaps an artificially created file
system which has a large percentage of the blocks in the middle of the
file system busied out, to make efficient testing of these sorts of
bugs more efficient. As I said, I'm thinking about how's the best way
to improve our testing regime to catch such problems the next time around.

Cheers,

- Ted

2013-02-27 20:12:17

by Linus Torvalds

[permalink] [raw]
Subject: [GIT PULL URGENT] ext4 regression fix for 3.9

The following changes since commit 304e220f0879198b1f5309ad6f0be862b4009491:

ext4: fix free clusters calculation in bigalloc filesystem (2013-02-22 15:27:52 -0500)

are available in the git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git tags/ext4_for_linus

for you to fetch changes up to 8e919d13048cd5acaadb2b15b48acbfb8832d3c2:

ext4: fix extent status tree regression for file systems > 512GB (2013-02-27 14:54:37 -0500)

----------------------------------------------------------------
This fixes a real brown paper bag bug which causes ext4 to choke on
file systems larger than 512GB.

----------------------------------------------------------------
Theodore Ts'o (1):
ext4: fix extent status tree regression for file systems > 512GB

fs/ext4/extents_status.h | 19 +++++++++++--------
1 file changed, 11 insertions(+), 8 deletions(-)

2013-02-27 20:14:37

by Theodore Ts'o

[permalink] [raw]
Subject: [GIT PULL URGENT] ext4 regression fix for 3.9

The following changes since commit 304e220f0879198b1f5309ad6f0be862b4009491:

ext4: fix free clusters calculation in bigalloc filesystem (2013-02-22 15:27:52 -0500)

are available in the git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git tags/ext4_for_linus

for you to fetch changes up to 8e919d13048cd5acaadb2b15b48acbfb8832d3c2:

ext4: fix extent status tree regression for file systems > 512GB (2013-02-27 14:54:37 -0500)

----------------------------------------------------------------
This fixes a real brown paper bag bug which causes ext4 to choke on
file systems larger than 512GB.

----------------------------------------------------------------
Theodore Ts'o (1):
ext4: fix extent status tree regression for file systems > 512GB

fs/ext4/extents_status.h | 19 +++++++++++--------
1 file changed, 11 insertions(+), 8 deletions(-)

2013-02-27 20:15:03

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [GIT PULL URGENT] ext4 regression fix for 3.9

On Wed, Feb 27, 2013 at 03:12:17PM -0500, Linus Torvalds wrote:
> The following changes since commit 304e220f0879198b1f5309ad6f0be862b4009491:

Obviously, this wasn't from Linus. I screwed up on sending the pull
request. Sigh...

- Ted

2013-02-27 20:23:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL URGENT] ext4 regression fix for 3.9

On Wed, Feb 27, 2013 at 12:15 PM, Theodore Ts'o <[email protected]> wrote:
> On Wed, Feb 27, 2013 at 03:12:17PM -0500, Linus Torvalds wrote:
>> The following changes since commit 304e220f0879198b1f5309ad6f0be862b4009491:
>
> Obviously, this wasn't from Linus. I screwed up on sending the pull
> request. Sigh...

That looked really odd in my inbox for a while ;)

Linus "the *real* one" Torvalds

2013-02-27 20:41:23

by Borislav Petkov

[permalink] [raw]
Subject: Re: [GIT PULL URGENT] ext4 regression fix for 3.9

On Wed, Feb 27, 2013 at 12:23:18PM -0800, Linus Torvalds wrote:
> On Wed, Feb 27, 2013 at 12:15 PM, Theodore Ts'o <[email protected]> wrote:
> > On Wed, Feb 27, 2013 at 03:12:17PM -0500, Linus Torvalds wrote:
> >> The following changes since commit 304e220f0879198b1f5309ad6f0be862b4009491:
> >
> > Obviously, this wasn't from Linus. I screwed up on sending the pull
> > request. Sigh...
>
> That looked really odd in my inbox for a while ;)
>
> Linus "the *real* one" Torvalds

He doesn't trust anyone - he sends pull requests to himself!

:-)

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2013-02-27 20:58:55

by Dmitry Monakhov

[permalink] [raw]
Subject: Re: [GIT PULL] ext4 updates for 3.9

On Wed, 27 Feb 2013 14:29:07 -0500, Theodore Ts'o <[email protected]> wrote:
> On Wed, Feb 27, 2013 at 02:19:23PM -0500, Dave Jones wrote:
> >
> > Looks like it's fixed here too.
> >
> > How did this make it through -next without anyone hitting it ?
> >
> > I can't remember how many years ago I last bought a disk < 1TB,
> > and I can't be alone. Or is everyone all about SSDs these days?
>
> I use LVM, so I have a number of volues which are smaler than 512GB,
> but very few which are actually larger than 1TB. And none on my test
> boxes. I was running the bleeding edge ext4 code on my laptop as for
> dogfooding purposes, but I have an 80GB mSATA SSD and a 500GB HDD on
> my X230 laptop (it requires a thin laptop drive, and 7mm drives don't
> come any bigger, alas).
>
> > Is anyone running xfstests or similar on linux-next regularly ?
>
> I run xfstests on the ext4 tree, and I ran it on ext4 plus Linus's tip
> before I submitted a pull request. The problem is that XFSTESTS is
> S-L-O-W if you use large partitions, so typically I use a 5GB
Indeed. That's why i give-up rotated disks and run xfstest only on SSD
or brd module
> partition sizes for my test runs. Normally we're worried about race
> condition bugs, not something as bone-headed as a bitmasking problem,
> so it makes sense to use a smaller disk for most of your testing.
> (Some folks do their xfstests run on SSD's or tmpfs image files, again
> for speed reasons, and it's unlikely they would be big enough.)
>
> So what we probably need to do is to have a separate set of tests
> using a loopback mount, and perhaps an artificially created file
> system which has a large percentage of the blocks in the middle of the
> file system busied out, to make efficient testing of these sorts of
> bugs more efficient. As I said, I'm thinking about how's the best way
> to improve our testing regime to catch such problems the next time around.
Amazing idea. Something like:

#dd if=/dev/zero of=/tmp/fs.img bs=1M seek=2000000 count=1
#mkfs.ext4 -m0 -i4096000 /tmp/fs.img
#mount /tmp/fs.img /mnt/ -oloop
#for ((i=0; i < 2000; i++));do fallocate -l $((1024*1024*1024)) /mnt/f$i ;done
#for ((i=0; i < 2000; i++));do truncate -s $((1023*1024*1024)) /mnt/f$i ;done

As result file system image has 2gb of free space wich is fragmented to ~2000
chunks 1Mb each. But image itself is quite small
# df /mnt
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/loop0 2047678076 2045679228 1998848 100% /mnt
# du -sch /tmp/fs.img
242M /tmp/fs.img
242M total

Later we can simply run xfstest/fio/fsx on this image.
I'll prepare new xfstest based on that idea. But the only disadvantage
is that loop dev has bottleneck, all requests will be serialized on i_mutex.
>
> Cheers,
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-02-27 21:30:32

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [GIT PULL] ext4 updates for 3.9

On Thu, Feb 28, 2013 at 12:58:55AM +0400, Dmitry Monakhov wrote:
> Indeed. That's why i give-up rotated disks and run xfstest only on SSD
> or brd module

Note that some problems only show up on slower devices (because then
the race window is wider), and sometimes they only show up on fast
devices. This is why I've been doing runs on both HDD's and on tmpfs
kvm disk image files.

> Later we can simply run xfstest/fio/fsx on this image.
> I'll prepare new xfstest based on that idea. But the only disadvantage
> is that loop dev has bottleneck, all requests will be serialized on i_mutex.

If the image file is used as a disk file for KVM, that should avoid
the serialization bottleneck of using the loop device.

Regards,

- Ted

2013-03-01 03:30:05

by Dave Jones

[permalink] [raw]
Subject: Re: [GIT PULL URGENT] ext4 regression fix for 3.9

On Wed, Feb 27, 2013 at 03:12:17PM -0500, Linus Torvalds wrote:
> The following changes since commit 304e220f0879198b1f5309ad6f0be862b4009491:
>
> ext4: fix free clusters calculation in bigalloc filesystem (2013-02-22 15:27:52 -0500)
>
> are available in the git repository at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git tags/ext4_for_linus
>
> for you to fetch changes up to 8e919d13048cd5acaadb2b15b48acbfb8832d3c2:
>
> ext4: fix extent status tree regression for file systems > 512GB (2013-02-27 14:54:37 -0500)
>
> ----------------------------------------------------------------
> This fixes a real brown paper bag bug which causes ext4 to choke on
> file systems larger than 512GB.
>
> ----------------------------------------------------------------
> Theodore Ts'o (1):
> ext4: fix extent status tree regression for file systems > 512GB
>
> fs/ext4/extents_status.h | 19 +++++++++++--------
> 1 file changed, 11 insertions(+), 8 deletions(-)

This has fixed the problem I reported, but I notice now that my desktop is really
sluggish. perf top shows it's almost constantly spinning in ext4_es_reclaim_extents_count

Any ideas ?

Dave

2013-03-01 04:00:55

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [GIT PULL URGENT] ext4 regression fix for 3.9

On Thu, Feb 28, 2013 at 10:30:05PM -0500, Dave Jones wrote:
>
> This has fixed the problem I reported, but I notice now that my
> desktop is really sluggish. perf top shows it's almost constantly
> spinning in ext4_es_reclaim_extents_count
>
> Any ideas ?

ext4_es_reclaim_extents_count() is getting called out of the slab
shrinker. It's getting called too often when there is significant
memory pressure. We can optimize this so we're not calculating it all
the time.

- Ted

2013-02-28 13:18:13

by Dave Chinner

[permalink] [raw]
Subject: Re: [GIT PULL] ext4 updates for 3.9

On Wed, Feb 27, 2013 at 02:29:07PM -0500, Theodore Ts'o wrote:
> On Wed, Feb 27, 2013 at 02:19:23PM -0500, Dave Jones wrote:
> >
> > Looks like it's fixed here too.
> >
> > How did this make it through -next without anyone hitting it ?
>
> > Is anyone running xfstests or similar on linux-next regularly ?
>
> I run xfstests on the ext4 tree, and I ran it on ext4 plus Linus's tip
> before I submitted a pull request. The problem is that XFSTESTS is
> S-L-O-W if you use large partitions, so typically I use a 5GB
> partition sizes for my test runs.

This isn't the case for XFS. I typically see 1TB scratch devices
only being ~10-20% slower than 10GB scratch devices, and 10TB only
being a little slower than 1TB scratch devices. I have to use sparse
devices and --large-fs for 100TB filesystem testing, so I can't
directly compare the speeds to those that I run on physical devices.
However I can say that it isn't significantly slower than using
small scratch devices...

> So what we probably need to do is to have a separate set of tests
> using a loopback mount, and perhaps an artificially created file
> system which has a large percentage of the blocks in the middle of the
> file system busied out, to make efficient testing of these sorts of

That's exactly what the --large-fs patch set I posted months ago does
for ext4 - it uses fallocate() to fill all but 50GB of the large
filesystem without actually writing any data and runs the standard
tests in the remaining unused space.

However, last time I tested ext4 with this patchset (when I posted
the patches months ago), multi-TB preallocation on ext4 was still too
slow to make it practical for testing on devices larger than 2-3TB.
Perhaps it would make testing 1-2TB ext4 filesystems fast enough
that you could do it regularly...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-03-01 05:00:35

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH] ext4: optimize ext4_es_shrink()

When the system is under memory pressure, ext4_es_srhink() will get
called very often. So optimize returning the number of items in the
file system's extent status cache by keeping a per-filesystem count,
instead of calculating it each time by scanning all of the inodes in
the extent status cache.

Also rename the slab used for the extent status cache to be
"ext4_extent_status" so it's obviousl the slab in question is created
by ext4.

Signed-off-by: "Theodore Ts'o" <[email protected]>
Cc: Zheng Liu <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/extents_status.c | 39 +++++++++++++--------------------------
include/trace/events/ext4.h | 40 ++++++++++++----------------------------
3 files changed, 26 insertions(+), 54 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 6e16c18..96c1093 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1268,6 +1268,7 @@ struct ext4_sb_info {
atomic_t s_mb_preallocated;
atomic_t s_mb_discarded;
atomic_t s_lock_busy;
+ atomic_t s_extent_cache_cnt;

/* locality groups */
struct ext4_locality_group __percpu *s_locality_groups;
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index f768f4a..27fcdd2 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -147,11 +147,12 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t end);
static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
int nr_to_scan);
-static int ext4_es_reclaim_extents_count(struct super_block *sb);

int __init ext4_init_es(void)
{
- ext4_es_cachep = KMEM_CACHE(extent_status, SLAB_RECLAIM_ACCOUNT);
+ ext4_es_cachep = kmem_cache_create("ext4_extent_status",
+ sizeof(struct extent_status),
+ 0, (SLAB_RECLAIM_ACCOUNT), NULL);
if (ext4_es_cachep == NULL)
return -ENOMEM;
return 0;
@@ -302,8 +303,10 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
/*
* We don't count delayed extent because we never try to reclaim them
*/
- if (!ext4_es_is_delayed(es))
+ if (!ext4_es_is_delayed(es)) {
EXT4_I(inode)->i_es_lru_nr++;
+ atomic_inc(&EXT4_SB(inode->i_sb)->s_extent_cache_cnt);
+ }

return es;
}
@@ -314,6 +317,7 @@ static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
if (!ext4_es_is_delayed(es)) {
BUG_ON(EXT4_I(inode)->i_es_lru_nr == 0);
EXT4_I(inode)->i_es_lru_nr--;
+ atomic_dec(&EXT4_SB(inode->i_sb)->s_extent_cache_cnt);
}

kmem_cache_free(ext4_es_cachep, es);
@@ -674,10 +678,11 @@ static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
int nr_to_scan = sc->nr_to_scan;
int ret, nr_shrunk = 0;

- trace_ext4_es_shrink_enter(sbi->s_sb, nr_to_scan);
+ ret = atomic_read(&sbi->s_extent_cache_cnt);
+ trace_ext4_es_shrink_enter(sbi->s_sb, nr_to_scan, ret);

if (!nr_to_scan)
- return ext4_es_reclaim_extents_count(sbi->s_sb);
+ return ret;

INIT_LIST_HEAD(&scanned);

@@ -705,9 +710,10 @@ static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
}
list_splice_tail(&scanned, &sbi->s_es_lru);
spin_unlock(&sbi->s_es_lru_lock);
- trace_ext4_es_shrink_exit(sbi->s_sb, nr_shrunk);

- return ext4_es_reclaim_extents_count(sbi->s_sb);
+ ret = atomic_read(&sbi->s_extent_cache_cnt);
+ trace_ext4_es_shrink_exit(sbi->s_sb, nr_shrunk, ret);
+ return ret;
}

void ext4_es_register_shrinker(struct super_block *sb)
@@ -751,25 +757,6 @@ void ext4_es_lru_del(struct inode *inode)
spin_unlock(&sbi->s_es_lru_lock);
}

-static int ext4_es_reclaim_extents_count(struct super_block *sb)
-{
- struct ext4_sb_info *sbi = EXT4_SB(sb);
- struct ext4_inode_info *ei;
- struct list_head *cur;
- int nr_cached = 0;
-
- spin_lock(&sbi->s_es_lru_lock);
- list_for_each(cur, &sbi->s_es_lru) {
- ei = list_entry(cur, struct ext4_inode_info, i_es_lru);
- read_lock(&ei->i_es_lock);
- nr_cached += ei->i_es_lru_nr;
- read_unlock(&ei->i_es_lock);
- }
- spin_unlock(&sbi->s_es_lru_lock);
- trace_ext4_es_reclaim_extents_count(sb, nr_cached);
- return nr_cached;
-}
-
static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
int nr_to_scan)
{
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index c0457c0..4ee4710 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2255,64 +2255,48 @@ TRACE_EVENT(ext4_es_lookup_extent_exit,
__entry->found ? __entry->status : 0)
);

-TRACE_EVENT(ext4_es_reclaim_extents_count,
- TP_PROTO(struct super_block *sb, int nr_cached),
-
- TP_ARGS(sb, nr_cached),
-
- TP_STRUCT__entry(
- __field( dev_t, dev )
- __field( int, nr_cached )
- ),
-
- TP_fast_assign(
- __entry->dev = sb->s_dev;
- __entry->nr_cached = nr_cached;
- ),
-
- TP_printk("dev %d,%d cached objects nr %d",
- MAJOR(__entry->dev), MINOR(__entry->dev),
- __entry->nr_cached)
-);
-
TRACE_EVENT(ext4_es_shrink_enter,
- TP_PROTO(struct super_block *sb, int nr_to_scan),
+ TP_PROTO(struct super_block *sb, int nr_to_scan, int cache_cnt),

- TP_ARGS(sb, nr_to_scan),
+ TP_ARGS(sb, nr_to_scan, cache_cnt),

TP_STRUCT__entry(
__field( dev_t, dev )
__field( int, nr_to_scan )
+ __field( int, cache_cnt )
),

TP_fast_assign(
__entry->dev = sb->s_dev;
__entry->nr_to_scan = nr_to_scan;
+ __entry->cache_cnt = cache_cnt;
),

- TP_printk("dev %d,%d nr to scan %d",
+ TP_printk("dev %d,%d nr_to_scan %d cache_cnt %d",
MAJOR(__entry->dev), MINOR(__entry->dev),
- __entry->nr_to_scan)
+ __entry->nr_to_scan, __entry->cache_cnt)
);

TRACE_EVENT(ext4_es_shrink_exit,
- TP_PROTO(struct super_block *sb, int shrunk_nr),
+ TP_PROTO(struct super_block *sb, int shrunk_nr, int cache_cnt),

- TP_ARGS(sb, shrunk_nr),
+ TP_ARGS(sb, shrunk_nr, cache_cnt),

TP_STRUCT__entry(
__field( dev_t, dev )
__field( int, shrunk_nr )
+ __field( int, cache_cnt )
),

TP_fast_assign(
__entry->dev = sb->s_dev;
__entry->shrunk_nr = shrunk_nr;
+ __entry->cache_cnt = cache_cnt;
),

- TP_printk("dev %d,%d nr to scan %d",
+ TP_printk("dev %d,%d shrunk_nr %d cache_cnt %d",
MAJOR(__entry->dev), MINOR(__entry->dev),
- __entry->shrunk_nr)
+ __entry->shrunk_nr, __entry->cache_cnt)
);

#endif /* _TRACE_EXT4_H */
--
1.7.12.rc0.22.gcdd159b


2013-03-01 15:41:28

by Eric Sandeen

[permalink] [raw]
Subject: Re: [GIT PULL] ext4 updates for 3.9

On 2/27/13 2:58 PM, Dmitry Monakhov wrote:
> On Wed, 27 Feb 2013 14:29:07 -0500, Theodore Ts'o <[email protected]> wrote:
>> On Wed, Feb 27, 2013 at 02:19:23PM -0500, Dave Jones wrote:
>>>
>>> Looks like it's fixed here too.
>>>
>>> How did this make it through -next without anyone hitting it ?
>>>
>>> I can't remember how many years ago I last bought a disk < 1TB,
>>> and I can't be alone. Or is everyone all about SSDs these days?
>>
>> I use LVM, so I have a number of volues which are smaler than 512GB,
>> but very few which are actually larger than 1TB. And none on my test
>> boxes. I was running the bleeding edge ext4 code on my laptop as for
>> dogfooding purposes, but I have an 80GB mSATA SSD and a 500GB HDD on
>> my X230 laptop (it requires a thin laptop drive, and 7mm drives don't
>> come any bigger, alas).
>>
>>> Is anyone running xfstests or similar on linux-next regularly ?
>>
>> I run xfstests on the ext4 tree, and I ran it on ext4 plus Linus's tip
>> before I submitted a pull request. The problem is that XFSTESTS is
>> S-L-O-W if you use large partitions, so typically I use a 5GB
> Indeed. That's why i give-up rotated disks and run xfstest only on SSD
> or brd module
>> partition sizes for my test runs. Normally we're worried about race
>> condition bugs, not something as bone-headed as a bitmasking problem,
>> so it makes sense to use a smaller disk for most of your testing.
>> (Some folks do their xfstests run on SSD's or tmpfs image files, again
>> for speed reasons, and it's unlikely they would be big enough.)
>>
>> So what we probably need to do is to have a separate set of tests
>> using a loopback mount, and perhaps an artificially created file
>> system which has a large percentage of the blocks in the middle of the
>> file system busied out, to make efficient testing of these sorts of
>> bugs more efficient. As I said, I'm thinking about how's the best way
>> to improve our testing regime to catch such problems the next time around.
> Amazing idea. Something like:
>
> #dd if=/dev/zero of=/tmp/fs.img bs=1M seek=2000000 count=1
> #mkfs.ext4 -m0 -i4096000 /tmp/fs.img
> #mount /tmp/fs.img /mnt/ -oloop
> #for ((i=0; i < 2000; i++));do fallocate -l $((1024*1024*1024)) /mnt/f$i ;done
> #for ((i=0; i < 2000; i++));do truncate -s $((1023*1024*1024)) /mnt/f$i ;done
>
> As result file system image has 2gb of free space wich is fragmented to ~2000
> chunks 1Mb each. But image itself is quite small
> # df /mnt
> Filesystem 1K-blocks Used Available Use% Mounted on
> /dev/loop0 2047678076 2045679228 1998848 100% /mnt
> # du -sch /tmp/fs.img
> 242M /tmp/fs.img
> 242M total
>
> Later we can simply run xfstest/fio/fsx on this image.
> I'll prepare new xfstest based on that idea. But the only disadvantage
> is that loop dev has bottleneck, all requests will be serialized on i_mutex.

Before anyone does too much work, it would be worth revisiting
dchinner's
[PATCH 0/10] xfstests: rework large filesystem testing
series from July 2012 to see if it meets the needs already.

It almost got all reviews, with one sticking point left, AFAICT.

-Eric

2013-03-01 16:11:41

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH] ext4: optimize ext4_es_shrink()

On Fri, Mar 01, 2013 at 12:00:29AM -0500, Theodore Ts'o wrote:
> When the system is under memory pressure, ext4_es_srhink() will get
> called very often. So optimize returning the number of items in the
> file system's extent status cache by keeping a per-filesystem count,
> instead of calculating it each time by scanning all of the inodes in
> the extent status cache.
>
> Also rename the slab used for the extent status cache to be
> "ext4_extent_status" so it's obviousl the slab in question is created
> by ext4.

Seems to work with no ill effects afaics.

thanks,

Dave


2013-03-01 16:26:57

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: optimize ext4_es_shrink()

On Fri, Mar 01, 2013 at 11:11:30AM -0500, Dave Jones wrote:
> On Fri, Mar 01, 2013 at 12:00:29AM -0500, Theodore Ts'o wrote:
> > When the system is under memory pressure, ext4_es_srhink() will get
> > called very often. So optimize returning the number of items in the
> > file system's extent status cache by keeping a per-filesystem count,
> > instead of calculating it each time by scanning all of the inodes in
> > the extent status cache.
> >
> > Also rename the slab used for the extent status cache to be
> > "ext4_extent_status" so it's obviousl the slab in question is created
> > by ext4.
>
> Seems to work with no ill effects afaics.

Thanks for reporting the problem and testing the fix!

I'll add a Reported-by: and Tested-by: Dave Jones <[email protected]>
to the commit. (Unless of course you have an objection, in which case
let me know.)

- Ted

2013-03-01 16:40:43

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] ext4: optimize ext4_es_shrink()

On 2/28/13 11:00 PM, Theodore Ts'o wrote:
> When the system is under memory pressure, ext4_es_srhink() will get
> called very often. So optimize returning the number of items in the
> file system's extent status cache by keeping a per-filesystem count,
> instead of calculating it each time by scanning all of the inodes in
> the extent status cache.
>
> Also rename the slab used for the extent status cache to be
> "ext4_extent_status" so it's obviousl the slab in question is created
> by ext4.

Certainly better than walking an arbitrarily long list. :)
So:

Reviewed-by: Eric Sandeen <[email protected]>

I was wondering a couple things, though -

1) should this one be scaled by the vfs_cache_pressure sysctl?

2) Also, given that this is only for shrinker accounting, do we need the
precision of the atomic counter? I see that quota uses a per-cpu counter.
Would a percpu counter be any more efficient? I'll follow
w/ a patch.

-Eric

2013-03-01 16:40:54

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH] ext4: optimize ext4_es_shrink()

On Fri, Mar 01, 2013 at 11:26:51AM -0500, Theodore Ts'o wrote:
> On Fri, Mar 01, 2013 at 11:11:30AM -0500, Dave Jones wrote:
> > On Fri, Mar 01, 2013 at 12:00:29AM -0500, Theodore Ts'o wrote:
> > > When the system is under memory pressure, ext4_es_srhink() will get
> > > called very often. So optimize returning the number of items in the
> > > file system's extent status cache by keeping a per-filesystem count,
> > > instead of calculating it each time by scanning all of the inodes in
> > > the extent status cache.
> > >
> > > Also rename the slab used for the extent status cache to be
> > > "ext4_extent_status" so it's obviousl the slab in question is created
> > > by ext4.
> >
> > Seems to work with no ill effects afaics.
>
> Thanks for reporting the problem and testing the fix!
>
> I'll add a Reported-by: and Tested-by: Dave Jones <[email protected]>
> to the commit. (Unless of course you have an objection, in which case
> let me know.)

Sure, that's fine.

Dave


2013-03-01 16:42:31

by Eric Sandeen

[permalink] [raw]
Subject: [PATCH] ext4: use percpu counter for extent cache count

Use a percpu counter rather than atomic types for shrinker accounting.
There's no need for ultimate accuracy in the shrinker, so this
should come a little more cheaply. The percpu struct is somewhat
large, but there was a big gap before the cache-aligned
s_es_lru_lock anyway, and it fits nicely in there.

Signed-off-by: Eric Sandeen <[email protected]>
---

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 96c1093..4a01ba3 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1268,7 +1268,6 @@ struct ext4_sb_info {
atomic_t s_mb_preallocated;
atomic_t s_mb_discarded;
atomic_t s_lock_busy;
- atomic_t s_extent_cache_cnt;

/* locality groups */
struct ext4_locality_group __percpu *s_locality_groups;
@@ -1310,6 +1309,7 @@ struct ext4_sb_info {
/* Reclaim extents from extent status tree */
struct shrinker s_es_shrinker;
struct list_head s_es_lru;
+ struct percpu_counter s_extent_cache_cnt;
spinlock_t s_es_lru_lock ____cacheline_aligned_in_smp;
};

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 27fcdd2..95796a1 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -305,7 +305,7 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
*/
if (!ext4_es_is_delayed(es)) {
EXT4_I(inode)->i_es_lru_nr++;
- atomic_inc(&EXT4_SB(inode->i_sb)->s_extent_cache_cnt);
+ percpu_counter_inc(&EXT4_SB(inode->i_sb)->s_extent_cache_cnt);
}

return es;
@@ -317,7 +317,7 @@ static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
if (!ext4_es_is_delayed(es)) {
BUG_ON(EXT4_I(inode)->i_es_lru_nr == 0);
EXT4_I(inode)->i_es_lru_nr--;
- atomic_dec(&EXT4_SB(inode->i_sb)->s_extent_cache_cnt);
+ percpu_counter_dec(&EXT4_SB(inode->i_sb)->s_extent_cache_cnt);
}

kmem_cache_free(ext4_es_cachep, es);
@@ -678,7 +678,7 @@ static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
int nr_to_scan = sc->nr_to_scan;
int ret, nr_shrunk = 0;

- ret = atomic_read(&sbi->s_extent_cache_cnt);
+ ret = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
trace_ext4_es_shrink_enter(sbi->s_sb, nr_to_scan, ret);

if (!nr_to_scan)
@@ -711,7 +711,7 @@ static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
list_splice_tail(&scanned, &sbi->s_es_lru);
spin_unlock(&sbi->s_es_lru_lock);

- ret = atomic_read(&sbi->s_extent_cache_cnt);
+ ret = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
trace_ext4_es_shrink_exit(sbi->s_sb, nr_shrunk, ret);
return ret;
}


2013-03-01 18:00:41

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: use percpu counter for extent cache count

On Fri, Mar 01, 2013 at 10:42:25AM -0600, Eric Sandeen wrote:
> Use a percpu counter rather than atomic types for shrinker accounting.
> There's no need for ultimate accuracy in the shrinker, so this
> should come a little more cheaply. The percpu struct is somewhat
> large, but there was a big gap before the cache-aligned
> s_es_lru_lock anyway, and it fits nicely in there.

I thought about using percpu counters, but I was worried about the
size on really big machines. OTOH, it will be the really large NUMA
machines where atomic_t will really hurt, so maybe we should use
percpu countesr and not really worry about it. It's on a per file
system basis, so even if it is a few hundred bytes it shouldn't break
the bank.

- Ted

2013-03-01 18:02:42

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] ext4: use percpu counter for extent cache count

On 3/1/13 12:00 PM, Theodore Ts'o wrote:
> On Fri, Mar 01, 2013 at 10:42:25AM -0600, Eric Sandeen wrote:
>> Use a percpu counter rather than atomic types for shrinker accounting.
>> There's no need for ultimate accuracy in the shrinker, so this
>> should come a little more cheaply. The percpu struct is somewhat
>> large, but there was a big gap before the cache-aligned
>> s_es_lru_lock anyway, and it fits nicely in there.
>
> I thought about using percpu counters, but I was worried about the
> size on really big machines. OTOH, it will be the really large NUMA
> machines where atomic_t will really hurt, so maybe we should use
> percpu countesr and not really worry about it. It's on a per file
> system basis, so even if it is a few hundred bytes it shouldn't break
> the bank.
>
> - Ted
>

I was mostly keying off what quota felt was best, I guess.
I'm not wedded to either approach, it was just a thought.
So you can take it or leave it. :)

Thanks,
-Eric

2013-03-02 15:26:51

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: use percpu counter for extent cache count

On Fri, Mar 01, 2013 at 12:02:34PM -0600, Eric Sandeen wrote:
>
> I was mostly keying off what quota felt was best, I guess.
> I'm not wedded to either approach, it was just a thought.
> So you can take it or leave it. :)

I'll take it, but for the record, there's a reason why I label patches
with "I haven't tested this yet" disclaimers.

For future reference, calling percpu_counter_init() and
percpu_counter_destroy() is not optional --- without them, in your
original patch, the kernel will crash them the first time an ext4 file
system mount is attempted.

No worries though, no harm done --- but it's better if people don't
rely on me to catch bugs before they go into the tree. :-)

- Ted

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 373d46c..1ae5860 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -783,6 +783,7 @@ static void ext4_put_super(struct super_block *sb)
percpu_counter_destroy(&sbi->s_freeinodes_counter);
percpu_counter_destroy(&sbi->s_dirs_counter);
percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
+ percpu_counter_destroy(&sbi->s_extent_cache_cnt);
brelse(sbi->s_sbh);
#ifdef CONFIG_QUOTA
for (i = 0; i < MAXQUOTAS; i++)
@@ -3688,6 +3689,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
if (!err) {
err = percpu_counter_init(&sbi->s_dirtyclusters_counter, 0);
}
+ if (!err) {
+ err = percpu_counter_init(&sbi->s_extent_cache_cnt, 0);
+ }
if (err) {
ext4_msg(sb, KERN_ERR, "insufficient memory");
goto failed_mount3;
@@ -3993,6 +3997,7 @@ failed_mount3:
percpu_counter_destroy(&sbi->s_freeinodes_counter);
percpu_counter_destroy(&sbi->s_dirs_counter);
percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
+ percpu_counter_destroy(&sbi->s_extent_cache_cnt);
if (sbi->s_mmp_tsk)
kthread_stop(sbi->s_mmp_tsk);
failed_mount2:

2013-03-02 19:54:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL URGENT] ext4 regression fix for 3.9

On Thu, Feb 28, 2013 at 8:00 PM, Theodore Ts'o <[email protected]> wrote:
>
> ext4_es_reclaim_extents_count() is getting called out of the slab
> shrinker. It's getting called too often when there is significant
> memory pressure. We can optimize this so we're not calculating it all
> the time.

This needs to be fixed or reverted. I traced back some user
interaction problems to this same issue. It literally gets so bad that
the whole system is choppy, and a profile shows that a lot of time is
being spent in the spin-lock protecting these data structures. Called
from ext4_es_reclaim_extents_count, ext4_es_lru_add and
ext4_es_shrink.

We're talking very user-noticeable pauses, so it's not just the
spinlock, there's some real badness that comes in with it too.
Possibly related to other (sleeping) locks being held while the data
structures are then refilled or something.

Linus

2013-03-02 23:15:06

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [GIT PULL URGENT] ext4 regression fix for 3.9

On Sat, Mar 02, 2013 at 11:54:37AM -0800, Linus Torvalds wrote:
>
> This needs to be fixed or reverted. I traced back some user
> interaction problems to this same issue. It literally gets so bad that
> the whole system is choppy, and a profile shows that a lot of time is
> being spent in the spin-lock protecting these data structures. Called
> from ext4_es_reclaim_extents_count, ext4_es_lru_add and
> ext4_es_shrink.

I'm doing final testing on a pull request that will fix these
problems. I'll send it to you later this evening.

- Ted

2013-03-03 16:24:40

by Zheng Liu

[permalink] [raw]
Subject: Re: [PATCH] ext4: use percpu counter for extent cache count

On Sat, Mar 02, 2013 at 10:26:44AM -0500, Theodore Ts'o wrote:
> On Fri, Mar 01, 2013 at 12:02:34PM -0600, Eric Sandeen wrote:
> >
> > I was mostly keying off what quota felt was best, I guess.
> > I'm not wedded to either approach, it was just a thought.
> > So you can take it or leave it. :)
>
> I'll take it, but for the record, there's a reason why I label patches
> with "I haven't tested this yet" disclaimers.
>
> For future reference, calling percpu_counter_init() and
> percpu_counter_destroy() is not optional --- without them, in your
> original patch, the kernel will crash them the first time an ext4 file
> system mount is attempted.
>
> No worries though, no harm done --- but it's better if people don't
> rely on me to catch bugs before they go into the tree. :-)

Really sorry for the delay reply.

Dave, thank you very much for your test and reporting this regression.

Ted, Eric, thank you very much for fixing it.

I really think I need to be responsible for reviewing and testing this
patch because I screw it up. My apology.

Thanks,
- Zheng

2013-03-04 16:11:27

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] ext4: use percpu counter for extent cache count

On 3/2/13 9:26 AM, Theodore Ts'o wrote:
> On Fri, Mar 01, 2013 at 12:02:34PM -0600, Eric Sandeen wrote:
>>
>> I was mostly keying off what quota felt was best, I guess.
>> I'm not wedded to either approach, it was just a thought.
>> So you can take it or leave it. :)
>
> I'll take it, but for the record, there's a reason why I label patches
> with "I haven't tested this yet" disclaimers.
>
> For future reference, calling percpu_counter_init() and
> percpu_counter_destroy() is not optional --- without them, in your
> original patch, the kernel will crash them the first time an ext4 file
> system mount is attempted.
>
> No worries though, no harm done --- but it's better if people don't
> rely on me to catch bugs before they go into the tree. :-)

Oh crud, I'm really sorry. I originally had the patch embedded
in a reply, and it did say "not tested" and was more of a "what do
you think about this" - then I elevated it to its own mail in the
thread and I lost the disclaimer. Entirely my fault. :(

That's two stupid-patch-demerits for me from Friday. Only one
one ext4. ;)

-Eric

> - Ted
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 373d46c..1ae5860 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -783,6 +783,7 @@ static void ext4_put_super(struct super_block *sb)
> percpu_counter_destroy(&sbi->s_freeinodes_counter);
> percpu_counter_destroy(&sbi->s_dirs_counter);
> percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
> + percpu_counter_destroy(&sbi->s_extent_cache_cnt);
> brelse(sbi->s_sbh);
> #ifdef CONFIG_QUOTA
> for (i = 0; i < MAXQUOTAS; i++)
> @@ -3688,6 +3689,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> if (!err) {
> err = percpu_counter_init(&sbi->s_dirtyclusters_counter, 0);
> }
> + if (!err) {
> + err = percpu_counter_init(&sbi->s_extent_cache_cnt, 0);
> + }
> if (err) {
> ext4_msg(sb, KERN_ERR, "insufficient memory");
> goto failed_mount3;
> @@ -3993,6 +3997,7 @@ failed_mount3:
> percpu_counter_destroy(&sbi->s_freeinodes_counter);
> percpu_counter_destroy(&sbi->s_dirs_counter);
> percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
> + percpu_counter_destroy(&sbi->s_extent_cache_cnt);
> if (sbi->s_mmp_tsk)
> kthread_stop(sbi->s_mmp_tsk);
> failed_mount2:
>