2017-01-12 03:49:49

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 0/4] ext4 inline cleanup patches

George Spelvin reported a deadlock problem when using a file system
with inline data enabled when it tries to expand the inode's extra
inode field space. The second patch is the critical one; the last two
are some cleanups.

Theodore Ts'o (4):
ext4: add debug_want_extra_isize mount option
ext4: fix deadlock between inline_data and
ext4_expand_extra_isize_ea()
ext4: avoid calling ext4_mark_inode_dirty() under unneeded semaphores
ext4: propagate error values from ext4_inline_data_truncate()

fs/ext4/ext4.h | 2 +-
fs/ext4/inline.c | 111 +++++++++++++++++++++++++++----------------------------
fs/ext4/inode.c | 6 ++-
fs/ext4/super.c | 9 ++++-
fs/ext4/xattr.c | 30 ++++++---------
fs/ext4/xattr.h | 32 ++++++++++++++++
6 files changed, 110 insertions(+), 80 deletions(-)

--
2.11.0.rc0.7.gbe5a750



2017-01-12 03:49:49

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH -v2] overlay: stress test changes to top and bottom layers simultaneously

Introduce a test which runs fsstress on the top and bottom overlayfs
directories simultaneously to find potential races that plagued wrapfs
derived file systems.

Signed-off-by: Theodore Ts'o <[email protected]>
---
tests/overlay/019 | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++
tests/overlay/019.out | 2 ++
tests/overlay/group | 1 +
3 files changed, 84 insertions(+)
create mode 100644 tests/overlay/019
create mode 100644 tests/overlay/019.out

diff --git a/tests/overlay/019 b/tests/overlay/019
new file mode 100644
index 00000000..1daf830f
--- /dev/null
+++ b/tests/overlay/019
@@ -0,0 +1,81 @@
+#! /bin/bash
+# FS QA Test 019
+#
+# Run fsstress on lower dir and top dir at the same time
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1 # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+ cd /
+ rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+_supported_fs overlay
+_supported_os Linux
+_require_scratch
+
+# Remove all files from previous tests
+_scratch_mkfs
+
+lowerdir=$SCRATCH_DEV/$OVERLAY_LOWER_DIR
+mkdir -p $lowerdir
+
+_scratch_mount
+
+echo "Silence is golden"
+
+d_low=$lowerdir/fsstress
+d_top=$SCRATCH_MNT/fsstress
+mkdir -p $d_low $d_top
+
+echo $FSSTRESS_PROG -s 42 -d $d_low -p 4 -n 1000 -l100 -v > $seqres.full.1
+$FSSTRESS_PROG -s 42 -d $d_low -p 4 -n 1000 -l100 -v >> $seqres.full.1 2>&1 &
+
+echo $FSSTRESS_PROG -s 42 -d $d_top -p 4 -n 1000 -l100 -v > $seqres.full.2
+$FSSTRESS_PROG -s 42 -d $d_top -p 4 -n 1000 -l100 -v >> $seqres.full.2 2>&1 &
+
+ret=0
+if ! wait %1
+then
+ echo "--------------------------------------" >>$seqres.full.1
+ echo "fsstress on lower directory returned $? - see $seqres.full.1"
+ echo "--------------------------------------" >>$seqres.full.1
+ ret=1
+fi
+
+if ! wait %2
+then
+ echo "--------------------------------------" >>$seqres.full.2
+ echo "fsstress on overlay directory returned $? - see $seqres.full.2"
+ echo "--------------------------------------" >>$seqres.full.2
+ ret=1
+fi
+
+cat $seqres.full.1 $seqres.full.2 > $seqres.full
+rm $seqres.full.1 $seqres.full.2
+
+if [ "$ret" -eq 1 ]
+then
+ status=1
+else
+ status=0
+fi
+
+exit $status
diff --git a/tests/overlay/019.out b/tests/overlay/019.out
new file mode 100644
index 00000000..163484bc
--- /dev/null
+++ b/tests/overlay/019.out
@@ -0,0 +1,2 @@
+QA output created by 019
+Silence is golden
diff --git a/tests/overlay/group b/tests/overlay/group
index 74822cbf..f8de84be 100644
--- a/tests/overlay/group
+++ b/tests/overlay/group
@@ -21,3 +21,4 @@
016 auto quick copyup
017 auto quick copyup
018 auto quick copyup
+019 quick stress
--
2.11.0.rc0.7.gbe5a750


2017-01-12 03:49:57

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 4/4] ext4: propagate error values from ext4_inline_data_truncate()

Signed-off-by: Theodore Ts'o <[email protected]>
---
fs/ext4/ext4.h | 2 +-
fs/ext4/inline.c | 40 +++++++++++++++++++++++-----------------
fs/ext4/inode.c | 4 +++-
3 files changed, 27 insertions(+), 19 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 6bcb9622fdf9..1cd077e02517 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3023,7 +3023,7 @@ extern int ext4_inline_data_fiemap(struct inode *inode,
extern int ext4_try_to_evict_inline_data(handle_t *handle,
struct inode *inode,
int needed);
-extern void ext4_inline_data_truncate(struct inode *inode, int *has_inline);
+extern int ext4_inline_data_truncate(struct inode *inode, int *has_inline);

extern int ext4_convert_inline_data(struct inode *inode);

diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 31f98dd04e51..accd665c5dc0 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -1896,10 +1896,10 @@ int ext4_try_to_evict_inline_data(handle_t *handle,
return error;
}

-void ext4_inline_data_truncate(struct inode *inode, int *has_inline)
+int ext4_inline_data_truncate(struct inode *inode, int *has_inline)
{
handle_t *handle;
- int inline_size, value_len, needed_blocks, no_expand;
+ int inline_size, value_len, needed_blocks, no_expand, err = 0;
size_t i_size;
void *value = NULL;
struct ext4_xattr_ibody_find is = {
@@ -1914,19 +1914,19 @@ void ext4_inline_data_truncate(struct inode *inode, int *has_inline)
needed_blocks = ext4_writepage_trans_blocks(inode);
handle = ext4_journal_start(inode, EXT4_HT_INODE, needed_blocks);
if (IS_ERR(handle))
- return;
+ return PTR_ERR(handle);

ext4_write_lock_xattr(inode, &no_expand);
if (!ext4_has_inline_data(inode)) {
*has_inline = 0;
ext4_journal_stop(handle);
- return;
+ return 0;
}

- if (ext4_orphan_add(handle, inode))
+ if ((err = ext4_orphan_add(handle, inode)) != 0)
goto out;

- if (ext4_get_inode_loc(inode, &is.iloc))
+ if ((err = ext4_get_inode_loc(inode, &is.iloc)) != 0)
goto out;

down_write(&EXT4_I(inode)->i_data_sem);
@@ -1937,24 +1937,29 @@ void ext4_inline_data_truncate(struct inode *inode, int *has_inline)
if (i_size < inline_size) {
/* Clear the content in the xattr space. */
if (inline_size > EXT4_MIN_INLINE_DATA_SIZE) {
- if (ext4_xattr_ibody_find(inode, &i, &is))
+ if ((err = ext4_xattr_ibody_find(inode, &i, &is)) != 0)
goto out_error;

BUG_ON(is.s.not_found);

value_len = le32_to_cpu(is.s.here->e_value_size);
value = kmalloc(value_len, GFP_NOFS);
- if (!value)
+ if (!value) {
+ err = -ENOMEM;
goto out_error;
+ }

- if (ext4_xattr_ibody_get(inode, i.name_index, i.name,
- value, value_len))
+ err = ext4_xattr_ibody_get(inode, i.name_index,
+ i.name, value, value_len);
+ if (err)
goto out_error;

i.value = value;
i.value_len = i_size > EXT4_MIN_INLINE_DATA_SIZE ?
i_size - EXT4_MIN_INLINE_DATA_SIZE : 0;
- if (ext4_xattr_ibody_inline_set(handle, inode, &i, &is))
+ err = ext4_xattr_ibody_inline_set(handle, inode,
+ &i, &is);
+ if (err)
goto out_error;
}

@@ -1979,13 +1984,14 @@ void ext4_inline_data_truncate(struct inode *inode, int *has_inline)
if (inode->i_nlink)
ext4_orphan_del(handle, inode);

- inode->i_mtime = inode->i_ctime = current_time(inode);
- ext4_mark_inode_dirty(handle, inode);
- if (IS_SYNC(inode))
- ext4_handle_sync(handle);
-
+ if (err == 0) {
+ inode->i_mtime = inode->i_ctime = current_time(inode);
+ err = ext4_mark_inode_dirty(handle, inode);
+ if (IS_SYNC(inode))
+ ext4_handle_sync(handle);
+ }
ext4_journal_stop(handle);
- return;
+ return err;
}

int ext4_convert_inline_data(struct inode *inode)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 86dde0667ccc..1e2c881f102d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4222,7 +4222,9 @@ int ext4_truncate(struct inode *inode)
if (ext4_has_inline_data(inode)) {
int has_inline = 1;

- ext4_inline_data_truncate(inode, &has_inline);
+ err = ext4_inline_data_truncate(inode, &has_inline);
+ if (err)
+ return err;
if (has_inline)
return 0;
}
--
2.11.0.rc0.7.gbe5a750


2017-01-12 03:49:58

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 2/4] ext4: fix deadlock between inline_data and ext4_expand_extra_isize_ea()

The xattr_sem deadlock problems fixed in commit 2e81a4eeedca: "ext4:
avoid deadlock when expanding inode size" didn't include the use of
xattr_sem in fs/ext4/inline.c. With the addition of project quota
which added a new extra inode field, this exposed deadlocks in the
inline_data code similar to the ones fixed by 2e81a4eeedca.

The deadlock can be reproduced via:

dmesg -n 7
mke2fs -t ext4 -O inline_data -Fq -I 256 /dev/vdc 32768
mount -t ext4 -o debug_want_extra_isize=24 /dev/vdc /vdc
mkdir /vdc/a
umount /vdc
mount -t ext4 /dev/vdc /vdc
echo foo > /vdc/a/foo

and looks like this:

[ 11.158815]
[ 11.160276] =============================================
[ 11.161960] [ INFO: possible recursive locking detected ]
[ 11.161960] 4.10.0-rc3-00015-g011b30a8a3cf #160 Tainted: G W
[ 11.161960] ---------------------------------------------
[ 11.161960] bash/2519 is trying to acquire lock:
[ 11.161960] (&ei->xattr_sem){++++..}, at: [<c1225a4b>] ext4_expand_extra_isize_ea+0x3d/0x4cd
[ 11.161960]
[ 11.161960] but task is already holding lock:
[ 11.161960] (&ei->xattr_sem){++++..}, at: [<c1227941>] ext4_try_add_inline_entry+0x3a/0x152
[ 11.161960]
[ 11.161960] other info that might help us debug this:
[ 11.161960] Possible unsafe locking scenario:
[ 11.161960]
[ 11.161960] CPU0
[ 11.161960] ----
[ 11.161960] lock(&ei->xattr_sem);
[ 11.161960] lock(&ei->xattr_sem);
[ 11.161960]
[ 11.161960] *** DEADLOCK ***
[ 11.161960]
[ 11.161960] May be due to missing lock nesting notation
[ 11.161960]
[ 11.161960] 4 locks held by bash/2519:
[ 11.161960] #0: (sb_writers#3){.+.+.+}, at: [<c11a2414>] mnt_want_write+0x1e/0x3e
[ 11.161960] #1: (&type->i_mutex_dir_key){++++++}, at: [<c119508b>] path_openat+0x338/0x67a
[ 11.161960] #2: (jbd2_handle){++++..}, at: [<c123314a>] start_this_handle+0x582/0x622
[ 11.161960] #3: (&ei->xattr_sem){++++..}, at: [<c1227941>] ext4_try_add_inline_entry+0x3a/0x152
[ 11.161960]
[ 11.161960] stack backtrace:
[ 11.161960] CPU: 0 PID: 2519 Comm: bash Tainted: G W 4.10.0-rc3-00015-g011b30a8a3cf #160
[ 11.161960] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1 04/01/2014
[ 11.161960] Call Trace:
[ 11.161960] dump_stack+0x72/0xa3
[ 11.161960] __lock_acquire+0xb7c/0xcb9
[ 11.161960] ? kvm_clock_read+0x1f/0x29
[ 11.161960] ? __lock_is_held+0x36/0x66
[ 11.161960] ? __lock_is_held+0x36/0x66
[ 11.161960] lock_acquire+0x106/0x18a
[ 11.161960] ? ext4_expand_extra_isize_ea+0x3d/0x4cd
[ 11.161960] down_write+0x39/0x72
[ 11.161960] ? ext4_expand_extra_isize_ea+0x3d/0x4cd
[ 11.161960] ext4_expand_extra_isize_ea+0x3d/0x4cd
[ 11.161960] ? _raw_read_unlock+0x22/0x2c
[ 11.161960] ? jbd2_journal_extend+0x1e2/0x262
[ 11.161960] ? __ext4_journal_get_write_access+0x3d/0x60
[ 11.161960] ext4_mark_inode_dirty+0x17d/0x26d
[ 11.161960] ? ext4_add_dirent_to_inline.isra.12+0xa5/0xb2
[ 11.161960] ext4_add_dirent_to_inline.isra.12+0xa5/0xb2
[ 11.161960] ext4_try_add_inline_entry+0x69/0x152
[ 11.161960] ext4_add_entry+0xa3/0x848
[ 11.161960] ? __brelse+0x14/0x2f
[ 11.161960] ? _raw_spin_unlock_irqrestore+0x44/0x4f
[ 11.161960] ext4_add_nondir+0x17/0x5b
[ 11.161960] ext4_create+0xcf/0x133
[ 11.161960] ? ext4_mknod+0x12f/0x12f
[ 11.161960] lookup_open+0x39e/0x3fb
[ 11.161960] ? __wake_up+0x1a/0x40
[ 11.161960] ? lock_acquire+0x11e/0x18a
[ 11.161960] path_openat+0x35c/0x67a
[ 11.161960] ? sched_clock_cpu+0xd7/0xf2
[ 11.161960] do_filp_open+0x36/0x7c
[ 11.161960] ? _raw_spin_unlock+0x22/0x2c
[ 11.161960] ? __alloc_fd+0x169/0x173
[ 11.161960] do_sys_open+0x59/0xcc
[ 11.161960] SyS_open+0x1d/0x1f
[ 11.161960] do_int80_syscall_32+0x4f/0x61
[ 11.161960] entry_INT80_32+0x2f/0x2f
[ 11.161960] EIP: 0xb76ad469
[ 11.161960] EFLAGS: 00000286 CPU: 0
[ 11.161960] EAX: ffffffda EBX: 08168ac8 ECX: 00008241 EDX: 000001b6
[ 11.161960] ESI: b75e46bc EDI: b7755000 EBP: bfbdb108 ESP: bfbdafc0
[ 11.161960] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b

Cc: [email protected] # 3.10 (requires 2e81a4eeedca as a prereq)
Reported-by: George Spelvin <[email protected]>
Signed-off-by: Theodore Ts'o <[email protected]>
---
fs/ext4/inline.c | 66 ++++++++++++++++++++++++++------------------------------
fs/ext4/xattr.c | 30 +++++++++++---------------
fs/ext4/xattr.h | 32 +++++++++++++++++++++++++++
3 files changed, 74 insertions(+), 54 deletions(-)

diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 437df6a1a841..99a5312ced52 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -381,7 +381,7 @@ static int ext4_update_inline_data(handle_t *handle, struct inode *inode,
static int ext4_prepare_inline_data(handle_t *handle, struct inode *inode,
unsigned int len)
{
- int ret, size;
+ int ret, size, no_expand;
struct ext4_inode_info *ei = EXT4_I(inode);

if (!ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA))
@@ -391,15 +391,14 @@ static int ext4_prepare_inline_data(handle_t *handle, struct inode *inode,
if (size < len)
return -ENOSPC;

- down_write(&EXT4_I(inode)->xattr_sem);
+ ext4_write_lock_xattr(inode, &no_expand);

if (ei->i_inline_off)
ret = ext4_update_inline_data(handle, inode, len);
else
ret = ext4_create_inline_data(handle, inode, len);

- up_write(&EXT4_I(inode)->xattr_sem);
-
+ ext4_write_unlock_xattr(inode, &no_expand);
return ret;
}

@@ -533,7 +532,7 @@ static int ext4_convert_inline_data_to_extent(struct address_space *mapping,
struct inode *inode,
unsigned flags)
{
- int ret, needed_blocks;
+ int ret, needed_blocks, no_expand;
handle_t *handle = NULL;
int retries = 0, sem_held = 0;
struct page *page = NULL;
@@ -573,7 +572,7 @@ static int ext4_convert_inline_data_to_extent(struct address_space *mapping,
goto out;
}

- down_write(&EXT4_I(inode)->xattr_sem);
+ ext4_write_lock_xattr(inode, &no_expand);
sem_held = 1;
/* If some one has already done this for us, just exit. */
if (!ext4_has_inline_data(inode)) {
@@ -610,7 +609,7 @@ static int ext4_convert_inline_data_to_extent(struct address_space *mapping,
put_page(page);
page = NULL;
ext4_orphan_add(handle, inode);
- up_write(&EXT4_I(inode)->xattr_sem);
+ ext4_write_unlock_xattr(inode, &no_expand);
sem_held = 0;
ext4_journal_stop(handle);
handle = NULL;
@@ -636,7 +635,7 @@ static int ext4_convert_inline_data_to_extent(struct address_space *mapping,
put_page(page);
}
if (sem_held)
- up_write(&EXT4_I(inode)->xattr_sem);
+ ext4_write_unlock_xattr(inode, &no_expand);
if (handle)
ext4_journal_stop(handle);
brelse(iloc.bh);
@@ -729,7 +728,7 @@ int ext4_try_to_write_inline_data(struct address_space *mapping,
int ext4_write_inline_data_end(struct inode *inode, loff_t pos, unsigned len,
unsigned copied, struct page *page)
{
- int ret;
+ int ret, no_expand;
void *kaddr;
struct ext4_iloc iloc;

@@ -747,7 +746,7 @@ int ext4_write_inline_data_end(struct inode *inode, loff_t pos, unsigned len,
goto out;
}

- down_write(&EXT4_I(inode)->xattr_sem);
+ ext4_write_lock_xattr(inode, &no_expand);
BUG_ON(!ext4_has_inline_data(inode));

kaddr = kmap_atomic(page);
@@ -757,7 +756,7 @@ int ext4_write_inline_data_end(struct inode *inode, loff_t pos, unsigned len,
/* clear page dirty so that writepages wouldn't work for us. */
ClearPageDirty(page);

- up_write(&EXT4_I(inode)->xattr_sem);
+ ext4_write_unlock_xattr(inode, &no_expand);
brelse(iloc.bh);
out:
return copied;
@@ -768,7 +767,7 @@ ext4_journalled_write_inline_data(struct inode *inode,
unsigned len,
struct page *page)
{
- int ret;
+ int ret, no_expand;
void *kaddr;
struct ext4_iloc iloc;

@@ -778,11 +777,11 @@ ext4_journalled_write_inline_data(struct inode *inode,
return NULL;
}

- down_write(&EXT4_I(inode)->xattr_sem);
+ ext4_write_lock_xattr(inode, &no_expand);
kaddr = kmap_atomic(page);
ext4_write_inline_data(inode, &iloc, kaddr, 0, len);
kunmap_atomic(kaddr);
- up_write(&EXT4_I(inode)->xattr_sem);
+ ext4_write_unlock_xattr(inode, &no_expand);

return iloc.bh;
}
@@ -1259,7 +1258,7 @@ static int ext4_convert_inline_data_nolock(handle_t *handle,
int ext4_try_add_inline_entry(handle_t *handle, struct ext4_filename *fname,
struct inode *dir, struct inode *inode)
{
- int ret, inline_size;
+ int ret, inline_size, no_expand;
void *inline_start;
struct ext4_iloc iloc;

@@ -1267,7 +1266,7 @@ int ext4_try_add_inline_entry(handle_t *handle, struct ext4_filename *fname,
if (ret)
return ret;

- down_write(&EXT4_I(dir)->xattr_sem);
+ ext4_write_lock_xattr(dir, &no_expand);
if (!ext4_has_inline_data(dir))
goto out;

@@ -1313,7 +1312,7 @@ int ext4_try_add_inline_entry(handle_t *handle, struct ext4_filename *fname,

out:
ext4_mark_inode_dirty(handle, dir);
- up_write(&EXT4_I(dir)->xattr_sem);
+ ext4_write_unlock_xattr(dir, &no_expand);
brelse(iloc.bh);
return ret;
}
@@ -1673,7 +1672,7 @@ int ext4_delete_inline_entry(handle_t *handle,
struct buffer_head *bh,
int *has_inline_data)
{
- int err, inline_size;
+ int err, inline_size, no_expand;
struct ext4_iloc iloc;
void *inline_start;

@@ -1681,7 +1680,7 @@ int ext4_delete_inline_entry(handle_t *handle,
if (err)
return err;

- down_write(&EXT4_I(dir)->xattr_sem);
+ ext4_write_lock_xattr(dir, &no_expand);
if (!ext4_has_inline_data(dir)) {
*has_inline_data = 0;
goto out;
@@ -1715,7 +1714,7 @@ int ext4_delete_inline_entry(handle_t *handle,

ext4_show_inline_dir(dir, iloc.bh, inline_start, inline_size);
out:
- up_write(&EXT4_I(dir)->xattr_sem);
+ ext4_write_unlock_xattr(dir, &no_expand);
brelse(iloc.bh);
if (err != -ENOENT)
ext4_std_error(dir->i_sb, err);
@@ -1814,11 +1813,11 @@ bool empty_inline_dir(struct inode *dir, int *has_inline_data)

int ext4_destroy_inline_data(handle_t *handle, struct inode *inode)
{
- int ret;
+ int ret, no_expand;

- down_write(&EXT4_I(inode)->xattr_sem);
+ ext4_write_lock_xattr(inode, &no_expand);
ret = ext4_destroy_inline_data_nolock(handle, inode);
- up_write(&EXT4_I(inode)->xattr_sem);
+ ext4_write_unlock_xattr(inode, &no_expand);

return ret;
}
@@ -1903,7 +1902,7 @@ int ext4_try_to_evict_inline_data(handle_t *handle,
void ext4_inline_data_truncate(struct inode *inode, int *has_inline)
{
handle_t *handle;
- int inline_size, value_len, needed_blocks;
+ int inline_size, value_len, needed_blocks, no_expand;
size_t i_size;
void *value = NULL;
struct ext4_xattr_ibody_find is = {
@@ -1920,7 +1919,7 @@ void ext4_inline_data_truncate(struct inode *inode, int *has_inline)
if (IS_ERR(handle))
return;

- down_write(&EXT4_I(inode)->xattr_sem);
+ ext4_write_lock_xattr(inode, &no_expand);
if (!ext4_has_inline_data(inode)) {
*has_inline = 0;
ext4_journal_stop(handle);
@@ -1978,7 +1977,7 @@ void ext4_inline_data_truncate(struct inode *inode, int *has_inline)
up_write(&EXT4_I(inode)->i_data_sem);
out:
brelse(is.iloc.bh);
- up_write(&EXT4_I(inode)->xattr_sem);
+ ext4_write_unlock_xattr(inode, &no_expand);
kfree(value);
if (inode->i_nlink)
ext4_orphan_del(handle, inode);
@@ -1994,7 +1993,7 @@ void ext4_inline_data_truncate(struct inode *inode, int *has_inline)

int ext4_convert_inline_data(struct inode *inode)
{
- int error, needed_blocks;
+ int error, needed_blocks, no_expand;
handle_t *handle;
struct ext4_iloc iloc;

@@ -2016,15 +2015,10 @@ int ext4_convert_inline_data(struct inode *inode)
goto out_free;
}

- down_write(&EXT4_I(inode)->xattr_sem);
- if (!ext4_has_inline_data(inode)) {
- up_write(&EXT4_I(inode)->xattr_sem);
- goto out;
- }
-
- error = ext4_convert_inline_data_nolock(handle, inode, &iloc);
- up_write(&EXT4_I(inode)->xattr_sem);
-out:
+ ext4_write_lock_xattr(inode, &no_expand);
+ if (ext4_has_inline_data(inode))
+ error = ext4_convert_inline_data_nolock(handle, inode, &iloc);
+ ext4_write_unlock_xattr(inode, &no_expand);
ext4_journal_stop(handle);
out_free:
brelse(iloc.bh);
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 5a94fa52b74f..c40bd55b6400 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1188,16 +1188,14 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
struct ext4_xattr_block_find bs = {
.s = { .not_found = -ENODATA, },
};
- unsigned long no_expand;
+ int no_expand;
int error;

if (!name)
return -EINVAL;
if (strlen(name) > 255)
return -ERANGE;
- down_write(&EXT4_I(inode)->xattr_sem);
- no_expand = ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND);
- ext4_set_inode_state(inode, EXT4_STATE_NO_EXPAND);
+ ext4_write_lock_xattr(inode, &no_expand);

error = ext4_reserve_inode_write(handle, inode, &is.iloc);
if (error)
@@ -1264,7 +1262,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
ext4_xattr_update_super_block(handle, inode->i_sb);
inode->i_ctime = current_time(inode);
if (!value)
- ext4_clear_inode_state(inode, EXT4_STATE_NO_EXPAND);
+ no_expand = 0;
error = ext4_mark_iloc_dirty(handle, inode, &is.iloc);
/*
* The bh is consumed by ext4_mark_iloc_dirty, even with
@@ -1278,9 +1276,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
cleanup:
brelse(is.iloc.bh);
brelse(bs.bh);
- if (no_expand == 0)
- ext4_clear_inode_state(inode, EXT4_STATE_NO_EXPAND);
- up_write(&EXT4_I(inode)->xattr_sem);
+ ext4_write_unlock_xattr(inode, &no_expand);
return error;
}

@@ -1497,12 +1493,11 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
int error = 0, tried_min_extra_isize = 0;
int s_min_extra_isize = le16_to_cpu(EXT4_SB(inode->i_sb)->s_es->s_min_extra_isize);
int isize_diff; /* How much do we need to grow i_extra_isize */
+ int no_expand;
+
+ if (ext4_write_trylock_xattr(inode, &no_expand) == 0)
+ return 0;

- down_write(&EXT4_I(inode)->xattr_sem);
- /*
- * Set EXT4_STATE_NO_EXPAND to avoid recursion when marking inode dirty
- */
- ext4_set_inode_state(inode, EXT4_STATE_NO_EXPAND);
retry:
isize_diff = new_extra_isize - EXT4_I(inode)->i_extra_isize;
if (EXT4_I(inode)->i_extra_isize >= new_extra_isize)
@@ -1584,17 +1579,16 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
EXT4_I(inode)->i_extra_isize = new_extra_isize;
brelse(bh);
out:
- ext4_clear_inode_state(inode, EXT4_STATE_NO_EXPAND);
- up_write(&EXT4_I(inode)->xattr_sem);
+ ext4_write_unlock_xattr(inode, &no_expand);
return 0;

cleanup:
brelse(bh);
/*
- * We deliberately leave EXT4_STATE_NO_EXPAND set here since inode
- * size expansion failed.
+ * Inode size expansion failed; don't try again
*/
- up_write(&EXT4_I(inode)->xattr_sem);
+ no_expand = 1;
+ ext4_write_unlock_xattr(inode, &no_expand);
return error;
}

diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index a92e783fa057..099c8b670ef5 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -102,6 +102,38 @@ extern const struct xattr_handler ext4_xattr_security_handler;

#define EXT4_XATTR_NAME_ENCRYPTION_CONTEXT "c"

+/*
+ * The EXT4_STATE_NO_EXPAND is overloaded and used for two purposes.
+ * The first is to signal that there the inline xattrs and data are
+ * taking up so much space that we might as well not keep trying to
+ * expand it. The second is that xattr_sem is taken for writing, so
+ * we shouldn't try to recurse into the inode expansion. For this
+ * second case, we need to make sure that we take save and restore the
+ * NO_EXPAND state flag appropriately.
+ */
+static inline void ext4_write_lock_xattr(struct inode *inode, int *save)
+{
+ down_write(&EXT4_I(inode)->xattr_sem);
+ *save = ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND);
+ ext4_set_inode_state(inode, EXT4_STATE_NO_EXPAND);
+}
+
+static inline int ext4_write_trylock_xattr(struct inode *inode, int *save)
+{
+ if (down_write_trylock(&EXT4_I(inode)->xattr_sem) == 0)
+ return 0;
+ *save = ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND);
+ ext4_set_inode_state(inode, EXT4_STATE_NO_EXPAND);
+ return 1;
+}
+
+static inline void ext4_write_unlock_xattr(struct inode *inode, int *save)
+{
+ if (*save == 0)
+ ext4_clear_inode_state(inode, EXT4_STATE_NO_EXPAND);
+ up_write(&EXT4_I(inode)->xattr_sem);
+}
+
extern ssize_t ext4_listxattr(struct dentry *, char *, size_t);

extern int ext4_xattr_get(struct inode *, int, const char *, void *, size_t);
--
2.11.0.rc0.7.gbe5a750


2017-01-12 03:49:49

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 1/4] ext4: add debug_want_extra_isize mount option

In order to test the inode extra isize expansion code, it is useful to
be able to easily create file systems that have inodes with extra
isize values smaller than the current desired value.

Signed-off-by: Theodore Ts'o <[email protected]>
---
fs/ext4/super.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 9d15a6293124..829e4a7b59e4 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1284,7 +1284,7 @@ enum {
Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
Opt_usrquota, Opt_grpquota, Opt_prjquota, Opt_i_version, Opt_dax,
Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
- Opt_lazytime, Opt_nolazytime,
+ Opt_lazytime, Opt_nolazytime, Opt_debug_want_extra_isize,
Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
Opt_inode_readahead_blks, Opt_journal_ioprio,
Opt_dioread_nolock, Opt_dioread_lock,
@@ -1352,6 +1352,7 @@ static const match_table_t tokens = {
{Opt_delalloc, "delalloc"},
{Opt_lazytime, "lazytime"},
{Opt_nolazytime, "nolazytime"},
+ {Opt_debug_want_extra_isize, "debug_want_extra_isize=%u"},
{Opt_nodelalloc, "nodelalloc"},
{Opt_removed, "mblk_io_submit"},
{Opt_removed, "nomblk_io_submit"},
@@ -1557,6 +1558,7 @@ static const struct mount_opts {
#endif
{Opt_nouid32, EXT4_MOUNT_NO_UID32, MOPT_SET},
{Opt_debug, EXT4_MOUNT_DEBUG, MOPT_SET},
+ {Opt_debug_want_extra_isize, 0, MOPT_GTE0},
{Opt_quota, EXT4_MOUNT_QUOTA | EXT4_MOUNT_USRQUOTA, MOPT_SET | MOPT_Q},
{Opt_usrquota, EXT4_MOUNT_QUOTA | EXT4_MOUNT_USRQUOTA,
MOPT_SET | MOPT_Q},
@@ -1670,6 +1672,8 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
if (arg == 0)
arg = JBD2_DEFAULT_MAX_COMMIT_AGE;
sbi->s_commit_interval = HZ * arg;
+ } else if (token == Opt_debug_want_extra_isize) {
+ sbi->s_want_extra_isize = arg;
} else if (token == Opt_max_batch_time) {
sbi->s_max_batch_time = arg;
} else if (token == Opt_min_batch_time) {
@@ -4081,7 +4085,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
sb->s_flags |= MS_RDONLY;

/* determine the minimum size of new large inodes, if present */
- if (sbi->s_inode_size > EXT4_GOOD_OLD_INODE_SIZE) {
+ if (sbi->s_inode_size > EXT4_GOOD_OLD_INODE_SIZE &&
+ sbi->s_want_extra_isize == 0) {
sbi->s_want_extra_isize = sizeof(struct ext4_inode) -
EXT4_GOOD_OLD_INODE_SIZE;
if (ext4_has_feature_extra_isize(sb)) {
--
2.11.0.rc0.7.gbe5a750


2017-01-12 03:49:57

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 3/4] ext4: avoid calling ext4_mark_inode_dirty() under unneeded semaphores

There is no need to call ext4_mark_inode_dirty while holding xattr_sem
or i_data_sem, so where it's easy to avoid it, move it out from the
critical region.

Signed-off-by: Theodore Ts'o <[email protected]>
---
fs/ext4/inline.c | 9 +++------
fs/ext4/inode.c | 2 +-
2 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 99a5312ced52..31f98dd04e51 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -1042,7 +1042,6 @@ static int ext4_add_dirent_to_inline(handle_t *handle,
dir->i_mtime = dir->i_ctime = current_time(dir);
ext4_update_dx_flag(dir);
dir->i_version++;
- ext4_mark_inode_dirty(handle, dir);
return 1;
}

@@ -1311,8 +1310,8 @@ int ext4_try_add_inline_entry(handle_t *handle, struct ext4_filename *fname,
ret = ext4_convert_inline_data_nolock(handle, dir, &iloc);

out:
- ext4_mark_inode_dirty(handle, dir);
ext4_write_unlock_xattr(dir, &no_expand);
+ ext4_mark_inode_dirty(handle, dir);
brelse(iloc.bh);
return ret;
}
@@ -1708,13 +1707,11 @@ int ext4_delete_inline_entry(handle_t *handle,
if (err)
goto out;

- err = ext4_mark_inode_dirty(handle, dir);
- if (unlikely(err))
- goto out;
-
ext4_show_inline_dir(dir, iloc.bh, inline_start, inline_size);
out:
ext4_write_unlock_xattr(dir, &no_expand);
+ if (likely(err == 0))
+ err = ext4_mark_inode_dirty(handle, dir);
brelse(iloc.bh);
if (err != -ENOENT)
ext4_std_error(dir->i_sb, err);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 88d57af1b516..86dde0667ccc 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2464,8 +2464,8 @@ static int mpage_map_and_submit_extent(handle_t *handle,
disksize = i_size;
if (disksize > EXT4_I(inode)->i_disksize)
EXT4_I(inode)->i_disksize = disksize;
- err2 = ext4_mark_inode_dirty(handle, inode);
up_write(&EXT4_I(inode)->i_data_sem);
+ err2 = ext4_mark_inode_dirty(handle, inode);
if (err2)
ext4_error(inode->i_sb,
"Failed to mark inode %lu dirty",
--
2.11.0.rc0.7.gbe5a750


2017-01-12 03:54:02

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH -v2] overlay: stress test changes to top and bottom layers simultaneously

On Wed, Jan 11, 2017 at 10:49:35PM -0500, Theodore Ts'o wrote:
> Introduce a test which runs fsstress on the top and bottom overlayfs
> directories simultaneously to find potential races that plagued wrapfs
> derived file systems.
>
> Signed-off-by: Theodore Ts'o <[email protected]>

sigh, please ignore this patch. I had some trash left over in the
directory I use to send out pathces.

- Ted

2017-01-12 09:19:38

by George Spelvin

[permalink] [raw]
Subject: Re: [PATCH 0/4] ext4 inline cleanup patches

It *seems* to work, although it doesn't apply to v4.9.x (because
ext4_truncate returns void in that version), and I ran into two
unrelared bugs in v4.10-rc3 before I was able to test.

I'll probably go back to v4.9.x and leave off patch 4/4.

[ 32.928704] BUG: unable to handle kernel paging request at ffffc90003ed5000
[ 32.928820] IP: memcpy_erms+0x6/0x10
[ 32.928926] PGD 42d41b067
[ 32.928926] PUD 42d41c067
[ 32.929264] PMD 428b6d067
[ 32.929634] PTE 0

[ 32.930000] Oops: 0000 [#1] SMP
[ 32.930000] Modules linked in: twofish_generic twofish_avx_x86_64 ablk_helper twofish_x86_64_3way twofish_x86_64 twofish_common cmac xcbc x86_pkg_temp_thermal crc32_pclmul crc32c_intel via_velocity
[ 32.930000] CPU: 2 PID: 2442 Comm: microcode_ctl Not tainted 4.10.0-rc3-00129-g9ce34823-dirty #590
[ 32.930000] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./X79-UP4, BIOS F7 03/19/2014
[ 32.930000] task: ffff8804226d4080 task.stack: ffffc90003ece000
[ 32.930000] RIP: 0010:memcpy_erms+0x6/0x10
[ 32.930000] RSP: 0018:ffffc90003ecfcc8 EFLAGS: 00010286
[ 32.930000] RAX: ffff880422700000 RBX: 0000000000005c00 RCX: 0000000000001c00
[ 32.930000] RDX: 0000000000005c00 RSI: ffffc90003ed5000 RDI: ffff880422704000
[ 32.930000] RBP: ffffc90003ecfce0 R08: 000000000001a358 R09: 0000000000000000
[ 32.930000] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90003ed1000
[ 32.930000] R13: ffffc90003ed1000 R14: 0000000000005c00 R15: ffffffff81c2ab20
[ 32.930000] FS: 0000000000000000(0000) GS:ffff88042f280000(0063) knlGS:00000000f77467c0
[ 32.930000] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
[ 32.930000] CR2: ffffc90003ed5000 CR3: 0000000428f9f000 CR4: 00000000001406e0
[ 32.930000] Call Trace:
[ 32.930000] ? kmemdup+0x31/0x40
[ 32.930000] __alloc_microcode_buf+0x3a/0x60
[ 32.930000] save_microcode_patch+0xaa/0x100
[ 32.930000] generic_load_microcode+0x1b7/0x240
[ 32.930000] ? get_ucode_fw+0x10/0x10
[ 32.930000] request_microcode_user+0x10/0x20
[ 32.930000] microcode_write+0x8f/0xf0
[ 32.930000] __vfs_write+0x23/0x120
[ 32.930000] ? debug_check_no_locks_freed+0xb1/0x140
[ 32.930000] ? trace_hardirqs_on_caller+0xed/0x1b0
[ 32.930000] ? trace_hardirqs_on+0xd/0x10
[ 32.930000] vfs_write+0xc4/0x220
[ 32.930000] ? putname+0x4e/0x60
[ 32.930000] SyS_write+0x44/0xa0
[ 32.930000] do_fast_syscall_32+0x94/0x210
[ 32.930000] entry_SYSENTER_compat+0x51/0x60
[ 32.930000] RIP: 0023:0xf7749af9
[ 32.930000] RSP: 002b:00000000ffaa6268 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[ 32.930000] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000f7560008
[ 32.930000] RDX: 0000000000096c00 RSI: 0000000000040000 RDI: 0000000000096c00
[ 32.930000] RBP: 00000000f7560008 R08: 0000000000000000 R09: 0000000000000000
[ 32.930000] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 32.930000] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 32.930000] Code: 90 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38
[ 32.930000] RIP: memcpy_erms+0x6/0x10 RSP: ffffc90003ecfcc8
[ 32.930000] CR2: ffffc90003ed5000
[ 32.930000] ---[ end trace aa4a28ed3683fb12 ]---
[ 32.930000] BUG: sleeping function called from invalid context at ./include/linux/sched.h:3153
[ 32.930000] in_atomic(): 0, irqs_disabled(): 1, pid: 2442, name: microcode_ctl
[ 32.930000] INFO: lockdep is turned off.
[ 32.930000] irq event stamp: 7732
[ 32.930000] hardirqs last enabled at (7731): [<ffffffff8112ffcb>] get_page_from_freelist+0x37b/0x910
[ 32.930000] hardirqs last disabled at (7732): [<ffffffff817b8b5f>] error_entry+0x6f/0xc0
[ 32.930000] softirqs last enabled at (7230): [<ffffffff810830b7>] __do_softirq+0x1f7/0x260
[ 32.930000] softirqs last disabled at (7215): [<ffffffff810832ae>] irq_exit+0xbe/0xd0
[ 32.930000] CPU: 2 PID: 2442 Comm: microcode_ctl Tainted: G D 4.10.0-rc3-00129-g9ce34823-dirty #590
[ 32.930000] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./X79-UP4, BIOS F7 03/19/2014
[ 32.930000] Call Trace:
[ 32.930000] dump_stack+0x68/0x93
[ 32.930000] ___might_sleep+0x16b/0x230
[ 32.930000] __might_sleep+0x45/0x80
[ 32.930000] exit_signals+0x1f/0x110
[ 32.930000] do_exit+0x9f/0xaa0
[ 32.930000] rewind_stack_do_exit+0x17/0x20
[ 32.930000] RIP: 0023:0xf7749af9
[ 32.930000] RSP: 002b:00000000ffaa6268 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[ 32.930000] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000f7560008
[ 32.930000] RDX: 0000000000096c00 RSI: 0000000000040000 RDI: 0000000000096c00
[ 32.930000] RBP: 00000000f7560008 R08: 0000000000000000 R09: 0000000000000000
[ 32.930000] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 32.930000] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

2017-01-12 14:47:01

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 0/4] ext4 inline cleanup patches

On Thu, Jan 12, 2017 at 04:12:54AM -0500, George Spelvin wrote:
> It *seems* to work, although it doesn't apply to v4.9.x (because
> ext4_truncate returns void in that version), and I ran into two
> unrelared bugs in v4.10-rc3 before I was able to test.
>
> I'll probably go back to v4.9.x and leave off patch 4/4.

Yes, 4/4 isn't needed to fix your issue. It's just that once i opened
up inline.c, I saw some other things I wanted to fix.

- Ted

2017-01-12 21:20:46

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 1/4] ext4: add debug_want_extra_isize mount option

On Jan 11, 2017, at 8:49 PM, Theodore Ts'o <[email protected]> wrote:
>
> In order to test the inode extra isize expansion code, it is useful to
> be able to easily create file systems that have inodes with extra
> isize values smaller than the current desired value.

Rather than adding all of the debug hooks as mount options, wouldn't it
make more sense to add tunables via debugfs or as an attr_extra_isize or
similar?

Cheers, Andreas

> Signed-off-by: Theodore Ts'o <[email protected]>
> ---
> fs/ext4/super.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 9d15a6293124..829e4a7b59e4 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1284,7 +1284,7 @@ enum {
> Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
> Opt_usrquota, Opt_grpquota, Opt_prjquota, Opt_i_version, Opt_dax,
> Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
> - Opt_lazytime, Opt_nolazytime,
> + Opt_lazytime, Opt_nolazytime, Opt_debug_want_extra_isize,
> Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
> Opt_inode_readahead_blks, Opt_journal_ioprio,
> Opt_dioread_nolock, Opt_dioread_lock,
> @@ -1352,6 +1352,7 @@ static const match_table_t tokens = {
> {Opt_delalloc, "delalloc"},
> {Opt_lazytime, "lazytime"},
> {Opt_nolazytime, "nolazytime"},
> + {Opt_debug_want_extra_isize, "debug_want_extra_isize=%u"},
> {Opt_nodelalloc, "nodelalloc"},
> {Opt_removed, "mblk_io_submit"},
> {Opt_removed, "nomblk_io_submit"},
> @@ -1557,6 +1558,7 @@ static const struct mount_opts {
> #endif
> {Opt_nouid32, EXT4_MOUNT_NO_UID32, MOPT_SET},
> {Opt_debug, EXT4_MOUNT_DEBUG, MOPT_SET},
> + {Opt_debug_want_extra_isize, 0, MOPT_GTE0},
> {Opt_quota, EXT4_MOUNT_QUOTA | EXT4_MOUNT_USRQUOTA, MOPT_SET | MOPT_Q},
> {Opt_usrquota, EXT4_MOUNT_QUOTA | EXT4_MOUNT_USRQUOTA,
> MOPT_SET | MOPT_Q},
> @@ -1670,6 +1672,8 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
> if (arg == 0)
> arg = JBD2_DEFAULT_MAX_COMMIT_AGE;
> sbi->s_commit_interval = HZ * arg;
> + } else if (token == Opt_debug_want_extra_isize) {
> + sbi->s_want_extra_isize = arg;
> } else if (token == Opt_max_batch_time) {
> sbi->s_max_batch_time = arg;
> } else if (token == Opt_min_batch_time) {
> @@ -4081,7 +4085,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> sb->s_flags |= MS_RDONLY;
>
> /* determine the minimum size of new large inodes, if present */
> - if (sbi->s_inode_size > EXT4_GOOD_OLD_INODE_SIZE) {
> + if (sbi->s_inode_size > EXT4_GOOD_OLD_INODE_SIZE &&
> + sbi->s_want_extra_isize == 0) {
> sbi->s_want_extra_isize = sizeof(struct ext4_inode) -
> EXT4_GOOD_OLD_INODE_SIZE;
> if (ext4_has_feature_extra_isize(sb)) {
> --
> 2.11.0.rc0.7.gbe5a750

Cheers, Andreas






Attachments:
signature.asc (833.00 B)
Message signed with OpenPGP using GPGMail

2017-01-13 02:58:38

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/4] ext4: add debug_want_extra_isize mount option

On Thu, Jan 12, 2017 at 12:19:37PM -0700, Andreas Dilger wrote:
> On Jan 11, 2017, at 8:49 PM, Theodore Ts'o <[email protected]> wrote:
> >
> > In order to test the inode extra isize expansion code, it is useful to
> > be able to easily create file systems that have inodes with extra
> > isize values smaller than the current desired value.
>
> Rather than adding all of the debug hooks as mount options, wouldn't it
> make more sense to add tunables via debugfs or as an attr_extra_isize or
> similar?

Debugfs would be lot more complicated to implement, and especially if
we had to set up a separate directory hierarchy for each file system.
And both debugfs and setting it up in sysfs would mean more runtime
memory in use for each file system this is mounted. The mount option
is actually one of the simpler, lighter weight ways of adding the
debug hook.

What's the objection (or objections) you have to using mount options?

- Ted

2017-01-18 08:48:51

by George Spelvin

[permalink] [raw]
Subject: kernel BUG at fs/ext4/inline.c:1943!

I was trying to rmdir an empty directory in lost+found that accumulated
during my recent problems.

EXT4-fs (md3): mounted filesystem with writeback data mode. Opts: data=writeback,delalloc

$ cd /mountpoint/lost+found
$ rmdir \#1625089/

------------[ cut here ]------------
kernel BUG at fs/ext4/inline.c:1943!
invalid opcode: 0000 [#1] SMP
Modules linked in: nfsd lockd grace sunrpc ablk_helper via_velocity x86_pkg_temp_thermal crc32_pclmul crc32c_intel [last unloaded: twofish_common]
CPU: 0 PID: 16879 Comm: rmdir Not tainted 4.10.0-rc3-00322-g1aa1df43-dirty #591
Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./X79-UP4, BIOS F7 03/19/2014
task: ffff88040e4cab00 task.stack: ffffc9000dd5c000
RIP: 0010:ext4_inline_data_truncate+0x3db/0x400
RSP: 0018:ffffc9000dd5dcd8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff88042d287ea0 RCX: 0000021810000000
RDX: 00000000ffffffc3 RSI: ffffc9000dd5dcf8 RDI: ffff8801a41771f0
RBP: ffffc9000dd5dd80 R08: ffff88006123e9c0 R09: ffff8803fe3d60a0
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801a41771f0
R13: ffff8801a4177058 R14: ffff8801a41770f0 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88042f200000(0063) knlGS:00000000f7787800
CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 00000000f765fbe0 CR3: 0000000179a5e000 CR4: 00000000001406f0
Call Trace:
ext4_truncate+0x1ea/0x300
ext4_evict_inode+0x2c7/0x3d0
evict+0xc0/0x190
iput+0x14c/0x1e0
dentry_unlink_inode+0xbe/0x160
d_delete+0x9c/0xb0
vfs_rmdir+0x102/0x130
do_rmdir+0x1a3/0x1e0
SyS_rmdir+0x11/0x20
do_fast_syscall_32+0x94/0x210
entry_SYSENTER_compat+0x51/0x60
RIP: 0023:0xf778aaf9
RSP: 002b:00000000ffda9e0c EFLAGS: 00000292 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 00000000ffdab332 RCX: 0000000000000000
RDX: 00000000565f4000 RSI: 00000000565efb38 RDI: 00000000ffdab332
RBP: 00000000ffda9e68 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Code: ff ff 89 c1 48 89 d7 48 c7 44 0a f8 00 00 00 00 8d 48 ff 31 c0 c1 e9 03 f3 48 ab e9 31 fe ff ff 41 bf f4 ff ff ff e9 3c fe ff ff <0f> 0b 89 c0 c7 02 00 00 00 00 c7 44 02 fc 00 00 00 00 e9 0f fe
RIP: ext4_inline_data_truncate+0x3db/0x400 RSP: ffffc9000dd5dcd8
---[ end trace caa17f858299cde5 ]---


That's the "BUG_ON(is.s.not_found);"

debugfs says the directory entry is gone, but I can:

debugfs: stat <1625089>
Inode: 1625089 Type: directory Mode: 0775 Flags: 0x10000000
Generation: 927350643 Version: 0x00000000:00000004
User: 1000 Group: 161 Project: 0 Size: 132
File ACL: 1664090185 Directory ACL: 0
Links: 0 Blockcount: 8
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x587f2034:472c74ec -- Wed Jan 18 02:58:44 2017
atime: 0x56b9e2f8:b68a7658 -- Tue Feb 9 08:00:40 2016
mtime: 0x56c1bc4b:a7765de8 -- Mon Feb 15 06:53:47 2016
crtime: 0x56ba9eb4:a51d90ac -- Tue Feb 9 21:21:40 2016
Size of extra inode fields: 32
Extended attributes:
system.data (72)
Inode checksum: 0xe2f12fc5
Size of inline data: 132
debugfs: dump <1625089> /tmp/inode
$ xxd /tmp/inode
00000000: 0b00 0000 0000 0000 3800 0000 0000 0000 ........8.......
00000010: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000020: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000040: 4800 3101 xxxx xxxx xxxx xxxx xxxx xxxx H.1.xxxxxxxxxxxx
00000050: xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxxxxxxxxxxxxxx
00000060: xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxxxxxxxxxxxxxx
00000070: xx2e 7064 6600 0000 0000 0000 0000 0000 x.pdf...........
00000080: 0000 0000 ....
debugfs: ea_get -f /tmp/ea <1625089> system.data
$ xxd /tmp/ea
00000000: 0000 0000 4800 3101 xxxx xxxx xxxx xxxx ....H.1.xxxxxxxx
00000010: xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxxxxxxxxxxxxxx
00000020: xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxxxxxxxxxxxxxx
00000030: xxxx xxxx xx2e 7064 6600 0000 0000 0000 xxxxx.pdf.......
00000040: 0000 0000 0000 0000 ........

2017-01-19 17:51:02

by George Spelvin

[permalink] [raw]
Subject: ext4_iget:4740: inode #%ld: block 48: comm find: invalid block

This is part of the fallout of previous errors, so I don't know how
it happened, but repeated e2fsck runs (git master, e2fsck 1.43.3.1
(22-Dec-2016)) aren't fixing it, which is at least an error in e2fsck.

I think it's something to do with inline_data. The error is
} else if (!ext4_has_inline_data(inode)) {
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
if ((S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
(S_ISLNK(inode->i_mode) &&
!ext4_inode_is_fast_symlink(inode))))
/* Validate extent which is part of inode */
ret = ext4_ext_check_inode(inode);
} else if (S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
(S_ISLNK(inode->i_mode) &&
!ext4_inode_is_fast_symlink(inode))) {
/* Validate block references which are part of inode */
4740--> ret = ext4_ind_check_inode(inode);
}

but I think the !ext4_has_inline_data() check is wrong. I.e. the inode
*does* have inline data but the test is not detecting it, leading
to -EFSCORRUPTED.

debugfs: stat <1171779>
Inode: 1171779 Type: directory Mode: 0775 Flags: 0x10000000
Generation: 2016668288 Version: 0x00000000:00000007
User: 1000 Group: 161 Project: 0 Size: 60
File ACL: 0 Directory ACL: 0
Links: 2 Blockcount: 0
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x587eff35:6e19953c -- Wed Jan 18 00:37:57 2017
atime: 0x56d5943f:bb6e1344 -- Tue Mar 1 08:08:15 2016
mtime: 0x587eff35:6e19953c -- Wed Jan 18 00:37:57 2017
crtime: 0x56d388b6:533e9e7c -- Sun Feb 28 18:54:30 2016
Size of extra inode fields: 32
Inode checksum: 0x82a01dca
Size of inline data: 60

debugfs: ls <1171779>
1171779 (12) . 1171778 (12) .. 1171954 (12) 0 1171955 (12) 1
1171956 (12) 2 1171957 (20) 3

debugfs: ex <1171779>
<1171779>: does not uses extent block maps
debugfs: blocks <1171779>

(i.e. blank line)


It has no blocks allocated, but contains data... must be inline, no?
The EXT4_INODE_INLINE_DATA bit (bit 28) is set, so i_inline_off must
be zero. How could that arise?

Other info:

debugfs: ea_list <1171779>
debugfs: htree <1171779>
htree: Not a hash-indexed directory
debugfs: id <1171779>
0000 fd45 e803 3c00 0000 3f94 d556 35ff 7e58 .E..<...?..V5.~X
0020 35ff 7e58 0000 0000 a100 0200 0000 0000 5.~X............
0040 0000 0010 0700 0000 42e1 1100 f2e1 1100 ........B.......
0060 0c00 0101 3000 0000 f3e1 1100 0c00 0101 ....0...........
0100 3100 0000 f4e1 1100 0c00 0101 3200 0000 1...........2...
0120 f5e1 1100 1400 0101 3300 0000 0000 0000 ........3.......
0140 0000 0000 80ea 3378 0000 0000 0000 0000 ......3x........
0160 0000 0000 0000 0000 0000 0000 ca1d 0000 ................
0200 2000 a082 3c95 196e 3c95 196e 4413 6ebb ...<..n<..nD.n.
0220 b688 d356 7c9e 3e53 0000 0000 0000 0000 ...V|.>S........
0240 0000 0000 0000 0000 0000 0000 0000 0000 ................
*

debugfs: dump <1171779> /tmp/empty
$ xxd /tmp/empty
00000000: 42e1 1100 f2e1 1100 0c00 0101 3000 0000 B...........0...
00000010: f3e1 1100 0c00 0101 3100 0000 f4e1 1100 ........1.......
00000020: 0c00 0101 3200 0000 f5e1 1100 1400 0101 ....2...........
00000030: 3300 0000 0000 0000 0000 0000 3...........

That data is clearly visible in the i_block area of the inode
(starting at offset 0050).

2017-01-19 22:58:43

by Theodore Ts'o

[permalink] [raw]
Subject: Re: kernel BUG at fs/ext4/inline.c:1943!

On Wed, Jan 18, 2017 at 03:21:28AM -0500, George Spelvin wrote:
> I was trying to rmdir an empty directory in lost+found that accumulated
> during my recent problems.
>
> EXT4-fs (md3): mounted filesystem with writeback data mode. Opts: data=writeback,delalloc
>
> $ cd /mountpoint/lost+found
> $ rmdir \#1625089/
>
> ------------[ cut here ]------------
> kernel BUG at fs/ext4/inline.c:1943!

>
> debugfs: stat <1625089>
> Inode: 1625089 Type: directory Mode: 0775 Flags: 0x10000000
> Generation: 927350643 Version: 0x00000000:00000004
> User: 1000 Group: 161 Project: 0 Size: 132
> File ACL: 1664090185 Directory ACL: 0
> Links: 0 Blockcount: 8
> Fragment: Address: 0 Number: 0 Size: 0
> ctime: 0x587f2034:472c74ec -- Wed Jan 18 02:58:44 2017
> atime: 0x56b9e2f8:b68a7658 -- Tue Feb 9 08:00:40 2016
> mtime: 0x56c1bc4b:a7765de8 -- Mon Feb 15 06:53:47 2016
> crtime: 0x56ba9eb4:a51d90ac -- Tue Feb 9 21:21:40 2016
> Size of extra inode fields: 32
> Extended attributes:
> system.data (72)
> Inode checksum: 0xe2f12fc5
> Size of inline data: 132

OK, so the problem seems the inode in question has the INLINE_DATA
flag set, but i_blocks is non-zero. And it looks like the data was
actually stored in an exernal data block, and the in-line xattr wasn't
present.

e2fsck should be checking and complaining if this is the case. If
not, it's a bug in e2fsck.

And the kernel certainy shouldn't be BUG'ing when it comes across
what is clearly a case of file system corruption.

Cheers,

- Ted

2017-01-20 13:37:45

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 3/4] ext4: avoid calling ext4_mark_inode_dirty() under unneeded semaphores

On Wed 11-01-17 22:49:37, Ted Tso wrote:
> There is no need to call ext4_mark_inode_dirty while holding xattr_sem
> or i_data_sem, so where it's easy to avoid it, move it out from the
> critical region.
>
> Signed-off-by: Theodore Ts'o <[email protected]>

Looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/ext4/inline.c | 9 +++------
> fs/ext4/inode.c | 2 +-
> 2 files changed, 4 insertions(+), 7 deletions(-)
>
> diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
> index 99a5312ced52..31f98dd04e51 100644
> --- a/fs/ext4/inline.c
> +++ b/fs/ext4/inline.c
> @@ -1042,7 +1042,6 @@ static int ext4_add_dirent_to_inline(handle_t *handle,
> dir->i_mtime = dir->i_ctime = current_time(dir);
> ext4_update_dx_flag(dir);
> dir->i_version++;
> - ext4_mark_inode_dirty(handle, dir);
> return 1;
> }
>
> @@ -1311,8 +1310,8 @@ int ext4_try_add_inline_entry(handle_t *handle, struct ext4_filename *fname,
> ret = ext4_convert_inline_data_nolock(handle, dir, &iloc);
>
> out:
> - ext4_mark_inode_dirty(handle, dir);
> ext4_write_unlock_xattr(dir, &no_expand);
> + ext4_mark_inode_dirty(handle, dir);
> brelse(iloc.bh);
> return ret;
> }
> @@ -1708,13 +1707,11 @@ int ext4_delete_inline_entry(handle_t *handle,
> if (err)
> goto out;
>
> - err = ext4_mark_inode_dirty(handle, dir);
> - if (unlikely(err))
> - goto out;
> -
> ext4_show_inline_dir(dir, iloc.bh, inline_start, inline_size);
> out:
> ext4_write_unlock_xattr(dir, &no_expand);
> + if (likely(err == 0))
> + err = ext4_mark_inode_dirty(handle, dir);
> brelse(iloc.bh);
> if (err != -ENOENT)
> ext4_std_error(dir->i_sb, err);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 88d57af1b516..86dde0667ccc 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2464,8 +2464,8 @@ static int mpage_map_and_submit_extent(handle_t *handle,
> disksize = i_size;
> if (disksize > EXT4_I(inode)->i_disksize)
> EXT4_I(inode)->i_disksize = disksize;
> - err2 = ext4_mark_inode_dirty(handle, inode);
> up_write(&EXT4_I(inode)->i_data_sem);
> + err2 = ext4_mark_inode_dirty(handle, inode);
> if (err2)
> ext4_error(inode->i_sb,
> "Failed to mark inode %lu dirty",
> --
> 2.11.0.rc0.7.gbe5a750
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-01-20 13:53:20

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/4] ext4: fix deadlock between inline_data and ext4_expand_extra_isize_ea()

On Wed 11-01-17 22:49:36, Ted Tso wrote:
> The xattr_sem deadlock problems fixed in commit 2e81a4eeedca: "ext4:
> avoid deadlock when expanding inode size" didn't include the use of
> xattr_sem in fs/ext4/inline.c. With the addition of project quota
> which added a new extra inode field, this exposed deadlocks in the
> inline_data code similar to the ones fixed by 2e81a4eeedca.
>
> The deadlock can be reproduced via:
>
> dmesg -n 7
> mke2fs -t ext4 -O inline_data -Fq -I 256 /dev/vdc 32768
> mount -t ext4 -o debug_want_extra_isize=24 /dev/vdc /vdc
> mkdir /vdc/a
> umount /vdc
> mount -t ext4 /dev/vdc /vdc
> echo foo > /vdc/a/foo
>
> and looks like this:
>
> [ 11.158815]
> [ 11.160276] =============================================
> [ 11.161960] [ INFO: possible recursive locking detected ]
> [ 11.161960] 4.10.0-rc3-00015-g011b30a8a3cf #160 Tainted: G W
> [ 11.161960] ---------------------------------------------
> [ 11.161960] bash/2519 is trying to acquire lock:
> [ 11.161960] (&ei->xattr_sem){++++..}, at: [<c1225a4b>] ext4_expand_extra_isize_ea+0x3d/0x4cd
> [ 11.161960]
> [ 11.161960] but task is already holding lock:
> [ 11.161960] (&ei->xattr_sem){++++..}, at: [<c1227941>] ext4_try_add_inline_entry+0x3a/0x152
> [ 11.161960]
> [ 11.161960] other info that might help us debug this:
> [ 11.161960] Possible unsafe locking scenario:
> [ 11.161960]
> [ 11.161960] CPU0
> [ 11.161960] ----
> [ 11.161960] lock(&ei->xattr_sem);
> [ 11.161960] lock(&ei->xattr_sem);
> [ 11.161960]
> [ 11.161960] *** DEADLOCK ***
> [ 11.161960]
> [ 11.161960] May be due to missing lock nesting notation
> [ 11.161960]
> [ 11.161960] 4 locks held by bash/2519:
> [ 11.161960] #0: (sb_writers#3){.+.+.+}, at: [<c11a2414>] mnt_want_write+0x1e/0x3e
> [ 11.161960] #1: (&type->i_mutex_dir_key){++++++}, at: [<c119508b>] path_openat+0x338/0x67a
> [ 11.161960] #2: (jbd2_handle){++++..}, at: [<c123314a>] start_this_handle+0x582/0x622
> [ 11.161960] #3: (&ei->xattr_sem){++++..}, at: [<c1227941>] ext4_try_add_inline_entry+0x3a/0x152
> [ 11.161960]
> [ 11.161960] stack backtrace:
> [ 11.161960] CPU: 0 PID: 2519 Comm: bash Tainted: G W 4.10.0-rc3-00015-g011b30a8a3cf #160
> [ 11.161960] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1 04/01/2014
> [ 11.161960] Call Trace:
> [ 11.161960] dump_stack+0x72/0xa3
> [ 11.161960] __lock_acquire+0xb7c/0xcb9
> [ 11.161960] ? kvm_clock_read+0x1f/0x29
> [ 11.161960] ? __lock_is_held+0x36/0x66
> [ 11.161960] ? __lock_is_held+0x36/0x66
> [ 11.161960] lock_acquire+0x106/0x18a
> [ 11.161960] ? ext4_expand_extra_isize_ea+0x3d/0x4cd
> [ 11.161960] down_write+0x39/0x72
> [ 11.161960] ? ext4_expand_extra_isize_ea+0x3d/0x4cd
> [ 11.161960] ext4_expand_extra_isize_ea+0x3d/0x4cd
> [ 11.161960] ? _raw_read_unlock+0x22/0x2c
> [ 11.161960] ? jbd2_journal_extend+0x1e2/0x262
> [ 11.161960] ? __ext4_journal_get_write_access+0x3d/0x60
> [ 11.161960] ext4_mark_inode_dirty+0x17d/0x26d
> [ 11.161960] ? ext4_add_dirent_to_inline.isra.12+0xa5/0xb2
> [ 11.161960] ext4_add_dirent_to_inline.isra.12+0xa5/0xb2
> [ 11.161960] ext4_try_add_inline_entry+0x69/0x152
> [ 11.161960] ext4_add_entry+0xa3/0x848
> [ 11.161960] ? __brelse+0x14/0x2f
> [ 11.161960] ? _raw_spin_unlock_irqrestore+0x44/0x4f
> [ 11.161960] ext4_add_nondir+0x17/0x5b
> [ 11.161960] ext4_create+0xcf/0x133
> [ 11.161960] ? ext4_mknod+0x12f/0x12f
> [ 11.161960] lookup_open+0x39e/0x3fb
> [ 11.161960] ? __wake_up+0x1a/0x40
> [ 11.161960] ? lock_acquire+0x11e/0x18a
> [ 11.161960] path_openat+0x35c/0x67a
> [ 11.161960] ? sched_clock_cpu+0xd7/0xf2
> [ 11.161960] do_filp_open+0x36/0x7c
> [ 11.161960] ? _raw_spin_unlock+0x22/0x2c
> [ 11.161960] ? __alloc_fd+0x169/0x173
> [ 11.161960] do_sys_open+0x59/0xcc
> [ 11.161960] SyS_open+0x1d/0x1f
> [ 11.161960] do_int80_syscall_32+0x4f/0x61
> [ 11.161960] entry_INT80_32+0x2f/0x2f
> [ 11.161960] EIP: 0xb76ad469
> [ 11.161960] EFLAGS: 00000286 CPU: 0
> [ 11.161960] EAX: ffffffda EBX: 08168ac8 ECX: 00008241 EDX: 000001b6
> [ 11.161960] ESI: b75e46bc EDI: b7755000 EBP: bfbdb108 ESP: bfbdafc0
> [ 11.161960] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
>
> Cc: [email protected] # 3.10 (requires 2e81a4eeedca as a prereq)
> Reported-by: George Spelvin <[email protected]>
> Signed-off-by: Theodore Ts'o <[email protected]>

Looks mostly good, just two comments below.

> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index 5a94fa52b74f..c40bd55b6400 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -1188,16 +1188,14 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
> struct ext4_xattr_block_find bs = {
> .s = { .not_found = -ENODATA, },
> };
> - unsigned long no_expand;
> + int no_expand;
> int error;
>
> if (!name)
> return -EINVAL;
> if (strlen(name) > 255)
> return -ERANGE;
> - down_write(&EXT4_I(inode)->xattr_sem);
> - no_expand = ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND);
> - ext4_set_inode_state(inode, EXT4_STATE_NO_EXPAND);
> + ext4_write_lock_xattr(inode, &no_expand);
>
> error = ext4_reserve_inode_write(handle, inode, &is.iloc);
> if (error)
> @@ -1264,7 +1262,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
> ext4_xattr_update_super_block(handle, inode->i_sb);
> inode->i_ctime = current_time(inode);
> if (!value)
> - ext4_clear_inode_state(inode, EXT4_STATE_NO_EXPAND);
> + no_expand = 0;

OK. I suppose you can use ext4_mark_inode_dirty() instead of
ext4_mark_iloc_dirty() because the deadlock on xattr_sem is now taken care
of by your changes, right?

> error = ext4_mark_iloc_dirty(handle, inode, &is.iloc);
> /*
> * The bh is consumed by ext4_mark_iloc_dirty, even with
> @@ -1278,9 +1276,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
> cleanup:
> brelse(is.iloc.bh);
> brelse(bs.bh);
> - if (no_expand == 0)
> - ext4_clear_inode_state(inode, EXT4_STATE_NO_EXPAND);
> - up_write(&EXT4_I(inode)->xattr_sem);
> + ext4_write_unlock_xattr(inode, &no_expand);
> return error;
> }
>
> @@ -1497,12 +1493,11 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
> int error = 0, tried_min_extra_isize = 0;
> int s_min_extra_isize = le16_to_cpu(EXT4_SB(inode->i_sb)->s_es->s_min_extra_isize);
> int isize_diff; /* How much do we need to grow i_extra_isize */
> + int no_expand;
> +
> + if (ext4_write_trylock_xattr(inode, &no_expand) == 0)
> + return 0;

Why do you play tricks with trylock here? ext4_mark_inode_dirty() checks
EXT4_STATE_NO_EXPAND and thus we should not ever get here if we already
hold xattr_sem...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-01-20 13:53:52

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/4] ext4: propagate error values from ext4_inline_data_truncate()

On Wed 11-01-17 22:49:38, Ted Tso wrote:
> Signed-off-by: Theodore Ts'o <[email protected]>

Looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza
> ---
> fs/ext4/ext4.h | 2 +-
> fs/ext4/inline.c | 40 +++++++++++++++++++++++-----------------
> fs/ext4/inode.c | 4 +++-
> 3 files changed, 27 insertions(+), 19 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 6bcb9622fdf9..1cd077e02517 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3023,7 +3023,7 @@ extern int ext4_inline_data_fiemap(struct inode *inode,
> extern int ext4_try_to_evict_inline_data(handle_t *handle,
> struct inode *inode,
> int needed);
> -extern void ext4_inline_data_truncate(struct inode *inode, int *has_inline);
> +extern int ext4_inline_data_truncate(struct inode *inode, int *has_inline);
>
> extern int ext4_convert_inline_data(struct inode *inode);
>
> diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
> index 31f98dd04e51..accd665c5dc0 100644
> --- a/fs/ext4/inline.c
> +++ b/fs/ext4/inline.c
> @@ -1896,10 +1896,10 @@ int ext4_try_to_evict_inline_data(handle_t *handle,
> return error;
> }
>
> -void ext4_inline_data_truncate(struct inode *inode, int *has_inline)
> +int ext4_inline_data_truncate(struct inode *inode, int *has_inline)
> {
> handle_t *handle;
> - int inline_size, value_len, needed_blocks, no_expand;
> + int inline_size, value_len, needed_blocks, no_expand, err = 0;
> size_t i_size;
> void *value = NULL;
> struct ext4_xattr_ibody_find is = {
> @@ -1914,19 +1914,19 @@ void ext4_inline_data_truncate(struct inode *inode, int *has_inline)
> needed_blocks = ext4_writepage_trans_blocks(inode);
> handle = ext4_journal_start(inode, EXT4_HT_INODE, needed_blocks);
> if (IS_ERR(handle))
> - return;
> + return PTR_ERR(handle);
>
> ext4_write_lock_xattr(inode, &no_expand);
> if (!ext4_has_inline_data(inode)) {
> *has_inline = 0;
> ext4_journal_stop(handle);
> - return;
> + return 0;
> }
>
> - if (ext4_orphan_add(handle, inode))
> + if ((err = ext4_orphan_add(handle, inode)) != 0)
> goto out;
>
> - if (ext4_get_inode_loc(inode, &is.iloc))
> + if ((err = ext4_get_inode_loc(inode, &is.iloc)) != 0)
> goto out;
>
> down_write(&EXT4_I(inode)->i_data_sem);
> @@ -1937,24 +1937,29 @@ void ext4_inline_data_truncate(struct inode *inode, int *has_inline)
> if (i_size < inline_size) {
> /* Clear the content in the xattr space. */
> if (inline_size > EXT4_MIN_INLINE_DATA_SIZE) {
> - if (ext4_xattr_ibody_find(inode, &i, &is))
> + if ((err = ext4_xattr_ibody_find(inode, &i, &is)) != 0)
> goto out_error;
>
> BUG_ON(is.s.not_found);
>
> value_len = le32_to_cpu(is.s.here->e_value_size);
> value = kmalloc(value_len, GFP_NOFS);
> - if (!value)
> + if (!value) {
> + err = -ENOMEM;
> goto out_error;
> + }
>
> - if (ext4_xattr_ibody_get(inode, i.name_index, i.name,
> - value, value_len))
> + err = ext4_xattr_ibody_get(inode, i.name_index,
> + i.name, value, value_len);
> + if (err)
> goto out_error;
>
> i.value = value;
> i.value_len = i_size > EXT4_MIN_INLINE_DATA_SIZE ?
> i_size - EXT4_MIN_INLINE_DATA_SIZE : 0;
> - if (ext4_xattr_ibody_inline_set(handle, inode, &i, &is))
> + err = ext4_xattr_ibody_inline_set(handle, inode,
> + &i, &is);
> + if (err)
> goto out_error;
> }
>
> @@ -1979,13 +1984,14 @@ void ext4_inline_data_truncate(struct inode *inode, int *has_inline)
> if (inode->i_nlink)
> ext4_orphan_del(handle, inode);
>
> - inode->i_mtime = inode->i_ctime = current_time(inode);
> - ext4_mark_inode_dirty(handle, inode);
> - if (IS_SYNC(inode))
> - ext4_handle_sync(handle);
> -
> + if (err == 0) {
> + inode->i_mtime = inode->i_ctime = current_time(inode);
> + err = ext4_mark_inode_dirty(handle, inode);
> + if (IS_SYNC(inode))
> + ext4_handle_sync(handle);
> + }
> ext4_journal_stop(handle);
> - return;
> + return err;
> }
>
> int ext4_convert_inline_data(struct inode *inode)
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 86dde0667ccc..1e2c881f102d 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4222,7 +4222,9 @@ int ext4_truncate(struct inode *inode)
> if (ext4_has_inline_data(inode)) {
> int has_inline = 1;
>
> - ext4_inline_data_truncate(inode, &has_inline);
> + err = ext4_inline_data_truncate(inode, &has_inline);
> + if (err)
> + return err;
> if (has_inline)
> return 0;
> }
> --
> 2.11.0.rc0.7.gbe5a750
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-01-20 19:13:30

by Damien Guibouret

[permalink] [raw]
Subject: Re: kernel BUG at fs/ext4/inline.c:1943!

Theodore Ts'o wrote:
> On Wed, Jan 18, 2017 at 03:21:28AM -0500, George Spelvin wrote:
>
>>I was trying to rmdir an empty directory in lost+found that accumulated
>>during my recent problems.
>>
>>EXT4-fs (md3): mounted filesystem with writeback data mode. Opts: data=writeback,delalloc
>>
>>$ cd /mountpoint/lost+found
>>$ rmdir \#1625089/
>>
>>------------[ cut here ]------------
>>kernel BUG at fs/ext4/inline.c:1943!
>
>
>>debugfs: stat <1625089>
>>Inode: 1625089 Type: directory Mode: 0775 Flags: 0x10000000
>>Generation: 927350643 Version: 0x00000000:00000004
>>User: 1000 Group: 161 Project: 0 Size: 132
>>File ACL: 1664090185 Directory ACL: 0
>>Links: 0 Blockcount: 8
>>Fragment: Address: 0 Number: 0 Size: 0
>> ctime: 0x587f2034:472c74ec -- Wed Jan 18 02:58:44 2017
>> atime: 0x56b9e2f8:b68a7658 -- Tue Feb 9 08:00:40 2016
>> mtime: 0x56c1bc4b:a7765de8 -- Mon Feb 15 06:53:47 2016
>>crtime: 0x56ba9eb4:a51d90ac -- Tue Feb 9 21:21:40 2016
>>Size of extra inode fields: 32
>>Extended attributes:
>> system.data (72)
>>Inode checksum: 0xe2f12fc5
>>Size of inline data: 132
>
>
> OK, so the problem seems the inode in question has the INLINE_DATA
> flag set, but i_blocks is non-zero. And it looks like the data was
> actually stored in an exernal data block, and the in-line xattr wasn't
> present.
>
> e2fsck should be checking and complaining if this is the case. If
> not, it's a bug in e2fsck.
>
> And the kernel certainy shouldn't be BUG'ing when it comes across
> what is clearly a case of file system corruption.
>
> Cheers,
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
Hello,

Perhaps I am wrong, but as the inode has a file ACL, the blockcount should be
different from 0?
So it seems valid on this point. Or is there something that prevent inlined file
to have ACL?

Regards,

Damien

2017-01-20 19:18:24

by Andreas Dilger

[permalink] [raw]
Subject: Re: kernel BUG at fs/ext4/inline.c:1943!

On Jan 20, 2017, at 12:14 PM, Damien Guibouret <[email protected]> wrote:
> Theodore Ts'o wrote:
>> On Wed, Jan 18, 2017 at 03:21:28AM -0500, George Spelvin wrote:
>>> I was trying to rmdir an empty directory in lost+found that accumulated
>>> during my recent problems.
>>>
>>> EXT4-fs (md3): mounted filesystem with writeback data mode. Opts: data=writeback,delalloc
>>>
>>> $ cd /mountpoint/lost+found
>>> $ rmdir \#1625089/
>>>
>>> ------------[ cut here ]------------
>>> kernel BUG at fs/ext4/inline.c:1943!
>>> debugfs: stat <1625089>
>>> Inode: 1625089 Type: directory Mode: 0775 Flags: 0x10000000
>>> Generation: 927350643 Version: 0x00000000:00000004
>>> User: 1000 Group: 161 Project: 0 Size: 132
>>> File ACL: 1664090185 Directory ACL: 0
>>> Links: 0 Blockcount: 8
>>> Fragment: Address: 0 Number: 0 Size: 0
>>> ctime: 0x587f2034:472c74ec -- Wed Jan 18 02:58:44 2017
>>> atime: 0x56b9e2f8:b68a7658 -- Tue Feb 9 08:00:40 2016
>>> mtime: 0x56c1bc4b:a7765de8 -- Mon Feb 15 06:53:47 2016
>>> crtime: 0x56ba9eb4:a51d90ac -- Tue Feb 9 21:21:40 2016
>>> Size of extra inode fields: 32
>>> Extended attributes:
>>> system.data (72)
>>> Inode checksum: 0xe2f12fc5
>>> Size of inline data: 132
>> OK, so the problem seems the inode in question has the INLINE_DATA
>> flag set, but i_blocks is non-zero. And it looks like the data was
>> actually stored in an exernal data block, and the in-line xattr wasn't
>> present.
>> e2fsck should be checking and complaining if this is the case. If
>> not, it's a bug in e2fsck.
>> And the kernel certainy shouldn't be BUG'ing when it comes across
>> what is clearly a case of file system corruption.
>> Cheers,
>> - Ted
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Hello,
>
> Perhaps I am wrong, but as the inode has a file ACL, the blockcount should be
> different from 0? So it seems valid on this point. Or is there something that
> prevent inlined file to have ACL?

The "File ACL" label is misleading. ACLs are stored as xattrs, (along with other
xattrs like SELinux labels), and also inline data. However, unless the ACL is
very large it would also be stored inside the inode itself. If there isn't space
inside the inode to store both the inline data and other xattrs, the inline data
should be bumped to an external block first (i.e. stored as a regular file),
since there is no benefit to doing the reverse.


Cheers, Andreas






Attachments:
signature.asc (833.00 B)
Message signed with OpenPGP using GPGMail

2017-01-20 22:23:48

by Damien Guibouret

[permalink] [raw]
Subject: Re: kernel BUG at fs/ext4/inline.c:1943!

Andreas Dilger wrote:
> On Jan 20, 2017, at 12:14 PM, Damien Guibouret <[email protected]> wrote:
>
>>Theodore Ts'o wrote:
>>
>>>On Wed, Jan 18, 2017 at 03:21:28AM -0500, George Spelvin wrote:
>>>
>>>>I was trying to rmdir an empty directory in lost+found that accumulated
>>>>during my recent problems.
>>>>
>>>>EXT4-fs (md3): mounted filesystem with writeback data mode. Opts: data=writeback,delalloc
>>>>
>>>>$ cd /mountpoint/lost+found
>>>>$ rmdir \#1625089/
>>>>
>>>>------------[ cut here ]------------
>>>>kernel BUG at fs/ext4/inline.c:1943!
>>>>debugfs: stat <1625089>
>>>>Inode: 1625089 Type: directory Mode: 0775 Flags: 0x10000000
>>>>Generation: 927350643 Version: 0x00000000:00000004
>>>>User: 1000 Group: 161 Project: 0 Size: 132
>>>>File ACL: 1664090185 Directory ACL: 0
>>>>Links: 0 Blockcount: 8
>>>>Fragment: Address: 0 Number: 0 Size: 0
>>>>ctime: 0x587f2034:472c74ec -- Wed Jan 18 02:58:44 2017
>>>>atime: 0x56b9e2f8:b68a7658 -- Tue Feb 9 08:00:40 2016
>>>>mtime: 0x56c1bc4b:a7765de8 -- Mon Feb 15 06:53:47 2016
>>>>crtime: 0x56ba9eb4:a51d90ac -- Tue Feb 9 21:21:40 2016
>>>>Size of extra inode fields: 32
>>>>Extended attributes:
>>>>system.data (72)
>>>>Inode checksum: 0xe2f12fc5
>>>>Size of inline data: 132
>>>
>>>OK, so the problem seems the inode in question has the INLINE_DATA
>>>flag set, but i_blocks is non-zero. And it looks like the data was
>>>actually stored in an exernal data block, and the in-line xattr wasn't
>>>present.
>>>e2fsck should be checking and complaining if this is the case. If
>>>not, it's a bug in e2fsck.
>>>And the kernel certainy shouldn't be BUG'ing when it comes across
>>>what is clearly a case of file system corruption.
>>>Cheers,
>>> - Ted
>>>--
>>>To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>>the body of a message to [email protected]
>>>More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>Hello,
>>
>>Perhaps I am wrong, but as the inode has a file ACL, the blockcount should be
>>different from 0? So it seems valid on this point. Or is there something that
>>prevent inlined file to have ACL?
>
>
> The "File ACL" label is misleading. ACLs are stored as xattrs, (along with other
> xattrs like SELinux labels), and also inline data. However, unless the ACL is
> very large it would also be stored inside the inode itself. If there isn't space
> inside the inode to store both the inline data and other xattrs, the inline data
> should be bumped to an external block first (i.e. stored as a regular file),
> since there is no benefit to doing the reverse.
>
>
> Cheers, Andreas
>

I certainly do not have a complete knowledge on how it works, but the
ext4_xattr_set_handle function does not try to move inline data to a block
before allocating a new block for the extended attribute (or reusing a block if
there is one with similar content). And some of its call tree does not clear
inline data before.
A case could be some directory creation that inherits big posix ACL from its
parent (ACL goes to a block, potentially the same than the one of its parent)
and the directory is created with inline data.

In the ext4_getattr function there is also some comment that says it can happen
"If there is inline data in the inode, the inode will normally not have data
blocks allocated (it may have an external xattr block)."

Regards,

Damien

2017-01-22 22:25:36

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 2/4] ext4: fix deadlock between inline_data and ext4_expand_extra_isize_ea()

On Fri, Jan 20, 2017 at 02:53:17PM +0100, Jan Kara wrote:
> > @@ -1264,7 +1262,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
> > ext4_xattr_update_super_block(handle, inode->i_sb);
> > inode->i_ctime = current_time(inode);
> > if (!value)
> > - ext4_clear_inode_state(inode, EXT4_STATE_NO_EXPAND);
> > + no_expand = 0;
>
> OK. I suppose you can use ext4_mark_inode_dirty() instead of
> ext4_mark_iloc_dirty() because the deadlock on xattr_sem is now taken care
> of by your changes, right?

Correct.

> > @@ -1497,12 +1493,11 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
> > int error = 0, tried_min_extra_isize = 0;
> > int s_min_extra_isize = le16_to_cpu(EXT4_SB(inode->i_sb)->s_es->s_min_extra_isize);
> > int isize_diff; /* How much do we need to grow i_extra_isize */
> > + int no_expand;
> > +
> > + if (ext4_write_trylock_xattr(inode, &no_expand) == 0)
> > + return 0;
>
> Why do you play tricks with trylock here? ext4_mark_inode_dirty() checks
> EXT4_STATE_NO_EXPAND and thus we should not ever get here if we already
> hold xattr_sem...

The problem is still a lock inversion in the truncate code path. The
simplest way of dealing with it to simply avoiding doing the
expand_isize operation on truncates. In the case where this is
happening on the deletion of an inode, doing the expansion is
pointless anyway.

- Ted

inline run xfstest ext4/307
[ 248.556548] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: acl,user_xattr,block_validity
[ 248.602530]
[ 248.604177] ======================================================
[ 248.610565] [ INFO: possible circular locking dependency detected ]
[ 248.617163] 4.10.0-rc3-ext4-00017-ge2a392e80d24-dirty #210 Not tainted
[ 248.623999] -------------------------------------------------------
[ 248.630404] fsstress/26316 is trying to acquire lock:
[ 248.635577] (&ei->i_data_sem){++++..}, at: [<ffffffff812a1b6f>] ext4_map_blocks+0xf7/0x4f5
[ 248.644066]
[ 248.644066] but task is already holding lock:
[ 248.650023] (&ei->xattr_sem){++++..}, at: [<ffffffff812e357c>] ext4_try_add_inline_entry+0x4c/0x1db
[ 248.659291]
[ 248.659291] which lock already depends on the new lock.
[ 248.659291]
[ 248.668022]
[ 248.668022] the existing dependency chain (in reverse order) is:
[ 248.675720]
[ 248.675720] -> #1 (&ei->xattr_sem){++++..}:
[ 248.681604]
[ 248.681614] [<ffffffff8111f550>] __lock_acquire+0x5e0/0x68a
[ 248.689524]
[ 248.689531] [<ffffffff81120364>] lock_acquire+0x122/0x1bd
[ 248.697263]
[ 248.697272] [<ffffffff818464d4>] down_write+0x39/0x99
[ 248.704680]
[ 248.704689] [<ffffffff812e1279>] ext4_expand_extra_isize_ea+0x50/0x42a
[ 248.713637]
[ 248.713645] [<ffffffff812a4d92>] ext4_mark_inode_dirty+0x170/0x27b
[ 248.722156]
[ 248.722166] [<ffffffff812cd6c9>] ext4_ext_truncate+0x27/0x8c
[ 248.730165]
[ 248.730173] [<ffffffff812a65f9>] ext4_truncate+0x26e/0x47d
[ 248.737989]
[ 248.737996] [<ffffffff812a91d8>] ext4_setattr+0x70b/0x85d
[ 248.745732]
[ 248.745739] [<ffffffff8121af50>] notify_change+0x236/0x37b
[ 248.753562]
[ 248.753569] [<ffffffff811fc72a>] do_truncate+0x6f/0x8e
[ 248.761066]
[ 248.761070] [<ffffffff8120c8e5>] do_last+0x58f/0x608
[ 248.768365]
[ 248.768369] [<ffffffff8120cbe3>] path_openat+0x285/0x30c
[ 248.777907]
[ 248.777911] [<ffffffff8120ccb7>] do_filp_open+0x4d/0xa3
[ 248.785541]
[ 248.785548] [<ffffffff811fd97e>] do_sys_open+0x13c/0x1cb
[ 248.793178]
[ 248.793183] [<ffffffff811fda2b>] SyS_open+0x1e/0x20
[ 248.800494]
[ 248.800502] [<ffffffff81848a01>] entry_SYSCALL_64_fastpath+0x1f/0xc2
[ 248.809230]
[ 248.809230] -> #0 (&ei->i_data_sem){++++..}:
[ 248.815091]
[ 248.815098] [<ffffffff8111ca13>] validate_chain.isra.20+0xaf7/0x1109
[ 248.823951]
[ 248.823956] [<ffffffff8111f550>] __lock_acquire+0x5e0/0x68a
[ 248.831880]
[ 248.831885] [<ffffffff81120364>] lock_acquire+0x122/0x1bd
[ 248.843958]
[ 248.843964] [<ffffffff81846444>] down_read+0x30/0x87
[ 248.851379]
[ 248.851390] [<ffffffff812a1b6f>] ext4_map_blocks+0xf7/0x4f5
[ 248.859409]
[ 248.859416] [<ffffffff812e21dc>] ext4_convert_inline_data_nolock+0xf0/0x33d
[ 248.868712]
[ 248.868720] [<ffffffff812e36b7>] ext4_try_add_inline_entry+0x187/0x1db
[ 248.877775]
[ 248.877780] [<ffffffff812af8bc>] ext4_add_entry+0xccc/0xd55
[ 248.886237]
[ 248.886240] [<ffffffff812af963>] ext4_add_nondir+0x1e/0x6b
[ 248.894163]
[ 248.894168] [<ffffffff812b1be1>] ext4_symlink+0x37b/0x41c
[ 248.902228]
[ 248.902233] [<ffffffff8120db52>] vfs_symlink+0xd9/0x134
[ 248.909877]
[ 248.909880] [<ffffffff8120dc2b>] SyS_symlinkat+0x7e/0xc3
[ 248.917592]
[ 248.917600] [<ffffffff8120dc86>] SyS_symlink+0x16/0x18
[ 248.925587]
[ 248.925599] [<ffffffff81848a01>] entry_SYSCALL_64_fastpath+0x1f/0xc2
[ 248.934482]
[ 248.934482] other info that might help us debug this:
[ 248.934482]
[ 248.942591] Possible unsafe locking scenario:
[ 248.942591]
[ 248.948629] CPU0 CPU1
[ 248.953271] ---- ----
[ 248.957906] lock(&ei->xattr_sem);
[ 248.961507] lock(&ei->i_data_sem);
[ 248.967719] lock(&ei->xattr_sem);
[ 248.973835] lock(&ei->i_data_sem);
[ 248.977578]
[ 248.977578] *** DEADLOCK ***
[ 248.977578]
[ 248.983621] 4 locks held by fsstress/26316:
[ 248.987911] #0: (sb_writers#3){.+.+.+}, at: [<ffffffff8121e549>] mnt_want_write+0x21/0x45
[ 248.996377] #1: (&type->i_mutex_dir_key/1){+.+.+.}, at: [<ffffffff8120ac5a>] filename_create+0x72/0x11c
[ 249.006145] #2: (jbd2_handle){++++..}, at: [<ffffffff812f02ec>] start_this_handle+0x33d/0x3bf
[ 249.014958] #3: (&ei->xattr_sem){++++..}, at: [<ffffffff812e357c>] ext4_try_add_inline_entry+0x4c/0x1db
[ 249.024650]
[ 249.024650] stack backtrace:
[ 249.029143] CPU: 1 PID: 26316 Comm: fsstress Not tainted 4.10.0-rc3-ext4-00017-ge2a392e80d24-dirty #210
[ 249.038654] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[ 249.048007] Call Trace:
[ 249.050573] dump_stack+0x85/0xbe
[ 249.054016] print_circular_bug+0x290/0x29e
[ 249.058315] validate_chain.isra.20+0xaf7/0x1109
[ 249.063057] __lock_acquire+0x5e0/0x68a
[ 249.067002] ? __lock_acquire+0x5e0/0x68a
[ 249.071126] lock_acquire+0x122/0x1bd
[ 249.074899] ? ext4_map_blocks+0xf7/0x4f5
[ 249.079022] down_read+0x30/0x87
[ 249.082367] ? ext4_map_blocks+0xf7/0x4f5
[ 249.086483] ext4_map_blocks+0xf7/0x4f5
[ 249.090428] ext4_convert_inline_data_nolock+0xf0/0x33d
[ 249.095763] ext4_try_add_inline_entry+0x187/0x1db
[ 249.100666] ext4_add_entry+0xccc/0xd55
[ 249.104616] ? __ext4_handle_dirty_metadata+0xca/0x1af
[ 249.109868] ? ext4_mark_iloc_dirty+0x660/0x6a9
[ 249.114509] ext4_add_nondir+0x1e/0x6b
[ 249.118366] ext4_symlink+0x37b/0x41c
[ 249.122151] ? __inode_permission+0x81/0x88
[ 249.126457] vfs_symlink+0xd9/0x134
[ 249.130056] SyS_symlinkat+0x7e/0xc3
[ 249.133751] ? trace_hardirqs_on_thunk+0x1a/0x1c
[ 249.138488] SyS_symlink+0x16/0x18
[ 249.142002] entry_SYSCALL_64_fastpath+0x1f/0xc2
[ 249.146737] RIP: 0033:0x7f02be12f237
[ 249.150425] RSP: 002b:00007ffca0afdca8 EFLAGS: 00000206 ORIG_RAX: 0000000000000058
[ 249.158119] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f02be12f237
[ 249.165492] RDX: 000000000000006c RSI: 00007f02b80008c0 RDI: 00007f02b8000d80
[ 249.173094] RBP: 0000000000000032 R08: 0000000000000003 R09: 00007f02b8000070
[ 249.180339] R10: 00007f02be3f6760 R11: 0000000000000206 R12: 0000000051eb851f
[ 249.187581] R13: 0000000000405020 R14: 0000000000000000 R15: 0000000000000000

2017-01-24 12:27:22

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/4] ext4: fix deadlock between inline_data and ext4_expand_extra_isize_ea()

On Sun 22-01-17 17:25:27, Ted Tso wrote:
> > > @@ -1497,12 +1493,11 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
> > > int error = 0, tried_min_extra_isize = 0;
> > > int s_min_extra_isize = le16_to_cpu(EXT4_SB(inode->i_sb)->s_es->s_min_extra_isize);
> > > int isize_diff; /* How much do we need to grow i_extra_isize */
> > > + int no_expand;
> > > +
> > > + if (ext4_write_trylock_xattr(inode, &no_expand) == 0)
> > > + return 0;
> >
> > Why do you play tricks with trylock here? ext4_mark_inode_dirty() checks
> > EXT4_STATE_NO_EXPAND and thus we should not ever get here if we already
> > hold xattr_sem...
>
> The problem is still a lock inversion in the truncate code path. The
> simplest way of dealing with it to simply avoiding doing the
> expand_isize operation on truncates. In the case where this is
> happening on the deletion of an inode, doing the expansion is
> pointless anyway.

I see, thanks for explanation. Well seeing all these problems with
ext4_expand_extra_isize() wouldn't we be better off by not calling it from
ext4_mark_inode_dirty() but rather explicitely from several well-defined
places? Because this implicit calling looks like it causes us too much
trouble.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-01-25 01:32:48

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 2/4] ext4: fix deadlock between inline_data and ext4_expand_extra_isize_ea()

On Tue, Jan 24, 2017 at 01:27:22PM +0100, Jan Kara wrote:
> I see, thanks for explanation. Well seeing all these problems with
> ext4_expand_extra_isize() wouldn't we be better off by not calling it from
> ext4_mark_inode_dirty() but rather explicitely from several well-defined
> places? Because this implicit calling looks like it causes us too much
> trouble.

Yeah, I suppose that might be a better way to go. We could only do it
on a file open, perhaps. It might not be as important it is to expand
the extra_isize on, say, a chmod, for example.

It's certainly worth looking to see whether it would simplify things
to go that way.

- Ted