2014-10-16 05:57:18

by Dave Jones

[permalink] [raw]
Subject: kernel BUG at fs/ext4/inode.c:2982!

Just hit this on Linus' current tree while running my fuzz-tester.
(No logs unfortunatly, so no idea what actually happened).

kernel BUG at fs/ext4/inode.c:2982!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: hidp rfcomm af_key llc2 can_bcm sctp libcrc32c can_raw nfc caif_socket caif af_802154 ieee802154 phonet af_rxrpc bluetooth can pppoe pppox ppp_generic slhc irda crc_ccitt rds rose x25 atm netrom appletalk ipx p8023 p8022 psnap llc ax25 nouveau cfg80211 rfkill kvm_intel kvm video backlight mxm_wmi wmi i2c_algo_bit drm_kms_helper ttm drm microcode tg3 serio_raw pcspkr ptp pps_core libphy i2c_core lpc_ich mfd_core rtc_cmos shpchp nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc raid0 floppy
CPU: 3 PID: 24261 Comm: trinity-c10 Not tainted 3.17.0+ #5
Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008
task: ffff8802094ccb40 ti: ffff8800bc168000 task.ti: ffff8800bc168000
RIP: 0010:[<ffffffff9a27cf83>] [<ffffffff9a27cf83>] ext4_direct_IO+0x713/0x750
RSP: 0018:ffff8800bc16ba78 EFLAGS: 00010246
RAX: 0000000000020000 RBX: 0000000000000001 RCX: 000000000000000f
RDX: 0000000000000008 RSI: ffff880033e368d0 RDI: ffff8802094cd3b8
RBP: ffff8800bc16baf8 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: ffff8800bc16bd40
R13: ffff880033e368d0 R14: ffff8800bc16bb30 R15: 000000000000001f
FS: 00007f8cc4e8f740(0000) GS:ffff880226400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000001 CR3: 00000000b7747000 CR4: 00000000000007e0
DR0: 0000000001c16000 DR1: 000000000160a000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Stack:
ffffea000560a600 ffffea00060dc480 ffffea000503d880 ffffea0005cbfc80
ffffea00056e6500 ffffea00049b1780 ffff880033e368d0 ffffea0005da7980
0000000000010000 0000000000010000 ffff8800bc16baf8 ffff880033e36ae0
Call Trace:
[<ffffffff9a1838d9>] generic_file_direct_write+0xa9/0x170
[<ffffffff9a183c4c>] __generic_file_write_iter+0x2ac/0x350
[<ffffffff9a275df9>] ext4_file_write_iter+0x109/0x3f0
[<ffffffff9a1d8adc>] ? __kmalloc+0x39c/0x420
[<ffffffff9a0a89e8>] ? sched_clock_cpu+0xa8/0xd0
[<ffffffff9a227881>] ? iter_file_splice_write+0x91/0x450
[<ffffffff9a0a8a66>] ? local_clock+0x16/0x30
[<ffffffff9a227a53>] iter_file_splice_write+0x263/0x450
[<ffffffff9a226d06>] direct_splice_actor+0x36/0x40
[<ffffffff9a2272d3>] splice_direct_to_actor+0xc3/0x1f0
[<ffffffff9a226cd0>] ? generic_pipe_buf_nosteal+0x10/0x10
[<ffffffff9a229032>] do_splice_direct+0x82/0xb0
[<ffffffff9a1f454f>] do_sendfile+0x1af/0x3a0
[<ffffffff9a1f533a>] SyS_sendfile64+0x8a/0xa0
[<ffffffff9a6ea82a>] ? tracesys_phase2+0x75/0xd9
[<ffffffff9a6ea889>] tracesys_phase2+0xd4/0xd9
Code: e8 83 57 e4 ff 85 c0 0f 85 a0 fc ff ff e9 47 ff ff ff 48 c7 c7 e0 f4 c3 9a e8 6a 57 e4 ff 85 c0 0f 85 e7 fc ff ff e9 6c ff ff ff <0f> 0b be fe 0b 00 00 48 c7 c7 f9 4d a2 9a e8 7a 3b df ff e9 c8
RIP [<ffffffff9a27cf83>] ext4_direct_IO+0x713/0x750
RSP <ffff8800bc16ba78>
---[ end trace d80209ec68bf10b8 ]---


That BUG_ON is..

2982 BUG_ON(iocb->private == NULL);

I'll try and reproduce it in the morning.

Dave


2014-10-16 09:31:51

by Dmitry Monakhov

[permalink] [raw]
Subject: Re: kernel BUG at fs/ext4/inode.c:2982!

Dave Jones <[email protected]> writes:

> Just hit this on Linus' current tree while running my fuzz-tester.
> (No logs unfortunatly, so no idea what actually happened).
>
> kernel BUG at fs/ext4/inode.c:2982!
Looks
familiar.http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-8086
Are you playing with fcntl?
Try this patch http://www.spinics.net/lists/linux-ext4/msg45683.html
> invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> Modules linked in: hidp rfcomm af_key llc2 can_bcm sctp libcrc32c can_raw nfc caif_socket caif af_802154 ieee802154 phonet af_rxrpc bluetooth can pppoe pppox ppp_generic slhc irda crc_ccitt rds rose x25 atm netrom appletalk ipx p8023 p8022 psnap llc ax25 nouveau cfg80211 rfkill kvm_intel kvm video backlight mxm_wmi wmi i2c_algo_bit drm_kms_helper ttm drm microcode tg3 serio_raw pcspkr ptp pps_core libphy i2c_core lpc_ich mfd_core rtc_cmos shpchp nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc raid0 floppy
> CPU: 3 PID: 24261 Comm: trinity-c10 Not tainted 3.17.0+ #5
> Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008
> task: ffff8802094ccb40 ti: ffff8800bc168000 task.ti: ffff8800bc168000
> RIP: 0010:[<ffffffff9a27cf83>] [<ffffffff9a27cf83>] ext4_direct_IO+0x713/0x750
> RSP: 0018:ffff8800bc16ba78 EFLAGS: 00010246
> RAX: 0000000000020000 RBX: 0000000000000001 RCX: 000000000000000f
> RDX: 0000000000000008 RSI: ffff880033e368d0 RDI: ffff8802094cd3b8
> RBP: ffff8800bc16baf8 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000001 R11: 0000000000000001 R12: ffff8800bc16bd40
> R13: ffff880033e368d0 R14: ffff8800bc16bb30 R15: 000000000000001f
> FS: 00007f8cc4e8f740(0000) GS:ffff880226400000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000000001 CR3: 00000000b7747000 CR4: 00000000000007e0
> DR0: 0000000001c16000 DR1: 000000000160a000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> Stack:
> ffffea000560a600 ffffea00060dc480 ffffea000503d880 ffffea0005cbfc80
> ffffea00056e6500 ffffea00049b1780 ffff880033e368d0 ffffea0005da7980
> 0000000000010000 0000000000010000 ffff8800bc16baf8 ffff880033e36ae0
> Call Trace:
> [<ffffffff9a1838d9>] generic_file_direct_write+0xa9/0x170
> [<ffffffff9a183c4c>] __generic_file_write_iter+0x2ac/0x350
> [<ffffffff9a275df9>] ext4_file_write_iter+0x109/0x3f0
> [<ffffffff9a1d8adc>] ? __kmalloc+0x39c/0x420
> [<ffffffff9a0a89e8>] ? sched_clock_cpu+0xa8/0xd0
> [<ffffffff9a227881>] ? iter_file_splice_write+0x91/0x450
> [<ffffffff9a0a8a66>] ? local_clock+0x16/0x30
> [<ffffffff9a227a53>] iter_file_splice_write+0x263/0x450
> [<ffffffff9a226d06>] direct_splice_actor+0x36/0x40
> [<ffffffff9a2272d3>] splice_direct_to_actor+0xc3/0x1f0
> [<ffffffff9a226cd0>] ? generic_pipe_buf_nosteal+0x10/0x10
> [<ffffffff9a229032>] do_splice_direct+0x82/0xb0
> [<ffffffff9a1f454f>] do_sendfile+0x1af/0x3a0
> [<ffffffff9a1f533a>] SyS_sendfile64+0x8a/0xa0
> [<ffffffff9a6ea82a>] ? tracesys_phase2+0x75/0xd9
> [<ffffffff9a6ea889>] tracesys_phase2+0xd4/0xd9
> Code: e8 83 57 e4 ff 85 c0 0f 85 a0 fc ff ff e9 47 ff ff ff 48 c7 c7 e0 f4 c3 9a e8 6a 57 e4 ff 85 c0 0f 85 e7 fc ff ff e9 6c ff ff ff <0f> 0b be fe 0b 00 00 48 c7 c7 f9 4d a2 9a e8 7a 3b df ff e9 c8
> RIP [<ffffffff9a27cf83>] ext4_direct_IO+0x713/0x750
> RSP <ffff8800bc16ba78>
> ---[ end trace d80209ec68bf10b8 ]---
>
>
> That BUG_ON is..
>
> 2982 BUG_ON(iocb->private == NULL);
>
> I'll try and reproduce it in the morning.
>
> Dave
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


Attachments:
(No filename) (818.00 B)

2014-10-16 14:33:46

by Dave Jones

[permalink] [raw]
Subject: Re: kernel BUG at fs/ext4/inode.c:2982!

On Thu, Oct 16, 2014 at 01:31:51PM +0400, Dmitry Monakhov wrote:
> Dave Jones <[email protected]> writes:
>
> > Just hit this on Linus' current tree while running my fuzz-tester.
> > (No logs unfortunatly, so no idea what actually happened).
> >
> > kernel BUG at fs/ext4/inode.c:2982!
> Looks
> familiar.http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-8086
> Are you playing with fcntl?

Actually, I was mistaken, there was a partial log.

Pid that oopsed was 24261. Last two calls of that pid was..

[child10:24261] [125] sendfile(out_fd=672, in_fd=672, offset=0x1, count=0xce89) = -1 (Bad address)
[child10:24261] [126] sendfile(out_fd=672, in_fd=672, offset=0x0, count=0xa1f000)

The second call was the one that oopsed.
No idea what fd 672 was unfortunatly. I'll have to add better logging of fd types.

At the same time that was running, other pids were also doing stuff with that same fd.

[child6:24355] [124] writev(fd=672, vec=0x1d2ec60, vlen=206)
[child8:24515] [1498] lseek(fd=672, offset=-1, whence=0x2)

Note that those calls also didn't complete, meaning they probably blocked on
some lock that the oopsing pid was holding.

So no recent fcntl's unless they were done before the last 2 syscalls each process did.

> Try this patch http://www.spinics.net/lists/linux-ext4/msg45683.html

I'll give it a try if I can get it reproducing easily, otherwise I cna't say
either way whether the patch is doing any good.

Dave

2014-10-16 20:15:04

by Dave Jones

[permalink] [raw]
Subject: Re: kernel BUG at fs/ext4/inode.c:2982!

On Thu, Oct 16, 2014 at 10:33:46AM -0400, Dave Jones wrote:

> > Try this patch http://www.spinics.net/lists/linux-ext4/msg45683.html
>
> I'll give it a try if I can get it reproducing easily, otherwise I cna't say
> either way whether the patch is doing any good.

ok, I managed to reproduce it a few times, and then tried again with
your patch. I'm not sure if it's related, but I'm now seeing lockups
in ext4. The process that gets stuck looks like this..

trinity-c21 R running task 13232 9781 831 0x10000004
0000000000000000 0000000000000001 0000000000000000 0000000000000000
ffff8800c8d333d8 0000000000000000 0000000000000002 00000000cacaa650
ffff880217cf7db8 0000000000000000 0000000000000000 ffffffffa929ff8c
Call Trace:
[<ffffffffa929ff8c>] ? ext4_map_blocks+0x31c/0x560
[<ffffffffa929fe75>] ? ext4_map_blocks+0x205/0x560
[<ffffffffa92e689c>] ? ext4_es_find_delayed_extent_range+0x48c/0x4e0
[<ffffffffa9298fe1>] ? ext4_llseek+0x261/0x3f0
[<ffffffffa922e6e9>] ? __fdget_pos+0x49/0x50
[<ffffffffa90e484e>] ? rcu_read_lock_held+0x6e/0x80
[<ffffffffa920ca44>] ? SyS_lseek+0x94/0xc0

I've got to run right now, but I'll look into tracing it some more when I get back.

Dave

2014-10-16 22:03:45

by Dmitry Monakhov

[permalink] [raw]
Subject: Re: kernel BUG at fs/ext4/inode.c:2982!

Dave Jones <[email protected]> writes:

> On Thu, Oct 16, 2014 at 10:33:46AM -0400, Dave Jones wrote:
>
> > > Try this patch http://www.spinics.net/lists/linux-ext4/msg45683.html
> >
> > I'll give it a try if I can get it reproducing easily, otherwise I cna't say
> > either way whether the patch is doing any good.
>
> ok, I managed to reproduce it a few times, and then tried again with
> your patch. I'm not sure if it's related, but I'm now seeing lockups
> in ext4. The process that gets stuck looks like this..
99.99% this one is not related. This is just one more uncovered one
by trinity magics. Looks like I've got what is wrong
lseek SEEK_HOLE/SEEK_DATA try to find block allocated one by one
instead of using ext4_ext_next_allocated_block() COMMIT c8c0df24
Plese look an my test below. I'm too tired for today (especially
after Gone girl the movie). I'll send patch tomorrow.
>
> trinity-c21 R running task 13232 9781 831 0x10000004
> 0000000000000000 0000000000000001 0000000000000000 0000000000000000
> ffff8800c8d333d8 0000000000000000 0000000000000002 00000000cacaa650
> ffff880217cf7db8 0000000000000000 0000000000000000 ffffffffa929ff8c
> Call Trace:
> [<ffffffffa929ff8c>] ? ext4_map_blocks+0x31c/0x560
> [<ffffffffa929fe75>] ? ext4_map_blocks+0x205/0x560
> [<ffffffffa92e689c>] ? ext4_es_find_delayed_extent_range+0x48c/0x4e0
> [<ffffffffa9298fe1>] ? ext4_llseek+0x261/0x3f0
> [<ffffffffa922e6e9>] ? __fdget_pos+0x49/0x50
> [<ffffffffa90e484e>] ? rcu_read_lock_held+0x6e/0x80
> [<ffffffffa920ca44>] ? SyS_lseek+0x94/0xc0
>
> I've got to run right now, but I'll look into tracing it some more when I get back.
>
> Dave


Attachments:
(No filename) (818.00 B)
lseek.c (616.00 B)
Download all attachments

2014-10-17 13:25:34

by Dmitry Monakhov

[permalink] [raw]
Subject: [PATCH] ext4: fix suboptimal seek_{data,hole} extents traversial

It is rediculus practice to scan inode block by block, this technique
applicable only for old indirect files. This takes signifficant amount
of time for really large files. Let's reuse ext4_fiemap which already
traverse inode-tree in most optimal meaner.

TESTCASE:
ftruncate64(fd, 0);
ftruncate64(fd, 1ULL << 40);
/* lseek will spin very long time */
lseek64(fd, 0, SEEK_DATA);
lseek64(fd, 0, SEEK_HOLE);


Original report: https://lkml.org/lkml/2014/10/16/620

##################################
BTW: Why do we need i_mutex here?

Signed-off-by: Dmitry Monakhov <[email protected]>
---
fs/ext4/extents.c | 4 +-
fs/ext4/file.c | 220 +++++++++++++++++++++++++---------------------------
2 files changed, 108 insertions(+), 116 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 37043d0..11cee53 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -5155,8 +5155,8 @@ int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,

/* fallback to generic here if not in extents fmt */
if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
- return generic_block_fiemap(inode, fieinfo, start, len,
- ext4_get_block);
+ return __generic_block_fiemap(inode, fieinfo, start, len,
+ ext4_get_block);

if (fiemap_check_flags(fieinfo, EXT4_FIEMAP_FLAGS))
return -EBADR;
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index aca7b24..12cbffd 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -273,24 +273,19 @@ static int ext4_file_open(struct inode * inode, struct file * filp)
* we determine this extent as a data or a hole according to whether the
* page cache has data or not.
*/
-static int ext4_find_unwritten_pgoff(struct inode *inode,
- int whence,
- struct ext4_map_blocks *map,
- loff_t *offset)
+static int ext4_find_unwritten_pgoff(struct inode *inode, int whence,
+ loff_t endoff, loff_t *offset)
{
struct pagevec pvec;
- unsigned int blkbits;
pgoff_t index;
pgoff_t end;
- loff_t endoff;
loff_t startoff;
loff_t lastoff;
int found = 0;

- blkbits = inode->i_sb->s_blocksize_bits;
startoff = *offset;
lastoff = startoff;
- endoff = (loff_t)(map->m_lblk + map->m_len) << blkbits;
+

index = startoff >> PAGE_CACHE_SHIFT;
end = endoff >> PAGE_CACHE_SHIFT;
@@ -408,147 +403,144 @@ out:
static loff_t ext4_seek_data(struct file *file, loff_t offset, loff_t maxsize)
{
struct inode *inode = file->f_mapping->host;
- struct ext4_map_blocks map;
- struct extent_status es;
- ext4_lblk_t start, last, end;
- loff_t dataoff, isize;
- int blkbits;
- int ret = 0;
+ struct fiemap_extent_info fie;
+ struct fiemap_extent ext[2];
+ loff_t next;
+ int i, ret = 0;

mutex_lock(&inode->i_mutex);
-
- isize = i_size_read(inode);
- if (offset >= isize) {
+ if (offset >= inode->i_size) {
mutex_unlock(&inode->i_mutex);
return -ENXIO;
}
-
- blkbits = inode->i_sb->s_blocksize_bits;
- start = offset >> blkbits;
- last = start;
- end = isize >> blkbits;
- dataoff = offset;
-
- do {
- map.m_lblk = last;
- map.m_len = end - last + 1;
- ret = ext4_map_blocks(NULL, inode, &map, 0);
- if (ret > 0 && !(map.m_flags & EXT4_MAP_UNWRITTEN)) {
- if (last != start)
- dataoff = (loff_t)last << blkbits;
+ fie.fi_flags = 0;
+ fie.fi_extents_max = 2;
+ fie.fi_extents_start = (struct fiemap_extent __user *) &ext;
+ while (1) {
+ mm_segment_t old_fs = get_fs();
+
+ fie.fi_extents_mapped = 0;
+ memset(ext, 0, sizeof(*ext) * fie.fi_extents_max);
+
+ set_fs(get_ds());
+ ret = ext4_fiemap(inode, &fie, offset, maxsize - offset);
+ set_fs(old_fs);
+ if (ret)
break;
- }

- /*
- * If there is a delay extent at this offset,
- * it will be as a data.
- */
- ext4_es_find_delayed_extent_range(inode, last, last, &es);
- if (es.es_len != 0 && in_range(last, es.es_lblk, es.es_len)) {
- if (last != start)
- dataoff = (loff_t)last << blkbits;
+ /* No extents found, EOF */
+ if (!fie.fi_extents_mapped) {
+ ret = -ENXIO;
break;
}
+ for (i = 0; i < fie.fi_extents_mapped; i++) {
+ next = (loff_t)(ext[i].fe_length + ext[i].fe_logical);

- /*
- * If there is a unwritten extent at this offset,
- * it will be as a data or a hole according to page
- * cache that has data or not.
- */
- if (map.m_flags & EXT4_MAP_UNWRITTEN) {
- int unwritten;
- unwritten = ext4_find_unwritten_pgoff(inode, SEEK_DATA,
- &map, &dataoff);
- if (unwritten)
- break;
- }
+ if (offset < (loff_t)ext[i].fe_logical)
+ offset = (loff_t)ext[i].fe_logical;
+ /*
+ * If extent is not unwritten, then it contains valid
+ * data, mapped or delayed.
+ */
+ if (!(ext[i].fe_flags & FIEMAP_EXTENT_UNWRITTEN))
+ goto out;

- last++;
- dataoff = (loff_t)last << blkbits;
- } while (last <= end);
+ /*
+ * If there is a unwritten extent at this offset,
+ * it will be as a data or a hole according to page
+ * cache that has data or not.
+ */
+ if (ext4_find_unwritten_pgoff(inode, SEEK_DATA,
+ next, &offset))
+ goto out;

+ if (ext[i].fe_flags & FIEMAP_EXTENT_LAST) {
+ ret = -ENXIO;
+ goto out;
+ }
+ offset = next;
+ }
+ }
+ if (offset > inode->i_size)
+ offset = inode->i_size;
+out:
mutex_unlock(&inode->i_mutex);
+ if (ret)
+ return ret;

- if (dataoff > isize)
- return -ENXIO;
-
- return vfs_setpos(file, dataoff, maxsize);
+ return vfs_setpos(file, offset, maxsize);
}

/*
- * ext4_seek_hole() retrieves the offset for SEEK_HOLE.
+ * ext4_seek_hole() retrieves the offset for SEEK_HOLE
*/
static loff_t ext4_seek_hole(struct file *file, loff_t offset, loff_t maxsize)
{
struct inode *inode = file->f_mapping->host;
- struct ext4_map_blocks map;
- struct extent_status es;
- ext4_lblk_t start, last, end;
- loff_t holeoff, isize;
- int blkbits;
- int ret = 0;
+ struct fiemap_extent_info fie;
+ struct fiemap_extent ext[2];
+ loff_t next;
+ int i, ret = 0;

mutex_lock(&inode->i_mutex);
-
- isize = i_size_read(inode);
- if (offset >= isize) {
+ if (offset >= inode->i_size) {
mutex_unlock(&inode->i_mutex);
return -ENXIO;
}

- blkbits = inode->i_sb->s_blocksize_bits;
- start = offset >> blkbits;
- last = start;
- end = isize >> blkbits;
- holeoff = offset;
+ fie.fi_flags = 0;
+ fie.fi_extents_max = 2;
+ fie.fi_extents_start = (struct fiemap_extent __user *)&ext;
+ while (1) {
+ mm_segment_t old_fs = get_fs();

- do {
- map.m_lblk = last;
- map.m_len = end - last + 1;
- ret = ext4_map_blocks(NULL, inode, &map, 0);
- if (ret > 0 && !(map.m_flags & EXT4_MAP_UNWRITTEN)) {
- last += ret;
- holeoff = (loff_t)last << blkbits;
- continue;
- }
+ fie.fi_extents_mapped = 0;
+ memset(ext, 0, sizeof(*ext));

- /*
- * If there is a delay extent at this offset,
- * we will skip this extent.
- */
- ext4_es_find_delayed_extent_range(inode, last, last, &es);
- if (es.es_len != 0 && in_range(last, es.es_lblk, es.es_len)) {
- last = es.es_lblk + es.es_len;
- holeoff = (loff_t)last << blkbits;
- continue;
- }
+ set_fs(get_ds());
+ ret = ext4_fiemap(inode, &fie, offset, maxsize - offset);
+ set_fs(old_fs);
+ if (ret)
+ break;

- /*
- * If there is a unwritten extent at this offset,
- * it will be as a data or a hole according to page
- * cache that has data or not.
- */
- if (map.m_flags & EXT4_MAP_UNWRITTEN) {
- int unwritten;
- unwritten = ext4_find_unwritten_pgoff(inode, SEEK_HOLE,
- &map, &holeoff);
- if (!unwritten) {
- last += ret;
- holeoff = (loff_t)last << blkbits;
+ /* No extents found */
+ if (!fie.fi_extents_mapped)
+ break;
+
+ for (i = 0; i < fie.fi_extents_mapped; i++) {
+ next = (loff_t)(ext[i].fe_logical + ext[i].fe_length);
+ /*
+ * If extent is not unwritten, then it contains valid
+ * data, mapped or delayed.
+ */
+ if (!(ext[i].fe_flags & FIEMAP_EXTENT_UNWRITTEN)) {
+ if (offset < (loff_t)ext[i].fe_logical)
+ goto out;
+ offset = next;
continue;
}
- }
-
- /* find a hole */
- break;
- } while (last <= end);
+ /*
+ * If there is a unwritten extent at this offset,
+ * it will be as a data or a hole according to page
+ * cache that has data or not.
+ */
+ if (ext4_find_unwritten_pgoff(inode, SEEK_HOLE,
+ next, &offset))
+ goto out;

+ offset = next;
+ if (ext[i].fe_flags & FIEMAP_EXTENT_LAST)
+ goto out;
+ }
+ }
+ if (offset > inode->i_size)
+ offset = inode->i_size;
+out:
mutex_unlock(&inode->i_mutex);
+ if (ret)
+ return ret;

- if (holeoff > isize)
- holeoff = isize;
-
- return vfs_setpos(file, holeoff, maxsize);
+ return vfs_setpos(file, offset, maxsize);
}

/*
--
1.7.1

2014-10-17 17:27:37

by Dave Jones

[permalink] [raw]
Subject: Re: kernel BUG at fs/ext4/inode.c:2982!

On Fri, Oct 17, 2014 at 02:03:45AM +0400, Dmitry Monakhov wrote:
> Dave Jones <[email protected]> writes:
>
> > On Thu, Oct 16, 2014 at 10:33:46AM -0400, Dave Jones wrote:
> >
> > > > Try this patch http://www.spinics.net/lists/linux-ext4/msg45683.html
> > >
> > > I'll give it a try if I can get it reproducing easily, otherwise I cna't say
> > > either way whether the patch is doing any good.
> >
> > ok, I managed to reproduce it a few times, and then tried again with
> > your patch. I'm not sure if it's related, but I'm now seeing lockups
> > in ext4. The process that gets stuck looks like this..
>
> 99.99% this one is not related. This is just one more uncovered one
> by trinity magics. Looks like I've got what is wrong
> lseek SEEK_HOLE/SEEK_DATA try to find block allocated one by one
> instead of using ext4_ext_next_allocated_block() COMMIT c8c0df24
> Plese look an my test below. I'm too tired for today (especially
> after Gone girl the movie). I'll send patch tomorrow.

So the good news is that I'm only seeing this 2nd bug now, so it appears
your patch referenced above does fix it, so it is a good band-aid until
a more generic solution is found.

thanks,

Dave

2014-11-25 21:14:33

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix suboptimal seek_{data,hole} extents traversial

On Fri, Oct 17, 2014 at 05:25:34PM +0400, Dmitry Monakhov wrote:
> It is rediculus practice to scan inode block by block, this technique
> applicable only for old indirect files. This takes signifficant amount
> of time for really large files. Let's reuse ext4_fiemap which already
> traverse inode-tree in most optimal meaner.
>
> TESTCASE:
> ftruncate64(fd, 0);
> ftruncate64(fd, 1ULL << 40);
> /* lseek will spin very long time */
> lseek64(fd, 0, SEEK_DATA);
> lseek64(fd, 0, SEEK_HOLE);
>
>
> Original report: https://lkml.org/lkml/2014/10/16/620
>
> ##################################
> BTW: Why do we need i_mutex here?
>
> Signed-off-by: Dmitry Monakhov <[email protected]>

Note: this patch causes generic/285 to loop forever in inline-data
mode. My guess is in the special case handling of inline data in
ext4_fiemap not playing well with this change, but I haven't had a
chance to look deeply into this yet.

- Ted

2014-11-27 14:49:15

by Dmitry Monakhov

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix suboptimal seek_{data,hole} extents traversial

Theodore Ts'o <[email protected]> writes:

> On Fri, Oct 17, 2014 at 05:25:34PM +0400, Dmitry Monakhov wrote:
>> It is rediculus practice to scan inode block by block, this technique
>> applicable only for old indirect files. This takes signifficant amount
>> of time for really large files. Let's reuse ext4_fiemap which already
>> traverse inode-tree in most optimal meaner.
>>
>> TESTCASE:
>> ftruncate64(fd, 0);
>> ftruncate64(fd, 1ULL << 40);
>> /* lseek will spin very long time */
>> lseek64(fd, 0, SEEK_DATA);
>> lseek64(fd, 0, SEEK_HOLE);
>>
>>
>> Original report: https://lkml.org/lkml/2014/10/16/620
>>
>> ##################################
>> BTW: Why do we need i_mutex here?
>>
>> Signed-off-by: Dmitry Monakhov <[email protected]>
>
> Note: this patch causes generic/285 to loop forever in inline-data
> mode. My guess is in the special case handling of inline data in
> ext4_fiemap not playing well with this change, but I haven't had a
> chance to look deeply into this yet.
Correct, it is appeared I've missed inline case. Will fix.
BTW: this happens because xfstests-bld contains obsolete e2fsprogs-lib 1.41.14
>
> - Ted


Attachments:
signature.asc (472.00 B)

2014-11-28 15:02:02

by Dmitry Monakhov

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix suboptimal seek_{data,hole} extents traversial

Theodore Ts'o <[email protected]> writes:

> On Fri, Oct 17, 2014 at 05:25:34PM +0400, Dmitry Monakhov wrote:
>> It is rediculus practice to scan inode block by block, this technique
>> applicable only for old indirect files. This takes signifficant amount
>> of time for really large files. Let's reuse ext4_fiemap which already
>> traverse inode-tree in most optimal meaner.
>>
>> TESTCASE:
>> ftruncate64(fd, 0);
>> ftruncate64(fd, 1ULL << 40);
>> /* lseek will spin very long time */
>> lseek64(fd, 0, SEEK_DATA);
>> lseek64(fd, 0, SEEK_HOLE);
>>
>>
>> Original report: https://lkml.org/lkml/2014/10/16/620
>>
>> ##################################
>> BTW: Why do we need i_mutex here?
>>
>> Signed-off-by: Dmitry Monakhov <[email protected]>
>
> Note: this patch causes generic/285 to loop forever in inline-data
> mode. My guess is in the special case handling of inline data in
> ext4_fiemap not playing well with this change, but I haven't had a
> chance to look deeply into this yet.
FYI: inline_data feature is fatally broken

1) incorrect lock order (journal_start vs grab_cache_page_write_begin )
ext4_da_write_inline_data_begin
->start_journal
->grab_cache_page_write_begin
->pagecache_get_page -> FS_REENTRANCE->DEADLOCK

2) ext4_inline_data_fiemap(0 ignores start and len arguments from ext4_filemap()
which obviously result in endless loop for anyone who want to use
fiemap for inline files (which my patch try to do)

I'll send patches soon.
>
> - Ted


Attachments:
signature.asc (472.00 B)

2014-11-29 17:53:01

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix suboptimal seek_{data,hole} extents traversial

On Fri, Nov 28, 2014 at 06:02:02PM +0300, Dmitry Monakhov wrote:
> FYI: inline_data feature is fatally broken

Thanks for finding the bugs!

> 2) ext4_inline_data_fiemap(0 ignores start and len arguments from ext4_filemap()
> which obviously result in endless loop for anyone who want to use
> fiemap for inline files (which my patch try to do)

Would this have been triggered using filefrag? I just checked and
noticed we don't have a filefrag test in xfstests; of course it would
be impossible to test other than the filefrag -v command successfully
completed, but maybe it might be worth adding such a test.

- Ted

2014-12-01 11:25:46

by Dmitry Monakhov

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix suboptimal seek_{data,hole} extents traversial

Theodore Ts'o <[email protected]> writes:

> On Fri, Nov 28, 2014 at 06:02:02PM +0300, Dmitry Monakhov wrote:
>> FYI: inline_data feature is fatally broken
>
> Thanks for finding the bugs!
>
>> 2) ext4_inline_data_fiemap(0 ignores start and len arguments from ext4_filemap()
>> which obviously result in endless loop for anyone who want to use
>> fiemap for inline files (which my patch try to do)
>
> Would this have been triggered using filefrag? I just checked and
> noticed we don't have a filefrag test in xfstests; of course it would
> be impossible to test other than the filefrag -v command successfully
> completed, but maybe it might be worth adding such a test.
This works only because filefrag use big buffer. If one want to call
fiemap with start != 0 it will loop. So we definitely need a regression
test for fiemap. I'll do it.

BTW bug can be spotted even in case of filefrag -v:
#### Create small file with inlined data
# echo test > /mnt/f1
#### Now boost it's i_size
# truncate --size 4096000 /mnt/f1
# filefrag -v /mnt/f1
Filesystem cylinder groups approximately 39
File size of /mnt/f1 is 4096000 (1000 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected:
flags:
0: 0.. 4095999: 2762792.. 6858791: 4096000:last,not_aligned,inline,eof
/mnt/f1: 1 extent found
###According to fiemap we are inlined 4Mb data. :)

At this moment I'm testing my patches, I'll send it today.
>
> - Ted


Attachments:
signature.asc (472.00 B)