Subject: ext4 out of order when use cfq scheduler

Hi EXT4 developers,

I am doing the power failure on a storage device. We are arm embedded system.


Test environment:
Kernel: 3.10.63
Test type: power failure
File create type: O_SYNC flag enabled
Mount option: data=ordered, journal_checksum,barrier=1
IO scheduler: CFQ

Debug method: kernel print every block number of write request. print whether it is journal block. print whether it is (REQ_FUA | REQ_META | write)
After system startup, backup journal blocks before mount the test partition.

Below is analysis which I think there is a out of order:
hp0000017aMhWY3i0vMv is the file which checksum is error. Only half of the file contents is correct.


[root@SiRFatlas6 ~]# debugfs /dev/nandblk0p3
debugfs 1.42.9 (28-Dec-2013)
debugfs: imap test/hp0000017aMhWY3i0vMv
Inode 390 is part of block group 0
located at block 141, offset 0x0280

00000280 80 81 00 00 10 00 04 00 c8 09 00 00 66 0a 00 00 |............f...|
00000290 66 0a 00 00 00 00 00 00 00 00 01 00 04 02 00 00 |f...............|
000002a0 00 00 08 00 01 00 00 00 0a f3 02 00 04 00 00 00 |................|
000002b0 00 00 00 00 00 00 00 00 80 00 00 00 00 2c 01 00 |.............,..| ==> the file contents is at 0x00012c00
000002c0 80 00 00 00 01 00 00 00 8e 26 01 00 00 00 00 00 |.........&......|


Search the block number from the journal blocks:

[root@SiRFatlas6 ~]# hexdump j.bin -C | grep "00 2c 01 00"
00039ab0 00 00 00 00 00 00 00 00 80 00 00 00 00 2c 01 00 |.............,..|

Search file name which the file checksum is error in journal blocks:

[root@SiRFatlas6 ~]# hexdump j.bin -C | grep "3i0vMv" -B1
00030c60 86 01 00 00 1c 00 14 01 68 70 30 30 30 30 30 31 |........hp000001|
00030c70 37 61 4d 68 57 59 33 69 30 76 4d 76 88 01 00 00 |7aMhWY3i0vMv....|


List all journal block record to check which journal block records it:

[root@SiRFatlas6 ~]# hexdump j.bin -C | grep "c0 3b 39 98"
00000000 c0 3b 39 98 00 00 00 04 00 00 00 00 00 00 08 00 |.;9.............|
00000800 c0 3b 39 98 00 00 00 05 00 00 00 6f 00 00 00 24 |.;9........o...$|
00001000 c0 3b 39 98 00 00 00 01 00 00 00 6f 00 00 00 75 |.;9........o...u|
0000c800 c0 3b 39 98 00 00 00 02 00 00 00 6f 00 00 00 00 |.;9........o....|
0000d000 c0 3b 39 98 00 00 00 01 00 00 00 70 00 00 00 65 |.;9........p...e|
00016000 c0 3b 39 98 00 00 00 02 00 00 00 70 00 00 00 00 |.;9........p....|
00016800 c0 3b 39 98 00 00 00 01 00 00 00 71 00 00 00 7c |.;9........q...||
00021000 c0 3b 39 98 00 00 00 02 00 00 00 71 00 00 00 00 |.;9........q....|
00021800 c0 3b 39 98 00 00 00 01 00 00 00 72 00 00 00 82 |.;9........r....|
0002d000 c0 3b 39 98 00 00 00 02 00 00 00 72 00 00 00 00 |.;9........r....|
0002d800 c0 3b 39 98 00 00 00 01 00 00 00 73 00 00 00 88 |.;9........s....| ==>00039ab0 is in last block, the file name and the start block number are all recorded in the journals.
0003a000 c0 3b 39 98 00 00 00 02 00 00 00 73 00 00 00 00 |.;9........s....|


Back to see the kernel log which it print all the block numbers:

...
...
[ 46.222671] 244109 75277
[ 46.222693]
[ 46.272438] 244352 75520
[ 46.272460]
[ 46.348417] 238443 69611
[ 46.348438]
[ 46.349811] 244480 75648
[ 46.352287]
[ 46.404904] 244609 75777
[ 46.404926]
[ 46.454698] 244738 75906
[ 46.454719]
[ 46.505439] 244992 76160
[ 46.505459]
[ 46.557783] 245120 76288
[ 46.557804]
[ 46.610075] 245249 76417
[ 46.610096]
[ 46.660196] 245378 76546
[ 46.660219]
[ 46.709906] 201691 32859 ==> journal start is 32768, so the offset is 91, block size=2048, so, the offset address in the j.bin is 0x2d800
[ 46.709928] J [ 46.711233]
[ 46.740635] drop to 9v
[ 46.749540] 201716 32884
[ 46.749560] J S
[ 46.751039]
[ 46.753151] 245632 76800 ==> 76800 in hex is 0x012c00, it is the same start block of the file which checksum is error.
[ 46.755284]
nanddisk idle -> 1.
[ 46.800227] 6v irq-2


The j.bin offset 0x2d800

0002d800 c0 3b 39 98 00 00 00 01 00 00 00 73 00 00 00 88 |.;9........s....| ==>00039ab0 is in last block
0003a000 c0 3b 39 98 00 00 00 02 00 00 00 73 00 00 00 00 |.;9........s....|


Normally, ext4 will first write the file contents, and then write the journal and journal commit. Then after some delay, it will write the meta data.
So The journal blocks contains the meta data of the file which the file contents already been written before.
But from above analysis, the journal sequence 0x73 already contain the file name(hp0000017aMhWY3i0vMv) and the start block number(76800). So from the kernel log, the block number(76800) should be available before the journal blocks but NOT after it.
It seems that there is out of order happen.

Could you please help to check this issue ? or give a explanation about it ?
Many thanks.

Best regards

Weller HUANG




2015-12-22 15:00:39

by Jan Kara

[permalink] [raw]
Subject: Re: ext4 out of order when use cfq scheduler

Hello,

On Tue 22-12-15 06:24:29, HUANG Weller (CM/EPF1-CN) wrote:
> I am doing the power failure on a storage device. We are arm embedded
> system.
>
> Test environment:
> Kernel: 3.10.63
> Test type: power failure
> File create type: O_SYNC flag enabled
> Mount option: data=ordered, journal_checksum,barrier=1
> IO scheduler: CFQ
>
> Debug method: kernel print every block number of write request. print
> whether it is journal block. print whether it is (REQ_FUA | REQ_META |
> write) After system startup, backup journal blocks before mount the test
> partition.

Do you know blktrace / blkparse tools? They would give you this information
without the need of patching the kernel...

> Below is analysis which I think there is a out of order:
> hp0000017aMhWY3i0vMv is the file which checksum is error. Only half of
> the file contents is correct.

Can you post the test program you are using? What is in the incorrect
block? Previous content of the file or some random garbage? If you are just
overwriting the file and see old contents, that could happen and is legal (we
just guarantee newly allocated blocks are first written out).

Is the problem reproducible?

Honza

> [root@SiRFatlas6 ~]# debugfs /dev/nandblk0p3
> debugfs 1.42.9 (28-Dec-2013)
> debugfs: imap test/hp0000017aMhWY3i0vMv
> Inode 390 is part of block group 0
> located at block 141, offset 0x0280
>
> 00000280 80 81 00 00 10 00 04 00 c8 09 00 00 66 0a 00 00 |............f...|
> 00000290 66 0a 00 00 00 00 00 00 00 00 01 00 04 02 00 00 |f...............|
> 000002a0 00 00 08 00 01 00 00 00 0a f3 02 00 04 00 00 00 |................|
> 000002b0 00 00 00 00 00 00 00 00 80 00 00 00 00 2c 01 00 |.............,..| ==> the file contents is at 0x00012c00
> 000002c0 80 00 00 00 01 00 00 00 8e 26 01 00 00 00 00 00 |.........&......|
>
>
> Search the block number from the journal blocks:
>
> [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "00 2c 01 00"
> 00039ab0 00 00 00 00 00 00 00 00 80 00 00 00 00 2c 01 00 |.............,..|
>
> Search file name which the file checksum is error in journal blocks:
>
> [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "3i0vMv" -B1
> 00030c60 86 01 00 00 1c 00 14 01 68 70 30 30 30 30 30 31 |........hp000001|
> 00030c70 37 61 4d 68 57 59 33 69 30 76 4d 76 88 01 00 00 |7aMhWY3i0vMv....|
>
>
> List all journal block record to check which journal block records it:
>
> [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "c0 3b 39 98"
> 00000000 c0 3b 39 98 00 00 00 04 00 00 00 00 00 00 08 00 |.;9.............|
> 00000800 c0 3b 39 98 00 00 00 05 00 00 00 6f 00 00 00 24 |.;9........o...$|
> 00001000 c0 3b 39 98 00 00 00 01 00 00 00 6f 00 00 00 75 |.;9........o...u|
> 0000c800 c0 3b 39 98 00 00 00 02 00 00 00 6f 00 00 00 00 |.;9........o....|
> 0000d000 c0 3b 39 98 00 00 00 01 00 00 00 70 00 00 00 65 |.;9........p...e|
> 00016000 c0 3b 39 98 00 00 00 02 00 00 00 70 00 00 00 00 |.;9........p....|
> 00016800 c0 3b 39 98 00 00 00 01 00 00 00 71 00 00 00 7c |.;9........q...||
> 00021000 c0 3b 39 98 00 00 00 02 00 00 00 71 00 00 00 00 |.;9........q....|
> 00021800 c0 3b 39 98 00 00 00 01 00 00 00 72 00 00 00 82 |.;9........r....|
> 0002d000 c0 3b 39 98 00 00 00 02 00 00 00 72 00 00 00 00 |.;9........r....|
> 0002d800 c0 3b 39 98 00 00 00 01 00 00 00 73 00 00 00 88 |.;9........s....| ==>00039ab0 is in last block, the file name and the start block number are all recorded in the journals.
> 0003a000 c0 3b 39 98 00 00 00 02 00 00 00 73 00 00 00 00 |.;9........s....|
>
>
> Back to see the kernel log which it print all the block numbers:
>
> ...
> ...
> [ 46.222671] 244109 75277
> [ 46.222693]
> [ 46.272438] 244352 75520
> [ 46.272460]
> [ 46.348417] 238443 69611
> [ 46.348438]
> [ 46.349811] 244480 75648
> [ 46.352287]
> [ 46.404904] 244609 75777
> [ 46.404926]
> [ 46.454698] 244738 75906
> [ 46.454719]
> [ 46.505439] 244992 76160
> [ 46.505459]
> [ 46.557783] 245120 76288
> [ 46.557804]
> [ 46.610075] 245249 76417
> [ 46.610096]
> [ 46.660196] 245378 76546
> [ 46.660219]
> [ 46.709906] 201691 32859 ==> journal start is 32768, so the offset is 91, block size=2048, so, the offset address in the j.bin is 0x2d800
> [ 46.709928] J [ 46.711233]
> [ 46.740635] drop to 9v
> [ 46.749540] 201716 32884
> [ 46.749560] J S
> [ 46.751039]
> [ 46.753151] 245632 76800 ==> 76800 in hex is 0x012c00, it is the same start block of the file which checksum is error.
> [ 46.755284]
> nanddisk idle -> 1.
> [ 46.800227] 6v irq-2
>
>
> The j.bin offset 0x2d800
>
> 0002d800 c0 3b 39 98 00 00 00 01 00 00 00 73 00 00 00 88 |.;9........s....| ==>00039ab0 is in last block
> 0003a000 c0 3b 39 98 00 00 00 02 00 00 00 73 00 00 00 00 |.;9........s....|
>
>
> Normally, ext4 will first write the file contents, and then write the
> journal and journal commit. Then after some delay, it will write the meta
> data. So The journal blocks contains the meta data of the file which the
> file contents already been written before. But from above analysis, the
> journal sequence 0x73 already contain the file name(hp0000017aMhWY3i0vMv)
> and the start block number(76800). So from the kernel log, the block
> number(76800) should be available before the journal blocks but NOT after
> it. It seems that there is out of order happen.
>
> Could you please help to check this issue ? or give a explanation about
> it ? Many thanks.
>
> Best regards
>
> Weller HUANG
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR

2016-01-05 15:30:46

by Jan Kara

[permalink] [raw]
Subject: Re: ext4 out of order when use cfq scheduler

On Mon 04-01-16 05:25:15, HUANG Weller (CM/ESW12-CN) wrote:
> > -----Original Message-----
> > From: Jan Kara [mailto:[email protected]]
> > Sent: Tuesday, December 22, 2015 11:01 PM
> > To: HUANG Weller (CM/EPF1-CN) <[email protected]>
> > Cc: [email protected]; Li, Michael <[email protected]>
> > Subject: Re: ext4 out of order when use cfq scheduler
> >
> > Hello,
> >
> > On Tue 22-12-15 06:24:29, HUANG Weller (CM/EPF1-CN) wrote:
> > > I am doing the power failure on a storage device. We are arm
> > > embedded system.
> > >
> > > Test environment:
> > > Kernel: 3.10.63
> > > Test type: power failure
> > > File create type: O_SYNC flag enabled Mount option: data=ordered,
> > > journal_checksum,barrier=1 IO scheduler:
> > > CFQ
> > >
> > > Debug method: kernel print every block number of write request.
> > > print whether it is journal block. print whether it is (REQ_FUA |
> > > REQ_META
> > > |
> > > write) After system startup, backup journal blocks before mount the
> > > test partition.
> >
> > Do you know blktrace / blkparse tools? They would give you this
> > information without the need of patching the kernel...
>
> Yes, I know that tool. Normally, I use that tool, but in this case, I patch kernel because of following reasons:
> 1. this is power failure test, blktrace might not catch some log near to the power failure.
> 2. I add some calculation when print these block numbers in the kernel patch. So I can directly use the value print from kernel.
> blktrace give me lots of logs and I have to do block number conversion one by one.

OK, fair enough.

> > > Below is analysis which I think there is a out of order:
> > > hp0000017aMhWY3i0vMv is the file which checksum is error. Only half
> > > of the file contents is correct.
> >
> > Can you post the test program you are using? What is in the incorrect block?
> > Previous content of the file or some random garbage? If you are just
> > overwriting the file and see old contents, that could happen and is
> > legal (we just guarantee newly allocated blocks are first written out).
> >
> > Is the problem reproducible?
>
> The test tool(attached in this mail) not rewrite a file but create many
> files after delete the old test files in every new test round. The test
> tool create file in this way: filename format: hp[HEX NUMBER][RANDOM
> STRING]. For example: hp0000017aMhWY3i0vMv, hex number: 0000017a random
> string: MhWY3i0vMv
>
> The file contents has some relationship to the file name. every 512
> bytes has its own checksum and sequence number. Also contain the file
> name and the random string. So we can easily trace the file contents(see
> the attached picture).

I see.

> >What is in the incorrect block?
>
> Since I clean the test partition with /dev/zero, so what I saw is that,
> half of the file is ok. But the rest are 0. !!!! But another information
> is that:!!!! hp0000017aMhWY3i0vMv is not the last file. from the number
> in the file name , we can know there still 3 files created after it and
> these files are all checksum OK. Except the last file is zero length.
>
> This issue is very easy to reproducible. Even we test with emmc, we also
> see this issue. If you need I to do anymore test, just let me know.

So you are running in 'ws' mode of your tool, am I right? Just looking into
the sources you've sent me I've noticed that although you set O_SYNC in
openflg when mode == MODE_WS, you do not use openflg at all. So file won't
be synced at all. That would well explain why you see that not all file
contents is written. So did you just send me a different version of the
source or is your test program really buggy?

Honza

> > > [root@SiRFatlas6 ~]# debugfs /dev/nandblk0p3 debugfs 1.42.9
> > > (28-Dec-2013)
> > > debugfs: imap test/hp0000017aMhWY3i0vMv Inode 390 is part of block
> > > group 0
> > > located at block 141, offset 0x0280
> > >
> > > 00000280 80 81 00 00 10 00 04 00 c8 09 00 00 66 0a 00 00
> > > |............f...|
> > > 00000290 66 0a 00 00 00 00 00 00 00 00 01 00 04 02 00 00
> > > |f...............|
> > > 000002a0 00 00 08 00 01 00 00 00 0a f3 02 00 04 00 00 00
> > > |................|
> > > 000002b0 00 00 00 00 00 00 00 00 80 00 00 00 00 2c 01 00
> > > |.............,..| ==> the file contents is at 0x00012c00
> > > 000002c0 80 00 00 00 01 00 00 00 8e 26 01 00 00 00 00 00
> > > |.........&......|
> > >
> > >
> > > Search the block number from the journal blocks:
> > >
> > > [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "00 2c 01 00"
> > > 00039ab0 00 00 00 00 00 00 00 00 80 00 00 00 00 2c 01 00
> > > |.............,..|
> > >
> > > Search file name which the file checksum is error in journal blocks:
> > >
> > > [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "3i0vMv" -B1
> > > 00030c60 86 01 00 00 1c 00 14 01 68 70 30 30 30 30 30 31
> > > |........hp000001|
> > > 00030c70 37 61 4d 68 57 59 33 69 30 76 4d 76 88 01 00 00
> > > |7aMhWY3i0vMv....|
> > >
> > >
> > > List all journal block record to check which journal block records it:
> > >
> > > [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "c0 3b 39 98"
> > > 00000000 c0 3b 39 98 00 00 00 04 00 00 00 00 00 00 08 00
> > > |.;9.............|
> > > 00000800 c0 3b 39 98 00 00 00 05 00 00 00 6f 00 00 00 24
> > > |.;9........o...$|
> > > 00001000 c0 3b 39 98 00 00 00 01 00 00 00 6f 00 00 00 75
> > > |.;9........o...u|
> > > 0000c800 c0 3b 39 98 00 00 00 02 00 00 00 6f 00 00 00 00
> > > |.;9........o....|
> > > 0000d000 c0 3b 39 98 00 00 00 01 00 00 00 70 00 00 00 65
> > > |.;9........p...e|
> > > 00016000 c0 3b 39 98 00 00 00 02 00 00 00 70 00 00 00 00
> > > |.;9........p....|
> > > 00016800 c0 3b 39 98 00 00 00 01 00 00 00 71 00 00 00 7c
> > > |.;9........q...||
> > > 00021000 c0 3b 39 98 00 00 00 02 00 00 00 71 00 00 00 00
> > > |.;9........q....|
> > > 00021800 c0 3b 39 98 00 00 00 01 00 00 00 72 00 00 00 82
> > > |.;9........r....|
> > > 0002d000 c0 3b 39 98 00 00 00 02 00 00 00 72 00 00 00 00
> > > |.;9........r....|
> > > 0002d800 c0 3b 39 98 00 00 00 01 00 00 00 73 00 00 00 88
> > > |.;9........s....|
> > ==>00039ab0 is in last block, the file name and the start block number
> > are all recorded in the journals.
> > > 0003a000 c0 3b 39 98 00 00 00 02 00 00 00 73 00 00 00 00
> > > |.;9........s....|
> > >
> > >
> > > Back to see the kernel log which it print all the block numbers:
> > >
> > > ...
> > > ...
> > > [ 46.222671] 244109 75277
> > > [ 46.222693]
> > > [ 46.272438] 244352 75520
> > > [ 46.272460]
> > > [ 46.348417] 238443 69611
> > > [ 46.348438]
> > > [ 46.349811] 244480 75648
> > > [ 46.352287]
> > > [ 46.404904] 244609 75777
> > > [ 46.404926]
> > > [ 46.454698] 244738 75906
> > > [ 46.454719]
> > > [ 46.505439] 244992 76160
> > > [ 46.505459]
> > > [ 46.557783] 245120 76288
> > > [ 46.557804]
> > > [ 46.610075] 245249 76417
> > > [ 46.610096]
> > > [ 46.660196] 245378 76546
> > > [ 46.660219]
> > > [ 46.709906] 201691 32859 ==> journal start is 32768, so the offset is 91,
> > block size=2048, so, the offset address in the j.bin is 0x2d800
> > > [ 46.709928] J [ 46.711233]
> > > [ 46.740635] drop to 9v
> > > [ 46.749540] 201716 32884
> > > [ 46.749560] J S
> > > [ 46.751039]
> > > [ 46.753151] 245632 76800 ==> 76800 in hex is 0x012c00, it is the same start
> > block of the file which checksum is error.
> > > [ 46.755284]
> > > nanddisk idle -> 1.
> > > [ 46.800227] 6v irq-2
> > >
> > >
> > > The j.bin offset 0x2d800
> > >
> > > 0002d800 c0 3b 39 98 00 00 00 01 00 00 00 73 00 00 00 88
> > > |.;9........s....| ==>00039ab0 is in last block
> > > 0003a000 c0 3b 39 98 00 00 00 02 00 00 00 73 00 00 00 00
> > > |.;9........s....|
> > >
> > >
> > > Normally, ext4 will first write the file contents, and then write
> > > the journal and journal commit. Then after some delay, it will write
> > > the meta data. So The journal blocks contains the meta data of the
> > > file which the file contents already been written before. But from
> > > above analysis, the journal sequence 0x73 already contain the file
> > > name(hp0000017aMhWY3i0vMv) and the start block number(76800). So
> > > from the kernel log, the block
> > > number(76800) should be available before the journal blocks but NOT
> > > after it. It seems that there is out of order happen.
> > >
> > > Could you please help to check this issue ? or give a explanation
> > > about it ? Many thanks.
> > >
> > > Best regards
> > >
> > > Weller HUANG
> > >
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-ext4"
> > > in the body of a message to [email protected] More majordomo
> > > info at http://vger.kernel.org/majordomo-info.html
> > --
> > Jan Kara <[email protected]>
> > SUSE Labs, CR



--
Jan Kara <[email protected]>
SUSE Labs, CR

Subject: RE: ext4 out of order when use cfq scheduler

> So you are running in 'ws' mode of your tool, am I right? Just looking into the
> sources you've sent me I've noticed that although you set O_SYNC in openflg
> when mode == MODE_WS, you do not use openflg at all. So file won't be synced
> at all. That would well explain why you see that not all file contents is written. So
> did you just send me a different version of the source or is your test program
> really buggy?
>

Yes, it is a bug of the test code. So the test tool create files without O_SYNC flag actually.
But , even in this case, is the out of order acceptable ? or is it normal ?


>
> > > > [root@SiRFatlas6 ~]# debugfs /dev/nandblk0p3 debugfs 1.42.9
> > > > (28-Dec-2013)
> > > > debugfs: imap test/hp0000017aMhWY3i0vMv Inode 390 is part of
> > > > block group 0
> > > > located at block 141, offset 0x0280
> > > >
> > > > 00000280 80 81 00 00 10 00 04 00 c8 09 00 00 66 0a 00 00
> > > > |............f...|
> > > > 00000290 66 0a 00 00 00 00 00 00 00 00 01 00 04 02 00 00
> > > > |f...............|
> > > > 000002a0 00 00 08 00 01 00 00 00 0a f3 02 00 04 00 00 00
> > > > |................|
> > > > 000002b0 00 00 00 00 00 00 00 00 80 00 00 00 00 2c 01 00
> > > > |.............,..| ==> the file contents is at 0x00012c00
> > > > 000002c0 80 00 00 00 01 00 00 00 8e 26 01 00 00 00 00 00
> > > > |.........&......|
> > > >
> > > >
> > > > Search the block number from the journal blocks:
> > > >
> > > > [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "00 2c 01 00"
> > > > 00039ab0 00 00 00 00 00 00 00 00 80 00 00 00 00 2c 01 00
> > > > |.............,..|
> > > >
> > > > Search file name which the file checksum is error in journal blocks:
> > > >
> > > > [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "3i0vMv" -B1
> > > > 00030c60 86 01 00 00 1c 00 14 01 68 70 30 30 30 30 30 31
> > > > |........hp000001|
> > > > 00030c70 37 61 4d 68 57 59 33 69 30 76 4d 76 88 01 00 00
> > > > |7aMhWY3i0vMv....|
> > > >
> > > >
> > > > List all journal block record to check which journal block records it:
> > > >
> > > > [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "c0 3b 39 98"
> > > > 00000000 c0 3b 39 98 00 00 00 04 00 00 00 00 00 00 08 00
> > > > |.;9.............|
> > > > 00000800 c0 3b 39 98 00 00 00 05 00 00 00 6f 00 00 00 24
> > > > |.;9........o...$|
> > > > 00001000 c0 3b 39 98 00 00 00 01 00 00 00 6f 00 00 00 75
> > > > |.;9........o...u|
> > > > 0000c800 c0 3b 39 98 00 00 00 02 00 00 00 6f 00 00 00 00
> > > > |.;9........o....|
> > > > 0000d000 c0 3b 39 98 00 00 00 01 00 00 00 70 00 00 00 65
> > > > |.;9........p...e|
> > > > 00016000 c0 3b 39 98 00 00 00 02 00 00 00 70 00 00 00 00
> > > > |.;9........p....|
> > > > 00016800 c0 3b 39 98 00 00 00 01 00 00 00 71 00 00 00 7c
> > > > |.;9........q...||
> > > > 00021000 c0 3b 39 98 00 00 00 02 00 00 00 71 00 00 00 00
> > > > |.;9........q....|
> > > > 00021800 c0 3b 39 98 00 00 00 01 00 00 00 72 00 00 00 82
> > > > |.;9........r....|
> > > > 0002d000 c0 3b 39 98 00 00 00 02 00 00 00 72 00 00 00 00
> > > > |.;9........r....|
> > > > 0002d800 c0 3b 39 98 00 00 00 01 00 00 00 73 00 00 00 88
> > > > |.;9........s....|
> > > ==>00039ab0 is in last block, the file name and the start block
> > > number are all recorded in the journals.
> > > > 0003a000 c0 3b 39 98 00 00 00 02 00 00 00 73 00 00 00 00
> > > > |.;9........s....|
> > > >
> > > >
> > > > Back to see the kernel log which it print all the block numbers:
> > > >
> > > > ...
> > > > ...
> > > > [ 46.222671] 244109 75277
> > > > [ 46.222693]
> > > > [ 46.272438] 244352 75520
> > > > [ 46.272460]
> > > > [ 46.348417] 238443 69611
> > > > [ 46.348438]
> > > > [ 46.349811] 244480 75648
> > > > [ 46.352287]
> > > > [ 46.404904] 244609 75777
> > > > [ 46.404926]
> > > > [ 46.454698] 244738 75906
> > > > [ 46.454719]
> > > > [ 46.505439] 244992 76160
> > > > [ 46.505459]
> > > > [ 46.557783] 245120 76288
> > > > [ 46.557804]
> > > > [ 46.610075] 245249 76417
> > > > [ 46.610096]
> > > > [ 46.660196] 245378 76546
> > > > [ 46.660219]
> > > > [ 46.709906] 201691 32859 ==> journal start is 32768, so the offset is 91,
> > > block size=2048, so, the offset address in the j.bin is 0x2d800
> > > > [ 46.709928] J [ 46.711233]
> > > > [ 46.740635] drop to 9v
> > > > [ 46.749540] 201716 32884
> > > > [ 46.749560] J S
> > > > [ 46.751039]
> > > > [ 46.753151] 245632 76800 ==> 76800 in hex is 0x012c00, it is the same
> start
> > > block of the file which checksum is error.
> > > > [ 46.755284]
> > > > nanddisk idle -> 1.
> > > > [ 46.800227] 6v irq-2
> > > >
> > > >
> > > > The j.bin offset 0x2d800
> > > >
> > > > 0002d800 c0 3b 39 98 00 00 00 01 00 00 00 73 00 00 00 88
> > > > |.;9........s....| ==>00039ab0 is in last block
> > > > 0003a000 c0 3b 39 98 00 00 00 02 00 00 00 73 00 00 00 00
> > > > |.;9........s....|
> > > >
> > > >
> > > > Normally, ext4 will first write the file contents, and then write
> > > > the journal and journal commit. Then after some delay, it will
> > > > write the meta data. So The journal blocks contains the meta data
> > > > of the file which the file contents already been written before.
> > > > But from above analysis, the journal sequence 0x73 already
> > > > contain the file
> > > > name(hp0000017aMhWY3i0vMv) and the start block number(76800). So
> > > > from the kernel log, the block
> > > > number(76800) should be available before the journal blocks but
> > > > NOT after it. It seems that there is out of order happen.
> > > >
> > > > Could you please help to check this issue ? or give a explanation
> > > > about it ? Many thanks.
> > > >
> > > > Best regards
> > > >
> > > > Weller HUANG
> > > >
> > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-ext4"
> > > > in the body of a message to [email protected] More
> > > > majordomo info at http://vger.kernel.org/majordomo-info.html
> > > --
> > > Jan Kara <[email protected]>
> > > SUSE Labs, CR
>
>
>
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2016-01-06 19:17:51

by Andreas Dilger

[permalink] [raw]
Subject: Re: ext4 out of order when use cfq scheduler


> On Jan 5, 2016, at 7:39 PM, HUANG Weller (CM/ESW12-CN) <[email protected]> wrote:
>
>> So you are running in 'ws' mode of your tool, am I right? Just looking into the
>> sources you've sent me I've noticed that although you set O_SYNC in openflg
>> when mode == MODE_WS, you do not use openflg at all. So file won't be synced
>> at all. That would well explain why you see that not all file contents is written. So
>> did you just send me a different version of the source or is your test program
>> really buggy?
>>
>
> Yes, it is a bug of the test code. So the test tool create files without O_SYNC flag actually.
> But , even in this case, is the out of order acceptable ? or is it normal ?

Without O_SYNC there is no ordering guarantee between non-overlapping writes. They may be reordered by the filesystem or the elevator or the storage device.

Cheers, Andreas

>
>>
>>>>> [root@SiRFatlas6 ~]# debugfs /dev/nandblk0p3 debugfs 1.42.9
>>>>> (28-Dec-2013)
>>>>> debugfs: imap test/hp0000017aMhWY3i0vMv Inode 390 is part of
>>>>> block group 0
>>>>> located at block 141, offset 0x0280
>>>>>
>>>>> 00000280 80 81 00 00 10 00 04 00 c8 09 00 00 66 0a 00 00
>>>>> |............f...|
>>>>> 00000290 66 0a 00 00 00 00 00 00 00 00 01 00 04 02 00 00
>>>>> |f...............|
>>>>> 000002a0 00 00 08 00 01 00 00 00 0a f3 02 00 04 00 00 00
>>>>> |................|
>>>>> 000002b0 00 00 00 00 00 00 00 00 80 00 00 00 00 2c 01 00
>>>>> |.............,..| ==> the file contents is at 0x00012c00
>>>>> 000002c0 80 00 00 00 01 00 00 00 8e 26 01 00 00 00 00 00
>>>>> |.........&......|
>>>>>
>>>>>
>>>>> Search the block number from the journal blocks:
>>>>>
>>>>> [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "00 2c 01 00"
>>>>> 00039ab0 00 00 00 00 00 00 00 00 80 00 00 00 00 2c 01 00
>>>>> |.............,..|
>>>>>
>>>>> Search file name which the file checksum is error in journal blocks:
>>>>>
>>>>> [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "3i0vMv" -B1
>>>>> 00030c60 86 01 00 00 1c 00 14 01 68 70 30 30 30 30 30 31
>>>>> |........hp000001|
>>>>> 00030c70 37 61 4d 68 57 59 33 69 30 76 4d 76 88 01 00 00
>>>>> |7aMhWY3i0vMv....|
>>>>>
>>>>>
>>>>> List all journal block record to check which journal block records it:
>>>>>
>>>>> [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "c0 3b 39 98"
>>>>> 00000000 c0 3b 39 98 00 00 00 04 00 00 00 00 00 00 08 00
>>>>> |.;9.............|
>>>>> 00000800 c0 3b 39 98 00 00 00 05 00 00 00 6f 00 00 00 24
>>>>> |.;9........o...$|
>>>>> 00001000 c0 3b 39 98 00 00 00 01 00 00 00 6f 00 00 00 75
>>>>> |.;9........o...u|
>>>>> 0000c800 c0 3b 39 98 00 00 00 02 00 00 00 6f 00 00 00 00
>>>>> |.;9........o....|
>>>>> 0000d000 c0 3b 39 98 00 00 00 01 00 00 00 70 00 00 00 65
>>>>> |.;9........p...e|
>>>>> 00016000 c0 3b 39 98 00 00 00 02 00 00 00 70 00 00 00 00
>>>>> |.;9........p....|
>>>>> 00016800 c0 3b 39 98 00 00 00 01 00 00 00 71 00 00 00 7c
>>>>> |.;9........q...||
>>>>> 00021000 c0 3b 39 98 00 00 00 02 00 00 00 71 00 00 00 00
>>>>> |.;9........q....|
>>>>> 00021800 c0 3b 39 98 00 00 00 01 00 00 00 72 00 00 00 82
>>>>> |.;9........r....|
>>>>> 0002d000 c0 3b 39 98 00 00 00 02 00 00 00 72 00 00 00 00
>>>>> |.;9........r....|
>>>>> 0002d800 c0 3b 39 98 00 00 00 01 00 00 00 73 00 00 00 88
>>>>> |.;9........s....|
>>>> ==>00039ab0 is in last block, the file name and the start block
>>>> number are all recorded in the journals.
>>>>> 0003a000 c0 3b 39 98 00 00 00 02 00 00 00 73 00 00 00 00
>>>>> |.;9........s....|
>>>>>
>>>>>
>>>>> Back to see the kernel log which it print all the block numbers:
>>>>>
>>>>> ...
>>>>> ...
>>>>> [ 46.222671] 244109 75277
>>>>> [ 46.222693]
>>>>> [ 46.272438] 244352 75520
>>>>> [ 46.272460]
>>>>> [ 46.348417] 238443 69611
>>>>> [ 46.348438]
>>>>> [ 46.349811] 244480 75648
>>>>> [ 46.352287]
>>>>> [ 46.404904] 244609 75777
>>>>> [ 46.404926]
>>>>> [ 46.454698] 244738 75906
>>>>> [ 46.454719]
>>>>> [ 46.505439] 244992 76160
>>>>> [ 46.505459]
>>>>> [ 46.557783] 245120 76288
>>>>> [ 46.557804]
>>>>> [ 46.610075] 245249 76417
>>>>> [ 46.610096]
>>>>> [ 46.660196] 245378 76546
>>>>> [ 46.660219]
>>>>> [ 46.709906] 201691 32859 ==> journal start is 32768, so the offset is 91,
>>>> block size=2048, so, the offset address in the j.bin is 0x2d800
>>>>> [ 46.709928] J [ 46.711233]
>>>>> [ 46.740635] drop to 9v
>>>>> [ 46.749540] 201716 32884
>>>>> [ 46.749560] J S
>>>>> [ 46.751039]
>>>>> [ 46.753151] 245632 76800 ==> 76800 in hex is 0x012c00, it is the same
>> start
>>>> block of the file which checksum is error.
>>>>> [ 46.755284]
>>>>> nanddisk idle -> 1.
>>>>> [ 46.800227] 6v irq-2
>>>>>
>>>>>
>>>>> The j.bin offset 0x2d800
>>>>>
>>>>> 0002d800 c0 3b 39 98 00 00 00 01 00 00 00 73 00 00 00 88
>>>>> |.;9........s....| ==>00039ab0 is in last block
>>>>> 0003a000 c0 3b 39 98 00 00 00 02 00 00 00 73 00 00 00 00
>>>>> |.;9........s....|
>>>>>
>>>>>
>>>>> Normally, ext4 will first write the file contents, and then write
>>>>> the journal and journal commit. Then after some delay, it will
>>>>> write the meta data. So The journal blocks contains the meta data
>>>>> of the file which the file contents already been written before.
>>>>> But from above analysis, the journal sequence 0x73 already
>>>>> contain the file
>>>>> name(hp0000017aMhWY3i0vMv) and the start block number(76800). So
>>>>> from the kernel log, the block
>>>>> number(76800) should be available before the journal blocks but
>>>>> NOT after it. It seems that there is out of order happen.
>>>>>
>>>>> Could you please help to check this issue ? or give a explanation
>>>>> about it ? Many thanks.
>>>>>
>>>>> Best regards
>>>>>
>>>>> Weller HUANG
>>>>>
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4"
>>>>> in the body of a message to [email protected] More
>>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> Jan Kara <[email protected]>
>>>> SUSE Labs, CR
>>
>>
>>
>> --
>> Jan Kara <[email protected]>
>> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






Attachments:
signature.asc (833.00 B)
Message signed with OpenPGP using GPGMail
Subject: RE: ext4 out of order when use cfq scheduler



> -----Original Message-----
> From: Andreas Dilger [mailto:[email protected]]
> Sent: Thursday, January 07, 2016 3:18 AM
> To: HUANG Weller (CM/ESW12-CN) <[email protected]>
> Cc: Jan Kara <[email protected]>; [email protected]
> Subject: Re: ext4 out of order when use cfq scheduler
>
>
> > On Jan 5, 2016, at 7:39 PM, HUANG Weller (CM/ESW12-CN)
> <[email protected]> wrote:
> >
> >> So you are running in 'ws' mode of your tool, am I right? Just
> >> looking into the sources you've sent me I've noticed that although
> >> you set O_SYNC in openflg when mode == MODE_WS, you do not use
> >> openflg at all. So file won't be synced at all. That would well
> >> explain why you see that not all file contents is written. So did you
> >> just send me a different version of the source or is your test program really
> buggy?
> >>
> >
> > Yes, it is a bug of the test code. So the test tool create files without O_SYNC
> flag actually.
> > But , even in this case, is the out of order acceptable ? or is it normal ?
>
> Without O_SYNC there is no ordering guarantee between non-overlapping writes.
> They may be reordered by the filesystem or the elevator or the storage device.
>

The out of sequence is happen before sending the block request to the storage device FW. I printed the log from the device block driver.
I remember, 2 years ago, I test on freescale platform with kernel version 3.8 and with same test tools. I never see such issue.
The journal blocks already contain the inode information of a file. But the file contents were wrote after the journal commit. The problem is that the length of the file described in the inode is not zero before the real data was written.
I think it is a problem. Do you agree with me ?


>
> >
> >>
> >>>>> [root@SiRFatlas6 ~]# debugfs /dev/nandblk0p3 debugfs 1.42.9
> >>>>> (28-Dec-2013)
> >>>>> debugfs: imap test/hp0000017aMhWY3i0vMv Inode 390 is part of
> >>>>> block group 0
> >>>>> located at block 141, offset 0x0280
> >>>>>
> >>>>> 00000280 80 81 00 00 10 00 04 00 c8 09 00 00 66 0a 00 00
> >>>>> |............f...|
> >>>>> 00000290 66 0a 00 00 00 00 00 00 00 00 01 00 04 02 00 00
> >>>>> |f...............|
> >>>>> 000002a0 00 00 08 00 01 00 00 00 0a f3 02 00 04 00 00 00
> >>>>> |................|
> >>>>> 000002b0 00 00 00 00 00 00 00 00 80 00 00 00 00 2c 01 00
> >>>>> |.............,..| ==> the file contents is at 0x00012c00
> >>>>> 000002c0 80 00 00 00 01 00 00 00 8e 26 01 00 00 00 00 00
> >>>>> |.........&......|
> >>>>>
> >>>>>
> >>>>> Search the block number from the journal blocks:
> >>>>>
> >>>>> [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "00 2c 01 00"
> >>>>> 00039ab0 00 00 00 00 00 00 00 00 80 00 00 00 00 2c 01 00
> >>>>> |.............,..|
> >>>>>
> >>>>> Search file name which the file checksum is error in journal blocks:
> >>>>>
> >>>>> [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "3i0vMv" -B1
> >>>>> 00030c60 86 01 00 00 1c 00 14 01 68 70 30 30 30 30 30 31
> >>>>> |........hp000001|
> >>>>> 00030c70 37 61 4d 68 57 59 33 69 30 76 4d 76 88 01 00 00
> >>>>> |7aMhWY3i0vMv....|
> >>>>>
> >>>>>
> >>>>> List all journal block record to check which journal block records it:
> >>>>>
> >>>>> [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "c0 3b 39 98"
> >>>>> 00000000 c0 3b 39 98 00 00 00 04 00 00 00 00 00 00 08 00
> >>>>> |.;9.............|
> >>>>> 00000800 c0 3b 39 98 00 00 00 05 00 00 00 6f 00 00 00 24
> >>>>> |.;9........o...$|
> >>>>> 00001000 c0 3b 39 98 00 00 00 01 00 00 00 6f 00 00 00 75
> >>>>> |.;9........o...u|
> >>>>> 0000c800 c0 3b 39 98 00 00 00 02 00 00 00 6f 00 00 00 00
> >>>>> |.;9........o....|
> >>>>> 0000d000 c0 3b 39 98 00 00 00 01 00 00 00 70 00 00 00 65
> >>>>> |.;9........p...e|
> >>>>> 00016000 c0 3b 39 98 00 00 00 02 00 00 00 70 00 00 00 00
> >>>>> |.;9........p....|
> >>>>> 00016800 c0 3b 39 98 00 00 00 01 00 00 00 71 00 00 00 7c
> >>>>> |.;9........q...||
> >>>>> 00021000 c0 3b 39 98 00 00 00 02 00 00 00 71 00 00 00 00
> >>>>> |.;9........q....|
> >>>>> 00021800 c0 3b 39 98 00 00 00 01 00 00 00 72 00 00 00 82
> >>>>> |.;9........r....|
> >>>>> 0002d000 c0 3b 39 98 00 00 00 02 00 00 00 72 00 00 00 00
> >>>>> |.;9........r....|
> >>>>> 0002d800 c0 3b 39 98 00 00 00 01 00 00 00 73 00 00 00 88
> >>>>> |.;9........s....|
> >>>> ==>00039ab0 is in last block, the file name and the start block
> >>>> number are all recorded in the journals.
> >>>>> 0003a000 c0 3b 39 98 00 00 00 02 00 00 00 73 00 00 00 00
> >>>>> |.;9........s....|
> >>>>>
> >>>>>
> >>>>> Back to see the kernel log which it print all the block numbers:
> >>>>>
> >>>>> ...
> >>>>> ...
> >>>>> [ 46.222671] 244109 75277
> >>>>> [ 46.222693]
> >>>>> [ 46.272438] 244352 75520
> >>>>> [ 46.272460]
> >>>>> [ 46.348417] 238443 69611
> >>>>> [ 46.348438]
> >>>>> [ 46.349811] 244480 75648
> >>>>> [ 46.352287]
> >>>>> [ 46.404904] 244609 75777
> >>>>> [ 46.404926]
> >>>>> [ 46.454698] 244738 75906
> >>>>> [ 46.454719]
> >>>>> [ 46.505439] 244992 76160
> >>>>> [ 46.505459]
> >>>>> [ 46.557783] 245120 76288
> >>>>> [ 46.557804]
> >>>>> [ 46.610075] 245249 76417
> >>>>> [ 46.610096]
> >>>>> [ 46.660196] 245378 76546
> >>>>> [ 46.660219]
> >>>>> [ 46.709906] 201691 32859 ==> journal start is 32768, so the offset is 91,
> >>>> block size=2048, so, the offset address in the j.bin is 0x2d800
> >>>>> [ 46.709928] J [ 46.711233]
> >>>>> [ 46.740635] drop to 9v
> >>>>> [ 46.749540] 201716 32884
> >>>>> [ 46.749560] J S
> >>>>> [ 46.751039]
> >>>>> [ 46.753151] 245632 76800 ==> 76800 in hex is 0x012c00, it is the same
> >> start
> >>>> block of the file which checksum is error.
> >>>>> [ 46.755284]
> >>>>> nanddisk idle -> 1.
> >>>>> [ 46.800227] 6v irq-2
> >>>>>
> >>>>>
> >>>>> The j.bin offset 0x2d800
> >>>>>
> >>>>> 0002d800 c0 3b 39 98 00 00 00 01 00 00 00 73 00 00 00 88
> >>>>> |.;9........s....| ==>00039ab0 is in last block
> >>>>> 0003a000 c0 3b 39 98 00 00 00 02 00 00 00 73 00 00 00 00
> >>>>> |.;9........s....|
> >>>>>
> >>>>>
> >>>>> Normally, ext4 will first write the file contents, and then write
> >>>>> the journal and journal commit. Then after some delay, it will
> >>>>> write the meta data. So The journal blocks contains the meta data
> >>>>> of the file which the file contents already been written before.
> >>>>> But from above analysis, the journal sequence 0x73 already
> >>>>> contain the file
> >>>>> name(hp0000017aMhWY3i0vMv) and the start block number(76800). So
> >>>>> from the kernel log, the block
> >>>>> number(76800) should be available before the journal blocks but
> >>>>> NOT after it. It seems that there is out of order happen.
> >>>>>
> >>>>> Could you please help to check this issue ? or give a explanation
> >>>>> about it ? Many thanks.
> >>>>>
> >>>>> Best regards
> >>>>>
> >>>>> Weller HUANG
> >>>>>
> >>>>>
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4"
> >>>>> in the body of a message to [email protected] More
> >>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
> >>>> --
> >>>> Jan Kara <[email protected]>
> >>>> SUSE Labs, CR
> >>
> >>
> >>
> >> --
> >> Jan Kara <[email protected]>
> >> SUSE Labs, CR
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4"
> > in the body of a message to [email protected] More majordomo
> > info at http://vger.kernel.org/majordomo-info.html
>
>
> Cheers, Andreas
>
>
>
>


2016-01-07 10:24:12

by Jan Kara

[permalink] [raw]
Subject: Re: ext4 out of order when use cfq scheduler

On Thu 07-01-16 06:43:00, HUANG Weller (CM/ESW12-CN) wrote:
> > -----Original Message-----
> > From: Jan Kara [mailto:[email protected]]
> > Sent: Wednesday, January 06, 2016 6:06 PM
> > To: HUANG Weller (CM/ESW12-CN) <[email protected]>
> > Subject: Re: ext4 out of order when use cfq scheduler
> >
> > On Wed 06-01-16 02:39:15, HUANG Weller (CM/ESW12-CN) wrote:
> > > > So you are running in 'ws' mode of your tool, am I right? Just
> > > > looking into the sources you've sent me I've noticed that although
> > > > you set O_SYNC in openflg when mode == MODE_WS, you do not use
> > > > openflg at all. So file won't be synced at all. That would well
> > > > explain why you see that not all file contents is written. So did
> > > > you just send me a different version of the source or is your test program
> > really buggy?
> > > >
> > >
> > > Yes, it is a bug of the test code. So the test tool create files
> > > without O_SYNC flag actually. But , even in this case, is the out of
> > > order acceptable ? or is it normal ?
> >
> > Without fsync(2) or O_SYNC, it is perfectly possible that some files are written and
> > others are not since nobody guarantees order of writeback of inodes. OTOH you
> > shouldn't ever see uninitialized data in the inode (but so far it isn't clear to me
> > whether you really see unitialized data or whether we really wrote zeros to those
> > blocks - ext4 can sometimes decide to do so). Your traces and disk contents
> > show that the problematic inode has extent of length 128 blocks starting at block
> > 0x12c00 and then extent of lenght 1 block starting at block 0x1268e. What is the
> > block size of the filesystem? Because inode size is only 0x40010.
> >
> > Some suggestions to try:
> > 1) Print also length of a write request in addition to the starting
> > block so that we can see how much actually got written
>
> Please see below failure analysis.
>
> > 2) Initialize the device to 0xff so that we can distinguish
> > uninitialized blocks from zeroed-out blocks.
>
> Yes, i Initialize the device to 0xff this time.
>
> > 3) Report exactly for which 512-byte blocks checksum matches and for
> > which it is wrong.
> The wrong contents are old file contents which are created in previous
> test round. It is caused by the "wrong" sequence inode data(in journal)
> and the file contents. So the file contents are not updated.

So this confuses me somewhat. You previously said that you always remove
files after each test round and then new ones are created. Is it still the
case? So the old file contents you speak about above is just some random
contents that happened to be in disk blocks we freshly allocated to the
file, am I right?

> > 4) Try newer kernel if possible.
> >
>
> Please see my analysis with print length of a write request

Thanks for the detailed analysis!

> Ext4 information:
> ----------------------------
> [root@SiRFatlas6 ~]# dumpe2fs /dev/nandblk0p3
> Filesystem revision #: 1 (dynamic)
> Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super huge_file uninit_bg dir_nlink extra_isize
> Filesystem flags: unsigned_directory_hash
> Default mount options: user_xattr acl
> Filesystem state: clean
> Errors behavior: Continue
> Filesystem OS type: Linux
> Inode count: 42816
> Block count: 85632
> Reserved block count: 4281
> Free blocks: 41789
> Free inodes: 42519
> First block: 0
> Block size: 2048
> Fragment size: 2048
> Reserved GDT blocks: 83
> Blocks per group: 16384
> Fragments per group: 16384
> Inodes per group: 7136
> Inode blocks per group: 446
> Flex block group size: 16
> Filesystem created: Thu Jan 1 00:42:42 1970
> Last mount time: Thu Jan 1 00:30:58 1970
> Last write time: Thu Jan 1 00:30:58 1970
> Mount count: 21
> Maximum mount count: -1
> Last checked: Thu Jan 1 00:42:42 1970
> Check interval: 0 (<none>)
> Lifetime writes: 10 MB
> Reserved blocks uid: 0 (user root)
> Reserved blocks gid: 0 (group root)
> First inode: 11
> Inode size: 128
> Journal inode: 8
> Default directory hash: half_md4
> Directory Hash Seed: e9c1cdf4-bc4c-40cb-930f-5f3923e68a33
> Journal backup: inode blocks
> Journal features: journal_incompat_revoke
> Journal size: 8M
> Journal length: 4096
> Journal sequence: 0x0000005b
> Journal start: 1
>
> Last four files are contents error:
> Failure files:
> -rw------- 1 root root 262160 Jan 1 00:44 /mnt/test/hp0000011aHHlIpPbjK6
> -rw------- 1 root root 262160 Jan 1 00:44 /mnt/test/hp0000011bjwWXnrY2jn
> -rw------- 1 root root 262160 Jan 1 00:44 /mnt/test/hp0000011c7HYSBFSTAk
> -rw------- 1 root root 135168 Jan 1 00:44 /mnt/test/hp0000011dliQ04nIiLB
>
> Analysis /mnt/test/hp0000011aHHlIpPbjK6
>
> debugfs: imap test/hp0000011aHHlIpPbjK6
> Inode 294 is part of block group 0
> located at block 135, offset 0x0280
>
> 00000280 80 81 00 00 10 00 04 00 aa 07 00 00 72 0a 00 00
> 00000290 72 0a 00 00 00 00 00 00 00 00 01 00 04 02 00 00
> 000002a0 00 00 08 00 01 00 00 00 0a f3 01 00 04 00 00 00
> 000002b0 00 00 00 00 00 00 00 00 81 00 00 00 00 dd 00 00 ==> 0xdd00
>
> Find the block number of the file contents: j.bin is the journal blocks which I backup before journal replay.
> [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "00 dd 00 00"
> 0002eab0 00 00 00 00 00 00 00 00 81 00 00 00 00 dd 00 00
>
> Check the block number in which journal commit, the commit ID is 0x58.
> [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "c0 3b 39 98"
> 00000000 c0 3b 39 98 00 00 00 04 00 00 00 00 00 00 08 00 |.;9.............|
> 00000800 c0 3b 39 98 00 00 00 05 00 00 00 55 00 00 00 30 |.;9........U...0|
> 00001000 c0 3b 39 98 00 00 00 01 00 00 00 55 00 00 00 75 |.;9........U...u|
> 0000f800 c0 3b 39 98 00 00 00 02 00 00 00 55 00 00 00 00 |.;9........U....|
> 00010000 c0 3b 39 98 00 00 00 01 00 00 00 56 00 00 00 65 |.;9........V...e|
> 00019000 c0 3b 39 98 00 00 00 02 00 00 00 56 00 00 00 00 |.;9........V....|
> 00019800 c0 3b 39 98 00 00 00 01 00 00 00 57 00 00 00 7c |.;9........W...||
> 00024000 c0 3b 39 98 00 00 00 02 00 00 00 57 00 00 00 00 |.;9........W....|
> 00024800 c0 3b 39 98 00 00 00 01 00 00 00 58 00 00 00 82 |.;9........X....|
> 0002f000 c0 3b 39 98 00 00 00 02 00 00 00 58 00 00 00 00 |.;9........X....|
> 00039000 c0 3b 39 98 00 00 00 02 00 00 00 1a 00 00 00 00 |.;9.............|
> 00039800 c0 3b 39 98 00 00 00 01 00 00 00 1b 00 00 00 8c |.;9.............|
> 00046800 c0 3b 39 98 00 00 00 02 00 00 00 1b 00 00 00 00 |.;9.............|
>
>
>
>
> Kernel log:
>
> [ 43.801072] 223746 54914(0xd682) 0x40800 bytes
> [ 43.802580]
> [ 43.853328] 224000 55168(0xd780) 0x40000 bytes
> [ 43.854876]
> [ 43.962590] 224128 55296(0xd800) 0x40800 bytes
> [ 43.964097]
> [ 44.016428] 221062 52230(0xcc06) 0x1000 bytes
> [ 44.017844]
> [ 44.019594] 224257 55425(0xd881) 0x40800 bytes
> [ 44.023832]
> [ 44.100933] 224386 55554(0xd902) 0x40800 bytes
> [ 44.102439]
> [ 44.165200] 224640 55808(0xda00) 0x40000 bytes
> [ 44.166727]
> [ 44.216643] 224768 55936(0xda80) 0x40800 bytes
> [ 44.218144]
> [ 44.270599] 224897 56065(0xdb01) 0x40800 bytes
> [ 44.272108]
> [ 44.348213] 201673 32841(0x8049) 0xa800 bytes ===> (32841-32768)*block_size = 0x24800 , it is the start block of journal ID 0x58
> [ 44.349659] J [ 44.351275]
> [ 44.360542] 201694 32862(0x805e) 0x800 bytes
> [ 44.361876] J S
> [ 44.363695]
> [ 44.365843] 225026 56194(0xdb82) 0x40800 bytes
> [ 44.369716]
> [ 44.417490] drop to 9v
> [ 44.422435] 225280 56448(0xdc80) 0x40000 bytes ==> file hp00000119py87lXtVHG contents. This file is checksum correct. But data contents are written after journal commit.
> [ 44.423938]
> [ 44.474334] 225408 56576(0xdd00) 0x40800 bytes ===> file hp0000011aHHlIpPbjK6's data written after journal commit!!!
> [ 44.475904]
> nanddisk idle -> 1.
> [ 44.479310] 6v irq-2

OK, so I was looking into the code and indeed, reality is correct and my
mental model was wrong! ;) I thought that inode gets added to the list of
inodes for which we need to wait for data IO completion during transaction
commit during block allocation. And I was wrong. It used to happen in
mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove calls to
ext4_jbd2_file_inode() from delalloc write path) where it got removed. And
that was wrong because although we submit data writes before dropping
handle for allocating transaction and updating i_size, nobody guarantees
that data IO is not delayed in the block layer until transaction commit.
Which seems to happen in your case. I'll send a fix. Thanks for your report
and persistence!

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

Subject: RE: ext4 out of order when use cfq scheduler



> -----Original Message-----
> From: Jan Kara [mailto:[email protected]]
> Sent: Thursday, January 07, 2016 6:24 PM
> To: HUANG Weller (CM/ESW12-CN) <[email protected]>
> Cc: Jan Kara <[email protected]>; [email protected]
> Subject: Re: ext4 out of order when use cfq scheduler
>
> On Thu 07-01-16 06:43:00, HUANG Weller (CM/ESW12-CN) wrote:
> > > -----Original Message-----
> > > From: Jan Kara [mailto:[email protected]]
> > > Sent: Wednesday, January 06, 2016 6:06 PM
> > > To: HUANG Weller (CM/ESW12-CN) <[email protected]>
> > > Subject: Re: ext4 out of order when use cfq scheduler
> > >
> > > On Wed 06-01-16 02:39:15, HUANG Weller (CM/ESW12-CN) wrote:
> > > > > So you are running in 'ws' mode of your tool, am I right? Just
> > > > > looking into the sources you've sent me I've noticed that
> > > > > although you set O_SYNC in openflg when mode == MODE_WS, you do
> > > > > not use openflg at all. So file won't be synced at all. That
> > > > > would well explain why you see that not all file contents is
> > > > > written. So did you just send me a different version of the
> > > > > source or is your test program
> > > really buggy?
> > > > >
> > > >
> > > > Yes, it is a bug of the test code. So the test tool create files
> > > > without O_SYNC flag actually. But , even in this case, is the out
> > > > of order acceptable ? or is it normal ?
> > >
> > > Without fsync(2) or O_SYNC, it is perfectly possible that some files
> > > are written and others are not since nobody guarantees order of
> > > writeback of inodes. OTOH you shouldn't ever see uninitialized data
> > > in the inode (but so far it isn't clear to me whether you really see
> > > unitialized data or whether we really wrote zeros to those blocks -
> > > ext4 can sometimes decide to do so). Your traces and disk contents
> > > show that the problematic inode has extent of length 128 blocks
> > > starting at block
> > > 0x12c00 and then extent of lenght 1 block starting at block 0x1268e.
> > > What is the block size of the filesystem? Because inode size is only 0x40010.
> > >
> > > Some suggestions to try:
> > > 1) Print also length of a write request in addition to the starting
> > > block so that we can see how much actually got written
> >
> > Please see below failure analysis.
> >
> > > 2) Initialize the device to 0xff so that we can distinguish
> > > uninitialized blocks from zeroed-out blocks.
> >
> > Yes, i Initialize the device to 0xff this time.
> >
> > > 3) Report exactly for which 512-byte blocks checksum matches and for
> > > which it is wrong.
> > The wrong contents are old file contents which are created in previous
> > test round. It is caused by the "wrong" sequence inode data(in
> > journal) and the file contents. So the file contents are not updated.
>
> So this confuses me somewhat. You previously said that you always remove files
> after each test round and then new ones are created. Is it still the case? So the old
> file contents you speak about above is just some random contents that happened
> to be in disk blocks we freshly allocated to the file, am I right?

Yes. You are right.
The "old file contents" means that the disk blocks which the contents is generated from last test round, and they are allocated to a new file in new test round.


>
> OK, so I was looking into the code and indeed, reality is correct and my mental
> model was wrong! ;) I thought that inode gets added to the list of inodes for which
> we need to wait for data IO completion during transaction commit during block
> allocation. And I was wrong. It used to happen in
> mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove calls to
> ext4_jbd2_file_inode() from delalloc write path) where it got removed. And that was
> wrong because although we submit data writes before dropping handle for
> allocating transaction and updating i_size, nobody guarantees that data IO is not
> delayed in the block layer until transaction commit.
> Which seems to happen in your case. I'll send a fix. Thanks for your report and
> persistence!
>

Thanks a lot for your feedback :-)
Because I am not familiar with the detail of the ext4 internal code. I will try to understand your explanation which you describe above. And have a look on related funcations.
Could you send the fix in this mail ?
And whether the kernel 3.14 also have such issue, right ?

> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2016-01-07 11:47:28

by Jan Kara

[permalink] [raw]
Subject: Re: ext4 out of order when use cfq scheduler

On Thu 07-01-16 11:02:29, HUANG Weller (CM/ESW12-CN) wrote:
> > -----Original Message-----
> > From: Jan Kara [mailto:[email protected]]
> > Sent: Thursday, January 07, 2016 6:24 PM
> > To: HUANG Weller (CM/ESW12-CN) <[email protected]>
> > Cc: Jan Kara <[email protected]>; [email protected]
> > Subject: Re: ext4 out of order when use cfq scheduler
> >
> > On Thu 07-01-16 06:43:00, HUANG Weller (CM/ESW12-CN) wrote:
> > > > -----Original Message-----
> > > > From: Jan Kara [mailto:[email protected]]
> > > > Sent: Wednesday, January 06, 2016 6:06 PM
> > > > To: HUANG Weller (CM/ESW12-CN) <[email protected]>
> > > > Subject: Re: ext4 out of order when use cfq scheduler
> > > >
> > > > On Wed 06-01-16 02:39:15, HUANG Weller (CM/ESW12-CN) wrote:
> > > > > > So you are running in 'ws' mode of your tool, am I right? Just
> > > > > > looking into the sources you've sent me I've noticed that
> > > > > > although you set O_SYNC in openflg when mode == MODE_WS, you do
> > > > > > not use openflg at all. So file won't be synced at all. That
> > > > > > would well explain why you see that not all file contents is
> > > > > > written. So did you just send me a different version of the
> > > > > > source or is your test program
> > > > really buggy?
> > > > > >
> > > > >
> > > > > Yes, it is a bug of the test code. So the test tool create files
> > > > > without O_SYNC flag actually. But , even in this case, is the out
> > > > > of order acceptable ? or is it normal ?
> > > >
> > > > Without fsync(2) or O_SYNC, it is perfectly possible that some files
> > > > are written and others are not since nobody guarantees order of
> > > > writeback of inodes. OTOH you shouldn't ever see uninitialized data
> > > > in the inode (but so far it isn't clear to me whether you really see
> > > > unitialized data or whether we really wrote zeros to those blocks -
> > > > ext4 can sometimes decide to do so). Your traces and disk contents
> > > > show that the problematic inode has extent of length 128 blocks
> > > > starting at block
> > > > 0x12c00 and then extent of lenght 1 block starting at block 0x1268e.
> > > > What is the block size of the filesystem? Because inode size is only 0x40010.
> > > >
> > > > Some suggestions to try:
> > > > 1) Print also length of a write request in addition to the starting
> > > > block so that we can see how much actually got written
> > >
> > > Please see below failure analysis.
> > >
> > > > 2) Initialize the device to 0xff so that we can distinguish
> > > > uninitialized blocks from zeroed-out blocks.
> > >
> > > Yes, i Initialize the device to 0xff this time.
> > >
> > > > 3) Report exactly for which 512-byte blocks checksum matches and for
> > > > which it is wrong.
> > > The wrong contents are old file contents which are created in previous
> > > test round. It is caused by the "wrong" sequence inode data(in
> > > journal) and the file contents. So the file contents are not updated.
> >
> > So this confuses me somewhat. You previously said that you always remove files
> > after each test round and then new ones are created. Is it still the case? So the old
> > file contents you speak about above is just some random contents that happened
> > to be in disk blocks we freshly allocated to the file, am I right?
>
> Yes. You are right.
> The "old file contents" means that the disk blocks which the contents is generated from last test round, and they are allocated to a new file in new test round.
>
>
> >
> > OK, so I was looking into the code and indeed, reality is correct and my mental
> > model was wrong! ;) I thought that inode gets added to the list of inodes for which
> > we need to wait for data IO completion during transaction commit during block
> > allocation. And I was wrong. It used to happen in
> > mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove calls to
> > ext4_jbd2_file_inode() from delalloc write path) where it got removed. And that was
> > wrong because although we submit data writes before dropping handle for
> > allocating transaction and updating i_size, nobody guarantees that data IO is not
> > delayed in the block layer until transaction commit.
> > Which seems to happen in your case. I'll send a fix. Thanks for your report and
> > persistence!
> >
>
> Thanks a lot for your feedback :-)
> Because I am not familiar with the detail of the ext4 internal code. I will try to understand your explanation which you describe above. And have a look on related funcations.
> Could you send the fix in this mail ?
> And whether the kernel 3.14 also have such issue, right ?

The problem is in all kernels starting with 3.8. Attached is a patch which
should fix the issue. Can you test whether it fixes the problem for you?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR


Attachments:
(No filename) (4.71 kB)
0001-ext4-Fix-data-exposure-after-a-crash.patch (3.03 kB)
Download all attachments

2016-01-07 12:19:00

by Jan Kara

[permalink] [raw]
Subject: Re: ext4 out of order when use cfq scheduler

On Thu 07-01-16 12:47:36, Jan Kara wrote:
> On Thu 07-01-16 11:02:29, HUANG Weller (CM/ESW12-CN) wrote:
> > > -----Original Message-----
> > > From: Jan Kara [mailto:[email protected]]
> > > Sent: Thursday, January 07, 2016 6:24 PM
> > > To: HUANG Weller (CM/ESW12-CN) <[email protected]>
> > > Cc: Jan Kara <[email protected]>; [email protected]
> > > Subject: Re: ext4 out of order when use cfq scheduler
> > >
> > > On Thu 07-01-16 06:43:00, HUANG Weller (CM/ESW12-CN) wrote:
> > > > > -----Original Message-----
> > > > > From: Jan Kara [mailto:[email protected]]
> > > > > Sent: Wednesday, January 06, 2016 6:06 PM
> > > > > To: HUANG Weller (CM/ESW12-CN) <[email protected]>
> > > > > Subject: Re: ext4 out of order when use cfq scheduler
> > > > >
> > > > > On Wed 06-01-16 02:39:15, HUANG Weller (CM/ESW12-CN) wrote:
> > > > > > > So you are running in 'ws' mode of your tool, am I right? Just
> > > > > > > looking into the sources you've sent me I've noticed that
> > > > > > > although you set O_SYNC in openflg when mode == MODE_WS, you do
> > > > > > > not use openflg at all. So file won't be synced at all. That
> > > > > > > would well explain why you see that not all file contents is
> > > > > > > written. So did you just send me a different version of the
> > > > > > > source or is your test program
> > > > > really buggy?
> > > > > > >
> > > > > >
> > > > > > Yes, it is a bug of the test code. So the test tool create files
> > > > > > without O_SYNC flag actually. But , even in this case, is the out
> > > > > > of order acceptable ? or is it normal ?
> > > > >
> > > > > Without fsync(2) or O_SYNC, it is perfectly possible that some files
> > > > > are written and others are not since nobody guarantees order of
> > > > > writeback of inodes. OTOH you shouldn't ever see uninitialized data
> > > > > in the inode (but so far it isn't clear to me whether you really see
> > > > > unitialized data or whether we really wrote zeros to those blocks -
> > > > > ext4 can sometimes decide to do so). Your traces and disk contents
> > > > > show that the problematic inode has extent of length 128 blocks
> > > > > starting at block
> > > > > 0x12c00 and then extent of lenght 1 block starting at block 0x1268e.
> > > > > What is the block size of the filesystem? Because inode size is only 0x40010.
> > > > >
> > > > > Some suggestions to try:
> > > > > 1) Print also length of a write request in addition to the starting
> > > > > block so that we can see how much actually got written
> > > >
> > > > Please see below failure analysis.
> > > >
> > > > > 2) Initialize the device to 0xff so that we can distinguish
> > > > > uninitialized blocks from zeroed-out blocks.
> > > >
> > > > Yes, i Initialize the device to 0xff this time.
> > > >
> > > > > 3) Report exactly for which 512-byte blocks checksum matches and for
> > > > > which it is wrong.
> > > > The wrong contents are old file contents which are created in previous
> > > > test round. It is caused by the "wrong" sequence inode data(in
> > > > journal) and the file contents. So the file contents are not updated.
> > >
> > > So this confuses me somewhat. You previously said that you always remove files
> > > after each test round and then new ones are created. Is it still the case? So the old
> > > file contents you speak about above is just some random contents that happened
> > > to be in disk blocks we freshly allocated to the file, am I right?
> >
> > Yes. You are right.
> > The "old file contents" means that the disk blocks which the contents is generated from last test round, and they are allocated to a new file in new test round.
> >
> >
> > >
> > > OK, so I was looking into the code and indeed, reality is correct and my mental
> > > model was wrong! ;) I thought that inode gets added to the list of inodes for which
> > > we need to wait for data IO completion during transaction commit during block
> > > allocation. And I was wrong. It used to happen in
> > > mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove calls to
> > > ext4_jbd2_file_inode() from delalloc write path) where it got removed. And that was
> > > wrong because although we submit data writes before dropping handle for
> > > allocating transaction and updating i_size, nobody guarantees that data IO is not
> > > delayed in the block layer until transaction commit.
> > > Which seems to happen in your case. I'll send a fix. Thanks for your report and
> > > persistence!
> > >
> >
> > Thanks a lot for your feedback :-)
> > Because I am not familiar with the detail of the ext4 internal code. I will try to understand your explanation which you describe above. And have a look on related funcations.
> > Could you send the fix in this mail ?
> > And whether the kernel 3.14 also have such issue, right ?
>
> The problem is in all kernels starting with 3.8. Attached is a patch which
> should fix the issue. Can you test whether it fixes the problem for you?

Oh, I have realized the patch is on top of current ext4 development tree
and it won't compile for current vanilla kernel because of
EXT4_GET_BLOCKS_ZERO check. Just remove that line when you get compilation
failure.

> + if (map->m_flags & EXT4_MAP_NEW &&
> + !(map->m_flags & EXT4_MAP_UNWRITTEN) &&
> + !(flags & EXT4_GET_BLOCKS_ZERO) &&

Just remove the above line and things should work for older kernels as
well.

> + ext4_should_order_data(inode)) {
> + ret = ext4_jbd2_file_inode(handle, inode);
> + if (ret)
> + return ret;
> + }
> }
> return retval;
> }

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

Subject: RE: ext4 out of order when use cfq scheduler

> >
> > >
> > > OK, so I was looking into the code and indeed, reality is correct
> > > and my mental model was wrong! ;) I thought that inode gets added to
> > > the list of inodes for which we need to wait for data IO completion
> > > during transaction commit during block allocation. And I was wrong.
> > > It used to happen in
> > > mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove
> > > calls to
> > > ext4_jbd2_file_inode() from delalloc write path) where it got
> > > removed. And that was wrong because although we submit data writes
> > > before dropping handle for allocating transaction and updating
> > > i_size, nobody guarantees that data IO is not delayed in the block layer until
> transaction commit.
> > > Which seems to happen in your case. I'll send a fix. Thanks for your
> > > report and persistence!
> > >
> >
> > Thanks a lot for your feedback :-)
> > Because I am not familiar with the detail of the ext4 internal code. I will try to
> understand your explanation which you describe above. And have a look on
> related funcations.
> > Could you send the fix in this mail ?
> > And whether the kernel 3.14 also have such issue, right ?
>
> The problem is in all kernels starting with 3.8. Attached is a patch which should fix
> the issue. Can you test whether it fixes the problem for you?
>
> Honza
> --

Yes, of course I will redo the test with the patch. And also give you feedback.
Thanks.
Huang weller

> Jan Kara <[email protected]>
> SUSE Labs, CR

Subject: RE: ext4 out of order when use cfq scheduler



> -----Original Message-----
> From: Jan Kara [mailto:[email protected]]
> Sent: Thursday, January 07, 2016 8:19 PM
> To: HUANG Weller (CM/ESW12-CN) <[email protected]>
> Cc: Jan Kara <[email protected]>; [email protected]; Li, Michael
> <[email protected]>
> Subject: Re: ext4 out of order when use cfq scheduler
>
> On Thu 07-01-16 12:47:36, Jan Kara wrote:
> > On Thu 07-01-16 11:02:29, HUANG Weller (CM/ESW12-CN) wrote:
> > > > -----Original Message-----
> > > > From: Jan Kara [mailto:[email protected]]
> > > > Sent: Thursday, January 07, 2016 6:24 PM
> > > > To: HUANG Weller (CM/ESW12-CN) <[email protected]>
> > > > Cc: Jan Kara <[email protected]>; [email protected]
> > > > Subject: Re: ext4 out of order when use cfq scheduler
> > > >
> > > > On Thu 07-01-16 06:43:00, HUANG Weller (CM/ESW12-CN) wrote:
> > > > > > -----Original Message-----
> > > > > > From: Jan Kara [mailto:[email protected]]
> > > > > > Sent: Wednesday, January 06, 2016 6:06 PM
> > > > > > To: HUANG Weller (CM/ESW12-CN) <[email protected]>
> > > > > > Subject: Re: ext4 out of order when use cfq scheduler
> > > > > >
> > > > > > On Wed 06-01-16 02:39:15, HUANG Weller (CM/ESW12-CN) wrote:
> > > > > > > > So you are running in 'ws' mode of your tool, am I right?
> > > > > > > > Just looking into the sources you've sent me I've noticed
> > > > > > > > that although you set O_SYNC in openflg when mode ==
> > > > > > > > MODE_WS, you do not use openflg at all. So file won't be
> > > > > > > > synced at all. That would well explain why you see that
> > > > > > > > not all file contents is written. So did you just send me
> > > > > > > > a different version of the source or is your test program
> > > > > > really buggy?
> > > > > > > >
> > > > > > >
> > > > > > > Yes, it is a bug of the test code. So the test tool create
> > > > > > > files without O_SYNC flag actually. But , even in this
> > > > > > > case, is the out of order acceptable ? or is it normal ?
> > > > > >
> > > > > > Without fsync(2) or O_SYNC, it is perfectly possible that some
> > > > > > files are written and others are not since nobody guarantees
> > > > > > order of writeback of inodes. OTOH you shouldn't ever see
> > > > > > uninitialized data in the inode (but so far it isn't clear to
> > > > > > me whether you really see unitialized data or whether we
> > > > > > really wrote zeros to those blocks -
> > > > > > ext4 can sometimes decide to do so). Your traces and disk
> > > > > > contents show that the problematic inode has extent of length
> > > > > > 128 blocks starting at block
> > > > > > 0x12c00 and then extent of lenght 1 block starting at block 0x1268e.
> > > > > > What is the block size of the filesystem? Because inode size is only
> 0x40010.
> > > > > >
> > > > > > Some suggestions to try:
> > > > > > 1) Print also length of a write request in addition to the
> > > > > > starting block so that we can see how much actually got
> > > > > > written
> > > > >
> > > > > Please see below failure analysis.
> > > > >
> > > > > > 2) Initialize the device to 0xff so that we can distinguish
> > > > > > uninitialized blocks from zeroed-out blocks.
> > > > >
> > > > > Yes, i Initialize the device to 0xff this time.
> > > > >
> > > > > > 3) Report exactly for which 512-byte blocks checksum matches
> > > > > > and for which it is wrong.
> > > > > The wrong contents are old file contents which are created in
> > > > > previous test round. It is caused by the "wrong" sequence inode
> > > > > data(in
> > > > > journal) and the file contents. So the file contents are not updated.
> > > >
> > > > So this confuses me somewhat. You previously said that you always
> > > > remove files after each test round and then new ones are created.
> > > > Is it still the case? So the old file contents you speak about
> > > > above is just some random contents that happened to be in disk blocks we
> freshly allocated to the file, am I right?
> > >
> > > Yes. You are right.
> > > The "old file contents" means that the disk blocks which the contents is
> generated from last test round, and they are allocated to a new file in new test
> round.
> > >
> > >
> > > >
> > > > OK, so I was looking into the code and indeed, reality is correct
> > > > and my mental model was wrong! ;) I thought that inode gets added
> > > > to the list of inodes for which we need to wait for data IO
> > > > completion during transaction commit during block allocation. And
> > > > I was wrong. It used to happen in
> > > > mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove
> > > > calls to
> > > > ext4_jbd2_file_inode() from delalloc write path) where it got
> > > > removed. And that was wrong because although we submit data writes
> > > > before dropping handle for allocating transaction and updating
> > > > i_size, nobody guarantees that data IO is not delayed in the block layer until
> transaction commit.
> > > > Which seems to happen in your case. I'll send a fix. Thanks for
> > > > your report and persistence!
> > > >
> > >
> > > Thanks a lot for your feedback :-)
> > > Because I am not familiar with the detail of the ext4 internal code. I will try to
> understand your explanation which you describe above. And have a look on
> related funcations.
> > > Could you send the fix in this mail ?
> > > And whether the kernel 3.14 also have such issue, right ?
> >
> > The problem is in all kernels starting with 3.8. Attached is a patch
> > which should fix the issue. Can you test whether it fixes the problem for you?
>
> Oh, I have realized the patch is on top of current ext4 development tree and it
> won't compile for current vanilla kernel because of EXT4_GET_BLOCKS_ZERO
> check. Just remove that line when you get compilation failure.
>
> > + if (map->m_flags & EXT4_MAP_NEW &&
> > + !(map->m_flags & EXT4_MAP_UNWRITTEN) &&
> > + !(flags & EXT4_GET_BLOCKS_ZERO) &&
>
> Just remove the above line and things should work for older kernels as well.
>
> > + ext4_should_order_data(inode)) {
> > + ret = ext4_jbd2_file_inode(handle, inode);
> > + if (ret)
> > + return ret;
> > + }
> > }
> > return retval;
> > }
>

Just confirmed with you because the patch tool didn't found:

"out_sem:
ret = check_block_validity(inode, map);" in my kernel.

after checking the code, I add the modification to the end of function : ext4_map_blocks
below is the diff. please help to double confirm.

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 10b71e4..d29a1d2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -753,6 +753,10 @@ has_zeroout:
int ret = check_block_validity(inode, map);
if (ret != 0)
return ret;
+ if(ext4_should_order_data(inode)) {
+ ret = ext4_jbd2_file_inode(handle, inode);
+ if (ret)
+ return ret;
}
return retval;
}
@@ -1113,15 +1117,6 @@ static int ext4_write_end(struct file *file,
int i_size_changed = 0;

trace_ext4_write_end(inode, pos, len, copied);
- if (ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE)) {
- ret = ext4_jbd2_file_inode(handle, inode);
- if (ret) {
- unlock_page(page);
- page_cache_release(page);
- goto errout;
- }
- }

Subject: RE: ext4 out of order when use cfq scheduler


> -----Original Message-----
> From: HUANG Weller (CM/ESW12-CN)
> Sent: Friday, January 08, 2016 8:47 AM
> To: 'Jan Kara' <[email protected]>
> Cc: [email protected]; Li, Michael <[email protected]>
> Subject: RE: ext4 out of order when use cfq scheduler
>
> > >
> > > >
> > > > OK, so I was looking into the code and indeed, reality is correct
> > > > and my mental model was wrong! ;) I thought that inode gets added
> > > > to the list of inodes for which we need to wait for data IO
> > > > completion during transaction commit during block allocation. And I was
> wrong.
> > > > It used to happen in
> > > > mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove
> > > > calls to
> > > > ext4_jbd2_file_inode() from delalloc write path) where it got
> > > > removed. And that was wrong because although we submit data writes
> > > > before dropping handle for allocating transaction and updating
> > > > i_size, nobody guarantees that data IO is not delayed in the block
> > > > layer until
> > transaction commit.
> > > > Which seems to happen in your case. I'll send a fix. Thanks for
> > > > your report and persistence!
> > > >
> > >
> > > Thanks a lot for your feedback :-)
> > > Because I am not familiar with the detail of the ext4 internal code.
> > > I will try to
> > understand your explanation which you describe above. And have a look
> > on related funcations.
> > > Could you send the fix in this mail ?
> > > And whether the kernel 3.14 also have such issue, right ?
> >
> > The problem is in all kernels starting with 3.8. Attached is a patch
> > which should fix the issue. Can you test whether it fixes the problem for you?
> >
> > Honza
> > --
>
> Yes, of course I will redo the test with the patch. And also give you feedback.

Test result:
Test on 2 targets with the kernel applied your patch. Both of them are OK after 5000 power failure tests. Our target test cycle is 10,000.
By the way, since your original patch can't be applied on 3.x kernel, The attached one is based on yours and can applied on old kernel(mine is 3.10) directly.

>
> > Jan Kara <[email protected]>
> > SUSE Labs, CR


Attachments:
0001-ext4-Fix-data-exposure-after-a-crash.patch (3.11 kB)
0001-ext4-Fix-data-exposure-after-a-crash.patch

2016-01-11 10:20:51

by Jan Kara

[permalink] [raw]
Subject: Re: ext4 out of order when use cfq scheduler

On Mon 11-01-16 09:05:20, HUANG Weller (CM/ESW12-CN) wrote:
>
> > -----Original Message-----
> > From: HUANG Weller (CM/ESW12-CN)
> > Sent: Friday, January 08, 2016 8:47 AM
> > To: 'Jan Kara' <[email protected]>
> > Cc: [email protected]; Li, Michael <[email protected]>
> > Subject: RE: ext4 out of order when use cfq scheduler
> >
> > > >
> > > > >
> > > > > OK, so I was looking into the code and indeed, reality is correct
> > > > > and my mental model was wrong! ;) I thought that inode gets added
> > > > > to the list of inodes for which we need to wait for data IO
> > > > > completion during transaction commit during block allocation. And I was
> > wrong.
> > > > > It used to happen in
> > > > > mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove
> > > > > calls to
> > > > > ext4_jbd2_file_inode() from delalloc write path) where it got
> > > > > removed. And that was wrong because although we submit data writes
> > > > > before dropping handle for allocating transaction and updating
> > > > > i_size, nobody guarantees that data IO is not delayed in the block
> > > > > layer until
> > > transaction commit.
> > > > > Which seems to happen in your case. I'll send a fix. Thanks for
> > > > > your report and persistence!
> > > > >
> > > >
> > > > Thanks a lot for your feedback :-)
> > > > Because I am not familiar with the detail of the ext4 internal code.
> > > > I will try to
> > > understand your explanation which you describe above. And have a look
> > > on related funcations.
> > > > Could you send the fix in this mail ?
> > > > And whether the kernel 3.14 also have such issue, right ?
> > >
> > > The problem is in all kernels starting with 3.8. Attached is a patch
> > > which should fix the issue. Can you test whether it fixes the problem for you?
> > >
> > > Honza
> > > --
> >
> > Yes, of course I will redo the test with the patch. And also give you feedback.
>
> Test result:
> Test on 2 targets with the kernel applied your patch. Both of them are OK
> after 5000 power failure tests. Our target test cycle is 10,000. By the
> way, since your original patch can't be applied on 3.x kernel, The
> attached one is based on yours and can applied on old kernel(mine is
> 3.10) directly.

Thanks for testing and the port! Once the patch is merged upstream, I'll
backport it and push it to all active stable kernels.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2016-01-28 08:02:54

by Murphy Zhou

[permalink] [raw]
Subject: Re: ext4 out of order when use cfq scheduler

On Tue, Dec 22, 2015 at 2:24 PM, HUANG Weller (CM/EPF1-CN)
<[email protected]> wrote:
> Hi EXT4 developers,
>
> I am doing the power failure on a storage device. We are arm embedded system.

Is there any chance that make this test a xfstest case ?

Thanks,
Xiong

>
>
> Test environment:
> Kernel: 3.10.63
> Test type: power failure
> File create type: O_SYNC flag enabled
> Mount option: data=ordered, journal_checksum,barrier=1
> IO scheduler: CFQ
>
> Debug method: kernel print every block number of write request. print whether it is journal block. print whether it is (REQ_FUA | REQ_META | write)
> After system startup, backup journal blocks before mount the test partition.
>
> Below is analysis which I think there is a out of order:
> hp0000017aMhWY3i0vMv is the file which checksum is error. Only half of the file contents is correct.
>
>
> [root@SiRFatlas6 ~]# debugfs /dev/nandblk0p3
> debugfs 1.42.9 (28-Dec-2013)
> debugfs: imap test/hp0000017aMhWY3i0vMv
> Inode 390 is part of block group 0
> located at block 141, offset 0x0280
>
> 00000280 80 81 00 00 10 00 04 00 c8 09 00 00 66 0a 00 00 |............f...|
> 00000290 66 0a 00 00 00 00 00 00 00 00 01 00 04 02 00 00 |f...............|
> 000002a0 00 00 08 00 01 00 00 00 0a f3 02 00 04 00 00 00 |................|
> 000002b0 00 00 00 00 00 00 00 00 80 00 00 00 00 2c 01 00 |.............,..| ==> the file contents is at 0x00012c00
> 000002c0 80 00 00 00 01 00 00 00 8e 26 01 00 00 00 00 00 |.........&......|
>
>
> Search the block number from the journal blocks:
>
> [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "00 2c 01 00"
> 00039ab0 00 00 00 00 00 00 00 00 80 00 00 00 00 2c 01 00 |.............,..|
>
> Search file name which the file checksum is error in journal blocks:
>
> [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "3i0vMv" -B1
> 00030c60 86 01 00 00 1c 00 14 01 68 70 30 30 30 30 30 31 |........hp000001|
> 00030c70 37 61 4d 68 57 59 33 69 30 76 4d 76 88 01 00 00 |7aMhWY3i0vMv....|
>
>
> List all journal block record to check which journal block records it:
>
> [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "c0 3b 39 98"
> 00000000 c0 3b 39 98 00 00 00 04 00 00 00 00 00 00 08 00 |.;9.............|
> 00000800 c0 3b 39 98 00 00 00 05 00 00 00 6f 00 00 00 24 |.;9........o...$|
> 00001000 c0 3b 39 98 00 00 00 01 00 00 00 6f 00 00 00 75 |.;9........o...u|
> 0000c800 c0 3b 39 98 00 00 00 02 00 00 00 6f 00 00 00 00 |.;9........o....|
> 0000d000 c0 3b 39 98 00 00 00 01 00 00 00 70 00 00 00 65 |.;9........p...e|
> 00016000 c0 3b 39 98 00 00 00 02 00 00 00 70 00 00 00 00 |.;9........p....|
> 00016800 c0 3b 39 98 00 00 00 01 00 00 00 71 00 00 00 7c |.;9........q...||
> 00021000 c0 3b 39 98 00 00 00 02 00 00 00 71 00 00 00 00 |.;9........q....|
> 00021800 c0 3b 39 98 00 00 00 01 00 00 00 72 00 00 00 82 |.;9........r....|
> 0002d000 c0 3b 39 98 00 00 00 02 00 00 00 72 00 00 00 00 |.;9........r....|
> 0002d800 c0 3b 39 98 00 00 00 01 00 00 00 73 00 00 00 88 |.;9........s....| ==>00039ab0 is in last block, the file name and the start block number are all recorded in the journals.
> 0003a000 c0 3b 39 98 00 00 00 02 00 00 00 73 00 00 00 00 |.;9........s....|
>
>
> Back to see the kernel log which it print all the block numbers:
>
> ...
> ...
> [ 46.222671] 244109 75277
> [ 46.222693]
> [ 46.272438] 244352 75520
> [ 46.272460]
> [ 46.348417] 238443 69611
> [ 46.348438]
> [ 46.349811] 244480 75648
> [ 46.352287]
> [ 46.404904] 244609 75777
> [ 46.404926]
> [ 46.454698] 244738 75906
> [ 46.454719]
> [ 46.505439] 244992 76160
> [ 46.505459]
> [ 46.557783] 245120 76288
> [ 46.557804]
> [ 46.610075] 245249 76417
> [ 46.610096]
> [ 46.660196] 245378 76546
> [ 46.660219]
> [ 46.709906] 201691 32859 ==> journal start is 32768, so the offset is 91, block size=2048, so, the offset address in the j.bin is 0x2d800
> [ 46.709928] J [ 46.711233]
> [ 46.740635] drop to 9v
> [ 46.749540] 201716 32884
> [ 46.749560] J S
> [ 46.751039]
> [ 46.753151] 245632 76800 ==> 76800 in hex is 0x012c00, it is the same start block of the file which checksum is error.
> [ 46.755284]
> nanddisk idle -> 1.
> [ 46.800227] 6v irq-2
>
>
> The j.bin offset 0x2d800
>
> 0002d800 c0 3b 39 98 00 00 00 01 00 00 00 73 00 00 00 88 |.;9........s....| ==>00039ab0 is in last block
> 0003a000 c0 3b 39 98 00 00 00 02 00 00 00 73 00 00 00 00 |.;9........s....|
>
>
> Normally, ext4 will first write the file contents, and then write the journal and journal commit. Then after some delay, it will write the meta data.
> So The journal blocks contains the meta data of the file which the file contents already been written before.
> But from above analysis, the journal sequence 0x73 already contain the file name(hp0000017aMhWY3i0vMv) and the start block number(76800). So from the kernel log, the block number(76800) should be available before the journal blocks but NOT after it.
> It seems that there is out of order happen.
>
> Could you please help to check this issue ? or give a explanation about it ?
> Many thanks.
>
> Best regards
>
> Weller HUANG
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Subject: RE: ext4 out of order when use cfq scheduler



> -----Original Message-----
> From: Xiong Zhou [mailto:[email protected]]
> Sent: Thursday, January 28, 2016 4:03 PM
> To: HUANG Weller (CM/ESW12-CN) <[email protected]>
> Cc: [email protected]; Li, Michael <[email protected]>
> Subject: Re: ext4 out of order when use cfq scheduler
>
> On Tue, Dec 22, 2015 at 2:24 PM, HUANG Weller (CM/EPF1-CN)
> <[email protected]> wrote:
> > Hi EXT4 developers,
> >
> > I am doing the power failure on a storage device. We are arm embedded system.
>
> Is there any chance that make this test a xfstest case ?
>

Do you mean use the xfstest tool to do the write test ? is there any difference or advantage to use xfstest ?

> >
> >
> > Test environment:
> > Kernel: 3.10.63
> > Test type: power failure
> > File create type: O_SYNC flag enabled
> > Mount option: data=ordered, journal_checksum,barrier=1 IO scheduler:
> > CFQ
> >
> > Debug method: kernel print every block number of write request. print
> > whether it is journal block. print whether it is (REQ_FUA | REQ_META | write)
> After system startup, backup journal blocks before mount the test partition.
> >
> > Below is analysis which I think there is a out of order:
> > hp0000017aMhWY3i0vMv is the file which checksum is error. Only half of the file
> contents is correct.
> >
> >
> > [root@SiRFatlas6 ~]# debugfs /dev/nandblk0p3 debugfs 1.42.9
> > (28-Dec-2013)
> > debugfs: imap test/hp0000017aMhWY3i0vMv Inode 390 is part of block
> > group 0
> > located at block 141, offset 0x0280
> >
> > 00000280 80 81 00 00 10 00 04 00 c8 09 00 00 66 0a 00 00
> > |............f...|
> > 00000290 66 0a 00 00 00 00 00 00 00 00 01 00 04 02 00 00
> > |f...............|
> > 000002a0 00 00 08 00 01 00 00 00 0a f3 02 00 04 00 00 00
> > |................|
> > 000002b0 00 00 00 00 00 00 00 00 80 00 00 00 00 2c 01 00
> > |.............,..| ==> the file contents is at 0x00012c00
> > 000002c0 80 00 00 00 01 00 00 00 8e 26 01 00 00 00 00 00
> > |.........&......|
> >
> >
> > Search the block number from the journal blocks:
> >
> > [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "00 2c 01 00"
> > 00039ab0 00 00 00 00 00 00 00 00 80 00 00 00 00 2c 01 00
> > |.............,..|
> >
> > Search file name which the file checksum is error in journal blocks:
> >
> > [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "3i0vMv" -B1
> > 00030c60 86 01 00 00 1c 00 14 01 68 70 30 30 30 30 30 31
> > |........hp000001|
> > 00030c70 37 61 4d 68 57 59 33 69 30 76 4d 76 88 01 00 00
> > |7aMhWY3i0vMv....|
> >
> >
> > List all journal block record to check which journal block records it:
> >
> > [root@SiRFatlas6 ~]# hexdump j.bin -C | grep "c0 3b 39 98"
> > 00000000 c0 3b 39 98 00 00 00 04 00 00 00 00 00 00 08 00
> > |.;9.............|
> > 00000800 c0 3b 39 98 00 00 00 05 00 00 00 6f 00 00 00 24
> > |.;9........o...$|
> > 00001000 c0 3b 39 98 00 00 00 01 00 00 00 6f 00 00 00 75
> > |.;9........o...u|
> > 0000c800 c0 3b 39 98 00 00 00 02 00 00 00 6f 00 00 00 00
> > |.;9........o....|
> > 0000d000 c0 3b 39 98 00 00 00 01 00 00 00 70 00 00 00 65
> > |.;9........p...e|
> > 00016000 c0 3b 39 98 00 00 00 02 00 00 00 70 00 00 00 00
> > |.;9........p....|
> > 00016800 c0 3b 39 98 00 00 00 01 00 00 00 71 00 00 00 7c
> > |.;9........q...||
> > 00021000 c0 3b 39 98 00 00 00 02 00 00 00 71 00 00 00 00
> > |.;9........q....|
> > 00021800 c0 3b 39 98 00 00 00 01 00 00 00 72 00 00 00 82
> > |.;9........r....|
> > 0002d000 c0 3b 39 98 00 00 00 02 00 00 00 72 00 00 00 00
> > |.;9........r....|
> > 0002d800 c0 3b 39 98 00 00 00 01 00 00 00 73 00 00 00 88 |.;9........s....|
> ==>00039ab0 is in last block, the file name and the start block number are all
> recorded in the journals.
> > 0003a000 c0 3b 39 98 00 00 00 02 00 00 00 73 00 00 00 00
> > |.;9........s....|
> >
> >
> > Back to see the kernel log which it print all the block numbers:
> >
> > ...
> > ...
> > [ 46.222671] 244109 75277
> > [ 46.222693]
> > [ 46.272438] 244352 75520
> > [ 46.272460]
> > [ 46.348417] 238443 69611
> > [ 46.348438]
> > [ 46.349811] 244480 75648
> > [ 46.352287]
> > [ 46.404904] 244609 75777
> > [ 46.404926]
> > [ 46.454698] 244738 75906
> > [ 46.454719]
> > [ 46.505439] 244992 76160
> > [ 46.505459]
> > [ 46.557783] 245120 76288
> > [ 46.557804]
> > [ 46.610075] 245249 76417
> > [ 46.610096]
> > [ 46.660196] 245378 76546
> > [ 46.660219]
> > [ 46.709906] 201691 32859 ==> journal start is 32768, so the offset is 91,
> block size=2048, so, the offset address in the j.bin is 0x2d800
> > [ 46.709928] J [ 46.711233]
> > [ 46.740635] drop to 9v
> > [ 46.749540] 201716 32884
> > [ 46.749560] J S
> > [ 46.751039]
> > [ 46.753151] 245632 76800 ==> 76800 in hex is 0x012c00, it is the same start
> block of the file which checksum is error.
> > [ 46.755284]
> > nanddisk idle -> 1.
> > [ 46.800227] 6v irq-2
> >
> >
> > The j.bin offset 0x2d800
> >
> > 0002d800 c0 3b 39 98 00 00 00 01 00 00 00 73 00 00 00 88
> > |.;9........s....| ==>00039ab0 is in last block
> > 0003a000 c0 3b 39 98 00 00 00 02 00 00 00 73 00 00 00 00
> > |.;9........s....|
> >
> >
> > Normally, ext4 will first write the file contents, and then write the journal and
> journal commit. Then after some delay, it will write the meta data.
> > So The journal blocks contains the meta data of the file which the file contents
> already been written before.
> > But from above analysis, the journal sequence 0x73 already contain the file
> name(hp0000017aMhWY3i0vMv) and the start block number(76800). So from the
> kernel log, the block number(76800) should be available before the journal blocks
> but NOT after it.
> > It seems that there is out of order happen.
> >
> > Could you please help to check this issue ? or give a explanation about it ?
> > Many thanks.
> >
> > Best regards
> >
> > Weller HUANG
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4"
> > in the body of a message to [email protected] More majordomo
> > info at http://vger.kernel.org/majordomo-info.html

2016-03-13 04:27:32

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 out of order when use cfq scheduler

On Thu, Jan 07, 2016 at 12:47:36PM +0100, Jan Kara wrote:
>
> The problem is in all kernels starting with 3.8. Attached is a patch which
> should fix the issue. Can you test whether it fixes the problem for you?

Sorry, I missed this patch because it was attached to an discussion
thread.

> The problem is that although for delayed allocated blocks we write their
> contents immediately after allocating them, there is no guarantee that
> the IO scheduler or device doesn't reorder things

I don't think that's the problem. In the commit thread when we call
blkdev_issue_flush() that acts as a barrier so the I/O scheduler won't
reorder writes after that point, which is before we write the commit
block. Instead, I believe the problem is in ext4_writepages:

ext4_journal_stop(handle);
/* Submit prepared bio */
ext4_io_submit(&mpd.io_submit);

Once we release the handle, the commit can start --- *before* we have
a chance to submit the I/O. Oops.

I believe if we swap these two calls, it should fix the problem Huang
was seeing.

Jan, do you agree?

- Ted


Subject: RE: ext4 out of order when use cfq scheduler



> -----Original Message-----
> From: Theodore Ts'o [mailto:[email protected]]
> Sent: Sunday, March 13, 2016 12:27 PM
> To: Jan Kara <[email protected]>
> Cc: HUANG Weller (CM/ESW12-CN) <[email protected]>; linux-
> [email protected]; Li, Michael <[email protected]>
> Subject: Re: ext4 out of order when use cfq scheduler
>
> On Thu, Jan 07, 2016 at 12:47:36PM +0100, Jan Kara wrote:
> >
> > The problem is in all kernels starting with 3.8. Attached is a patch
> > which should fix the issue. Can you test whether it fixes the problem for you?
>
> Sorry, I missed this patch because it was attached to an discussion thread.
>
> > The problem is that although for delayed allocated blocks we write
> > their contents immediately after allocating them, there is no
> > guarantee that the IO scheduler or device doesn't reorder things
>
> I don't think that's the problem. In the commit thread when we call
> blkdev_issue_flush() that acts as a barrier so the I/O scheduler won't reorder writes
> after that point, which is before we write the commit block. Instead, I believe the
> problem is in ext4_writepages:
>
> ext4_journal_stop(handle);
> /* Submit prepared bio */
> ext4_io_submit(&mpd.io_submit);
>
> Once we release the handle, the commit can start --- *before* we have
> a chance to submit the I/O. Oops.
>
> I believe if we swap these two calls, it should fix the problem Huang was seeing.
>
> Jan, do you agree?
>
> - Ted

Hi Ted and Jan,

You can give me a patch and I can redo the verification on my kernel and HWs.

I also look into the code, since In my test case, I use data=ordered option and without sync. So the write operation will goto ext4_da_writepages(), right ?
My kernel version is 3.10.63, as I see io_submit and journal_stop sequence already in that order.

while (!ret && wbc->nr_to_write > 0) {
ext4_journal_start
write_cache_pages_da
mpage_da_map_and_submit ==>
ext4_journal_stop
}

Thanks.

2016-03-14 07:39:23

by Jan Kara

[permalink] [raw]
Subject: Re: ext4 out of order when use cfq scheduler

On Sat 12-03-16 23:27:23, Ted Tso wrote:
> On Thu, Jan 07, 2016 at 12:47:36PM +0100, Jan Kara wrote:
> > The problem is in all kernels starting with 3.8. Attached is a patch which
> > should fix the issue. Can you test whether it fixes the problem for you?
>
> Sorry, I missed this patch because it was attached to an discussion
> thread.

I have actually sent this patch in a standalone thread on January 11
(http://lists.openwall.net/linux-ext4/2016/01/11/3) together with one more
cleanup.

> > The problem is that although for delayed allocated blocks we write their
> > contents immediately after allocating them, there is no guarantee that
> > the IO scheduler or device doesn't reorder things
>
> I don't think that's the problem. In the commit thread when we call
> blkdev_issue_flush() that acts as a barrier so the I/O scheduler won't
> reorder writes after that point, which is before we write the commit
> block. Instead, I believe the problem is in ext4_writepages:
>
> ext4_journal_stop(handle);
> /* Submit prepared bio */
> ext4_io_submit(&mpd.io_submit);
>
> Once we release the handle, the commit can start --- *before* we have
> a chance to submit the I/O. Oops.
>
> I believe if we swap these two calls, it should fix the problem Huang
> was seeing.

No, that won't be enough. blkdev_issue_flush() is not guaranteed to do
anything to IOs which have not reported completion before
blkdev_issue_flush() was called. Specifically, CFQ will queue submitted bio
in its internal RB tree, following flush request completely bypasses this
tree and goes directly to the disk where it flushes caches. And only later
CFQ decides to schedule async writeback from the flusher thread which is
queued in the RB tree...

Note that the behavior has changed to be like this with the flushing
machinery rewrite. Before that, IO scheduler had to drain all the
outstanding IO requests (IO cache flush behaved like IO barrier). So your
patch would be enough with the old flushing machinery but is not enough
since 3.0 or so...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2016-03-14 14:36:43

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 out of order when use cfq scheduler

On Mon, Mar 14, 2016 at 08:39:28AM +0100, Jan Kara wrote:
> No, that won't be enough. blkdev_issue_flush() is not guaranteed to do
> anything to IOs which have not reported completion before
> blkdev_issue_flush() was called. Specifically, CFQ will queue submitted bio
> in its internal RB tree, following flush request completely bypasses this
> tree and goes directly to the disk where it flushes caches. And only later
> CFQ decides to schedule async writeback from the flusher thread which is
> queued in the RB tree...

Oh, right. I am forgetting about the flushing mahchinery rewrite.
Thanks for pointing that out.

But what we *could* do is to swap those two calls and then in the case
where delalloc is enabled, could maintain a list of inodes where we
only need to call filemap_fdatawait(), and not initiate writeback for
any dirty pages which had been caused by non-allocating writes.

- Ted

2016-03-15 10:46:29

by Jan Kara

[permalink] [raw]
Subject: Re: ext4 out of order when use cfq scheduler

On Mon 14-03-16 10:36:35, Ted Tso wrote:
> On Mon, Mar 14, 2016 at 08:39:28AM +0100, Jan Kara wrote:
> > No, that won't be enough. blkdev_issue_flush() is not guaranteed to do
> > anything to IOs which have not reported completion before
> > blkdev_issue_flush() was called. Specifically, CFQ will queue submitted bio
> > in its internal RB tree, following flush request completely bypasses this
> > tree and goes directly to the disk where it flushes caches. And only later
> > CFQ decides to schedule async writeback from the flusher thread which is
> > queued in the RB tree...
>
> Oh, right. I am forgetting about the flushing mahchinery rewrite.
> Thanks for pointing that out.
>
> But what we *could* do is to swap those two calls and then in the case
> where delalloc is enabled, could maintain a list of inodes where we
> only need to call filemap_fdatawait(), and not initiate writeback for
> any dirty pages which had been caused by non-allocating writes.

We actually don't need to swap those two calls - page is already marked as
under writeback in

mpage_map_and_submit_buffers() -> mpage_submit_page -> ext4_bio_write_page

which gets called while we still hold the transaction handle. I agree
calling filemap_fdatawait() from JBD2 during commit should be enough to fix
issues with delalloc writeback. I'm just somewhat afraid that it will be
more fragile: If we add inode to transaction's list in ext4_map_blocks(),
we are pretty sure there's no way to allocate block to an inode without
introducing data exposure issues (which are then very hard to spot). If we
depend on callers of ext4_map_blocks() to properly add inode to appropriate
transaction list, we have much more places to check. I'll think whether we
could make this more robust.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2016-03-15 14:46:11

by Jan Kara

[permalink] [raw]
Subject: Re: ext4 out of order when use cfq scheduler

On Tue 15-03-16 11:46:34, Jan Kara wrote:
> On Mon 14-03-16 10:36:35, Ted Tso wrote:
> > On Mon, Mar 14, 2016 at 08:39:28AM +0100, Jan Kara wrote:
> > > No, that won't be enough. blkdev_issue_flush() is not guaranteed to do
> > > anything to IOs which have not reported completion before
> > > blkdev_issue_flush() was called. Specifically, CFQ will queue submitted bio
> > > in its internal RB tree, following flush request completely bypasses this
> > > tree and goes directly to the disk where it flushes caches. And only later
> > > CFQ decides to schedule async writeback from the flusher thread which is
> > > queued in the RB tree...
> >
> > Oh, right. I am forgetting about the flushing mahchinery rewrite.
> > Thanks for pointing that out.
> >
> > But what we *could* do is to swap those two calls and then in the case
> > where delalloc is enabled, could maintain a list of inodes where we
> > only need to call filemap_fdatawait(), and not initiate writeback for
> > any dirty pages which had been caused by non-allocating writes.
>
> We actually don't need to swap those two calls - page is already marked as
> under writeback in
>
> mpage_map_and_submit_buffers() -> mpage_submit_page -> ext4_bio_write_page
>
> which gets called while we still hold the transaction handle. I agree
> calling filemap_fdatawait() from JBD2 during commit should be enough to fix
> issues with delalloc writeback. I'm just somewhat afraid that it will be
> more fragile: If we add inode to transaction's list in ext4_map_blocks(),
> we are pretty sure there's no way to allocate block to an inode without
> introducing data exposure issues (which are then very hard to spot). If we
> depend on callers of ext4_map_blocks() to properly add inode to appropriate
> transaction list, we have much more places to check. I'll think whether we
> could make this more robust.

OK, I have something - Huang, can you check whether the attached patches
also fix your data exposure issues please? The first patch is the original
fix, patch two is a cleanup, patches 3 and 4 implement the speedup
suggested by Ted. Patches are only lightly tested so far. I'll run more
comprehensive tests later and in particular I want to check whether the
additional complexity actually brings us some advantage at least for
workloads which redirty pages in addition to writing some new ones using
delayed allocation.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR


Attachments:
(No filename) (2.39 kB)
0001-ext4-Fix-data-exposure-after-a-crash.patch (3.22 kB)
0002-ext4-Remove-EXT4_STATE_ORDERED_MODE.patch (1.77 kB)
0003-jbd2-Add-support-for-avoiding-data-writes-during-tra.patch (7.67 kB)
0004-ext4-Do-not-ask-jbd2-to-write-data-for-delalloc-buff.patch (4.89 kB)
Download all attachments

2016-03-15 20:09:30

by Jan Kara

[permalink] [raw]
Subject: Re: ext4 out of order when use cfq scheduler

On Tue 15-03-16 15:46:33, Jan Kara wrote:
> On Tue 15-03-16 11:46:34, Jan Kara wrote:
> > On Mon 14-03-16 10:36:35, Ted Tso wrote:
> > > On Mon, Mar 14, 2016 at 08:39:28AM +0100, Jan Kara wrote:
> > > > No, that won't be enough. blkdev_issue_flush() is not guaranteed to do
> > > > anything to IOs which have not reported completion before
> > > > blkdev_issue_flush() was called. Specifically, CFQ will queue submitted bio
> > > > in its internal RB tree, following flush request completely bypasses this
> > > > tree and goes directly to the disk where it flushes caches. And only later
> > > > CFQ decides to schedule async writeback from the flusher thread which is
> > > > queued in the RB tree...
> > >
> > > Oh, right. I am forgetting about the flushing mahchinery rewrite.
> > > Thanks for pointing that out.
> > >
> > > But what we *could* do is to swap those two calls and then in the case
> > > where delalloc is enabled, could maintain a list of inodes where we
> > > only need to call filemap_fdatawait(), and not initiate writeback for
> > > any dirty pages which had been caused by non-allocating writes.
> >
> > We actually don't need to swap those two calls - page is already marked as
> > under writeback in
> >
> > mpage_map_and_submit_buffers() -> mpage_submit_page -> ext4_bio_write_page
> >
> > which gets called while we still hold the transaction handle. I agree
> > calling filemap_fdatawait() from JBD2 during commit should be enough to fix
> > issues with delalloc writeback. I'm just somewhat afraid that it will be
> > more fragile: If we add inode to transaction's list in ext4_map_blocks(),
> > we are pretty sure there's no way to allocate block to an inode without
> > introducing data exposure issues (which are then very hard to spot). If we
> > depend on callers of ext4_map_blocks() to properly add inode to appropriate
> > transaction list, we have much more places to check. I'll think whether we
> > could make this more robust.
>
> OK, I have something - Huang, can you check whether the attached patches
> also fix your data exposure issues please? The first patch is the original
> fix, patch two is a cleanup, patches 3 and 4 implement the speedup
> suggested by Ted. Patches are only lightly tested so far. I'll run more
> comprehensive tests later and in particular I want to check whether the
> additional complexity actually brings us some advantage at least for
> workloads which redirty pages in addition to writing some new ones using
> delayed allocation.

OK, there was a bug in patch 3. Attached is a new version of patches 3 and
4.
Honza


Attachments:
(No filename) (2.55 kB)
0003-jbd2-Add-support-for-avoiding-data-writes-during-tra.patch (7.30 kB)
0004-ext4-Do-not-ask-jbd2-to-write-data-for-delalloc-buff.patch (4.17 kB)
Download all attachments
Subject: RE: ext4 out of order when use cfq scheduler

>
> OK, I have something - Huang, can you check whether the attached patches also
> fix your data exposure issues please? The first patch is the original fix, patch two
> is a cleanup, patches 3 and 4 implement the speedup suggested by Ted. Patches
> are only lightly tested so far. I'll run more comprehensive tests later and in
> particular I want to check whether the additional complexity actually brings us
> some advantage at least for workloads which redirty pages in addition to writing
> some new ones using delayed allocation.
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR


Sure, I will try to merge the patches and do the test on my kernel.

Huang weller

Subject: RE: ext4 out of order when use cfq scheduler


> >
> > OK, I have something - Huang, can you check whether the attached
> > patches also fix your data exposure issues please? The first patch is
> > the original fix, patch two is a cleanup, patches 3 and 4 implement
> > the speedup suggested by Ted. Patches are only lightly tested so far.
> > I'll run more comprehensive tests later and in particular I want to
> > check whether the additional complexity actually brings us some
> > advantage at least for workloads which redirty pages in addition to
> > writing some new ones using delayed allocation.
>
> OK, there was a bug in patch 3. Attached is a new version of patches 3 and 4.
>


Hi Kara,
Patches are applied on my kernel. Basically it is working.
Power loss test is running now. Need some days to give you test result.
At same time, Some conflicts need your confirm:

Target Kernel 3.10.63.

Patch 0004 conflicts:
----------------------------------
- Don't have function: *mpage_map_one_extent*
So , please review this segment code at patch 0004-ext4-Do-not-ask-jbd2-to-write-data-for-delalloc-buff.patch

@@ -1697,7 +1700,8 @@ static void mpage_da_map_and_submit(struct mpage_da_data *mpd)
* So use reserved blocks to allocate metadata if possible.
*/
get_blocks_flags = EXT4_GET_BLOCKS_CREATE |
- EXT4_GET_BLOCKS_METADATA_NOFAIL;
+ EXT4_GET_BLOCKS_METADATA_NOFAIL |
+ EXT4_GET_BLOCKS_IO_SUBMIT;
if (ext4_should_dioread_nolock(mpd->inode))
get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
if (mpd->b_state & (1 << BH_Delay))


- Don't have function *__ext4_block_zero_page_range*, so ignore the modification under this function.


Patch 0001 conflicts:
----------------------------------
- Don't have macro EXT4_GET_BLOCKS_ZERO is not available on kernel 3.10.63, so the patch is like this please help to review:

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e48bd5a..03017a9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -751,6 +751,19 @@ has_zeroout:
int ret = check_block_validity(inode, map);
if (ret != 0)
return ret;
+
+ /*
+ * Inodes with freshly allocated blocks where contents will be
+ * visible after transaction commit must be on transaction's
+ * ordered data list.
+ */
+ if (map->m_flags & EXT4_MAP_NEW &&
+ !(map->m_flags & EXT4_MAP_UNWRITTEN) &&
+ ext4_should_order_data(inode)) {
+ ret = ext4_jbd2_file_inode(handle, inode);
+ if (ret)
+ return ret;
+ }
}
return retval;
}


2016-03-18 09:19:41

by Jan Kara

[permalink] [raw]
Subject: Re: ext4 out of order when use cfq scheduler

On Wed 16-03-16 02:30:13, HUANG Weller (CM/ESW12-CN) wrote:
>
> > >
> > > OK, I have something - Huang, can you check whether the attached
> > > patches also fix your data exposure issues please? The first patch is
> > > the original fix, patch two is a cleanup, patches 3 and 4 implement
> > > the speedup suggested by Ted. Patches are only lightly tested so far.
> > > I'll run more comprehensive tests later and in particular I want to
> > > check whether the additional complexity actually brings us some
> > > advantage at least for workloads which redirty pages in addition to
> > > writing some new ones using delayed allocation.
> >
> > OK, there was a bug in patch 3. Attached is a new version of patches 3 and 4.
> >
>
> Patches are applied on my kernel. Basically it is working.
> Power loss test is running now. Need some days to give you test result.

Thank you!

> At same time, Some conflicts need your confirm:

The resolution of the conflicts looks good.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

Subject: RE: ext4 out of order when use cfq scheduler

>
> OK, I have something - Huang, can you check whether the attached patches also
> fix your data exposure issues please? The first patch is the original fix, patch two
> is a cleanup, patches 3 and 4 implement the speedup suggested by Ted. Patches
> are only lightly tested so far. I'll run more comprehensive tests later and in
> particular I want to check whether the additional complexity actually brings us
> some advantage at least for workloads which redirty pages in addition to writing
> some new ones using delayed allocation.
>

Test done.
Both targets(kernel 3.10.63) PASS the power loss test with 10,000 cycles. Test with io-scheduler CFQ.

2016-03-24 12:17:05

by Jan Kara

[permalink] [raw]
Subject: Re: ext4 out of order when use cfq scheduler

On Thu 24-03-16 10:16:05, HUANG Weller (CM/ESW12-CN) wrote:
> >
> > OK, I have something - Huang, can you check whether the attached patches also
> > fix your data exposure issues please? The first patch is the original fix, patch two
> > is a cleanup, patches 3 and 4 implement the speedup suggested by Ted. Patches
> > are only lightly tested so far. I'll run more comprehensive tests later and in
> > particular I want to check whether the additional complexity actually brings us
> > some advantage at least for workloads which redirty pages in addition to writing
> > some new ones using delayed allocation.
> >
>
> Test done.
> Both targets(kernel 3.10.63) PASS the power loss test with 10,000 cycles. Test with io-scheduler CFQ.

Thanks for testing! I'll submit those patches once I verify there is some
performance gain in only waiting...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

Subject: FW: ext4 out of order when use cfq scheduler

Hi Kara,

I saw the patch " fix data exposure after a crash " in the kernel 4.6.2 release.
I remember you provided 2 more patches for performance optimization. Could you tell me whether they are necessary ?

Thanks.
Best regards

Weller HUANG



-----Original Message-----
From: Jan Kara [mailto:[email protected]]
Sent: Wednesday, March 16, 2016 4:10 AM
To: Theodore Ts'o <[email protected]>
Cc: Jan Kara <[email protected]>; HUANG Weller (CM/ESW12-CN) <[email protected]>; [email protected]; Li, Michael <[email protected]>
Subject: Re: ext4 out of order when use cfq scheduler

On Tue 15-03-16 15:46:33, Jan Kara wrote:
> On Tue 15-03-16 11:46:34, Jan Kara wrote:
> > On Mon 14-03-16 10:36:35, Ted Tso wrote:
> > > On Mon, Mar 14, 2016 at 08:39:28AM +0100, Jan Kara wrote:
> > > > No, that won't be enough. blkdev_issue_flush() is not guaranteed
> > > > to do anything to IOs which have not reported completion before
> > > > blkdev_issue_flush() was called. Specifically, CFQ will queue
> > > > submitted bio in its internal RB tree, following flush request
> > > > completely bypasses this tree and goes directly to the disk
> > > > where it flushes caches. And only later CFQ decides to schedule
> > > > async writeback from the flusher thread which is queued in the RB tree...
> > >
> > > Oh, right. I am forgetting about the flushing mahchinery rewrite.
> > > Thanks for pointing that out.
> > >
> > > But what we *could* do is to swap those two calls and then in the
> > > case where delalloc is enabled, could maintain a list of inodes
> > > where we only need to call filemap_fdatawait(), and not initiate
> > > writeback for any dirty pages which had been caused by non-allocating writes.
> >
> > We actually don't need to swap those two calls - page is already
> > marked as under writeback in
> >
> > mpage_map_and_submit_buffers() -> mpage_submit_page ->
> > ext4_bio_write_page
> >
> > which gets called while we still hold the transaction handle. I
> > agree calling filemap_fdatawait() from JBD2 during commit should be
> > enough to fix issues with delalloc writeback. I'm just somewhat
> > afraid that it will be more fragile: If we add inode to
> > transaction's list in ext4_map_blocks(), we are pretty sure there's
> > no way to allocate block to an inode without introducing data
> > exposure issues (which are then very hard to spot). If we depend on
> > callers of ext4_map_blocks() to properly add inode to appropriate
> > transaction list, we have much more places to check. I'll think whether we could make this more robust.
>
> OK, I have something - Huang, can you check whether the attached
> patches also fix your data exposure issues please? The first patch is
> the original fix, patch two is a cleanup, patches 3 and 4 implement
> the speedup suggested by Ted. Patches are only lightly tested so far.
> I'll run more comprehensive tests later and in particular I want to
> check whether the additional complexity actually brings us some
> advantage at least for workloads which redirty pages in addition to
> writing some new ones using delayed allocation.

OK, there was a bug in patch 3. Attached is a new version of patches 3 and 4.
Honza


Attachments:
0003-jbd2-Add-support-for-avoiding-data-writes-during-tra.patch (7.47 kB)
0003-jbd2-Add-support-for-avoiding-data-writes-during-tra.patch
0004-ext4-Do-not-ask-jbd2-to-write-data-for-delalloc-buff.patch (4.29 kB)
0004-ext4-Do-not-ask-jbd2-to-write-data-for-delalloc-buff.patch
Download all attachments

2016-06-22 13:09:26

by Jan Kara

[permalink] [raw]
Subject: Re: FW: ext4 out of order when use cfq scheduler

Hi,

On Wed 22-06-16 11:55:12, HUANG Weller (CM/ESW12-CN) wrote:
> I saw the patch " fix data exposure after a crash " in the kernel 4.6.2
> release. I remember you provided 2 more patches for performance
> optimization. Could you tell me whether they are necessary ?

They are not necessary. They may just improve performance in some cases...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR