LinuxLists.cc - [BK][PATCH] Reiser4, will double Linux FS performance, please apply

[permalink] [raw]

Subject: Re: [BK][PATCH] Reiser4, will double Linux FS performance, please apply

Dieter N?tzel wrote:

>Am Donnerstag, 31. Oktober 2002 22:05 schrieb Jeff Garzik:
>
>
>>Hans Reiser wrote:
>>
>>
>>
>>>If you want to talk about 2.6 then you should talk about reiser4 not
>>>reiserfs v3, and reiser4 is 7.6 times the write performance of ext3
>>>for 30 copies of the linux kernel source code using modern IDE drives
>>>and modern processors on a dual-CPU box, so I don't think any amount
>>>of improved scalability will make ext3 competitive with reiser4 for
>>>performance usages.
>>>
>>>
>>What is the read performance like?
>>
>>
>
>From his mentioned paper http://www.namesys.com/v4/fast_reiser4.html, it is
>more then doubled compared to ext3 and ReiserFS v3.
>
>To be fair he should explain if it was compared to the latest ext3 (htree)
>stuff or not, yet.
>
>It looks truly impressive.
>
>Regards,
> Dieter
>
>
Unfortunately that was an older version of reiser4, and we are still
analyzing why it has higher read performance than what we are shipping
today. Give me a week, and I'll have a better answer for you. What we
shipped has higher read performance than ext3, but something is not what
it should be and needs fixing.

Green and Zam and Umka, on Monday please start work on seriously
analyzing how the block allocation differs between the new and the old
kernel, now that you can finally reproduce the benchmark on the old kernel.

--
Hans

2002-11-01 01:24:17

[permalink] [raw]

Subject: Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply

Andrew Morton wrote:

>Hans Reiser wrote:
>
>
>>Green and Zam and Umka, on Monday please start work on seriously
>>analyzing how the block allocation differs between the new and the old
>>kernel, now that you can finally reproduce the benchmark on the old kernel.
>>
>>
>
>I just sent the Orlov allocator patch to Linus. It will double or
>triple ext2 performance in that test, so please make sure you compare
>against the latest. There's a copy at
>http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.45/shpte-stuff/broken-out/orlov-allocator.patch
>
>We can expect similar gains for ext3, when that's done.
>
>(The 2x-3x is on an 8meg filesystem. Larger filesystems should
>gain more)
>
>
>
>
Well, if we are only 2.5 times as fast for writes as ext3 after your
patch is applied, I'll still feel good.;-)

Better benchmarks will be conducted during the next 3 months, the ones
we have are still a bit raw....

--
Hans

2002-11-01 01:50:01

[permalink] [raw]

Subject: Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply

Andrew Morton wrote:

>Hans Reiser wrote:
>
>
>>Well, if we are only 2.5 times as fast for writes as ext3 after your
>>patch is applied, I'll still feel good.;-)
>>
>>
>>
>
>whupping ext3's butt on write performance isn't very hard, really ;)
>
>But it should be done based on "feature equivalency". By default,
>ext3 uses ordered data writes. Data is written to disk before
>the metadata to which that data refers is committed to journal.
>
>It would be questionable to compare a metadata-only journalling
>approach to ext3 with data=journal or data=ordered.
>
>
>
>
>
The atomic transactions that reiser4 offers are a much higher level of
data security than data journaling. Really, you should read the 17 page
papers I send you URLs to;-).....
(http://www.namesys.com/v4/fast_reiser4.html).

--
Hans

2002-11-01 10:17:08

[permalink] [raw]

Subject: Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply

> The atomic transactions that reiser4 offers are a much higher level of
> data security than data journaling. Really, you should read the 17 page
> papers I send you URLs to;-).....
> (http://www.namesys.com/v4/fast_reiser4.html).

Am I to assume the following is expected behavior then?

# mkfs.reiser4 /dev/sda2
mkfs.reiser4, 0.1.0
Information: Reiser4 is going to be created on /dev/sda2.
(Yes/No): y
Creating reiser4 on /dev/sda2 with default40 profile...done
Synchronizing /dev/sda2...done
# mount /dev/sda2 /ap
# df /ap
Filesystem 1k-blocks Used Available Use% Mounted on
/dev/sda2 1490332 136 1490196 1% /ap
# (cd /ap && tar xzf /usr/src/linux-2.5.45.tgz)
# df /ap
Filesystem 1k-blocks Used Available Use% Mounted on
/dev/sda2 1490332 200508 1289824 14% /ap
# sync
# df /ap
Filesystem 1k-blocks Used Available Use% Mounted on
/dev/sda2 1490332 200468 1289864 14% /ap
# rm -rf /ap/linux-2.5.45
# df /ap
Filesystem 1k-blocks Used Available Use% Mounted on
/dev/sda2 1490332 255436 1234896 18% /ap
# # wtf is going on here?
# sync
# df /ap
Filesystem 1k-blocks Used Available Use% Mounted on
/dev/sda2 1490332 85848 1404484 6% /ap
# umount /ap
# mount /dev/sda2 /ap
# df /ap
Filesystem 1k-blocks Used Available Use% Mounted on
/dev/sda2 1490332 54532 1435800 4% /ap
# # and here?

T.

2002-11-01 17:17:31

[permalink] [raw]

Subject: Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply

Tomas Szepe writes:
> > The atomic transactions that reiser4 offers are a much higher level of
> > data security than data journaling. Really, you should read the 17 page
> > papers I send you URLs to;-).....
> > (http://www.namesys.com/v4/fast_reiser4.html).
>
> Am I to assume the following is expected behavior then?
>
> # mkfs.reiser4 /dev/sda2
> mkfs.reiser4, 0.1.0
> Information: Reiser4 is going to be created on /dev/sda2.
> (Yes/No): y
> Creating reiser4 on /dev/sda2 with default40 profile...done
> Synchronizing /dev/sda2...done
> # mount /dev/sda2 /ap
> # df /ap
> Filesystem 1k-blocks Used Available Use% Mounted on
> /dev/sda2 1490332 136 1490196 1% /ap
> # (cd /ap && tar xzf /usr/src/linux-2.5.45.tgz)
> # df /ap
> Filesystem 1k-blocks Used Available Use% Mounted on
> /dev/sda2 1490332 200508 1289824 14% /ap
> # sync
> # df /ap
> Filesystem 1k-blocks Used Available Use% Mounted on
> /dev/sda2 1490332 200468 1289864 14% /ap
> # rm -rf /ap/linux-2.5.45
> # df /ap
> Filesystem 1k-blocks Used Available Use% Mounted on
> /dev/sda2 1490332 255436 1234896 18% /ap
> # # wtf is going on here?
> # sync
> # df /ap
> Filesystem 1k-blocks Used Available Use% Mounted on
> /dev/sda2 1490332 85848 1404484 6% /ap
> # umount /ap
> # mount /dev/sda2 /ap
> # df /ap
> Filesystem 1k-blocks Used Available Use% Mounted on
> /dev/sda2 1490332 54532 1435800 4% /ap
> # # and here?

This should help:

diff -Nru a/txnmgr.c b/txnmgr.c
--- a/txnmgr.c Wed Oct 30 18:58:09 2002
+++ b/txnmgr.c Fri Nov 1 20:13:27 2002
@@ -1917,7 +1917,7 @@
return;
}

- if (!jnode_is_unformatted) {
+ if (jnode_is_znode(node)) {
if ( /**jnode_get_block(node) &&*/
!blocknr_is_fake(jnode_get_block(node))) {
/* jnode has assigned real disk block. Put it into

>
> T.

Thank you for report.

--
Alex.

2002-11-02 13:17:59

[permalink] [raw]

Subject: Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply

> This should help:
>
> diff -Nru a/txnmgr.c b/txnmgr.c
> --- a/txnmgr.c Wed Oct 30 18:58:09 2002
> +++ b/txnmgr.c Fri Nov 1 20:13:27 2002
> @@ -1917,7 +1917,7 @@
> return;
> }
>
> - if (!jnode_is_unformatted) {
> + if (jnode_is_znode(node)) {
> if ( /**jnode_get_block(node) &&*/
> !blocknr_is_fake(jnode_get_block(node))) {
> /* jnode has assigned real disk block. Put it into

Jup, this fixes the leak, but free space still isn't reported accurately
until after sync gets called, which I believe is a bug too.

Compare:
[reiser3]
$ pwd
/tmp
$ dd if=/dev/zero of=testfile bs=16k count=64
64+0 records in
64+0 records out
$ df /
Filesystem 1k-blocks Used Available Use% Mounted on
/dev/sda1 526296 330696 195600 63% /
$ rm testfile
$ df /
Filesystem 1k-blocks Used Available Use% Mounted on
/dev/sda1 526296 329672 196624 63% /
$ sync
$ df /
Filesystem 1k-blocks Used Available Use% Mounted on
/dev/sda1 526296 329672 196624 63% /

[reiser4]
$ pwd
/ap/tmp
$ dd if=/dev/zero of=testfile bs=16k count=64
64+0 records in
64+0 records out
$ df /ap
Filesystem 1k-blocks Used Available Use% Mounted on
/dev/sda2 1490332 1152 1489180 1% /ap
$ rm testfile
$ df /ap
Filesystem 1k-blocks Used Available Use% Mounted on
/dev/sda2 1490332 1160 1489172 1% /ap
$ sync
$ df /ap
Filesystem 1k-blocks Used Available Use% Mounted on
/dev/sda2 1490332 128 1490204 1% /ap

T.

2002-11-02 13:32:05

[permalink] [raw]

Subject: Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply

Hi,

Another one: trying to build 2.5.45 off a reiser4 mountpoint, I get:

reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
WARNING: Flush raced against extent->tail
reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
WARNING: flush failed: -11
jnode_flush failed with err = -11
reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
WARNING: Flush raced against extent->tail
reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
WARNING: flush failed: -11
jnode_flush failed with err = -11
reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
WARNING: Flush raced against extent->tail
reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
WARNING: flush failed: -11
jnode_flush failed with err = -11
reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
WARNING: Flush raced against extent->tail
reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
WARNING: flush failed: -11
jnode_flush failed with err = -11
reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
WARNING: Flush raced against extent->tail
reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
WARNING: flush failed: -11
jnode_flush failed with err = -11
reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
WARNING: Flush raced against extent->tail
reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
WARNING: flush failed: -11
jnode_flush failed with err = -11
reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
WARNING: Flush raced against extent->tail
reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
WARNING: flush failed: -11
jnode_flush failed with err = -11
reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
WARNING: Flush raced against extent->tail
reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
WARNING: flush failed: -11
jnode_flush failed with err = -11
reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
WARNING: Flush raced against extent->tail
reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
WARNING: flush failed: -11
jnode_flush failed with err = -11
reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
WARNING: Flush raced against extent->tail
reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
WARNING: flush failed: -11
jnode_flush failed with err = -11
reiser4[fixdep(841)]: traverse_tree (fs/reiser4/search.c:465)[nikita-1481]:
WARNING: Too many iterations: 128
reiser4[fixdep(841)]: traverse_tree (fs/reiser4/search.c:465)[nikita-1481]:
WARNING: Too many iterations: 256
reiser4[fixdep(841)]: traverse_tree (fs/reiser4/search.c:465)[nikita-1481]:
WARNING: Too many iterations: 512
reiser4[fixdep(841)]: traverse_tree (fs/reiser4/search.c:465)[nikita-1481]:
WARNING: Too many iterations: 1024
reiser4[fixdep(841)]: traverse_tree (fs/reiser4/search.c:465)[nikita-1481]:
WARNING: Too many iterations: 2048
reiser4[fixdep(841)]: traverse_tree (fs/reiser4/search.c:465)[nikita-1481]:
WARNING: Too many iterations: 4096
reiser4[fixdep(841)]: traverse_tree (fs/reiser4/search.c:465)[nikita-1481]:
WARNING: Too many iterations: 8192
reiser4[fixdep(841)]: traverse_tree (fs/reiser4/search.c:465)[nikita-1481]:
WARNING: Too many iterations: 16384
reiser4[fixdep(952)]: extent2tail (fs/reiser4/plugin/file/tail_conversion.c:476)[nikita-2282]:
WARNING: Partial conversion of 105116: 1 of 2
reiser4[cc1(957)]: extent2tail (fs/reiser4/plugin/file/tail_conversion.c:476)[nikita-2282]:
WARNING: Partial conversion of 105116: 0 of 2
[snip]

... after which r4 crashes completely --
Starts to hog all cpu time and umount() never goes through.

T.

2002-11-04 10:53:47

[permalink] [raw]

Subject: Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply

Tomas Szepe writes:
> > This should help:
> >
> > diff -Nru a/txnmgr.c b/txnmgr.c
> > --- a/txnmgr.c Wed Oct 30 18:58:09 2002
> > +++ b/txnmgr.c Fri Nov 1 20:13:27 2002
> > @@ -1917,7 +1917,7 @@
> > return;
> > }
> >
> > - if (!jnode_is_unformatted) {
> > + if (jnode_is_znode(node)) {
> > if ( /**jnode_get_block(node) &&*/
> > !blocknr_is_fake(jnode_get_block(node))) {
> > /* jnode has assigned real disk block. Put it into
>
>
> Jup, this fixes the leak, but free space still isn't reported accurately
> until after sync gets called, which I believe is a bug too.

In reiser4 allocation of disk space is delayed to transaction commit. It
is not possible to estimate precisely amount of disk space that will be
allocated during commit, and hence statfs(2) results are not updated
until one does sync(2) (forcing commit) or transaction is committed due
to age (10 minutes by default).

>
> Compare:
> [reiser3]
> $ pwd
> /tmp
> $ dd if=/dev/zero of=testfile bs=16k count=64
> 64+0 records in
> 64+0 records out
> $ df /
> Filesystem 1k-blocks Used Available Use% Mounted on
> /dev/sda1 526296 330696 195600 63% /
> $ rm testfile
> $ df /
> Filesystem 1k-blocks Used Available Use% Mounted on
> /dev/sda1 526296 329672 196624 63% /
> $ sync
> $ df /

[...]

Nikita.

2002-11-04 11:56:22

[permalink] [raw]

Subject: Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply

Tomas Szepe writes:
> Hi,
>
> Another one: trying to build 2.5.45 off a reiser4 mountpoint, I get:
>
> reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
> WARNING: Flush raced against extent->tail
> reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
> WARNING: flush failed: -11
> jnode_flush failed with err = -11

Can you please try the following patch to the fs/reiser4/flush.c:
----------------------------------------------------------------------
--- /tmp/flush.c Mon Nov 4 14:32:21 2002
+++ flush.c Mon Nov 4 14:32:32 2002
@@ -3149,7 +3149,8 @@ flush_scan_extent(flush_scan * scan, int
only. Will be removed. */
warning("nikita-2732",
"Flush raced against extent->tail");
- ret = -EAGAIN;
+ scan->stop = 1;
+ ret = 0;
goto exit;
}
assert("jmacd-1230", item_is_extent(&scan->parent_coord));
----------------------------------------------------------------------

> reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
> WARNING: Flush raced against extent->tail

[...]

> WARNING: Too many iterations: 8192
> reiser4[fixdep(841)]: traverse_tree (fs/reiser4/search.c:465)[nikita-1481]:
> WARNING: Too many iterations: 16384
> reiser4[fixdep(952)]: extent2tail (fs/reiser4/plugin/file/tail_conversion.c:476)[nikita-2282]:
> WARNING: Partial conversion of 105116: 1 of 2
> reiser4[cc1(957)]: extent2tail (fs/reiser4/plugin/file/tail_conversion.c:476)[nikita-2282]:
> WARNING: Partial conversion of 105116: 0 of 2
> [snip]
>
> ... after which r4 crashes completely --
> Starts to hog all cpu time and umount() never goes through.

Try to wait a bit more and check whether any more "WARNING: Too many
iterations" appear, OK?

>
> T.

Nikita.

2002-11-04 17:04:26

[permalink] [raw]

Subject: Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply

> > Hi,
> >
> > Another one: trying to build 2.5.45 off a reiser4 mountpoint, I get:
> >
> > reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
> > WARNING: Flush raced against extent->tail
> > reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
> > WARNING: flush failed: -11
> > jnode_flush failed with err = -11
>
> Can you please try the following patch to the fs/reiser4/flush.c:
> ----------------------------------------------------------------------
> --- /tmp/flush.c Mon Nov 4 14:32:21 2002
> +++ flush.c Mon Nov 4 14:32:32 2002
> @@ -3149,7 +3149,8 @@ flush_scan_extent(flush_scan * scan, int
> only. Will be removed. */
> warning("nikita-2732",
> "Flush raced against extent->tail");
> - ret = -EAGAIN;
> + scan->stop = 1;
> + ret = 0;
> goto exit;
> }
> assert("jmacd-1230", item_is_extent(&scan->parent_coord));

Seems to fix the flush errors, however, I can still see the race warnings.
Worse though, at one point I stumbled upon the following:

$ df /ap
Filesystem 1k-blocks Used Available Use% Mounted on
/dev/sda2 1490332 -73786976294838198272 1498808 101% /ap

This was right after I hit the reset button while compiling the kernel
off a reiser4 mountpoint, went on to finish the build after reboot and
then "rm -rf"'d the whole source tree (i.e. there was nothing on the
filesystem again).

reiser4.o is 20021031 plus the rmdir leak fix from this thread plus
your patch above.

> > ... after which r4 crashes completely --
> > Starts to hog all cpu time and umount() never goes through.
>
> Try to wait a bit more and check whether any more "WARNING: Too many
> iterations" appear, OK?

Jup, now all I get is the race warnings.

--
tomas szepe <[email protected]>

2002-11-04 17:46:49

[permalink] [raw]

Subject: Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply

Tomas Szepe writes:
> > > Hi,
> > >
> > > Another one: trying to build 2.5.45 off a reiser4 mountpoint, I get:
> > >
> > > reiser4[pdflush(7)]: flush_scan_extent (fs/reiser4/flush.c:3127)[nikita-2732]:
> > > WARNING: Flush raced against extent->tail
> > > reiser4[pdflush(7)]: jnode_flush (fs/reiser4/flush.c:1024)[jmacd-16739]:
> > > WARNING: flush failed: -11
> > > jnode_flush failed with err = -11
> >
> > Can you please try the following patch to the fs/reiser4/flush.c:
> > ----------------------------------------------------------------------
> > --- /tmp/flush.c Mon Nov 4 14:32:21 2002
> > +++ flush.c Mon Nov 4 14:32:32 2002
> > @@ -3149,7 +3149,8 @@ flush_scan_extent(flush_scan * scan, int
> > only. Will be removed. */
> > warning("nikita-2732",
> > "Flush raced against extent->tail");
> > - ret = -EAGAIN;
> > + scan->stop = 1;
> > + ret = 0;
> > goto exit;
> > }
> > assert("jmacd-1230", item_is_extent(&scan->parent_coord));
>
> Seems to fix the flush errors, however, I can still see the race warnings.

Good. Warning was left there for debugging. I shall remove it.

> Worse though, at one point I stumbled upon the following:
>
> $ df /ap
> Filesystem 1k-blocks Used Available Use% Mounted on
> /dev/sda2 1490332 -73786976294838198272 1498808 101% /ap
>
> This was right after I hit the reset button while compiling the kernel
> off a reiser4 mountpoint, went on to finish the build after reboot and
> then "rm -rf"'d the whole source tree (i.e. there was nothing on the
> filesystem again).
>
> reiser4.o is 20021031 plus the rmdir leak fix from this thread plus
> your patch above.

Do you have debugging on?

>
> > > ... after which r4 crashes completely --
> > > Starts to hog all cpu time and umount() never goes through.
> >
> > Try to wait a bit more and check whether any more "WARNING: Too many
> > iterations" appear, OK?
>
> Jup, now all I get is the race warnings.
>
> --
> tomas szepe <[email protected]>

Nikita.

2002-11-04 18:04:06

[permalink] [raw]

Subject: Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply

> > Worse though, at one point I stumbled upon the following:
> >
> > $ df /ap
> > Filesystem 1k-blocks Used Available Use% Mounted on
> > /dev/sda2 1490332 -73786976294838198272 1498808 101% /ap
> >
> > This was right after I hit the reset button while compiling the kernel
> > off a reiser4 mountpoint, went on to finish the build after reboot and
> > then "rm -rf"'d the whole source tree (i.e. there was nothing on the
> > filesystem again).
> >
> > reiser4.o is 20021031 plus the rmdir leak fix from this thread plus
> > your patch above.
>
> Do you have debugging on?

Nop.

--
tomas szepe <[email protected]>

2002-11-04 19:50:15

by Andreas Dilger

[permalink] [raw]

Subject: Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply

On Nov 04, 2002 14:00 +0300, Nikita Danilov wrote:
> > Jup, this fixes the leak, but free space still isn't reported accurately
> > until after sync gets called, which I believe is a bug too.
>
> In reiser4 allocation of disk space is delayed to transaction commit. It
> is not possible to estimate precisely amount of disk space that will be
> allocated during commit, and hence statfs(2) results are not updated
> until one does sync(2) (forcing commit) or transaction is committed due
> to age (10 minutes by default).

I find this more than a bit frightening, and it could obviously be a
huge source of reiser4's dramatic performance improvements - nothing is
being written to disk until long after a benchmark is complete (provided
you have enough RAM) if it isn't explicitly syncing before completing
the test (benchmarks like dbench and iozone don't necessarily sync).

Even more importantly, people losing 10 minutes of work is pretty
unacceptable, IMHO. The default flush interval is 30 seconds for a
reason, and in realistic scenarios files don't grow over a 10 minute
period, and even if they do you would want to start flushing that to
disk long before you have a few GB of outstanding changes. Also, this
would be a real source of problems (as I previously read was hinted at
in another reiser4 email) with filesystem full conditions.

At the very least, you need to reserve blocks in the filesystem for writes
that are under delayed allocation. Overestimating space requirements
(i.e. reserve a full block for each file, regardless of whether it will be
packed in the future or not) is far preferrable to underestimating and
running out of space after a write which already "completed" suddenly
finding itself out of space. If you get close to filling the filesystem,
then you can always flush the transaction to disk to "solidify your
estimates" before returning a needless ENOSPC. This will also make your
"statfs" space reporting fairly consistent, because you will return the
"reserved" stats even if they are only slightly off.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2002-11-05 07:24:33

[permalink] [raw]

Subject: Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply

Nikita Danilov wrote:

>Tomas Szepe writes:
> > > This should help:
> > >
> > > diff -Nru a/txnmgr.c b/txnmgr.c
> > > --- a/txnmgr.c Wed Oct 30 18:58:09 2002
> > > +++ b/txnmgr.c Fri Nov 1 20:13:27 2002
> > > @@ -1917,7 +1917,7 @@
> > > return;
> > > }
> > >
> > > - if (!jnode_is_unformatted) {
> > > + if (jnode_is_znode(node)) {
> > > if ( /**jnode_get_block(node) &&*/
> > > !blocknr_is_fake(jnode_get_block(node))) {
> > > /* jnode has assigned real disk block. Put it into
> >
> >
> > Jup, this fixes the leak, but free space still isn't reported accurately
> > until after sync gets called, which I believe is a bug too.
>
>In reiser4 allocation of disk space is delayed to transaction commit. It
>is not possible to estimate precisely amount of disk space that will be
>allocated during commit, and hence statfs(2) results are not updated
>until one does sync(2) (forcing commit) or transaction is committed due
>to age (10 minutes by default).
>
>
>
The above is badly phrased, and the behavior complained of is indeed a
bug not a feature. Please fix.

statfs should be updated immediately in accordance with estimates used
by the space reservation code, and then adjusted at commit time in
accordance with actual usage.

Andreas, the performance advantage is achieved using much more than the
amount of RAM available on the computer, and is therefore mostly
independent of max transaction age. The appropriate setting of
transaction max age depends on the user. The setting we chose is
appropriate for software developers doing compiles. It is not clear to
me yet what the right setting is. Perhaps 3 minutes is more
appropriate. I was probably overly influenced by Drew Roselli's
statistics on how long the cyle is between rewrites. Her statistics are
probably skewed by having lots of CS students using the machines she got
her data from. 5 seconds is too short to perform good layout
optimization for subsequent reads.

Hans

2002-11-05 08:26:53

[permalink] [raw]

Subject: Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply

reiser writes:
> Nikita Danilov wrote:
>
> >Tomas Szepe writes:
> > > > This should help:
> > > >
> > > > diff -Nru a/txnmgr.c b/txnmgr.c
> > > > --- a/txnmgr.c Wed Oct 30 18:58:09 2002
> > > > +++ b/txnmgr.c Fri Nov 1 20:13:27 2002
> > > > @@ -1917,7 +1917,7 @@
> > > > return;
> > > > }
> > > >
> > > > - if (!jnode_is_unformatted) {
> > > > + if (jnode_is_znode(node)) {
> > > > if ( /**jnode_get_block(node) &&*/
> > > > !blocknr_is_fake(jnode_get_block(node))) {
> > > > /* jnode has assigned real disk block. Put it into
> > >
> > >
> > > Jup, this fixes the leak, but free space still isn't reported accurately
> > > until after sync gets called, which I believe is a bug too.
> >
> >In reiser4 allocation of disk space is delayed to transaction commit. It
> >is not possible to estimate precisely amount of disk space that will be
> >allocated during commit, and hence statfs(2) results are not updated
> >until one does sync(2) (forcing commit) or transaction is committed due
> >to age (10 minutes by default).
> >
> >
> >
> The above is badly phrased, and the behavior complained of is indeed a
> bug not a feature. Please fix.
>
> statfs should be updated immediately in accordance with estimates used
> by the space reservation code, and then adjusted at commit time in
> accordance with actual usage.

We should not do that unless we implement forcing of commits at out of free
space situation.

>
> Andreas, the performance advantage is achieved using much more than the
> amount of RAM available on the computer, and is therefore mostly
> independent of max transaction age. The appropriate setting of
> transaction max age depends on the user. The setting we chose is
> appropriate for software developers doing compiles. It is not clear to
> me yet what the right setting is. Perhaps 3 minutes is more
> appropriate. I was probably overly influenced by Drew Roselli's
> statistics on how long the cyle is between rewrites. Her statistics are
> probably skewed by having lots of CS students using the machines she got
> her data from. 5 seconds is too short to perform good layout
> optimization for subsequent reads.
>
> Hans
>

--
Alex.

2002-11-05 08:37:54

[permalink] [raw]

Subject: Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply

Alexander Zarochentcev wrote:

> > >
> > >In reiser4 allocation of disk space is delayed to transaction commit. It
> > >is not possible to estimate precisely amount of disk space that will be
> > >allocated during commit, and hence statfs(2) results are not updated
> > >until one does sync(2) (forcing commit) or transaction is committed due
> > >to age (10 minutes by default).
> > >
> > >
> > >
> > The above is badly phrased, and the behavior complained of is indeed a
> > bug not a feature. Please fix.
> >
> > statfs should be updated immediately in accordance with estimates used
> > by the space reservation code, and then adjusted at commit time in
> > accordance with actual usage.
>
>We should not do that unless we implement forcing of commits at out of free
>space situation.
>
I thought we had agreed to do forcing of commits at out of free space
quite some time ago? In any event, we should do forcing of commits at
out of free space. Yes?

2002-11-05 08:47:22

[permalink] [raw]

Subject: Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply

reiser writes:
> Alexander Zarochentcev wrote:
>
> > > >
> > > >In reiser4 allocation of disk space is delayed to transaction commit. It
> > > >is not possible to estimate precisely amount of disk space that will be
> > > >allocated during commit, and hence statfs(2) results are not updated
> > > >until one does sync(2) (forcing commit) or transaction is committed due
> > > >to age (10 minutes by default).
> > > >
> > > >
> > > >
> > > The above is badly phrased, and the behavior complained of is indeed a
> > > bug not a feature. Please fix.
> > >
> > > statfs should be updated immediately in accordance with estimates used
> > > by the space reservation code, and then adjusted at commit time in
> > > accordance with actual usage.
> >
> >We should not do that unless we implement forcing of commits at out of free
> >space situation.
> >
> I thought we had agreed to do forcing of commits at out of free space
> quite some time ago? In any event, we should do forcing of commits at
> out of free space. Yes?

we will control this by a block allocator flag, we set it when we can close
current transaction. I think for most cases it will be set.

--
Alex.

2002-11-05 09:23:14

by Andreas Dilger

[permalink] [raw]

Subject: Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply

On Nov 04, 2002 23:30 -0800, reiser wrote:
> The appropriate setting of
> transaction max age depends on the user. The setting we chose is
> appropriate for software developers doing compiles. It is not clear to
> me yet what the right setting is. Perhaps 3 minutes is more
> appropriate. I was probably overly influenced by Drew Roselli's
> statistics on how long the cyle is between rewrites. Her statistics are
> probably skewed by having lots of CS students using the machines she got
> her data from. 5 seconds is too short to perform good layout
> optimization for subsequent reads.

I think the bdflush defaults are (were?) something like 5 seconds for
metadata, and 30 seconds for file data. reiser4 should (if it doesn't
already) use the parameters set by sys_bdflush() to tune the writeout
intervals.

I would think that either:
a) A file was completely written in under 30 seconds (e.g. untar or gcc
or whatever else you are doing), so deferring allocation and writing
to disk does not help you at all.
b) A file is continuing to be written for more than 30 seconds that
has a very large amount of outstanding data which can be committed
to disk with (probably) the same read optimization quality as any
larger amount of data.
c) A file is continuing to be written for more than 30 seconds that
is growing slowly and no matter how long you defer the write you
will only get an incremental read layout. Presumably you could do
something to pre-allocate/reserve a bunch of space at the end of this
file as it continues to grow.

So, except for the very unusual case of files with lifespans between 30
seconds and 300 seconds, or files that are written to between those
intervals, I would guess that you are not gaining much extra benefit by
deferring the writes another 270 seconds.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2002-11-05 09:52:40

[permalink] [raw]

Subject: Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply

> >> > This should help:
> >> >
> >> > diff -Nru a/txnmgr.c b/txnmgr.c
> >> > --- a/txnmgr.c Wed Oct 30 18:58:09 2002
> >> > +++ b/txnmgr.c Fri Nov 1 20:13:27 2002
> >> > @@ -1917,7 +1917,7 @@
> >> > return;
> >> > }
> >> >
> >> > - if (!jnode_is_unformatted) {
> >> > + if (jnode_is_znode(node)) {
> >> > if ( /**jnode_get_block(node) &&*/
> >> > !blocknr_is_fake(jnode_get_block(node))) {
> >> > /* jnode has assigned real disk block. Put it into
> >>
> >>
> >> Jup, this fixes the leak, but free space still isn't reported accurately
> >> until after sync gets called, which I believe is a bug too.
> >
> >In reiser4 allocation of disk space is delayed to transaction commit. It
> >is not possible to estimate precisely amount of disk space that will be
> >allocated during commit, and hence statfs(2) results are not updated
> >until one does sync(2) (forcing commit) or transaction is committed due
> >to age (10 minutes by default).
> >
> The above is badly phrased, and the behavior complained of is indeed
> a bug not a feature. Please fix.

I just noticed the file
http://thebsh.namesys.com/snapshots/2002.10.31/reiser4.diff
had changed, the difference from the original 20021031 snapshot being:

--- fs_reiser4.diff.old 2002-10-31 14:11:50.000000000 +0100
+++ fs_reiser4.diff.new 2002-11-04 16:57:46.000000000 +0100
@@ -46903,7 +46903,7 @@
+#if REISER4_USER_LEVEL_SIMULATION
+# define check_spin_is_locked(s) spin_is_locked(s)
+# define check_spin_is_not_locked(s) spin_is_not_locked(s)
-+#elif defined( CONFIG_DEBUG_SPINLOCK ) && defined( CONFIG_SMP )
++#elif 0 && defined( CONFIG_DEBUG_SPINLOCK ) && defined( CONFIG_SMP )
+# define check_spin_is_not_locked(s) ( ( s ) -> owner != get_current() )
+# define spin_is_not_locked(s) ( ( s ) -> owner == NULL )
+# define check_spin_is_locked(s) ( ( s ) -> owner == get_current() )

So either someone is messing about with your webserver or you want multiple
versions of the supposedly same diff floating around (not exactly suitable
for gathering bugreports, is it?). If you're short on disk space, how about
gzipping the fs diff? Squeezes down to ~500k from almost 2MB.

--
Tomas Szepe <[email protected]>

2002-11-05 10:06:49

[permalink] [raw]

Subject: Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply

Tomas Szepe writes:
> > >> > This should help:
> > >> >
> > >> > diff -Nru a/txnmgr.c b/txnmgr.c
> > >> > --- a/txnmgr.c Wed Oct 30 18:58:09 2002
> > >> > +++ b/txnmgr.c Fri Nov 1 20:13:27 2002
> > >> > @@ -1917,7 +1917,7 @@
> > >> > return;
> > >> > }
> > >> >
> > >> > - if (!jnode_is_unformatted) {
> > >> > + if (jnode_is_znode(node)) {
> > >> > if ( /**jnode_get_block(node) &&*/
> > >> > !blocknr_is_fake(jnode_get_block(node))) {
> > >> > /* jnode has assigned real disk block. Put it into
> > >>
> > >>
> > >> Jup, this fixes the leak, but free space still isn't reported accurately
> > >> until after sync gets called, which I believe is a bug too.
> > >
> > >In reiser4 allocation of disk space is delayed to transaction commit. It
> > >is not possible to estimate precisely amount of disk space that will be
> > >allocated during commit, and hence statfs(2) results are not updated
> > >until one does sync(2) (forcing commit) or transaction is committed due
> > >to age (10 minutes by default).
> > >
> > The above is badly phrased, and the behavior complained of is indeed
> > a bug not a feature. Please fix.
>
> I just noticed the file
> http://thebsh.namesys.com/snapshots/2002.10.31/reiser4.diff
> had changed, the difference from the original 20021031 snapshot being:
>
> --- fs_reiser4.diff.old 2002-10-31 14:11:50.000000000 +0100
> +++ fs_reiser4.diff.new 2002-11-04 16:57:46.000000000 +0100
> @@ -46903,7 +46903,7 @@
> +#if REISER4_USER_LEVEL_SIMULATION
> +# define check_spin_is_locked(s) spin_is_locked(s)
> +# define check_spin_is_not_locked(s) spin_is_not_locked(s)
> -+#elif defined( CONFIG_DEBUG_SPINLOCK ) && defined( CONFIG_SMP )
> ++#elif 0 && defined( CONFIG_DEBUG_SPINLOCK ) && defined( CONFIG_SMP )
> +# define check_spin_is_not_locked(s) ( ( s ) -> owner != get_current() )
> +# define spin_is_not_locked(s) ( ( s ) -> owner == NULL )
> +# define check_spin_is_locked(s) ( ( s ) -> owner == get_current() )
>
> So either someone is messing about with your webserver or you want multiple
> versions of the supposedly same diff floating around (not exactly suitable
> for gathering bugreports, is it?). If you're short on disk space, how about
> gzipping the fs diff? Squeezes down to ~500k from almost 2MB.

done for 2002.10.31 snapshot.

>
> --
> Tomas Szepe <[email protected]>

--
Alex.

2002-11-05 10:16:56

[permalink] [raw]

Subject: Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply

> > I just noticed the file
> > http://thebsh.namesys.com/snapshots/2002.10.31/reiser4.diff
> > had changed, the difference from the original 20021031 snapshot being:
> >
> > --- fs_reiser4.diff.old 2002-10-31 14:11:50.000000000 +0100
> > +++ fs_reiser4.diff.new 2002-11-04 16:57:46.000000000 +0100
> > @@ -46903,7 +46903,7 @@
> > +#if REISER4_USER_LEVEL_SIMULATION
> > +# define check_spin_is_locked(s) spin_is_locked(s)
> > +# define check_spin_is_not_locked(s) spin_is_not_locked(s)
> > -+#elif defined( CONFIG_DEBUG_SPINLOCK ) && defined( CONFIG_SMP )
> > ++#elif 0 && defined( CONFIG_DEBUG_SPINLOCK ) && defined( CONFIG_SMP )
> > +# define check_spin_is_not_locked(s) ( ( s ) -> owner != get_current() )
> > +# define spin_is_not_locked(s) ( ( s ) -> owner == NULL )
> > +# define check_spin_is_locked(s) ( ( s ) -> owner == get_current() )
> >
> > So either someone is messing about with your webserver or you want multiple
> > versions of the supposedly same diff floating around (not exactly suitable
> > for gathering bugreports, is it?). If you're short on disk space, how about
> > gzipping the fs diff? Squeezes down to ~500k from almost 2MB.
>
> done for 2002.10.31 snapshot.

Well the point is -- could you create a new dir each time you do updates
to the current snapshot?

Here's export-pagevec_deactivate_inactive.diff for 2.5.46:

diff -urN linux-2.5.46/mm/Makefile linux-2.5.46r4/mm/Makefile
--- linux-2.5.46/mm/Makefile 2002-11-05 11:07:21.000000000 +0100
+++ linux-2.5.46.1/mm/Makefile 2002-11-05 11:13:11.000000000 +0100
@@ -2,7 +2,7 @@
# Makefile for the linux memory manager.
#

-export-objs := shmem.o filemap.o mempool.o page_alloc.o page-writeback.o
+export-objs := shmem.o filemap.o mempool.o page_alloc.o page-writeback.o swap.o

obj-y := memory.o mmap.o filemap.o fremap.o mprotect.o mlock.o mremap.o \
vmalloc.o slab.o bootmem.o swap.o vmscan.o page_alloc.o \
diff -urN linux-2.5.46/mm/swap.c linux-2.5.46.1/mm/swap.c
--- linux-2.5.46/mm/swap.c 2002-11-05 11:07:21.000000000 +0100
+++ linux-2.5.46.1/mm/swap.c 2002-11-05 11:13:35.000000000 +0100
@@ -23,6 +23,7 @@
#include <linux/buffer_head.h>
#include <linux/prefetch.h>
#include <linux/percpu.h>
+#include <linux/module.h>

/* How many pages do we try to swap or page in/out together? */
int page_cluster;
@@ -227,6 +228,7 @@
spin_unlock_irq(&zone->lru_lock);
__pagevec_release(pvec);
}
+EXPORT_SYMBOL(pagevec_deactivate_inactive);

/*
* Add the passed pages to the LRU, then drop the caller's refcount

2002-11-05 10:39:39

[permalink] [raw]

Subject: Re: [BK][PATCH] Reiser4, will double Linux FS performance, pleaseapply

Tomas Szepe writes:
> > >> > This should help:
> > >> >
> > >> > diff -Nru a/txnmgr.c b/txnmgr.c
> > >> > --- a/txnmgr.c Wed Oct 30 18:58:09 2002
> > >> > +++ b/txnmgr.c Fri Nov 1 20:13:27 2002
> > >> > @@ -1917,7 +1917,7 @@
> > >> > return;
> > >> > }
> > >> >
> > >> > - if (!jnode_is_unformatted) {
> > >> > + if (jnode_is_znode(node)) {
> > >> > if ( /**jnode_get_block(node) &&*/
> > >> > !blocknr_is_fake(jnode_get_block(node))) {
> > >> > /* jnode has assigned real disk block. Put it into
> > >>
> > >>
> > >> Jup, this fixes the leak, but free space still isn't reported accurately
> > >> until after sync gets called, which I believe is a bug too.
> > >
> > >In reiser4 allocation of disk space is delayed to transaction commit. It
> > >is not possible to estimate precisely amount of disk space that will be
> > >allocated during commit, and hence statfs(2) results are not updated
> > >until one does sync(2) (forcing commit) or transaction is committed due
> > >to age (10 minutes by default).
> > >
> > The above is badly phrased, and the behavior complained of is indeed
> > a bug not a feature. Please fix.
>
> I just noticed the file
> http://thebsh.namesys.com/snapshots/2002.10.31/reiser4.diff
> had changed, the difference from the original 20021031 snapshot being:
>
> --- fs_reiser4.diff.old 2002-10-31 14:11:50.000000000 +0100
> +++ fs_reiser4.diff.new 2002-11-04 16:57:46.000000000 +0100
> @@ -46903,7 +46903,7 @@
> +#if REISER4_USER_LEVEL_SIMULATION
> +# define check_spin_is_locked(s) spin_is_locked(s)
> +# define check_spin_is_not_locked(s) spin_is_not_locked(s)
> -+#elif defined( CONFIG_DEBUG_SPINLOCK ) && defined( CONFIG_SMP )
> ++#elif 0 && defined( CONFIG_DEBUG_SPINLOCK ) && defined( CONFIG_SMP )
> +# define check_spin_is_not_locked(s) ( ( s ) -> owner != get_current() )
> +# define spin_is_not_locked(s) ( ( s ) -> owner == NULL )
> +# define check_spin_is_locked(s) ( ( s ) -> owner == get_current() )
>
> So either someone is messing about with your webserver or you want multiple
> versions of the supposedly same diff floating around (not exactly suitable

Looks like you managed to download early buggy version of diff that only
existed on the server for the short time and was overwritten in place
later (yes, silly thing to do).

> for gathering bugreports, is it?). If you're short on disk space, how about
> gzipping the fs diff? Squeezes down to ~500k from almost 2MB.

OK.

>

Nikita.

2002-11-05 21:01:58