Greetings,
This post is dedicated to Ted, who raised doubts a while back about
whether Tux3 can ever have a fast fsync:
https://lkml.org/lkml/2013/5/11/128
"Re: Tux3 Report: Faster than tmpfs, what?"
Ted suggested that Tux3's inherently asynchronous design would be a
limitation when it comes to synchronous operations like fsync. As he
put it, "any advantage of decoupling the front/back end is nullified"
because of temporal coupling. We found the opposite to be true: our
asynchronous design works as well for synchronous operations as it does
for any other, and Tux3 now has a high performance fsync to prove it.
Until now, Tux3 handled fsync as a full delta commit equivalent to
syncfs, just like Ext3. This is semantically correct but sometimes
commits more data than necessary and creates a bottleneck by serializing
all fsyncs. To optimize it, we added a mechanism for committing any
subset of dirty inodes separately from a full delta.
Like our normal delta commit, the design is asynchronous: front end
tasks produce fsync work and a backend task commits it. We now have two
backends, one to commit fsyncs and another to commit deltas, serialized
so the combination is single threaded and mostly lockless. Each fsync
moves an inode's dirty blocks from the front delta to the back delta,
then queues the inode for the backend. The backend spins away committing
fsyncs, opportunistically batching them up into group commits. The fsync
backend gives priority to the delta backend: whenever a full delta flush
is needed because of cache pressure or cache age, it finishes its fsync
work and gets out of the way.
This approach required only minimal changes to our existing delta commit
mechanism, mainly to support crash consistency. In particular, our delta
model did not need to be changed at all, and block forking protects
fsync commits in the same way as normal delta commits. That is, an inode
can be freely updated (even via mmap) while it is being fsynced. The
fsync data is isolated from those changes and the frontend does not
stall.
I measured fsync performance using a 7200 RPM disk as a virtual drive
under KVM, configured with cache=none so that asynchronous writes are
cached and synchronous writes translate into direct writes to the
block device. To focus purely on fsync, I wrote a small utility (at the
end of this post) that forks a number of tasks, each of which
continuously appends to and fsyncs its own file. For a single task doing
1,000 fsyncs of 1K each, we have:
Ext4: 34.34s
XFS: 23.63s
Btrfs: 34.84s
Tux3: 17.24s
Tux3 has a nice advantage for isolated fsyncs. This is mainly due to
writing a small number of blocks per commit, currently just five blocks
for a small file. As for a normal delta commit, we do not update bitmap
blocks or index nodes, but log logical changes instead, flushing out
the primary metadata with occasional "unify" deltas to control the log
size and keep replay fast. The 1,000 fsync test typically does about
three unifies, so we do in fact update all our metadata and pay that
cost, just not on every commit.
The win is bigger than it appears at first glance, because writing the
commit block takes about 16.5 ms. That would be similar for all the
filesystems tested, so factoring that out, we see that Tux3 must be
doing 10 times less work than XFS, 24 times less than Ext4 and 25 times
less than Btrfs. Over time, that gap should widen as we reduce our small
file commit overhead towards just two blocks.
In this single threaded test, we pay the full price for "communication
delays" with our asynchronous backend, which evidently does not amount
to much. Given that a context switch is on the order of microseconds
while writing the commit block is on the order of milliseconds, it is
unsurprising that two extra context switches just disappear in the
noise.
Things get more interesting with parallel fsyncs. In this test, each
task does ten fsyncs and task count scales from ten to ten thousand. We
see that all tested filesystems are able to combine fsyncs into group
commits, with varying degrees of success:
Tasks: 10 100 1,000 10,000
Ext4: 0.79s 0.98s 4.62s 61.45s
XFS: 0.75s 1.68s 20.97s 238.23s
Btrfs 0.53s 0.78s 3.80s 84.34s
Tux3: 0.27s 0.34s 1.00s 6.86s
(lower is better)
We expect sub-linear scaling with respect to tasks as opportunities to
combine commits increase, then linear scaling as total write traffic
begins to dominate. Tux3 still shows sub-linear scaling even at 10,000
tasks. XFS scales poorly, and also suffers from read starvation at the
high end, sometimes taking tens of seconds to cat a file or minutes to
list a directory. Ext4 and Tux3 exhibit no such issues, remaining
completely responsive under all tested loads. The bottom line for this
test is, Tux3 is twice as fast at fsync with a modest task count, and
the gap widens to nine times faster than its nearest competitor as task
count increases.
Is there any practical use for fast parallel fsync of tens of thousands
of tasks? This could be useful for a scalable transaction server that
sits directly on the filesystem instead of a database, as is the
fashion for big data these days. It certainly can't hurt to know that
if you need that kind of scaling, Tux3 will do it.
Of course, a pure fsync load could be viewed as somewhat unnatural. We
also need to know what happens under a realistic load with buffered
operations mixed with fsyncs. We turn to an old friend, dbench:
Dbench -t10
Tasks: 8 16 32
Ext4: 35.32 MB/s 34.08 MB/s 39.71 MB/s
XFS: 32.12 MB/s 25.08 MB/s 30.12 MB/s
Btrfs: 54.40 MB/s 75.09 MB/s 102.81 MB/s
Tux3: 85.82 MB/s 133.69 MB/s 159.78 MB/s
(higher is better)
Tux3 and Btrfs scale well and are way ahead of Ext4 and XFS, while Ext4
and XFS scale poorly or even negatively. Tux3 is the leader by a wide
margin, beating XFS by more than a factor of 5 at the high end.
Dbench -t10 -s (all file operations synchronous)
Tasks: 8 16 32
Ext4: 4.51 MB/s 6.25 MB/s 7.72 MB/s
XFS: 4.24 MB/s 4.77 MB/s 5.15 MB/s
Btrfs: 7.98 MB/s 13.87 MB/s 22.87 MB/s
Tux3: 15.41 MB/s 25.56 MB/s 39.15 MB/s
(higher is better)
With a pure synchronous load (O_SYNC) the ranking is not changed but the
gaps widen, and Tux3 outperforms XFS by a factor of 7.5.
At the risk of overgeneralizing, a trend seems to be emerging: the new,
write-anywhere designs run synchronous operations faster, combine
synchronous and asynchronous operations more efficiently, and scale
better to high task counts than the traditional journaling designs. If
there is to be an Ext5, it might be worth considering the merit of
abandoning the journal in favor of something along the lines of Tux3's
redirect-on-write and logging combination.
Getting back to Ted's question, perhaps an asynchronous design really is
a better idea all round, even for synchronous operations, and perhaps
there really is such a thing as an improved design that is not just a
different set of tradeoffs.
In the full disclosure department, Tux3 is still not properly optimized
in some areas. One of them is fragmentation: it is not very hard to
make Tux3 slow down by running long tests. Our current allocation
algorithm is completely naive - it just allocates the next available
block and wraps at the top of volume. After a few wraps, it makes a big
mess. So today we are not claiming victory in the benchmark department,
we still have some work to do. Today is just about fsync, for which it
is fair to say that Tux3 sets a new performance standard.
Regards,
Daniel
Footnote: While I was doing this work, Hirofumi set about improving our
full filesystem sync to support merging parallel full syncs into single
delta commits. That one change improved fsync performance so much that
I almost abandoned my separate fsync work. However, a true, separate
fsync with aggressive group commit eventually proved superior, so now we
have both: a high performance fsync and a full filesystem sync that is
nearly as fast under many loads.
/*
* syncs.c
*
* D.R. Phillips, 2015
*
* To build: c99 -Wall syncs.c -o syncs
* To run: ./syncs [<filename> [<syncs> [<tasks>]]]
*/
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <sys/wait.h>
#include <errno.h>
#include <sys/stat.h>
char text[1024] = { "hello world!\n" };
int main(int argc, const char *argv[]) {
const char *basename = argc < 1 ? "foo" : argv[1];
char name[100];
int steps = argc < 3 ? 1 : atoi(argv[2]);
int tasks = argc < 4 ? 1 : atoi(argv[3]);
int err, fd;
for (int t = 0; t < tasks; t++) {
snprintf(name, sizeof name, "%s%i", basename, t);
if (!fork())
goto child;
}
for (int t = 0; t < tasks; t++)
wait(&err);
return 0;
child:
fd = creat(name, S_IRWXU);
for (int i = 0; i < steps; i++) {
write(fd, text, sizeof text);
fsync(fd);
}
return 0;
}
Where does tux3 live? What I found looked abandoned.
-Mike
On Tuesday, April 28, 2015 7:21:11 PM PDT, Mike Galbraith wrote:
> Where does tux3 live? What I found looked abandoned.
Current work is here:
https://github.com/OGAWAHirofumi/linux-tux3
Note, the new fsync code isn't pushed to that tree yet, however Hirofumi's
optimized syncfs is already in there, which isn't a lot slower.
Regards,
Daniel
On Wed, Apr 29, 2015 at 8:01 AM, Daniel Phillips <[email protected]> wrote:
> On Tuesday, April 28, 2015 7:21:11 PM PDT, Mike Galbraith wrote:
>>
>> Where does tux3 live? What I found looked abandoned.
>
>
> Current work is here:
>
> https://github.com/OGAWAHirofumi/linux-tux3
>
> Note, the new fsync code isn't pushed to that tree yet, however Hirofumi's
> optimized syncfs is already in there, which isn't a lot slower.
Where can I find the fsync code?
IOW how to reproduce your results? :)
--
Thanks,
//richard
On Tue, 2015-04-28 at 23:01 -0700, Daniel Phillips wrote:
> On Tuesday, April 28, 2015 7:21:11 PM PDT, Mike Galbraith wrote:
> > Where does tux3 live? What I found looked abandoned.
>
> Current work is here:
>
> https://github.com/OGAWAHirofumi/linux-tux3
>
> Note, the new fsync code isn't pushed to that tree yet, however Hirofumi's
> optimized syncfs is already in there, which isn't a lot slower.
Ah, I did find the right spot, it's just been idle a while. Where does
one find mkfs.tux3?
-Mike
On Tuesday, April 28, 2015 11:20:08 PM PDT, Richard Weinberger wrote:
> On Wed, Apr 29, 2015 at 8:01 AM, Daniel Phillips <[email protected]> wrote:
>> On Tuesday, April 28, 2015 7:21:11 PM PDT, Mike Galbraith wrote: ...
>
> Where can I find the fsync code?
> IOW how to reproduce your results? :)
Hi Richard,
If you can bear with us, the latest code needs to make it through
Hirofumi's QA before it appears on github. If you are impatient, the
fsync in current head benchmarks pretty well too, I don't think I need
to apologize for it.
In any case, you build userspace tools from the hirofumi-user branch,
by doing make in fs/tux3/user. This builds the tux3 command, and you
make a filesystem with "tux3 mkfs <volume>".
You can build the kernel including Tux3 from the hirofumi branch, or
the hirofumi-user branch, providing you do make clean SUBDIRS=fs/tux3
before building the kernel or make clean in tux3/user before building
the user space, so user and kernel .o files don't collide. A little
awkward indeed, but still it is pretty cool that we can even build
that code for user space.
The wiki might be helpful:
https://github.com/OGAWAHirofumi/linux-tux3/wiki
https://github.com/OGAWAHirofumi/linux-tux3/wiki/Compile
This is current, except you want to build from hirofumi and hirofumi-user
rather than master and user because the latter is a bit old.
Regards,
Daniel
On Tuesday, April 28, 2015 11:33:33 PM PDT, Mike Galbraith wrote:
> On Tue, 2015-04-28 at 23:01 -0700, Daniel Phillips wrote:
>> On Tuesday, April 28, 2015 7:21:11 PM PDT, Mike Galbraith wrote:
>>> Where does tux3 live? What I found looked abandoned.
>>
>> Current work is here:
>>
>> https://github.com/OGAWAHirofumi/linux-tux3
>>
>> Note, the new fsync code isn't pushed to that tree yet, however Hirofumi's
>> optimized syncfs is already in there, which isn't a lot slower.
>
> Ah, I did find the right spot, it's just been idle a while. Where does
> one find mkfs.tux3?
Hi Mike,
See my reply to Richard. You are right, we have been developing on
Hirofumi's
branch and master is getting old. Short version:
checkout hirofumi-user
cd fs/tux3/user
make
./tux3 mkfs <volume>
Regards,
Daniel
On Wed, 2015-04-29 at 00:23 -0700, Daniel Phillips wrote:
> On Tuesday, April 28, 2015 11:33:33 PM PDT, Mike Galbraith wrote:
> > On Tue, 2015-04-28 at 23:01 -0700, Daniel Phillips wrote:
> >> On Tuesday, April 28, 2015 7:21:11 PM PDT, Mike Galbraith wrote:
> >>> Where does tux3 live? What I found looked abandoned.
> >>
> >> Current work is here:
> >>
> >> https://github.com/OGAWAHirofumi/linux-tux3
> >>
> >> Note, the new fsync code isn't pushed to that tree yet, however Hirofumi's
> >> optimized syncfs is already in there, which isn't a lot slower.
> >
> > Ah, I did find the right spot, it's just been idle a while. Where does
> > one find mkfs.tux3?
>
> Hi Mike,
>
> See my reply to Richard. You are right, we have been developing on
> Hirofumi's
> branch and master is getting old. Short version:
>
> checkout hirofumi-user
> cd fs/tux3/user
> make
> ./tux3 mkfs <volume>
Ok, thanks.
I was curious about horrible looking plain ole dbench numbers you
posted, as when I used to play with it, default looked like a kinda
silly non-io test most frequently used to pile threads on a box to see
when the axles started bending. Seems default load has changed.
With dbench v4.00, tux3 seems to be king of the max_latency hill, but
btrfs took throughput on my box. With v3.04, tux3 took 1st place at
splashing about in pagecache, but last place at dbench -S.
Hohum, curiosity satisfied.
/usr/local/bin/dbench -t 30 (version 4.00)
ext4 Throughput 31.6148 MB/sec 8 clients 8 procs max_latency=1696.854 ms
xfs Throughput 26.4005 MB/sec 8 clients 8 procs max_latency=1508.581 ms
btrfs Throughput 82.3654 MB/sec 8 clients 8 procs max_latency=1274.960 ms
tux3 Throughput 93.0047 MB/sec 8 clients 8 procs max_latency=99.712 ms
ext4 Throughput 49.9795 MB/sec 16 clients 16 procs max_latency=2180.108 ms
xfs Throughput 35.038 MB/sec 16 clients 16 procs max_latency=3107.321 ms
btrfs Throughput 148.894 MB/sec 16 clients 16 procs max_latency=618.070 ms
tux3 Throughput 130.532 MB/sec 16 clients 16 procs max_latency=141.743 ms
ext4 Throughput 69.2642 MB/sec 32 clients 32 procs max_latency=3166.374 ms
xfs Throughput 55.3805 MB/sec 32 clients 32 procs max_latency=4921.660 ms
btrfs Throughput 230.488 MB/sec 32 clients 32 procs max_latency=3673.387 ms
tux3 Throughput 179.473 MB/sec 32 clients 32 procs max_latency=194.046 ms
/usr/local/bin/dbench -B fileio -t 30 (version 4.00)
ext4 Throughput 84.7361 MB/sec 32 clients 32 procs max_latency=1401.683 ms
xfs Throughput 57.9369 MB/sec 32 clients 32 procs max_latency=1397.910 ms
btrfs Throughput 268.738 MB/sec 32 clients 32 procs max_latency=639.411 ms
tux3 Throughput 186.172 MB/sec 32 clients 32 procs max_latency=167.389 ms
/usr/bin/dbench -t 30 32 (version 3.04)
ext4 Throughput 7920.95 MB/sec 32 procs
xfs Throughput 674.993 MB/sec 32 procs
btrfs Throughput 1910.63 MB/sec 32 procs
tux3 Throughput 8262.68 MB/sec 32 procs
/usr/bin/dbench -S -t 30 32 (version 3.04)
ext4 Throughput 87.2774 MB/sec (sync dirs) 32 procs
xfs Throughput 89.3977 MB/sec (sync dirs) 32 procs
btrfs Throughput 101.888 MB/sec (sync dirs) 32 procs
tux3 Throughput 78.7463 MB/sec (sync dirs) 32 procs
Here's something that _might_ interest xfs folks.
cd git (source repository of git itself)
make clean
echo 3 > /proc/sys/vm/drop_caches
time make -j8 test
ext4 2m20.721s
xfs 6m41.887s <-- ick
btrfs 1m32.038s
tux3 1m30.262s
Testing by Aunt Tilly: mkfs, no fancy switches, mount the thing, test.
Are defaults for mkfs.xfs such that nobody sane uses them, or does xfs
really hate whatever git selftests are doing this much?
-Mike
On 2015-04-29 15:05, Mike Galbraith wrote:
> Here's something that _might_ interest xfs folks.
>
> cd git (source repository of git itself)
> make clean
> echo 3 > /proc/sys/vm/drop_caches
> time make -j8 test
>
> ext4 2m20.721s
> xfs 6m41.887s <-- ick
> btrfs 1m32.038s
> tux3 1m30.262s
>
> Testing by Aunt Tilly: mkfs, no fancy switches, mount the thing, test.
>
> Are defaults for mkfs.xfs such that nobody sane uses them, or does xfs
> really hate whatever git selftests are doing this much?
>
> -Mike
>
I've been using the defaults for it and have been perfectly happy,
although I do use a few non-default mount options (like noatime and
noquota). It may just be a factor of what exactly the tests are doing.
Based on my experience, xfs _is_ better performance wise with a few
large files instead of a lot of small ones when used with the default
mkfs options. Of course, my uses for it are more focused on stability
and reliability than performance (my primary use for XFS is /boot, and I
use BTRFS for pretty much everything else).
On Wednesday, April 29, 2015 9:42:43 AM PDT, Mike Galbraith wrote:
>
> [dbench bakeoff]
>
> With dbench v4.00, tux3 seems to be king of the max_latency hill, but
> btrfs took throughput on my box. With v3.04, tux3 took 1st place at
> splashing about in pagecache, but last place at dbench -S.
>
> Hohum, curiosity satisfied.
Hi Mike,
Thanks for that. Please keep in mind, that was our B team, it does a
full fs sync for every fsync. Maybe a rematch when the shiny new one
lands? Also, hardware? It looks like a single 7200 RPM disk, but it
would be nice to know. And it seems, not all dbench 4.0 are equal.
Mine doesn't have a -B option.
That order of magnitude latency difference is striking. It sounds
good, but what does it mean? I see a smaller difference here, maybe
because of running under KVM.
Your results seem to confirm the gap I noticed between Ext4 and XFS
on the one hand and Btrfs and Tux3 on the other, with the caveat that
the anomalous dbench -S result is probably about running with the
older fsync code. Of course, this is just dbench, but maybe something
to keep an eye on.
Regards,
Daniel
On Wednesday, April 29, 2015 12:05:26 PM PDT, Mike Galbraith wrote:
> Here's something that _might_ interest xfs folks.
>
> cd git (source repository of git itself)
> make clean
> echo 3 > /proc/sys/vm/drop_caches
> time make -j8 test
>
> ext4 2m20.721s
> xfs 6m41.887s <-- ick
> btrfs 1m32.038s
> tux3 1m30.262s
>
> Testing by Aunt Tilly: mkfs, no fancy switches, mount the thing, test.
>
> Are defaults for mkfs.xfs such that nobody sane uses them, or does xfs
> really hate whatever git selftests are doing this much?
I'm more interested in the fact that we eked out a win :)
Btrfs appears to optimize tiny files by storing them in its big btree,
the equivalent of our itree, and Tux3 doesn't do that yet, so we are a
bit hobbled for a make load. Eventually, that gap should widen.
The pattern I noticed where the write-anywhere designs are beating the
journal designs seems to continue here. I am sure there are exceptions,
but maybe it is a real thing.
Regards,
Daniel
Daniel Phillips <[email protected]> writes:
> On Wednesday, April 29, 2015 9:42:43 AM PDT, Mike Galbraith wrote:
>>
>> [dbench bakeoff]
>>
>> With dbench v4.00, tux3 seems to be king of the max_latency hill, but
>> btrfs took throughput on my box. With v3.04, tux3 took 1st place at
>> splashing about in pagecache, but last place at dbench -S.
>>
>> Hohum, curiosity satisfied.
>
> Thanks for that. Please keep in mind, that was our B team, it does a
> full fs sync for every fsync. Maybe a rematch when the shiny new one
> lands? Also, hardware? It looks like a single 7200 RPM disk, but it
> would be nice to know. And it seems, not all dbench 4.0 are equal.
> Mine doesn't have a -B option.
Yeah, I also want to know hardware. Also, what size of partition? And
each test was done by fresh FS (i.e. after mkfs), or same FS was used
through all tests?
My "hirofumi" branch in public repo is still having the bug to leave the
empty block for inodes by repeat of create and unlink. And this bug
makes fragment of FS very fast. (This bug is what I'm fixing, now.)
If same FS was used, your test might hit to this bug.
Thanks.
--
OGAWA Hirofumi <[email protected]>
On Wed, Apr 29, 2015 at 09:05:26PM +0200, Mike Galbraith wrote:
> Here's something that _might_ interest xfs folks.
>
> cd git (source repository of git itself)
> make clean
> echo 3 > /proc/sys/vm/drop_caches
> time make -j8 test
>
> ext4 2m20.721s
> xfs 6m41.887s <-- ick
> btrfs 1m32.038s
> tux3 1m30.262s
>
> Testing by Aunt Tilly: mkfs, no fancy switches, mount the thing, test.
TL;DR: Results are *very different* on a 256GB Samsung 840 EVO SSD
with slightly slower CPUs (E5-4620 @ 2.20GHz)i, all filesystems
using defaults:
real user sys
xfs 3m16.138s 7m8.341s 14m32.462s
ext4 3m18.045s 7m7.840s 14m32.994s
btrfs 3m45.149s 7m10.184s 16m30.498s
What you are seeing is physical seek distances impacting read
performance. XFS does not optimise for minimal physical seek
distance, and hence is slower than filesytsems that do optimise for
minimal seek distance. This shows up especially well on slow single
spindles.
XFS is *adequate* for the use on slow single drives, but it is
really designed for best performance on storage hardware that is not
seek distance sensitive.
IOWS, XFS just hates your disk. Spend $50 and buy a cheap SSD and
the problem goes away. :)
----
And now in more detail.
It's easy to be fast on empty filesystems. XFS does not aim to be
fast in such situations - it aims to have consistent performance
across the life of the filesystem.
In this case, ext4, btrfs and tux3 have optimal allocation filling
from the outside of the disk, while XFS is spreading the files
across (at least) 4 separate regions of the whole disk. Hence XFS is
seeing seek times on read are much larger than the other filesystems
when the filesystem is empty as it is doing full disk seeks rather
than being confined to the outer edges of spindle.
Thing is, once you've abused those filesytsems for a couple of
months, the files in ext4, btrfs and tux3 are not going to be laid
out perfectly on the outer edge of the disk. They'll be spread all
over the place and so all the filesystems will be seeing large seeks
on read. The thing is, XFS will have roughly the same performance as
when the filesystem is empty because the spreading of the allocation
allows it to maintain better locality and separation and hence
doesn't fragment free space nearly as badly as the oher filesystems.
Free space fragmentation is what leads to performance degradation in
filesystems, and all the other filesystem will have degraded to be
*much worse* than XFS.
Put simply: empty filesystem benchmarking does not show the real
performance of the filesystem under sustained production workloads.
Hence benchmarks like this - while interesting from a theoretical
point of view and are widely used for bragging about whose got the
fastest - are mostly irrelevant to determining how the filesystem
will perform in production environments.
We can also look at this algorithm in a different way: take a large
filesystem (say a few hundred TB) across a few tens of disks in a
linear concat. ext4, btrfs and tux3 will only hit the first disk in
the concat, and so go no faster because they are still bound by
physical seek times. XFS, however, will spread the load across many
(if not all) of the disks, and so effectively reduce the average
seek time by the number of disks doing concurrent IO. Then you'll
see that application level IO concurrency becomes the performance
limitation, not the physical seek time of the hardware.
IOWs, what you don't see here is that the XFS algorithms that make
your test slow will keep *lots* of disks busy. i.e. testing empty
filesystem performance a single, slow disk demonstrates that an
algorithm designed for scalability isn't designed to acheive
physical seek distance minimisation. Hence your storage makes XFS
look particularly poor in comparison to filesystems that are being
designed and optimised for the limitations of single slow spindles...
To further demonstrate that it is physical seek distance that is the
issue here, lets take the seek time out of the equation (e.g. use a
SSD). Doing that will result in basically no difference in
performance between all 4 filesystems as performance will now be
determined by application level concurrency and that is the same for
all tests.
e.g. on a 16p, 16GB RAM VM with storage on a SSDs a "make -j 8"
compile test on a kernel source tree (using my normal test machine
.config) gives:
real user sys
xfs: 4m6.723s 26m21.087s 2m49.426s
ext4: 4m11.415s 26m21.122s 2m49.786s
btrfs: 4m8.118s 26m26.440s 2m50.357s
i.e. take seek times out of the picture, and XFS is just as fast as
any of the other filesystems.
Just about everyone I know uses SSDs in their laptops and machines
that build kernels these days, and spinning disks are rapidly
disappearing from enterprise and HPC environments which also happens
to be the target markets for XFS. Hence filesystem performance on
slow single spindles is the furthest thing away from what we really
need to optimise XFS for.
Indeed, I'll point you to where we are going with fsync optimisation
- it's completely the other end of the scale:
http://oss.sgi.com/archives/xfs/2014-06/msg00214.html
i.e. being able to scale effectively to tens of thousands of fsync
calls every second because that's what applications like ceph and
gluster really need from XFS....
> Are defaults for mkfs.xfs such that nobody sane uses them, or does xfs
> really hate whatever git selftests are doing this much?
It just hates your disk. Spend $50 and buy a cheap SSD and the
problem goes away. :)
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, Apr 28, 2015 at 04:13:18PM -0700, Daniel Phillips wrote:
> Greetings,
>
> This post is dedicated to Ted, who raised doubts a while back about
> whether Tux3 can ever have a fast fsync:
>
> https://lkml.org/lkml/2013/5/11/128
> "Re: Tux3 Report: Faster than tmpfs, what?"
[snip]
> I measured fsync performance using a 7200 RPM disk as a virtual
> drive under KVM, configured with cache=none so that asynchronous
> writes are cached and synchronous writes translate into direct
> writes to the block device.
Yup, a slow single spindle, so fsync performance is determined by
seek latency of the filesystem. Hence the filesystem that "wins"
will be the filesystem that minimises fsync seek latency above all
other considerations.
http://www.spinics.net/lists/kernel/msg1978216.html
So, to demonstrate, I'll run the same tests but using a 256GB
samsung 840 EVO SSD and show how much the picture changes. I didn't
test tux3, you don't make it easy to get or build.
> To focus purely on fsync, I wrote a
> small utility (at the end of this post) that forks a number of
> tasks, each of which continuously appends to and fsyncs its own
> file. For a single task doing 1,000 fsyncs of 1K each, we have:
>
> Ext4: 34.34s
> XFS: 23.63s
> Btrfs: 34.84s
> Tux3: 17.24s
Ext4: 1.94s
XFS: 2.06s
Btrfs: 2.06s
All equally fast, so I can't see how tux3 would be much faster here.
> Things get more interesting with parallel fsyncs. In this test, each
> task does ten fsyncs and task count scales from ten to ten thousand.
> We see that all tested filesystems are able to combine fsyncs into
> group commits, with varying degrees of success:
>
> Tasks: 10 100 1,000 10,000
> Ext4: 0.79s 0.98s 4.62s 61.45s
> XFS: 0.75s 1.68s 20.97s 238.23s
> Btrfs 0.53s 0.78s 3.80s 84.34s
> Tux3: 0.27s 0.34s 1.00s 6.86s
Tasks: 10 100 1,000 10,000
Ext4: 0.05s 0.12s 0.48s 3.99s
XFS: 0.25s 0.41s 0.96s 4.07s
Btrfs 0.22s 0.50s 2.86s 161.04s
(lower is better)
Ext4 and XFS are fast and show similar performance. Tux3 *can't* be
very much faster as most of the elapsed time in the test is from
forking the processes that do the IO and fsyncs.
FWIW, btrfs shows it's horrible fsync implementation here, burning
huge amounts of CPU to do bugger all IO. i.e. it burnt all 16p for 2
and a half minutes in that 10000 fork test so wasn't IO bound at
all.
> Is there any practical use for fast parallel fsync of tens of thousands
> of tasks? This could be useful for a scalable transaction server
> that sits directly on the filesystem instead of a database, as is
> the fashion for big data these days. It certainly can't hurt to know
> that if you need that kind of scaling, Tux3 will do it.
Ext4 and XFS already do that just fine, too, when you use storage
suited to such a workload and you have a sane interface for
submitting tens of thousands of concurrent fsync operations. e.g
http://oss.sgi.com/archives/xfs/2014-06/msg00214.html
> Of course, a pure fsync load could be viewed as somewhat unnatural. We
> also need to know what happens under a realistic load with buffered
> operations mixed with fsyncs. We turn to an old friend, dbench:
>
> Dbench -t10
>
> Tasks: 8 16 32
> Ext4: 35.32 MB/s 34.08 MB/s 39.71 MB/s
> XFS: 32.12 MB/s 25.08 MB/s 30.12 MB/s
> Btrfs: 54.40 MB/s 75.09 MB/s 102.81 MB/s
> Tux3: 85.82 MB/s 133.69 MB/s 159.78 MB/s
> (higher is better)
On a SSD (256GB samsung 840 EVO), running 4.0.0:
Tasks: 8 16 32
Ext4: 598.27 MB/s 981.13 MB/s 1233.77 MB/s
XFS: 884.62 MB/s 1328.21 MB/s 1373.66 MB/s
Btrfs: 201.64 MB/s 137.55 MB/s 108.56 MB/s
dbench looks *very different* when there is no seek latency,
doesn't it?
> Dbench -t10 -s (all file operations synchronous)
>
> Tasks: 8 16 32
> Ext4: 4.51 MB/s 6.25 MB/s 7.72 MB/s
> XFS: 4.24 MB/s 4.77 MB/s 5.15 MB/s
> Btrfs: 7.98 MB/s 13.87 MB/s 22.87 MB/s
> Tux3: 15.41 MB/s 25.56 MB/s 39.15 MB/s
> (higher is better)
Ext4: 173.54 MB/s 294.41 MB/s 424.11 MB/s
XFS: 172.98 MB/s 342.78 MB/s 458.87 MB/s
Btrfs: 36.92 MB/s 34.52 MB/s 55.19 MB/s
Again, the numbers are completely the other way around on a SSD,
with the conventional filesystems being 5-10x faster than the
WA/COW style filesystem.
....
> In the full disclosure department, Tux3 is still not properly
> optimized in some areas. One of them is fragmentation: it is not
> very hard to make Tux3 slow down by running long tests. Our current
Oh, that still hasn't been fixed?
Until you sort of how you are going to scale allocation to tens of
TB and not fragment free space over time, fsync performance of the
filesystem is pretty much irrelevant. Changing the allocation
algorithms will fundamentally alter the IO patterns and so all these
benchmarks are essentially meaningless.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, 2015-04-30 at 10:20 +1000, Dave Chinner wrote:
> IOWS, XFS just hates your disk. Spend $50 and buy a cheap SSD and
> the problem goes away. :)
I'd love to. Too bad sorry sack of sh*t MB manufacturer only applied
_connectors_ to 4 of 6 available ports, and they're all in use :)
> ----
>
> And now in more detail.
Thanks for those details, made perfect sense.
-Mike
On Wed, 2015-04-29 at 13:40 -0700, Daniel Phillips wrote:
> On Wednesday, April 29, 2015 9:42:43 AM PDT, Mike Galbraith wrote:
> >
> > [dbench bakeoff]
> >
> > With dbench v4.00, tux3 seems to be king of the max_latency hill, but
> > btrfs took throughput on my box. With v3.04, tux3 took 1st place at
> > splashing about in pagecache, but last place at dbench -S.
> >
> > Hohum, curiosity satisfied.
>
> Hi Mike,
>
> Thanks for that. Please keep in mind, that was our B team, it does a
> full fs sync for every fsync. Maybe a rematch when the shiny new one
> lands? Also, hardware? It looks like a single 7200 RPM disk, but it
> would be nice to know. And it seems, not all dbench 4.0 are equal.
> Mine doesn't have a -B option.
Hm, mine came from git://git.samba.org/sahlberg/dbench.git. The thing
has all kinds of cool options I have no clue how to use.
Yeah, the box is a modern plane jane, loads of CPU, cheap a$$ spinning
rust IO. It has an SSD, but that's currently occupied by games OS.
I'll eventually either buy a bigger one or steal it from winders. The
only thing stopping me is my inherent mistrust of storage media that has
no moving parts, but wears out anyway, and with no bearings whining to
warn you :)
> That order of magnitude latency difference is striking. It sounds
> good, but what does it mean? I see a smaller difference here, maybe
> because of running under KVM.
That max_latency thing is flush.
-Mike
On Thu, 2015-04-30 at 07:06 +0900, OGAWA Hirofumi wrote:
> Yeah, I also want to know hardware. Also, what size of partition? And
> each test was done by fresh FS (i.e. after mkfs), or same FS was used
> through all tests?
1TB rust bucket, with new fs each test.
-Mike
On Wed, 2015-04-29 at 14:12 -0700, Daniel Phillips wrote:
> Btrfs appears to optimize tiny files by storing them in its big btree,
> the equivalent of our itree, and Tux3 doesn't do that yet, so we are a
> bit hobbled for a make load.
That's not a build load, it's a git load. btrfs beat all others at the
various git/quilt things I tried (since that's what I do lots of in real
life), but tux3 looked quite good too.
As Dave noted though, an orchard produces oodles of apples over its
lifetime, these shiny new apples may lose luster over time ;-)
-Mike
Am Donnerstag, 30. April 2015, 10:20:08 schrieb Dave Chinner:
> On Wed, Apr 29, 2015 at 09:05:26PM +0200, Mike Galbraith wrote:
> > Here's something that _might_ interest xfs folks.
> >
> > cd git (source repository of git itself)
> > make clean
> > echo 3 > /proc/sys/vm/drop_caches
> > time make -j8 test
> >
> > ext4 2m20.721s
> > xfs 6m41.887s <-- ick
> > btrfs 1m32.038s
> > tux3 1m30.262s
> >
> > Testing by Aunt Tilly: mkfs, no fancy switches, mount the thing, test.
>
> TL;DR: Results are *very different* on a 256GB Samsung 840 EVO SSD
> with slightly slower CPUs (E5-4620 @ 2.20GHz)i, all filesystems
> using defaults:
>
> real user sys
> xfs 3m16.138s 7m8.341s 14m32.462s
> ext4 3m18.045s 7m7.840s 14m32.994s
> btrfs 3m45.149s 7m10.184s 16m30.498s
>
> What you are seeing is physical seek distances impacting read
> performance. XFS does not optimise for minimal physical seek
> distance, and hence is slower than filesytsems that do optimise for
> minimal seek distance. This shows up especially well on slow single
> spindles.
>
> XFS is *adequate* for the use on slow single drives, but it is
> really designed for best performance on storage hardware that is not
> seek distance sensitive.
>
> IOWS, XFS just hates your disk. Spend $50 and buy a cheap SSD and
> the problem goes away. :)
I am quite surprised that a traditional filesystem that was created in the
age of rotating media does not like this kind of media and even seems to
excel on BTRFS on the new non rotating media available.
But…
> ----
>
> And now in more detail.
>
> It's easy to be fast on empty filesystems. XFS does not aim to be
> fast in such situations - it aims to have consistent performance
> across the life of the filesystem.
… this is a quite important addition.
> Thing is, once you've abused those filesytsems for a couple of
> months, the files in ext4, btrfs and tux3 are not going to be laid
> out perfectly on the outer edge of the disk. They'll be spread all
> over the place and so all the filesystems will be seeing large seeks
> on read. The thing is, XFS will have roughly the same performance as
> when the filesystem is empty because the spreading of the allocation
> allows it to maintain better locality and separation and hence
> doesn't fragment free space nearly as badly as the oher filesystems.
> Free space fragmentation is what leads to performance degradation in
> filesystems, and all the other filesystem will have degraded to be
> *much worse* than XFS.
I even still see hungs on what I tend to see as freespace fragmentation in
BTRFS. My /home on a Dual (!) BTRFS SSD setup can basically stall to a
halt when it has reserved all space of the device for chunks. So this
merkaba:~> btrfs fi sh /home
Label: 'home' uuid: […]
Total devices 2 FS bytes used 129.48GiB
devid 1 size 170.00GiB used 146.03GiB path /dev/mapper/msata-
home
devid 2 size 170.00GiB used 146.03GiB path /dev/mapper/sata-
home
Btrfs v3.18
merkaba:~> btrfs fi df /home
Data, RAID1: total=142.00GiB, used=126.72GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.00GiB, used=2.76GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
is safe, but one I have size 170 GiB user 170 GiB, even if inside the
chunks there is enough free space to allocate from, enough as in 30-40
GiB, it can happen that writes are stalled up to the point that
applications on the desktop freeze and I see hung task messages in kernel
log.
This is the case upto kernel 4.0. I have seen Chris Mason fixing some write
stalls for big facebook setups, maybe it will help here, but unless this
issue is fixed, I think BTRFS is not yet fully production ready, unless you
leave *huge* amount of free space, as in for 200 GiB of data you want to
write make a 400 GiB volume.
> Put simply: empty filesystem benchmarking does not show the real
> performance of the filesystem under sustained production workloads.
> Hence benchmarks like this - while interesting from a theoretical
> point of view and are widely used for bragging about whose got the
> fastest - are mostly irrelevant to determining how the filesystem
> will perform in production environments.
>
> We can also look at this algorithm in a different way: take a large
> filesystem (say a few hundred TB) across a few tens of disks in a
> linear concat. ext4, btrfs and tux3 will only hit the first disk in
> the concat, and so go no faster because they are still bound by
> physical seek times. XFS, however, will spread the load across many
> (if not all) of the disks, and so effectively reduce the average
> seek time by the number of disks doing concurrent IO. Then you'll
> see that application level IO concurrency becomes the performance
> limitation, not the physical seek time of the hardware.
That are the allocation groups. I always wondered how it can be beneficial
to spread the allocations onto 4 areas of one partition on expensive seek
media. Now that makes better sense for me. I always had the gut impression
that XFS may not be the fastest in all cases, but it is one of the
filesystem with the most consistent performance over time, but never was
able to fully explain why that is.
Thanks,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
On Wednesday, April 29, 2015 6:46:16 PM PDT, Dave Chinner wrote:
>> I measured fsync performance using a 7200 RPM disk as a virtual
>> drive under KVM, configured with cache=none so that asynchronous
>> writes are cached and synchronous writes translate into direct
>> writes to the block device.
>
> Yup, a slow single spindle, so fsync performance is determined by
> seek latency of the filesystem. Hence the filesystem that "wins"
> will be the filesystem that minimises fsync seek latency above all
> other considerations.
>
> http://www.spinics.net/lists/kernel/msg1978216.html
If you want to declare that XFS only works well on solid state disks
and big storage arrays, that is your business. But if you do, you can no
longer call XFS a general purpose filesystem. And if you would rather
disparage people who report genuine performance bugs than get down to
fixing them, that is your business too. Don't expect to be able to stop
the bug reports by bluster.
> So, to demonstrate, I'll run the same tests but using a 256GB
> samsung 840 EVO SSD and show how much the picture changes.
I will go you one better, I ran a series of fsync tests using tmpfs,
and I now have a very clear picture of how the picture changes. The
executive summary is: Tux3 is still way faster, and still scales way
better to large numbers of tasks. I have every confidence that the same
is true of SSD.
> I didn't test tux3, you don't make it easy to get or build.
There is no need to apologize for not testing Tux3, however, it is
unseemly to throw mud at the same time. Remember, you are the person
who put so much energy into blocking Tux3 from merging last summer. If
it now takes you a little extra work to build it then it is hard to be
really sympathetic. Mike apparently did not find it very hard.
>> To focus purely on fsync, I wrote a
>> small utility (at the end of this post) that forks a number of
>> tasks, each of which continuously appends to and fsyncs its own
>> file. For a single task doing 1,000 fsyncs of 1K each, we have:
>>
>> Ext4: 34.34s
>> XFS: 23.63s
>> Btrfs: 34.84s
>> Tux3: 17.24s
>
> Ext4: 1.94s
> XFS: 2.06s
> Btrfs: 2.06s
>
> All equally fast, so I can't see how tux3 would be much faster here.
Running the same thing on tmpfs, Tux3 is significantly faster:
Ext4: 1.40s
XFS: 1.10s
Btrfs: 1.56s
Tux3: 1.07s
> Tasks: 10 100 1,000 10,000
> Ext4: 0.05s 0.12s 0.48s 3.99s
> XFS: 0.25s 0.41s 0.96s 4.07s
> Btrfs 0.22s 0.50s 2.86s 161.04s
> (lower is better)
>
> Ext4 and XFS are fast and show similar performance. Tux3 *can't* be
> very much faster as most of the elapsed time in the test is from
> forking the processes that do the IO and fsyncs.
You wish. In fact, Tux3 is a lot faster. You must have made a mistake in
estimating your fork overhead. It is easy to check, just run "syncs foo
0 10000". I get 0.23 seconds to fork 10,0000 proceses, create the files
and exit. Here are my results on tmpfs, triple checked and reproducible:
Tasks: 10 100 1,000 10,000
Ext4: 0.05 0.14 1.53 26.56
XFS: 0.05 0.16 2.10 29.76
Btrfs: 0.08 0.37 3.18 34.54
Tux3: 0.02 0.05 0.18 2.16
Note: you should recheck your final number for Btrfs. I have seen Btrfs
fall off the rails and take wildly longer on some tests just like that.
We know Btrfs has corner case issues, I don't think they deny it.
Unlike you, Chris Mason is a gentleman when faced with issues. Instead
of insulting his colleagues and hurling around the sort of abuse that
has gained LKML its current unenviable reputation, he gets down to work
and fixes things.
You should do that too, your own house is not in order. XFS has major
issues. One easily reproducible one is a denial of service during the
10,000 task test where it takes multiple seconds to cat small files. I
saw XFS do this on both spinning disk and tmpfs, and I have seen it
hang for minutes trying to list a directory. I looked a bit into it, and
I see that you are blocking for aeons trying to acquire a lock in open.
Here is an example. While doing "sync6 fs/foo 10 10000":
time cat fs/foo999
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
real 0m2.282s
user 0m0.000s
sys 0m0.000s
You and I both know the truth: Ext4 is the only really reliable general
purpose filesystem on Linux at the moment. XFS is definitely not, I
have seen ample evidence with my own eyes. What you need is people
helping you fix your issues instead of making your colleagues angry at
you with your incessant attacks.
> FWIW, btrfs shows it's horrible fsync implementation here, burning
> huge amounts of CPU to do bugger all IO. i.e. it burnt all 16p for 2
> and a half minutes in that 10000 fork test so wasn't IO bound at
> all.
Btrfs is hot and cold. In my tmpfs tests, Btrfs beats XFS at high
task counts. It is actually amazing the progress Btrfs has made in
performance. I for one appreciate the work they are doing and I admire
the way Chris conducts both himself and his project. I wish you were
more like Chris, and I wish I was for that matter.
I agree that Btrfs uses too much CPU, but there is no need to be rude
about it. I think the Btrfs team knows how to use a profiler.
>> Is there any practical use for fast parallel fsync of tens of thousands
>> of tasks? This could be useful for a scalable transaction server
>> that sits directly on the filesystem instead of a database, as is
>> the fashion for big data these days. It certainly can't hurt to know
>> that if you need that kind of scaling, Tux3 will do it.
>
> Ext4 and XFS already do that just fine, too, when you use storage
> suited to such a workload and you have a sane interface for
> submitting tens of thousands of concurrent fsync operations. e.g
>
> http://oss.sgi.com/archives/xfs/2014-06/msg00214.html
Tux3 turns in really great performance with an ordinary, cheap spinning
disk using standard Posix ops. It is not for you to tell people they
don't care about that, and it is wrong for you to imply that we only
perform well on spinning disk - you don't know that, and it's not true.
By the way, I like your asynchronous fsync, nice work. It by no means
obviates the need for a fast implementation of the standard operation.
> On a SSD (256GB samsung 840 EVO), running 4.0.0:
>
> Tasks: 8 16 32
> Ext4: 598.27 MB/s 981.13 MB/s 1233.77 MB/s
> XFS: 884.62 MB/s 1328.21 MB/s 1373.66 MB/s
> Btrfs: 201.64 MB/s 137.55 MB/s 108.56 MB/s
>
> dbench looks *very different* when there is no seek latency,
> doesn't it?
It looks like Btrfs hit a bug, not a huge surprise. Btrfs hit an assert
for me earlier this evening. It is rare but it happens. I rebooted and
got sane numbers. Running dbench -t10 on tmpfs I get:
Tasks: 8 16 32
Ext4: 660.69 MB/s 708.81 MB/s 720.12 MB/s
XFS: 692.01 MB/s 388.53 MB/s 134.84 MB/s
Btrfs: 229.66 MB/s 341.27 MB/s 377.97 MB/s
Tux3: 1147.12 MB/s 1401.61 MB/s 1283.74 MB/s
Looks like XFS hit a bump and fell off the cliff at 32 threads. I reran
that one many times because I don't want to give you an inaccurate
report.
Tux3 turned in a great performance. I am not pleased with the negative
scaling at 32 threads, but it still finishes way ahead.
>> Dbench -t10 -s (all file operations synchronous)
>>
>> Tasks: 8 16 32
>> Ext4: 4.51 MB/s 6.25 MB/s 7.72 MB/s
>> XFS: 4.24 MB/s 4.77 MB/s 5.15 MB/s
>> Btrfs: 7.98 MB/s 13.87 MB/s 22.87 MB/s
>> Tux3: 15.41 MB/s 25.56 MB/s 39.15 MB/s
>> (higher is better)
>
> Ext4: 173.54 MB/s 294.41 MB/s 424.11 MB/s
> XFS: 172.98 MB/s 342.78 MB/s 458.87 MB/s
> Btrfs: 36.92 MB/s 34.52 MB/s 55.19 MB/s
>
> Again, the numbers are completely the other way around on a SSD,
> with the conventional filesystems being 5-10x faster than the
> WA/COW style filesystem.
I wouldn't be so sure about that...
Tasks: 8 16 32
Ext4: 93.06 MB/s 98.67 MB/s 102.16 MB/s
XFS: 81.10 MB/s 79.66 MB/s 73.27 MB/s
Btrfs: 43.77 MB/s 64.81 MB/s 90.35 MB/s
Tux3: 198.49 MB/s 279.00 MB/s 318.41 MB/s
>> In the full disclosure department, Tux3 is still not properly
>> optimized in some areas. One of them is fragmentation: it is not
>> very hard to make Tux3 slow down by running long tests. Our current
>
> Oh, that still hasn't been fixed?
Count your blessings while you can.
> Until you sort of how you are going to scale allocation to tens of
> TB and not fragment free space over time, fsync performance of the
> filesystem is pretty much irrelevant. Changing the allocation
> algorithms will fundamentally alter the IO patterns and so all these
> benchmarks are essentially meaningless.
Ahem, are you the same person for whom fsync was the most important
issue in the world last time the topic came up, to the extent of
spreading around FUD and entirely ignoring the great work we had
accomplished for regular file operations? I said then that when we got
around to a proper fsync it would be competitive. Now here it is, so you
want to change the topic. I understand.
Honestly, you would be a lot better off investigating why our fsync
algorithm is so good.
Regards,
Daniel
On Wednesday, April 29, 2015 8:50:57 PM PDT, Mike Galbraith wrote:
> On Wed, 2015-04-29 at 13:40 -0700, Daniel Phillips wrote:
>>
>> That order of magnitude latency difference is striking. It sounds
>> good, but what does it mean? I see a smaller difference here, maybe
>> because of running under KVM.
>
> That max_latency thing is flush.
Right, it is just the max run time of all operations, including flush
(dbench's name for fsync I think) which would most probably be the longest
running one. I would like to know how we manage to pull that off. Now
that you mention it, I see a factor of two or so latency win here, not
the order of magnitude that you saw. Maybe KVM introduces some fuzz
for me.
I checked whether fsync = sync is the reason, and no. Well, that goes
on the back burner, we will no doubt figure it out in due course.
Regards,
Daniel
On Wednesday, April 29, 2015 5:20:08 PM PDT, Dave Chinner wrote:
> It's easy to be fast on empty filesystems. XFS does not aim to be
> fast in such situations - it aims to have consistent performance
> across the life of the filesystem.
>
> In this case, ext4, btrfs and tux3 have optimal allocation filling
> from the outside of the disk, while XFS is spreading the files
> across (at least) 4 separate regions of the whole disk. Hence XFS is
> seeing seek times on read are much larger than the other filesystems
> when the filesystem is empty as it is doing full disk seeks rather
> than being confined to the outer edges of spindle.
>
> Thing is, once you've abused those filesytsems for a couple of
> months, the files in ext4, btrfs and tux3 are not going to be laid
> out perfectly on the outer edge of the disk. They'll be spread all
> over the place and so all the filesystems will be seeing large seeks
> on read. The thing is, XFS will have roughly the same performance as
> when the filesystem is empty because the spreading of the allocation
> allows it to maintain better locality and separation and hence
> doesn't fragment free space nearly as badly as the oher filesystems.
> Free space fragmentation is what leads to performance degradation in
> filesystems, and all the other filesystem will have degraded to be
> *much worse* than XFS.
>
> Put simply: empty filesystem benchmarking does not show the real
> performance of the filesystem under sustained production workloads.
> Hence benchmarks like this - while interesting from a theoretical
> point of view and are widely used for bragging about whose got the
> fastest - are mostly irrelevant to determining how the filesystem
> will perform in production environments.
>
> We can also look at this algorithm in a different way: take a large
> filesystem (say a few hundred TB) across a few tens of disks in a
> linear concat. ext4, btrfs and tux3 will only hit the first disk in
> the concat, and so go no faster because they are still bound by
> physical seek times. XFS, however, will spread the load across many
> (if not all) of the disks, and so effectively reduce the average
> seek time by the number of disks doing concurrent IO. Then you'll
> see that application level IO concurrency becomes the performance
> limitation, not the physical seek time of the hardware.
>
> IOWs, what you don't see here is that the XFS algorithms that make
> your test slow will keep *lots* of disks busy. i.e. testing empty
> filesystem performance a single, slow disk demonstrates that an
> algorithm designed for scalability isn't designed to acheive
> physical seek distance minimisation. Hence your storage makes XFS
> look particularly poor in comparison to filesystems that are being
> designed and optimised for the limitations of single slow spindles...
>
> To further demonstrate that it is physical seek distance that is the
> issue here, lets take the seek time out of the equation (e.g. use a
> SSD). Doing that will result in basically no difference in
> performance between all 4 filesystems as performance will now be
> determined by application level concurrency and that is the same for
> all tests.
Lovely sounding argument, but it is wrong because Tux3 still beats XFS
even with seek time factored out of the equation.
Even with SSD, if you just go splattering files all over the disk you
will pay for it in latency and lifetime when the disk goes into
continuous erase and your messy layout causes write multiplication.
But of course you can design your filesystem any way you want. Tux3
is designed to be fast on the hardware that people actually have.
Regards,
Daniel
On Thu, 2015-04-30 at 04:14 -0700, Daniel Phillips wrote:
> Lovely sounding argument, but it is wrong because Tux3 still beats XFS
> even with seek time factored out of the equation.
Hm. Do you have big-storage comparison numbers to back that? I'm no
storage guy (waiting for holographic crystal arrays to obsolete all this
crap;), but Dave's big-storage guy words made sense to me.
-Mike
On Thursday, April 30, 2015 5:07:21 AM PDT, Mike Galbraith wrote:
> On Thu, 2015-04-30 at 04:14 -0700, Daniel Phillips wrote:
>
>> Lovely sounding argument, but it is wrong because Tux3 still beats XFS
>> even with seek time factored out of the equation.
>
> Hm. Do you have big-storage comparison numbers to back that? I'm no
> storage guy (waiting for holographic crystal arrays to obsolete all this
> crap;), but Dave's big-storage guy words made sense to me.
This has nothing to do with big storage. The proposition was that seek
time is the reason for Tux3's fsync performance. That claim was easily
falsified by removing the seek time.
Dave's big storage words are there to draw attention away from the fact
that XFS ran the Git tests four times slower than Tux3 and three times
slower than Ext4. Whatever the big storage excuse is for that, the fact
is, XFS obviously sucks at little storage.
He also posted nonsense: "XFS, however, will spread the load across
many (if not all) of the disks, and so effectively reduce the average
seek time by the number of disks doing concurrent IO." False. No matter
how big an array of spinning disks you have, seek latency and
synchronous write latency stay the same. It is just an attempt to
bamboozle you. If instead he had talked about throughput, he would have
a point. But he didn't, because he knows that does not help his
argument. If fsync sucks on one disk, it will suck just as much on
a thousand disks.
The talk about filling up from the outside of disk is disingenuous.
Dave should know that Ext4 does not do that, it spreads out allocations
exactly to give good aging, and it does deliver that - Ext4's aging
performance is second to none. What XFS does is just stupid, and
instead of admitting that and fixing it, Dave claims it would be great
if the disk was an array or an SSD instead of what it actually is.
Regards,
Daniel
On Thu, 2015-04-30 at 05:58 -0700, Daniel Phillips wrote:
> On Thursday, April 30, 2015 5:07:21 AM PDT, Mike Galbraith wrote:
> > On Thu, 2015-04-30 at 04:14 -0700, Daniel Phillips wrote:
> >
> >> Lovely sounding argument, but it is wrong because Tux3 still beats XFS
> >> even with seek time factored out of the equation.
> >
> > Hm. Do you have big-storage comparison numbers to back that? I'm no
> > storage guy (waiting for holographic crystal arrays to obsolete all this
> > crap;), but Dave's big-storage guy words made sense to me.
>
> This has nothing to do with big storage. The proposition was that seek
> time is the reason for Tux3's fsync performance. That claim was easily
> falsified by removing the seek time.
>
> Dave's big storage words are there to draw attention away from the fact
> that XFS ran the Git tests four times slower than Tux3 and three times
> slower than Ext4. Whatever the big storage excuse is for that, the fact
> is, XFS obviously sucks at little storage.
If you allocate spanning the disk from start of life, you're going to
eat seeks that others don't until later. That seemed rather obvious and
straight forward. He flat stated that xfs has passable performance on
single bit of rust, and openly explained why. I see no misdirection,
only some evidence of bad blood between you two.
No, I won't be switching to xfs any time soon, but then it would take a
hell of a lot of evidence to get me to move away from ext4. I trust
ext[n] deeply because it has proven many times over the years that it
can take one hell of a lot (of self inflicted wounds;).
-Mike
On 04/30/2015 06:48 AM, Mike Galbraith wrote:
> On Thu, 2015-04-30 at 05:58 -0700, Daniel Phillips wrote:
>> On Thursday, April 30, 2015 5:07:21 AM PDT, Mike Galbraith wrote:
>>> On Thu, 2015-04-30 at 04:14 -0700, Daniel Phillips wrote:
>>>
>>>> Lovely sounding argument, but it is wrong because Tux3 still beats XFS
>>>> even with seek time factored out of the equation.
>>>
>>> Hm. Do you have big-storage comparison numbers to back that? I'm no
>>> storage guy (waiting for holographic crystal arrays to obsolete all this
>>> crap;), but Dave's big-storage guy words made sense to me.
>>
>> This has nothing to do with big storage. The proposition was that seek
>> time is the reason for Tux3's fsync performance. That claim was easily
>> falsified by removing the seek time.
>>
>> Dave's big storage words are there to draw attention away from the fact
>> that XFS ran the Git tests four times slower than Tux3 and three times
>> slower than Ext4. Whatever the big storage excuse is for that, the fact
>> is, XFS obviously sucks at little storage.
>
> If you allocate spanning the disk from start of life, you're going to
> eat seeks that others don't until later. That seemed rather obvious and
> straight forward.
It is a logical falacy. It mixes a grain of truth (spreading all over the
disk causes extra seeks) with an obvious falsehood (it is not necessarily
the only possible way to avoid long term fragmentation).
> He flat stated that xfs has passable performance on
> single bit of rust, and openly explained why. I see no misdirection,
> only some evidence of bad blood between you two.
Raising the spectre of theoretical fragmentation issues when we have not
even begun that work is a straw man and intellectually dishonest. You have
to wonder why he does it. It is destructive to our community image and
harmful to progress.
> No, I won't be switching to xfs any time soon, but then it would take a
> hell of a lot of evidence to get me to move away from ext4. I trust
> ext[n] deeply because it has proven many times over the years that it
> can take one hell of a lot (of self inflicted wounds;).
Regards,
Daniel
Daniel Phillips wrote:
>
>
> On 04/30/2015 06:48 AM, Mike Galbraith wrote:
>> On Thu, 2015-04-30 at 05:58 -0700, Daniel Phillips wrote:
>>> On Thursday, April 30, 2015 5:07:21 AM PDT, Mike Galbraith wrote:
>>>> On Thu, 2015-04-30 at 04:14 -0700, Daniel Phillips wrote:
>>>>
>>>>> Lovely sounding argument, but it is wrong because Tux3 still beats XFS
>>>>> even with seek time factored out of the equation.
>>>>
>>>> Hm. Do you have big-storage comparison numbers to back that? I'm no
>>>> storage guy (waiting for holographic crystal arrays to obsolete all this
>>>> crap;), but Dave's big-storage guy words made sense to me.
>>>
>>> This has nothing to do with big storage. The proposition was that seek
>>> time is the reason for Tux3's fsync performance. That claim was easily
>>> falsified by removing the seek time.
>>>
>>> Dave's big storage words are there to draw attention away from the fact
>>> that XFS ran the Git tests four times slower than Tux3 and three times
>>> slower than Ext4. Whatever the big storage excuse is for that, the fact
>>> is, XFS obviously sucks at little storage.
>>
>> If you allocate spanning the disk from start of life, you're going to
>> eat seeks that others don't until later. That seemed rather obvious and
>> straight forward.
>
> It is a logical falacy. It mixes a grain of truth (spreading all over the
> disk causes extra seeks) with an obvious falsehood (it is not necessarily
> the only possible way to avoid long term fragmentation).
You're reading into it what isn't there. Spreading over the disk isn't
(just) about avoiding fragmentation - it's about delivering consistent
and predictable latency. It is undeniable that if you start by only
allocating from the fastest portion of the platter, you are going to see
performance slow down over time. If you start by spreading allocations
across the entire platter, you make the worst-case and average-case
latency equal, which is exactly what a lot of folks are looking for.
>> He flat stated that xfs has passable performance on
>> single bit of rust, and openly explained why. I see no misdirection,
>> only some evidence of bad blood between you two.
>
> Raising the spectre of theoretical fragmentation issues when we have not
> even begun that work is a straw man and intellectually dishonest. You have
> to wonder why he does it. It is destructive to our community image and
> harmful to progress.
It is a fact of life that when you change one aspect of an intimately
interconnected system, something else will change as well. You have
naive/nonexistent free space management now; when you design something
workable there it is going to impact everything else you've already
done. It's an easy bet that the impact will be negative, the only
question is to what degree.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
On Thu, 2015-04-30 at 07:07 -0700, Daniel Phillips wrote:
>
> On 04/30/2015 06:48 AM, Mike Galbraith wrote:
> > On Thu, 2015-04-30 at 05:58 -0700, Daniel Phillips wrote:
> >> On Thursday, April 30, 2015 5:07:21 AM PDT, Mike Galbraith wrote:
> >>> On Thu, 2015-04-30 at 04:14 -0700, Daniel Phillips wrote:
> >>>
> >>>> Lovely sounding argument, but it is wrong because Tux3 still beats XFS
> >>>> even with seek time factored out of the equation.
> >>>
> >>> Hm. Do you have big-storage comparison numbers to back that? I'm no
> >>> storage guy (waiting for holographic crystal arrays to obsolete all this
> >>> crap;), but Dave's big-storage guy words made sense to me.
> >>
> >> This has nothing to do with big storage. The proposition was that seek
> >> time is the reason for Tux3's fsync performance. That claim was easily
> >> falsified by removing the seek time.
> >>
> >> Dave's big storage words are there to draw attention away from the fact
> >> that XFS ran the Git tests four times slower than Tux3 and three times
> >> slower than Ext4. Whatever the big storage excuse is for that, the fact
> >> is, XFS obviously sucks at little storage.
> >
> > If you allocate spanning the disk from start of life, you're going to
> > eat seeks that others don't until later. That seemed rather obvious and
> > straight forward.
>
> It is a logical falacy. It mixes a grain of truth (spreading all over the
> disk causes extra seeks) with an obvious falsehood (it is not necessarily
> the only possible way to avoid long term fragmentation).
Shrug, but seems it is a solution, and more importantly, an implemented
solution. What I gleaned up as a layman reader is that xfs has no
fragmentation issue, but tux3 still does. It doesn't seem right to slam
xfs for a conscious design decision unless tux3 can proudly display its
superior solution, which I gathered doesn't yet exist.
> > He flat stated that xfs has passable performance on
> > single bit of rust, and openly explained why. I see no misdirection,
> > only some evidence of bad blood between you two.
>
> Raising the spectre of theoretical fragmentation issues when we have not
> even begun that work is a straw man and intellectually dishonest. You have
> to wonder why he does it. It is destructive to our community image and
> harmful to progress.
Well ok, let's forget bad blood, straw men... and answering my question
too I suppose. Not having any sexy IO gizmos in my little desktop box,
I don't care deeply which stomps the other flat on beastly boxen.
-Mike
On Thu, Apr 30, 2015 at 11:00:05AM +0200, Martin Steigerwald wrote:
> > IOWS, XFS just hates your disk. Spend $50 and buy a cheap SSD and
> > the problem goes away. :)
>
> I am quite surprised that a traditional filesystem that was created in the
> age of rotating media does not like this kind of media and even seems to
> excel on BTRFS on the new non rotating media available.
You shouldn't be surprised; XFS was designed in an era where RAID was
extremely important. To this day, on a very large RAID arrays, I'm
pretty sure none of the other file systems will come close to touching
XFS, because it was optimized by some really, really good file system
engineers for that hardware. And while RAID systems are certainly not
identical to SSD, the fact that you have multiple disk heads means
that a good file system will optimize for that parallelism, and that's
how SSD's get their speed (individual SSD channels aren't really all
that fast; it's the fast that you can be reading or writing arge
numbers of them in parallel that high end flash get their really great
performance numbers.)
> > Thing is, once you've abused those filesytsems for a couple of
> > months, the files in ext4, btrfs and tux3 are not going to be laid
> > out perfectly on the outer edge of the disk. They'll be spread all
> > over the place and so all the filesystems will be seeing large seeks
> > on read. The thing is, XFS will have roughly the same performance as
> > when the filesystem is empty because the spreading of the allocation
> > allows it to maintain better locality and separation and hence
> > doesn't fragment free space nearly as badly as the oher filesystems.
> > Free space fragmentation is what leads to performance degradation in
> > filesystems, and all the other filesystem will have degraded to be
> > *much worse* than XFS.
In fact, ext4 doesn't actually lay out things perfectly on the outer
edge of the disk either, because we try to do spreading as well.
Worse, we use a random algorithm to try to do the spreading, so that
means that results from run to run on an empty file system will show a
lot more variation. I won't claim that we're best in class with
either our spreading techniques or our ability to manage free space
fragmentation, although we do a lot of work to manage free space
fragmentation as well.
One of the problems is that it's *hard* to get good benchmarking
numbers that take into account file system aging and measure how well
the free space has been fragmented over time. Most of the benchmark
results that I've seen do a really lousy job at this, and the vast
majority don't even try.
This is one of the reasons why I find head-to-head "competitions"
between file systems to be not very helpful for anything other than
benchmarketing. It's almost certain that the benchmark won't be
"fair" in some way, and it doesn't really matter whether the person
doing the benchmark was doing it with malice aforethought, or was just
incompetent and didn't understand the issues --- or did understand the
issues and didn't really care, because what they _really_ wanted to do
was to market their file system.
And even if the benchmark is fair, it might not match up with the end
user's hardware, or their use case. There will always be some use
case where file system A is better than file system B, for pretty much
any file system. Don't get me wrong --- I will do comparisons between
file systems, but only so I can figure out ways of making _my_ file
system better. And more often than not, it's comparisons of the same
file system before and after adding some new feature which is the most
interesting.
> That are the allocation groups. I always wondered how it can be beneficial
> to spread the allocations onto 4 areas of one partition on expensive seek
> media. Now that makes better sense for me. I always had the gut impression
> that XFS may not be the fastest in all cases, but it is one of the
> filesystem with the most consistent performance over time, but never was
> able to fully explain why that is.
Yep, pretty much all of the traditional update-in-place file systems
since the BSD FFS have done this, and for the same reason. For COW
file systems which are are constantly moving data and metadata blocks
around, they will need different strategies for trying to avoid the
free space fragmentation problem as the file system ages.
Cheers,
- Ted
On 04/30/2015 07:28 AM, Howard Chu wrote:
> Daniel Phillips wrote:
>>
>>
>> On 04/30/2015 06:48 AM, Mike Galbraith wrote:
>>> On Thu, 2015-04-30 at 05:58 -0700, Daniel Phillips wrote:
>>>> On Thursday, April 30, 2015 5:07:21 AM PDT, Mike Galbraith wrote:
>>>>> On Thu, 2015-04-30 at 04:14 -0700, Daniel Phillips wrote:
>>>>>
>>>>>> Lovely sounding argument, but it is wrong because Tux3 still beats XFS
>>>>>> even with seek time factored out of the equation.
>>>>>
>>>>> Hm. Do you have big-storage comparison numbers to back that? I'm no
>>>>> storage guy (waiting for holographic crystal arrays to obsolete all this
>>>>> crap;), but Dave's big-storage guy words made sense to me.
>>>>
>>>> This has nothing to do with big storage. The proposition was that seek
>>>> time is the reason for Tux3's fsync performance. That claim was easily
>>>> falsified by removing the seek time.
>>>>
>>>> Dave's big storage words are there to draw attention away from the fact
>>>> that XFS ran the Git tests four times slower than Tux3 and three times
>>>> slower than Ext4. Whatever the big storage excuse is for that, the fact
>>>> is, XFS obviously sucks at little storage.
>>>
>>> If you allocate spanning the disk from start of life, you're going to
>>> eat seeks that others don't until later. That seemed rather obvious and
>>> straight forward.
>>
>> It is a logical falacy. It mixes a grain of truth (spreading all over the
>> disk causes extra seeks) with an obvious falsehood (it is not necessarily
>> the only possible way to avoid long term fragmentation).
>
> You're reading into it what isn't there. Spreading over the disk isn't (just) about avoiding
> fragmentation - it's about delivering consistent and predictable latency. It is undeniable that if
> you start by only allocating from the fastest portion of the platter, you are going to see
> performance slow down over time. If you start by spreading allocations across the entire platter,
> you make the worst-case and average-case latency equal, which is exactly what a lot of folks are
> looking for.
Another fallacy: intentionally running slower than necessary is not necessarily
the only way to deliver consistent and predictable latency. Not only that, but
intentionally running slower than necessary does not necessarily guarantee
performing better than some alternate strategy later.
Anyway, let's not be silly. Everybody in the room who wants Git to run 4 times
slower with no guarantee of any benefit in the future, please raise your hand.
>>> He flat stated that xfs has passable performance on
>>> single bit of rust, and openly explained why. I see no misdirection,
>>> only some evidence of bad blood between you two.
>>
>> Raising the spectre of theoretical fragmentation issues when we have not
>> even begun that work is a straw man and intellectually dishonest. You have
>> to wonder why he does it. It is destructive to our community image and
>> harmful to progress.
>
> It is a fact of life that when you change one aspect of an intimately interconnected system,
> something else will change as well. You have naive/nonexistent free space management now; when you
> design something workable there it is going to impact everything else you've already done. It's an
> easy bet that the impact will be negative, the only question is to what degree.
You might lose that bet. For example, suppose we do strictly linear allocation
each delta, and just leave nice big gaps between the deltas for future
expansion. Clearly, we run at similar or identical speed to the current naive
strategy until we must start filling in the gaps, and at that point our layout
is not any worse than XFS, which started bad and stayed that way.
Now here is where you lose the bet: we already know that linear allocation
with wrap ends horribly right? However, as above, we start linear, without
compromise, but because of the gaps we leave, we are able to switch to a
slower strategy, but not nearly as slow as the ugly tangle we get with
simple wrap. So impact over the lifetime of the filesystem is positive, not
negative, and what seemed to be self evident to you turns out to be wrong.
In short, we would rather deliver as much performance as possible, all the
time. I really don't need to think about it very hard to know that is what I
want, and what most users want.
I will make you a bet in return: when we get to doing that part properly, the
quality of the work will be just as high as everything else we have completed
so far. Why would we suddenly get lazy?
Regards,
Daniel
On 04/30/2015 07:33 AM, Mike Galbraith wrote:
> Well ok, let's forget bad blood, straw men... and answering my question
> too I suppose. Not having any sexy IO gizmos in my little desktop box,
> I don't care deeply which stomps the other flat on beastly boxen.
I'm with you, especially the forget bad blood part. I did my time in
big storage and I will no doubt do it again, but right now, what I care
about is bringing truth and beauty to small storage, which includes
that spinning rust of yours and also the cheap SSD you are about to
run out and buy.
I hope you caught the bit about how Tux3 is doing really well running
in tmpfs? According to my calculations, that means good things for SSD
performance.
Regards,
Daniel
Hi Ted,
On 04/30/2015 07:57 AM, Theodore Ts'o wrote:
> This is one of the reasons why I find head-to-head "competitions"
> between file systems to be not very helpful for anything other than
> benchmarketing. It's almost certain that the benchmark won't be
> "fair" in some way, and it doesn't really matter whether the person
> doing the benchmark was doing it with malice aforethought, or was just
> incompetent and didn't understand the issues --- or did understand the
> issues and didn't really care, because what they _really_ wanted to do
> was to market their file system.
Your proposition, as I understand it, is that nobody should ever do
benchmarks because any benchmark must be one of: 1) malicious; 2)
incompetent; or 3) careless. When in fact, a benchmark may be perfectly
honest, competently done, and informative.
> And even if the benchmark is fair, it might not match up with the end
> user's hardware, or their use case. There will always be some use
> case where file system A is better than file system B, for pretty much
> any file system. Don't get me wrong --- I will do comparisons between
> file systems, but only so I can figure out ways of making _my_ file
> system better. And more often than not, it's comparisons of the same
> file system before and after adding some new feature which is the most
> interesting.
I cordially invite you to replicate our fsync benchmarks, or invent
your own. I am confident that you will find that the numbers are
accurate, that the test cases were well chosen, that the results are
informative, and that there is no sleight of hand.
As for whether or not people should "market" their filesystems as you
put it, that is easy for you to disparage when you are the incumbant.
If we don't tell people what is great about Tux3 then how will they
ever find out? Sure, it might be "advertising", but the important
question is, is it _truthful_ advertising? Surely you remember how
Linus got started... that was really blatant, and I am glad he did it.
>> That are the allocation groups. I always wondered how it can be beneficial
>> to spread the allocations onto 4 areas of one partition on expensive seek
>> media. Now that makes better sense for me. I always had the gut impression
>> that XFS may not be the fastest in all cases, but it is one of the
>> filesystem with the most consistent performance over time, but never was
>> able to fully explain why that is.
>
> Yep, pretty much all of the traditional update-in-place file systems
> since the BSD FFS have done this, and for the same reason. For COW
> file systems which are are constantly moving data and metadata blocks
> around, they will need different strategies for trying to avoid the
> free space fragmentation problem as the file system ages.
Right, different problems, but I have a pretty good idea how to go
about it now. I made a failed attempt a while back and learned a lot,
my mistake was to try to give every object a fixed home position based
on where it was first written and the result was worse for both read
and write. Now the interesting thing is, naive linear allocation is
great for both read and read, so my effort now is directed towards
ways of doing naive linear allocation but choosing carefully which
order we do the allocation in. I will keep you posted on how that
progresses of course.
Anyway, how did we get onto allocation? I thought my post was about
fsync, and after all, you are the guest of honor.
Regards,
Daniel
Daniel Phillips wrote:
> On 04/30/2015 07:28 AM, Howard Chu wrote:
>> You're reading into it what isn't there. Spreading over the disk isn't (just) about avoiding
>> fragmentation - it's about delivering consistent and predictable latency. It is undeniable that if
>> you start by only allocating from the fastest portion of the platter, you are going to see
>> performance slow down over time. If you start by spreading allocations across the entire platter,
>> you make the worst-case and average-case latency equal, which is exactly what a lot of folks are
>> looking for.
>
> Another fallacy: intentionally running slower than necessary is not necessarily
> the only way to deliver consistent and predictable latency.
Totally agree with you there.
> Not only that, but
> intentionally running slower than necessary does not necessarily guarantee
> performing better than some alternate strategy later.
True, it's a question of algorithmic efficiency - does the performance
decay linearly or logarithmically.
> Anyway, let's not be silly. Everybody in the room who wants Git to run 4 times
> slower with no guarantee of any benefit in the future, please raise your hand.
git is an important workload for us as developers, but I don't think
that's the only workload that's important for us.
>>>> He flat stated that xfs has passable performance on
>>>> single bit of rust, and openly explained why. I see no misdirection,
>>>> only some evidence of bad blood between you two.
>>>
>>> Raising the spectre of theoretical fragmentation issues when we have not
>>> even begun that work is a straw man and intellectually dishonest. You have
>>> to wonder why he does it. It is destructive to our community image and
>>> harmful to progress.
>>
>> It is a fact of life that when you change one aspect of an intimately interconnected system,
>> something else will change as well. You have naive/nonexistent free space management now; when you
>> design something workable there it is going to impact everything else you've already done. It's an
>> easy bet that the impact will be negative, the only question is to what degree.
>
> You might lose that bet. For example, suppose we do strictly linear allocation
> each delta, and just leave nice big gaps between the deltas for future
> expansion. Clearly, we run at similar or identical speed to the current naive
> strategy until we must start filling in the gaps, and at that point our layout
> is not any worse than XFS, which started bad and stayed that way.
>
> Now here is where you lose the bet: we already know that linear allocation
> with wrap ends horribly right? However, as above, we start linear, without
> compromise, but because of the gaps we leave, we are able to switch to a
> slower strategy, but not nearly as slow as the ugly tangle we get with
> simple wrap. So impact over the lifetime of the filesystem is positive, not
> negative, and what seemed to be self evident to you turns out to be wrong.
>
> In short, we would rather deliver as much performance as possible, all the
> time. I really don't need to think about it very hard to know that is what I
> want, and what most users want.
>
> I will make you a bet in return: when we get to doing that part properly, the
> quality of the work will be just as high as everything else we have completed
> so far. Why would we suddenly get lazy?
I never said anything about getting lazy. You're working in a closed
system though. If you run today's version on a system, and then you run
your future version on that same hardware, you're doing more CPU work
and probably more I/O work to do the more complex space management. It's
not quite zero-sum but close enough, when you're talking about highly
optimized designs.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
Am Donnerstag, 30. April 2015, 10:57:10 schrieb Theodore Ts'o:
> One of the problems is that it's *hard* to get good benchmarking
> numbers that take into account file system aging and measure how well
> the free space has been fragmented over time. Most of the benchmark
> results that I've seen do a really lousy job at this, and the vast
> majority don't even try.
>
> This is one of the reasons why I find head-to-head "competitions"
> between file systems to be not very helpful for anything other than
> benchmarketing. It's almost certain that the benchmark won't be
> "fair" in some way, and it doesn't really matter whether the person
> doing the benchmark was doing it with malice aforethought, or was just
> incompetent and didn't understand the issues --- or did understand the
> issues and didn't really care, because what they _really_ wanted to do
> was to market their file system.
I agree to that.
One benchmark measure one thing, and if its with the fresh filesystem, it
does so with a fresh filesystem.
Benchmarks that aiming at how to test an aged filesystem are much more
expensive in time and resources needed, unless one reuses and aged
filesystem image again and again.
Thanks for your explainations, Ted,
Ciao,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
On the 30th of April 2015 17:14, Daniel Phillips wrote:
Hallo hardcore coders
> On 04/30/2015 07:28 AM, Howard Chu wrote:
>> Daniel Phillips wrote:
>>>
>>> On 04/30/2015 06:48 AM, Mike Galbraith wrote:
>>>> On Thu, 2015-04-30 at 05:58 -0700, Daniel Phillips wrote:
>>>>> On Thursday, April 30, 2015 5:07:21 AM PDT, Mike Galbraith wrote:
>>>>>> On Thu, 2015-04-30 at 04:14 -0700, Daniel Phillips wrote:
>>>>>>
>>>>>>> Lovely sounding argument, but it is wrong because Tux3 still beats XFS
>>>>>>> even with seek time factored out of the equation.
>>>>>> Hm. Do you have big-storage comparison numbers to back that? I'm no
>>>>>> storage guy (waiting for holographic crystal arrays to obsolete all this
>>>>>> crap;), but Dave's big-storage guy words made sense to me.
>>>>> This has nothing to do with big storage. The proposition was that seek
>>>>> time is the reason for Tux3's fsync performance. That claim was easily
>>>>> falsified by removing the seek time.
>>>>>
>>>>> Dave's big storage words are there to draw attention away from the fact
>>>>> that XFS ran the Git tests four times slower than Tux3 and three times
>>>>> slower than Ext4. Whatever the big storage excuse is for that, the fact
>>>>> is, XFS obviously sucks at little storage.
>>>> If you allocate spanning the disk from start of life, you're going to
>>>> eat seeks that others don't until later. That seemed rather obvious and
>>>> straight forward.
>>> It is a logical falacy. It mixes a grain of truth (spreading all over the
>>> disk causes extra seeks) with an obvious falsehood (it is not necessarily
>>> the only possible way to avoid long term fragmentation).
>> You're reading into it what isn't there. Spreading over the disk isn't (just) about avoiding
>> fragmentation - it's about delivering consistent and predictable latency. It is undeniable that if
>> you start by only allocating from the fastest portion of the platter, you are going to see
>> performance slow down over time. If you start by spreading allocations across the entire platter,
>> you make the worst-case and average-case latency equal, which is exactly what a lot of folks are
>> looking for.
> Another fallacy: intentionally running slower than necessary is not necessarily
> the only way to deliver consistent and predictable latency. Not only that, but
> intentionally running slower than necessary does not necessarily guarantee
> performing better than some alternate strategy later.
>
> Anyway, let's not be silly. Everybody in the room who wants Git to run 4 times
> slower with no guarantee of any benefit in the future, please raise your hand.
>
>>>> He flat stated that xfs has passable performance on
>>>> single bit of rust, and openly explained why. I see no misdirection,
>>>> only some evidence of bad blood between you two.
>>> Raising the spectre of theoretical fragmentation issues when we have not
>>> even begun that work is a straw man and intellectually dishonest. You have
>>> to wonder why he does it. It is destructive to our community image and
>>> harmful to progress.
>> It is a fact of life that when you change one aspect of an intimately interconnected system,
>> something else will change as well. You have naive/nonexistent free space management now; when you
>> design something workable there it is going to impact everything else you've already done. It's an
>> easy bet that the impact will be negative, the only question is to what degree.
> You might lose that bet. For example, suppose we do strictly linear allocation
> each delta, and just leave nice big gaps between the deltas for future
> expansion. Clearly, we run at similar or identical speed to the current naive
> strategy until we must start filling in the gaps, and at that point our layout
> is not any worse than XFS, which started bad and stayed that way.
>
> Now here is where you lose the bet: we already know that linear allocation
> with wrap ends horribly right? However, as above, we start linear, without
> compromise, but because of the gaps we leave, we are able to switch to a
> slower strategy, but not nearly as slow as the ugly tangle we get with
> simple wrap. So impact over the lifetime of the filesystem is positive, not
> negative, and what seemed to be self evident to you turns out to be wrong.
>
> In short, we would rather deliver as much performance as possible, all the
> time. I really don't need to think about it very hard to know that is what I
> want, and what most users want.
>
> I will make you a bet in return: when we get to doing that part properly, the
> quality of the work will be just as high as everything else we have completed
> so far. Why would we suddenly get lazy?
>
> Regards,
>
> Daniel
> --
>
How?
Maybe this is explained and discussed in a new thread about allocation
or so.
Thanks
Best Regards
Have fun
C.S.
On Thu, Apr 30, 2015 at 03:28:13AM -0700, Daniel Phillips wrote:
> On Wednesday, April 29, 2015 6:46:16 PM PDT, Dave Chinner wrote:
> >>I measured fsync performance using a 7200 RPM disk as a virtual
> >>drive under KVM, configured with cache=none so that asynchronous
> >>writes are cached and synchronous writes translate into direct
> >>writes to the block device.
> >
> >Yup, a slow single spindle, so fsync performance is determined by
> >seek latency of the filesystem. Hence the filesystem that "wins"
> >will be the filesystem that minimises fsync seek latency above
> >all other considerations.
> >
> >http://www.spinics.net/lists/kernel/msg1978216.html
>
> If you want to declare that XFS only works well on solid state
> disks and big storage arrays, that is your business. But if you
> do, you can no longer call XFS a general purpose filesystem. And
Well, yes - I never claimed XFS is a general purpose filesystem. It
is a high performance filesystem. Is is also becoming more relevant
to general purpose systems as low cost storage gains capabilities
that used to be considered the domain of high performance storage...
> >So, to demonstrate, I'll run the same tests but using a 256GB
> >samsung 840 EVO SSD and show how much the picture changes.
>
> I will go you one better, I ran a series of fsync tests using
> tmpfs, and I now have a very clear picture of how the picture
> changes. The executive summary is: Tux3 is still way faster, and
> still scales way better to large numbers of tasks. I have every
> confidence that the same is true of SSD.
/dev/ramX can't be compared to an SSD. Yes, they both have low
seek/IO latency but they have very different dispatch and IO
concurrency models. One is synchronous, the other is fully
asynchronous.
This is an important distinction, as we'll see later on....
> >I didn't test tux3, you don't make it easy to get or build.
>
> There is no need to apologize for not testing Tux3, however, it is
> unseemly to throw mud at the same time. Remember, you are the
These trees:
git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3.git
git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3-test.git
have not been updated for 11 months. I thought tux3 had died long
ago.
You should keep them up to date, and send patches for xfstests to
support tux3, and then you'll get a lot more people running,
testing and breaking tux3....
> >>To focus purely on fsync, I wrote a
> >>small utility (at the end of this post) that forks a number of
> >>tasks, each of which continuously appends to and fsyncs its own
> >>file. For a single task doing 1,000 fsyncs of 1K each, we have:
.....
> >All equally fast, so I can't see how tux3 would be much faster here.
>
> Running the same thing on tmpfs, Tux3 is significantly faster:
>
> Ext4: 1.40s
> XFS: 1.10s
> Btrfs: 1.56s
> Tux3: 1.07s
3% is not "signficantly faster". It's within run to run variation!
> > Tasks: 10 100 1,000 10,000
> > Ext4: 0.05s 0.12s 0.48s 3.99s
> > XFS: 0.25s 0.41s 0.96s 4.07s
> > Btrfs 0.22s 0.50s 2.86s 161.04s
> > (lower is better)
> >
> >Ext4 and XFS are fast and show similar performance. Tux3 *can't* be
> >very much faster as most of the elapsed time in the test is from
> >forking the processes that do the IO and fsyncs.
>
> You wish. In fact, Tux3 is a lot faster.
Yes, it's easy to be fast when you have simple, naive algorithms and
an empty filesystem.
> triple checked and reproducible:
>
> Tasks: 10 100 1,000 10,000
> Ext4: 0.05 0.14 1.53 26.56
> XFS: 0.05 0.16 2.10 29.76
> Btrfs: 0.08 0.37 3.18 34.54
> Tux3: 0.02 0.05 0.18 2.16
Yet I can't reproduce those XFS or ext4 numbers you are quoting
there. eg. XFS on a 4GB ram disk:
$ for i in 10 100 1000 10000; do rm /mnt/test/foo* ; time ./test-fsync /mnt/test/foo 10 $i; done
real 0m0.030s
user 0m0.000s
sys 0m0.014s
real 0m0.031s
user 0m0.008s
sys 0m0.157s
real 0m0.305s
user 0m0.029s
sys 0m1.555s
real 0m3.624s
user 0m0.219s
sys 0m17.631s
$
That's roughly 10x faster than your numbers. Can you describe your
test setup in detail? e.g. post the full log from block device
creation to benchmark completion so I can reproduce what you are
doing exactly?
> Note: you should recheck your final number for Btrfs. I have seen
> Btrfs fall off the rails and take wildly longer on some tests just
> like that.
Completely reproducable:
$ sudo mkfs.btrfs -f /dev/vdc
Btrfs v3.16.2
See http://btrfs.wiki.kernel.org for more information.
Turning ON incompat feature 'extref': increased hardlink limit per file to 65536
fs created label (null) on /dev/vdc
nodesize 16384 leafsize 16384 sectorsize 4096 size 500.00TiB
$ sudo mount /dev/vdc /mnt/test
$ sudo chmod 777 /mnt/test
$ for i in 10 100 1000 10000; do rm /mnt/test/foo* ; time ./test-fsync /mnt/test/foo 10 $i; done
real 0m0.068s
user 0m0.000s
sys 0m0.061s
real 0m0.563s
user 0m0.001s
sys 0m2.047s
real 0m2.851s
user 0m0.040s
sys 0m24.503s
real 2m38.713s
user 0m0.533s
sys 38m34.831s
Same result - ~160s burning all 16 CPUs, as can be seen by the
system time.
And even on a 4GB ram disk, the 10000 process test comes in at:
real 0m35.567s
user 0m0.707s
sys 6m1.922s
That's the same wall time as your tst, but the CPU burn on my
machine is still clearly evident. You indicated that it's not doing
this on your machine, so I don't think we can really use btfrs
numbers for comparison purposes if it is behaving so differently on
different machines....
[snip]
> One easily reproducible one is a denial of service
> during the 10,000 task test where it takes multiple seconds to cat
> small files. I saw XFS do this on both spinning disk and tmpfs, and
> I have seen it hang for minutes trying to list a directory. I looked
> a bit into it, and I see that you are blocking for aeons trying to
> acquire a lock in open.
Yes, that's the usual case when XFS is waiting on buffer readahead
IO completion. The latency of which is completely determined by
block layer queuing and scheduling behaviour. And the block device
queue is being dominated by the 10,000 concurrent write processes
you just ran.....
"Doctor, it hurts when I do this!"
[snip]
> You and I both know the truth: Ext4 is the only really reliable
> general purpose filesystem on Linux at the moment.
BWAHAHAHAHAHAHAH-*choke*
*cough*
*cough*
/me wipes tears from his eyes
That's the funniest thing I've read in a long time :)
[snip]
> >On a SSD (256GB samsung 840 EVO), running 4.0.0:
> >
> > Tasks: 8 16 32
> > Ext4: 598.27 MB/s 981.13 MB/s 1233.77 MB/s
> > XFS: 884.62 MB/s 1328.21 MB/s 1373.66 MB/s
> > Btrfs: 201.64 MB/s 137.55 MB/s 108.56 MB/s
> >
> >dbench looks *very different* when there is no seek latency,
> >doesn't it?
>
> It looks like Btrfs hit a bug, not a huge surprise. Btrfs hit an assert
> for me earlier this evening. It is rare but it happens. I rebooted
> and got sane numbers. Running dbench -t10 on tmpfs I get:
>
> Tasks: 8 16 32
> Ext4: 660.69 MB/s 708.81 MB/s 720.12 MB/s
> XFS: 692.01 MB/s 388.53 MB/s 134.84 MB/s
> Btrfs: 229.66 MB/s 341.27 MB/s 377.97 MB/s
> Tux3: 1147.12 MB/s 1401.61 MB/s 1283.74 MB/s
>
> Looks like XFS hit a bump and fell off the cliff at 32 threads. I reran
> that one many times because I don't want to give you an inaccurate
> report.
I can't reproduce those numbers, either. On /dev/ram0:
Tasks: 8 16 32
Ext4: 1416.11 MB/s 1585.81 MB/s 1406.18 MB/s
XFS: 2580.58 MB/s 1367.48 MB/s 994.46 MB/s
Btrfs: 151.89 MB/s 84.88 MB/s 73.16 MB/s
Still, that negative XFS scalability shouldn't be occuring - it
should be level off and be much flatter if everything is working
correctly.
<ding>
Ah.
Ram disks and synchronous IO.....
The XFS journal a completely asynchronous IO engine and the
synchronous IO done by the ram disk really screws with the
concurrency model. There are journal write aggregation optimisations
that are based on the "buffer under IO" state detection, which is
completely skipped when journal IO is synchronous and completed in
the submission context. This problem doesn't occur on actual storage
devices where IO is asynchronous.
So, yes, dbench can trigger an interesting behaviour in XFS, but
it's well understood and doesn't actually effect normal storage
devices. If you need a volatile fileystem for performance
reasons then tmpfs is what you want, not XFS....
[
Feel free to skip the detail:
Let's go back to that SSD, which does asynchronous IO and so
the journal to operates fully asynchronously:
$ for i in 8 16 32 64 128 256; do dbench -t10 $i -D /mnt/test; done
Throughput 811.806 MB/sec 8 clients 8 procs max_latency=12.152 ms
Throughput 1285.47 MB/sec 16 clients 16 procs max_latency=22.880 ms
Throughput 1516.22 MB/sec 32 clients 32 procs max_latency=73.381 ms
Throughput 1724.57 MB/sec 64 clients 64 procs max_latency=256.681 ms
Throughput 2046.91 MB/sec 128 clients 128 procs max_latency=1068.169 ms
Throughput 1895.4 MB/sec 256 clients 256 procs max_latency=3157.738 ms
So performance improves out to 128 processes and then the
SSD runs out of capacity - it's doing >400MB/s write IO at
128 clients. That makes latency blow out as we add more
load, so it doesn't go any faster and we start to back up on
the log. Hence we slowly start to go backwards as client
count continues to increase and contention builds up on
global wait queues.
Now, XFS has 8 log buffer and so can issue 8 concurrent
journal writes. Let's run dbench with fewer processes on a
ram disk, and see what happens as we increase the number of
processes doing IO and hence triggering journal writes:
$ for i in 1 2 4 6 8; do dbench -t10 $i -D /mnt/test |grep Throughput; done
Throughput 653.163 MB/sec 1 clients 1 procs max_latency=0.355 ms
Throughput 1273.65 MB/sec 2 clients 2 procs max_latency=3.947 ms
Throughput 2189.19 MB/sec 4 clients 4 procs max_latency=7.582 ms
Throughput 2318.33 MB/sec 6 clients 6 procs max_latency=8.023 ms
Throughput 2212.85 MB/sec 8 clients 8 procs max_latency=9.120 ms
Yeah, ok, we scale out to 4 processes, then level off.
That's going to be limited by allocation concurrency during
writes, not the journal (the default is 4 AGs on a
filesystem so small). Let's make 16 AGs, cause seeks don't
matter on a ram disk.
$ sudo mkfs.xfs -f -d agcount=16 /dev/ram0
....
$ for i in 1 2 4 6 8; do dbench -t10 $i -D /mnt/test |grep Throughput; done
Throughput 656.189 MB/sec 1 clients 1 procs max_latency=0.565 ms
Throughput 1277.25 MB/sec 2 clients 2 procs max_latency=3.739 ms
Throughput 2350.73 MB/sec 4 clients 4 procs max_latency=5.126 ms
Throughput 2754.3 MB/sec 6 clients 6 procs max_latency=8.063 ms
Throughput 3135.11 MB/sec 8 clients 8 procs max_latency=6.746 ms
Yup, as expected the we continue to increase performance out
to 8 processes now that there isn't an allocation
concurrency limit being hit.
What happens as we pass 8 processes now?
$ for i in 4 8 12 16; do dbench -t10 $i -D /mnt/test |grep Throughput; done
Throughput 2277.53 MB/sec 4 clients 4 procs max_latency=5.778 ms
Throughput 3070.3 MB/sec 8 clients 8 procs max_latency=7.808 ms
Throughput 2555.29 MB/sec 12 clients 12 procs max_latency=8.518 ms
Throughput 1868.96 MB/sec 16 clients 16 procs max_latency=14.193 ms
$
As expected, past 8 processes perform tails off because the
journal state machine is not scheduling after dispatch of
the journal IO and hence allowing other threads to aggregate
journal writes into the next active log buffer because there
is no "under IO" stage in the state machine to it to trigger
log write aggregation delays off.
I'd completely forgotten about this - I discovered it 3 or 4
years ago, and then simply stopped using ramdisks for
performance testing because I could get better performance
from XFS on highly concurrent workloads from real storage.
]
> >>Dbench -t10 -s (all file operations synchronous)
> >>
> >> Tasks: 8 16 32
> >> Ext4: 4.51 MB/s 6.25 MB/s 7.72 MB/s
> >> XFS: 4.24 MB/s 4.77 MB/s 5.15 MB/s
> >> Btrfs: 7.98 MB/s 13.87 MB/s 22.87 MB/s
> >> Tux3: 15.41 MB/s 25.56 MB/s 39.15 MB/s
> >> (higher is better)
> >
> > Ext4: 173.54 MB/s 294.41 MB/s 424.11 MB/s
> > XFS: 172.98 MB/s 342.78 MB/s 458.87 MB/s
> > Btrfs: 36.92 MB/s 34.52 MB/s 55.19 MB/s
> >
> >Again, the numbers are completely the other way around on a SSD,
> >with the conventional filesystems being 5-10x faster than the
> >WA/COW style filesystem.
>
> I wouldn't be so sure about that...
>
> Tasks: 8 16 32
> Ext4: 93.06 MB/s 98.67 MB/s 102.16 MB/s
> XFS: 81.10 MB/s 79.66 MB/s 73.27 MB/s
> Btrfs: 43.77 MB/s 64.81 MB/s 90.35 MB/s
> Tux3: 198.49 MB/s 279.00 MB/s 318.41 MB/s
Ext4: 807.21 MB/s 1089.89 MB/s 867.55 MB/s
XFS: 997.77 MB/s 1011.51 MB/s 876.49 MB/s
Btrfs: 55.66 MB/s 56.77 MB/s 60.30 MB/s
Numbers are again very different for XFS and ext4 on /dev/ramX on my
system. Need to work out why yours are so low....
> >Until you sort of how you are going to scale allocation to tens of
> >TB and not fragment free space over time, fsync performance of the
> >filesystem is pretty much irrelevant. Changing the allocation
> >algorithms will fundamentally alter the IO patterns and so all these
> >benchmarks are essentially meaningless.
>
> Ahem, are you the same person for whom fsync was the most important
> issue in the world last time the topic came up, to the extent of
> spreading around FUD and entirely ignoring the great work we had
> accomplished for regular file operations?
Actually, I don't remember any discussions about fsync.
Things I remember that needed addressing are:
- the lack of ENOSPC detection
- the writeback integration issues
- the code cleanliness issues (ifdef mess, etc)
- the page forking design problems
- the lack of scalable inode and space allocation
algorithms.
Those are the things I remember, and fsync performance pales in
comparison to those.
> I said then that when we
> got around to a proper fsync it would be competitive. Now here it
> is, so you want to change the topic. I understand.
I haven't changed the topic, just the storage medium. The simple
fact is that the world is moving away from slow sata storage at a
pretty rapid pace and it's mostly going solid state. Spinning disks
also changing - they are going to ZBC based SMR, which is a
compeltely different problem space which doesn't even appear to be
on the tux3 radar....
So where does tux3 fit into a storage future of byte addressable
persistent memory and ZBC based SMR devices?
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote:
>
> Well, yes - I never claimed XFS is a general purpose filesystem. It
> is a high performance filesystem. Is is also becoming more relevant
> to general purpose systems as low cost storage gains capabilities
> that used to be considered the domain of high performance storage...
OK. Well, Tux3 is general purpose and that means we care about single
spinning disk and small systems.
>>> So, to demonstrate, I'll run the same tests but using a 256GB
>>> samsung 840 EVO SSD and show how much the picture changes.
>>
>> I will go you one better, I ran a series of fsync tests using
>> tmpfs, and I now have a very clear picture of how the picture
>> changes. The executive summary is: Tux3 is still way faster, and
>> still scales way better to large numbers of tasks. I have every
>> confidence that the same is true of SSD.
>
> /dev/ramX can't be compared to an SSD. Yes, they both have low
> seek/IO latency but they have very different dispatch and IO
> concurrency models. One is synchronous, the other is fully
> asynchronous.
I had ram available and no SSD handy to abuse. I was interested in
measuring the filesystem overhead with the device factored out. I
mounted loopback on a tmpfs file, which seems to be about the same as
/dev/ram, maybe slightly faster, but much easier to configure. I ran
some tests on a ramdisk just now and was mortified to find that I have
to reboot to empty the disk. It would take a compelling reason before
I do that again.
> This is an important distinction, as we'll see later on....
I regard it as predictive of Tux3 performance on NVM.
> These trees:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3.git
> git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3-test.git
>
> have not been updated for 11 months. I thought tux3 had died long
> ago.
>
> You should keep them up to date, and send patches for xfstests to
> support tux3, and then you'll get a lot more people running,
> testing and breaking tux3....
People are starting to show up to do testing now, pretty much the first
time, so we must do some housecleaning. It is gratifying that Tux3 never
broke for Mike, but of course it will assert just by running out of
space at the moment. As you rightly point out, that fix is urgent and is
my current project.
>> Running the same thing on tmpfs, Tux3 is significantly faster:
>>
>> Ext4: 1.40s
>> XFS: 1.10s
>> Btrfs: 1.56s
>> Tux3: 1.07s
>
> 3% is not "signficantly faster". It's within run to run variation!
You are right, XFS and Tux3 are within experimental error for single
syncs on the ram disk, while Ext4 and Btrfs are way slower:
Ext4: 1.59s
XFS: 1.11s
Btrfs: 1.70s
Tux3: 1.11s
A distinct performance gap appears between Tux3 and XFS as parallel
tasks increase.
>> You wish. In fact, Tux3 is a lot faster. ...
>
> Yes, it's easy to be fast when you have simple, naive algorithms and
> an empty filesystem.
No it isn't or the others would be fast too. In any case our algorithms
are far from naive, except for allocation. You can rest assured that
when allocation is brought up to a respectable standard in the fullness
of time, it will be competitive and will not harm our clean filesystem
performance at all.
There is no call for you to disparage our current achievements, which
are significant. I do not mind some healthy skepticism about the
allocation work, you know as well as anyone how hard it is. However your
denial of our current result is irritating and creates the impression
that you have an agenda. If you want to complain about something real,
complain that our current code drop is not done yet. I will humbly
apologize, and the same for enospc.
>> triple checked and reproducible:
>>
>> Tasks: 10 100 1,000 10,000
>> Ext4: 0.05 0.14 1.53 26.56
>> XFS: 0.05 0.16 2.10 29.76
>> Btrfs: 0.08 0.37 3.18 34.54
>> Tux3: 0.02 0.05 0.18 2.16
>
> Yet I can't reproduce those XFS or ext4 numbers you are quoting
> there. eg. XFS on a 4GB ram disk:
>
> $ for i in 10 100 1000 10000; do rm /mnt/test/foo* ; time
> ./test-fsync /mnt/test/foo 10 $i; done
>
> real 0m0.030s
> user 0m0.000s
> sys 0m0.014s
>
> real 0m0.031s
> user 0m0.008s
> sys 0m0.157s
>
> real 0m0.305s
> user 0m0.029s
> sys 0m1.555s
>
> real 0m3.624s
> user 0m0.219s
> sys 0m17.631s
> $
>
> That's roughly 10x faster than your numbers. Can you describe your
> test setup in detail? e.g. post the full log from block device
> creation to benchmark completion so I can reproduce what you are
> doing exactly?
Mine is a lame i5 minitower with 4GB from Fry's. Yours is clearly way
more substantial, so I can't compare my numbers directly to yours.
Clearly the curve is the same: your numbers increase 10x going from 100
to 1,000 tasks and 12x going from 1,000 to 10,000. The Tux3 curve is
significantly flatter and starts from a lower base, so it ends with a
really wide gap. You will need to take my word for that for now. I
promise that the beer is on me should you not find that reproducible.
The repository delay is just about not bothering Hirofumi for a merge
while he finishes up his inode table anti-fragmentation work.
>> Note: you should recheck your final number for Btrfs. I have seen
>> Btrfs fall off the rails and take wildly longer on some tests just
>> like that.
>
> Completely reproducable...
I believe you. I found that Btrfs does that way too much. So does XFS
from time to time, when it gets up into lots of tasks. Read starvation
on XFS is much worse than Btrfs, and XFS also exhibits some very
undesirable behavior with initial file create. Note: Ext4 and Tux3 have
roughly zero read starvation in any of these tests, which pretty much
proves it is not just a block scheduler thing. I don't think this is
something you should dismiss.
>> One easily reproducible one is a denial of service
>> during the 10,000 task test where it takes multiple seconds to cat
>> small files. I saw XFS do this on both spinning disk and tmpfs, and
>> I have seen it hang for minutes trying to list a directory. I looked
>> a bit into it, and I see that you are blocking for aeons trying to
>> acquire a lock in open.
>
> Yes, that's the usual case when XFS is waiting on buffer readahead
> IO completion. The latency of which is completely determined by
> block layer queuing and scheduling behaviour. And the block device
> queue is being dominated by the 10,000 concurrent write processes
> you just ran.....
>
> "Doctor, it hurts when I do this!"
It only hurts XFS (and sometimes Btrfs) when you do that. I believe
your theory is wrong about the cause, or at least Ext4 and Tux3 skirt
that issue somehow. We definitely did not do anything special to avoid
it.
>> You and I both know the truth: Ext4 is the only really reliable
>> general purpose filesystem on Linux at the moment.
>
> That's the funniest thing I've read in a long time :)
I'm glad I could lighten your day, but I remain uncomfortable with the
read starvation issues and the massive long lock holds I see. Perhaps
XFS is stable if you don't push too many tasks at it.
[snipped the interesting ramdisk performance bug hunt]
OK, fair enough, you get a return match on SSD when I get hold of one.
>> I wouldn't be so sure about that...
>>
>> Tasks: 8 16 32
>> Ext4: 93.06 MB/s 98.67 MB/s 102.16 MB/s
>> XFS: 81.10 MB/s 79.66 MB/s 73.27 MB/s
>> Btrfs: 43.77 MB/s 64.81 MB/s 90.35 MB/s ...
>
> Ext4: 807.21 MB/s 1089.89 MB/s 867.55 MB/s
> XFS: 997.77 MB/s 1011.51 MB/s 876.49 MB/s
> Btrfs: 55.66 MB/s 56.77 MB/s 60.30 MB/s
>
> Numbers are again very different for XFS and ext4 on /dev/ramX on my
> system. Need to work out why yours are so low....
Your machine makes mine look like a PCjr.
>> Ahem, are you the same person for whom fsync was the most important
>> issue in the world last time the topic came up, to the extent of
>> spreading around FUD and entirely ignoring the great work we had
>> accomplished for regular file operations? ...
>
> Actually, I don't remember any discussions about fsync.
Here:
http://www.spinics.net/lists/linux-fsdevel/msg64825.html
(Re: Tux3 Report: Faster than tmpfs, what?)
It still rankles that you took my innocent omission of the detail that
Hirofumi had removed the fsyncs from dbench and turned it into a major
FUD attack, casting aspersions on our integrity. We removed the fsyncs
because we weren't interested in measuring something we had not
implemented yet, it is that simple.
That, plus Ted's silly pronouncements that I could not answer at the
time, is what motivated me to design and implement an fsync that would
not just be competitive, but would righteously kick the tails of XFS
and Ext4, which is done. If I were you, I would wait for the code drop,
verify it, and then give credit where credit is due. Then I would
figure out how to make XFS work like that.
> Things I remember that needed addressing are:
> - the lack of ENOSPC detection
> - the writeback integration issues
> - the code cleanliness issues (ifdef mess, etc)
> - the page forking design problems
> - the lack of scalable inode and space allocation
> algorithms.
>
> Those are the things I remember, and fsync performance pales in
> comparison to those.
With the exception of "page forking design", it is the same list as
ours, with progress on all of them. I freely admit that optimized fsync
was not on the critical path, but you made it an issue so I addressed
it. Anyway, I needed to hone my kernel debugging skills and that worked
out well.
>> I said then that when we
>> got around to a proper fsync it would be competitive. Now here it
>> is, so you want to change the topic. I understand.
>
> I haven't changed the topic, just the storage medium. The simple
> fact is that the world is moving away from slow sata storage at a
> pretty rapid pace and it's mostly going solid state. Spinning disks
> also changing - they are going to ZBC based SMR, which is a
> compeltely different problem space which doesn't even appear to be
> on the tux3 radar....
>
> So where does tux3 fit into a storage future of byte addressable
> persistent memory and ZBC based SMR devices?
You won't convince us to abandon spinning rust, it's going to be around
a lot longer than you think. Obviously, we care about SSD and I believe
you will find that Tux3 is more than competitive there. We lay things
out in a very erase block friendly way. We need to address the volume
wrap issue of course, and that is in progress. This is much easier than
spinning disk.
Tux3's redirect-on-write[1] is obviously a natural for SMR, however
I will not get excited about it unless a vendor waves money.
Regards,
Daniel
[1] Copy-on-write is a misnomer because there is no copy. The proper
term is "redirect-on-write".
On Fri, 1 May 2015, Daniel Phillips wrote:
> On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote:
>>
>> Well, yes - I never claimed XFS is a general purpose filesystem. It
>> is a high performance filesystem. Is is also becoming more relevant
>> to general purpose systems as low cost storage gains capabilities
>> that used to be considered the domain of high performance storage...
>
> OK. Well, Tux3 is general purpose and that means we care about single
> spinning disk and small systems.
keep in mind that if you optimize only for the small systems you may not scale
as well to the larger ones.
>>>> So, to demonstrate, I'll run the same tests but using a 256GB
>>>> samsung 840 EVO SSD and show how much the picture changes.
>>>
>>> I will go you one better, I ran a series of fsync tests using
>>> tmpfs, and I now have a very clear picture of how the picture
>>> changes. The executive summary is: Tux3 is still way faster, and
>>> still scales way better to large numbers of tasks. I have every
>>> confidence that the same is true of SSD.
>>
>> /dev/ramX can't be compared to an SSD. Yes, they both have low
>> seek/IO latency but they have very different dispatch and IO
>> concurrency models. One is synchronous, the other is fully
>> asynchronous.
>
> I had ram available and no SSD handy to abuse. I was interested in
> measuring the filesystem overhead with the device factored out. I
> mounted loopback on a tmpfs file, which seems to be about the same as
> /dev/ram, maybe slightly faster, but much easier to configure. I ran
> some tests on a ramdisk just now and was mortified to find that I have
> to reboot to empty the disk. It would take a compelling reason before
> I do that again.
>
>> This is an important distinction, as we'll see later on....
>
> I regard it as predictive of Tux3 performance on NVM.
per the ramdisk but, possibly not as relavent as you may think. This is why it's
good to test on as many different systems as you can. As you run into different
types of performance you can then pick ones to keep and test all the time.
Single spinning disk is interesting now, but will be less interesting later.
multiple spinning disks in an array of some sort is going to remain very
interesting for quite a while.
now, some things take a lot more work to test than others. Getting time on a
system with a high performance, high capacity RAID is hard, but getting hold of
an SSD from Fry's is much easier. If it's a budget item, ping me directly and I
can donate one for testing (the cost of a drive is within my unallocated budget
and using that to improve Linux is worthwhile)
>>> Running the same thing on tmpfs, Tux3 is significantly faster:
>>>
>>> Ext4: 1.40s
>>> XFS: 1.10s
>>> Btrfs: 1.56s
>>> Tux3: 1.07s
>>
>> 3% is not "signficantly faster". It's within run to run variation!
>
> You are right, XFS and Tux3 are within experimental error for single
> syncs on the ram disk, while Ext4 and Btrfs are way slower:
>
> Ext4: 1.59s
> XFS: 1.11s
> Btrfs: 1.70s
> Tux3: 1.11s
>
> A distinct performance gap appears between Tux3 and XFS as parallel
> tasks increase.
It will be interesting to see if this continues to be true on more systems. I
hope it does.
>>> You wish. In fact, Tux3 is a lot faster. ...
>>
>> Yes, it's easy to be fast when you have simple, naive algorithms and
>> an empty filesystem.
>
> No it isn't or the others would be fast too. In any case our algorithms
> are far from naive, except for allocation. You can rest assured that
> when allocation is brought up to a respectable standard in the fullness
> of time, it will be competitive and will not harm our clean filesystem
> performance at all.
>
> There is no call for you to disparage our current achievements, which
> are significant. I do not mind some healthy skepticism about the
> allocation work, you know as well as anyone how hard it is. However your
> denial of our current result is irritating and creates the impression
> that you have an agenda. If you want to complain about something real,
> complain that our current code drop is not done yet. I will humbly
> apologize, and the same for enospc.
As I'm reading Dave's comments, he isn't attacking you the way you seem to think
he is. He is pointing ot that there are problems with your data, but he's also
taking a lot of time to explain what's happening (and yes, some of this is
probably because your simple tests with XFS made it look so bad)
the other filesystems don't use naive algortihms, they use something more
complex, and while your current numbers are interesting, they are only
preliminary until you add something to handle fragmentation. That can cause very
significant problems. Remember how fabulous btrfs looked in the initial reports?
and then corner cases were found that caused real problems and as the algorithms
have been changed to prevent those corner cases from being so easy to hit, the
common case has suffered somewhat. This isn't an attack on Tux2 or btrfs, it's
just a reality of programming. If you are not accounting for all the corner
cases, everything is easier, and faster.
>> That's roughly 10x faster than your numbers. Can you describe your
>> test setup in detail? e.g. post the full log from block device
>> creation to benchmark completion so I can reproduce what you are
>> doing exactly?
>
> Mine is a lame i5 minitower with 4GB from Fry's. Yours is clearly way
> more substantial, so I can't compare my numbers directly to yours.
If you are doing tests with a 4G ramdisk on a machine with only 4G of RAM, it
seems like you end up testing a lot more than just the filesystem. Testing in
such low memory situations can indentify significant issues, but it is
questionable as a 'which filesystem is better' benchmark.
> Clearly the curve is the same: your numbers increase 10x going from 100
> to 1,000 tasks and 12x going from 1,000 to 10,000. The Tux3 curve is
> significantly flatter and starts from a lower base, so it ends with a
> really wide gap. You will need to take my word for that for now. I
> promise that the beer is on me should you not find that reproducible.
>
> The repository delay is just about not bothering Hirofumi for a merge
> while he finishes up his inode table anti-fragmentation work.
Just a suggestion, but before you do a huge post about how great your filesystem
is performing, making the code avaialble so that others can test it when
prompted by your post is probably a very good idea. If it means that you have to
send out your post a week later, it's a very small cost for the benefit of
having other people able to easily try it on hardware that you don't have access
to.
If there is a reason to post wihtout the code being in the main, publicised
repo, then your post should point people at what code they can use to duplicate
it.
but really, 11 months without updating the main repo?? This is Open Source
development, publish early and often.
>>> Note: you should recheck your final number for Btrfs. I have seen
>>> Btrfs fall off the rails and take wildly longer on some tests just
>>> like that.
>>
>> Completely reproducable...
>
> I believe you. I found that Btrfs does that way too much. So does XFS
> from time to time, when it gets up into lots of tasks. Read starvation
> on XFS is much worse than Btrfs, and XFS also exhibits some very
> undesirable behavior with initial file create. Note: Ext4 and Tux3 have
> roughly zero read starvation in any of these tests, which pretty much
> proves it is not just a block scheduler thing. I don't think this is
> something you should dismiss.
something to investigate, but I have seen probelms on ext* in the past. ext4 may
have fixed this, or it may just have moved the point where it triggers.
>>> I wouldn't be so sure about that...
>>>
>>> Tasks: 8 16 32
>>> Ext4: 93.06 MB/s 98.67 MB/s 102.16 MB/s
>>> XFS: 81.10 MB/s 79.66 MB/s 73.27 MB/s
>>> Btrfs: 43.77 MB/s 64.81 MB/s 90.35 MB/s ...
>>
>> Ext4: 807.21 MB/s 1089.89 MB/s 867.55 MB/s
>> XFS: 997.77 MB/s 1011.51 MB/s 876.49 MB/s
>> Btrfs: 55.66 MB/s 56.77 MB/s 60.30 MB/s
>>
>> Numbers are again very different for XFS and ext4 on /dev/ramX on my
>> system. Need to work out why yours are so low....
>
> Your machine makes mine look like a PCjr.
The interesting thing here is that on the faster machine btrfs didn't speed up
significantly while ext4 and xfs did. It will be interesting to see what the
results are for tux3
and both of you need to remember that while servers are getting faster, we are
also seeing much lower power, weaker servers showing up as well. And while these
smaller servers are not trying to do teh 10000 thread fsync workload, they are
using flash based storage more frequently than they are spinning rust
(frequently through the bottleneck of a SD card) so continuing tests on low end
devices is good.
>>> I said then that when we
>>> got around to a proper fsync it would be competitive. Now here it
>>> is, so you want to change the topic. I understand.
>>
>> I haven't changed the topic, just the storage medium. The simple
>> fact is that the world is moving away from slow sata storage at a
>> pretty rapid pace and it's mostly going solid state. Spinning disks
>> also changing - they are going to ZBC based SMR, which is a
>> compeltely different problem space which doesn't even appear to be
>> on the tux3 radar....
>>
>> So where does tux3 fit into a storage future of byte addressable
>> persistent memory and ZBC based SMR devices?
>
> You won't convince us to abandon spinning rust, it's going to be around
> a lot longer than you think. Obviously, we care about SSD and I believe
> you will find that Tux3 is more than competitive there. We lay things
> out in a very erase block friendly way. We need to address the volume
> wrap issue of course, and that is in progress. This is much easier than
> spinning disk.
>
> Tux3's redirect-on-write[1] is obviously a natural for SMR, however
> I will not get excited about it unless a vendor waves money.
what drives are available now? see if you can get a couple (either directly or
donated)
David Lang
On Friday, May 1, 2015 6:07:48 PM PDT, David Lang wrote:
> On Fri, 1 May 2015, Daniel Phillips wrote:
>> On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote:
>>>
>>> Well, yes - I never claimed XFS is a general purpose filesystem. It
>>> is a high performance filesystem. Is is also becoming more relevant
>>> to general purpose systems as low cost storage gains capabilities
>>> that used to be considered the domain of high performance storage...
>>
>> OK. Well, Tux3 is general purpose and that means we care about single
>> spinning disk and small systems.
>
> keep in mind that if you optimize only for the small systems
> you may not scale as well to the larger ones.
Tux3 is designed to scale, and it will when the time comes. I look
forward to putting Shardmap through its billion file test in due course.
However, right now it would be wise to stay focused on basic
functionality suited to a workstation because volunteer devs tend to
have those. After that, phones are a natural direction, where hard core
ACID commit and really smooth file ops are particularly attractive.
> per the ramdisk but, possibly not as relavent as you may think.
> This is why it's good to test on as many different systems as
> you can. As you run into different types of performance you can
> then pick ones to keep and test all the time.
I keep being surprised how well it works for things we never tested
before.
> Single spinning disk is interesting now, but will be less
> interesting later. multiple spinning disks in an array of some
> sort is going to remain very interesting for quite a while.
The way to do md well is to integrate it into the block layer like
Freebsd does (GEOM) and expose a richer interface for the filesystem.
That is how I think Tux3 should work with big iron raid. I hope to be
able to tackle that sometime before the stars start winking out.
> now, some things take a lot more work to test than others.
> Getting time on a system with a high performance, high capacity
> RAID is hard, but getting hold of an SSD from Fry's is much
> easier. If it's a budget item, ping me directly and I can donate
> one for testing (the cost of a drive is within my unallocated
> budget and using that to improve Linux is worthwhile)
Thanks.
> As I'm reading Dave's comments, he isn't attacking you the way
> you seem to think he is. He is pointing ot that there are
> problems with your data, but he's also taking a lot of time to
> explain what's happening (and yes, some of this is probably
> because your simple tests with XFS made it look so bad)
I hope the lightening up trend is a trend.
> the other filesystems don't use naive algortihms, they use
> something more complex, and while your current numbers are
> interesting, they are only preliminary until you add something
> to handle fragmentation. That can cause very significant
> problems.
Fsync is pretty much agnostic to fragmentation, so those results are
unlikely to change substantially even if we happen to do a lousy job on
allocation policy, which I naturally consider unlikely. In fact, Tux3
fsync is going to get faster over time for a couple of reasons: the
minimum blocks per commit will be reduced, and we will get rid of most
of the seeks to beginning of volume that we currently suffer per commit.
> Remember how fabulous btrfs looked in the initial
> reports? and then corner cases were found that caused real
> problems and as the algorithms have been changed to prevent
> those corner cases from being so easy to hit, the common case
> has suffered somewhat. This isn't an attack on Tux2 or btrfs,
> it's just a reality of programming. If you are not accounting
> for all the corner cases, everything is easier, and faster.
>> Mine is a lame i5 minitower with 4GB from Fry's. Yours is clearly way
>> more substantial, so I can't compare my numbers directly to yours.
>
> If you are doing tests with a 4G ramdisk on a machine with only
> 4G of RAM, it seems like you end up testing a lot more than just
> the filesystem. Testing in such low memory situations can
> indentify significant issues, but it is questionable as a 'which
> filesystem is better' benchmark.
A 1.3 GB tmpfs, and sorry, it is 10 GB (the machine next to it is 4G).
I am careful to ensure the test environment does not have spurious
memory or cpu hogs. I will not claim that this is the most sterile test
environment possible, but it is adequate for the task at hand. Nearly
always, when I find big variations in the test numbers it turns out to
be a quirk of one filesystem that is not exhibited by the others.
Everything gets multiple runs and lands in a spreadsheet. Any fishy
variance is investigated.
By the way, the low variance kings by far are Ext4 and Tux3, and of
those two, guess which one is more consistent. XFS is usually steady,
but can get "emotional" with lots of tasks, and Btrfs has regular wild
mood swings whenever the stars change alignment. And while I'm making
gross generalizations: XFS and Btrfs go OOM way before Ext4 and Tux3.
> Just a suggestion, but before you do a huge post about how
> great your filesystem is performing, making the code avaialble
> so that others can test it when prompted by your post is
> probably a very good idea. If it means that you have to send out
> your post a week later, it's a very small cost for the benefit
> of having other people able to easily try it on hardware that
> you don't have access to.
Next time. This time I wanted it off my plate as soon as possible so I
could move on to enospc work. And this way is more involving, we get a
little suspense before the rematch.
> If there is a reason to post wihtout the code being in the
> main, publicised repo, then your post should point people at
> what code they can use to duplicate it.
I could have included the patch in the post, it is small enough. If it
still isn't in the repo in a few days then I will post it, to avoid
giving the impression I'm desperately trying to fix obscure bugs in it,
which isn't the case.
> but really, 11 months without updating the main repo?? This is
> Open Source development, publish early and often.
It's not as bad as that:
https://github.com/OGAWAHirofumi/linux-tux3/commits/hirofumi
https://github.com/OGAWAHirofumi/linux-tux3/commits/hirofumi-user
> something to investigate, but I have seen probelms on ext* in
> the past. ext4 may have fixed this, or it may just have moved
> the point where it triggers.
My spectrum of tests is small and I am not hunting for anomalies, only
reporting what happened to come up. It is not very surprising that some
odd things happen with 10,000 tasks, there is probably not much test
coverage there. On the whole I was surprised and impressed when all
filesystems mostly just worked. I was expecting to hit scheduler issues
for one thing, and nothing obvious came up. Also, not one oops on any
filesystem (even Tux3) and only one assert, already reported upstream
and turned out to be fixed a week or two ago.
>>> ...
>> Your machine makes mine look like a PCjr. ...
>
> The interesting thing here is that on the faster machine btrfs
> didn't speed up significantly while ext4 and xfs did. It will be
> interesting to see what the results are for tux3
The numbers are well into the something-is-really-wrong zone (and I
should have flagged that earlier but it was a long day). That test is
supposed to be -s, all synchronous, and his numbers are more typical of
async. Needs double checking all round, including here. Anybody can
replicate that test, it is only an apt-get install dbench away (hint
hint).
Differences: my numbers are kvm with loopback mount on tmpfs. His are
on ramdisk and probably native. I have to reboot to make a ramdisk big
enough to run dbench and I would rather not right now.
How important is it to get to the bottom of the variance in test
results running on RAM? Probably important in the long run, because
storage devices are looking more like RAM all the time, but as of
today, maybe not very urgent.
Also, I was half expecting somebody to question the wisdom of running
benchmarks under KVM instead of native, but nobody did. Just for the
record, I would respond: running virtual probably accounts for the
majority of server instances today.
> and both of you need to remember that while servers are getting
> faster, we are also seeing much lower power, weaker servers
> showing up as well. And while these smaller servers are not
> trying to do teh 10000 thread fsync workload, they are using
> flash based storage more frequently than they are spinning rust
> (frequently through the bottleneck of a SD card) so continuing
> tests on low end devices is good.
Low end servers and embedded concerns me more, indeed.
> what drives are available now? see if you can get a couple
> (either directly or donated)
Right, time to hammer on flash.
Regards,
Daniel
On the 2nd of May 2015 12:26, Daniel Phillips wrote:
Aloha everybody
> On Friday, May 1, 2015 6:07:48 PM PDT, David Lang wrote:
>> On Fri, 1 May 2015, Daniel Phillips wrote:
>>> On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote:
>>>>
>>>> Well, yes - I never claimed XFS is a general purpose filesystem. It
>>>> is a high performance filesystem. Is is also becoming more relevant
>>>> to general purpose systems as low cost storage gains capabilities
>>>> that used to be considered the domain of high performance storage...
>>>
>>> OK. Well, Tux3 is general purpose and that means we care about single
>>> spinning disk and small systems.
>>
>> keep in mind that if you optimize only for the small systems you may
>> not scale as well to the larger ones.
>
> Tux3 is designed to scale, and it will when the time comes. I look
> forward to putting Shardmap through its billion file test in due
> course. However, right now it would be wise to stay focused on basic
> functionality suited to a workstation because volunteer devs tend to
> have those. After that, phones are a natural direction, where hard
> core ACID commit and really smooth file ops are particularly attractive.
>
Has anybody else a deja vu?
Nevertheless, why don't you just put your fsync, your interpretations
(ACID, shardmap, etc.) of my things (OntoFS, file system of SASOS4Fun,
and OntoBase), and whatever gimmicks you have in mind together into a
prototypical file system, test it, and sent a message with a short
description focused solely on others' and your foundational ideas and
the technical features, a WWW address where the code can be found, and
your test results to this mailing list without your marketing and
self-promotion BEFORE and NOT DUE COURSE respectively NOT AFTER anybody
else does a similar work or I am so annoyed that I implement it?
Also, if it is so general that XFS and EXT4 should adapt it, then why
don't you help to improve these file systems?
Btw.: I have rejected my claims I made in that email, but definitely not
given up my copyright if it is valid.
>> per the ramdisk but, possibly not as relavent as you may think. This
>> is why it's good to test on as many different systems as you can. As
>> you run into different types of performance you can then pick ones to
>> keep and test all the time.
>
> I keep being surprised how well it works for things we never tested
> before.
>
>> Single spinning disk is interesting now, but will be less interesting
>> later. multiple spinning disks in an array of some sort is going to
>> remain very interesting for quite a while.
>
> The way to do md well is to integrate it into the block layer like
> Freebsd does (GEOM) and expose a richer interface for the filesystem.
> That is how I think Tux3 should work with big iron raid. I hope to be
> able to tackle that sometime before the stars start winking out.
>
>> now, some things take a lot more work to test than others. Getting
>> time on a system with a high performance, high capacity RAID is hard,
>> but getting hold of an SSD from Fry's is much easier. If it's a
>> budget item, ping me directly and I can donate one for testing (the
>> cost of a drive is within my unallocated budget and using that to
>> improve Linux is worthwhile)
>
> Thanks.
>
>> As I'm reading Dave's comments, he isn't attacking you the way you
>> seem to think he is. He is pointing ot that there are problems with
>> your data, but he's also taking a lot of time to explain what's
>> happening (and yes, some of this is probably because your simple
>> tests with XFS made it look so bad)
>
> I hope the lightening up trend is a trend.
>
>> the other filesystems don't use naive algortihms, they use something
>> more complex, and while your current numbers are interesting, they
>> are only preliminary until you add something to handle fragmentation.
>> That can cause very significant problems.
>
> Fsync is pretty much agnostic to fragmentation, so those results are
> unlikely to change substantially even if we happen to do a lousy job
> on allocation policy, which I naturally consider unlikely. In fact,
> Tux3 fsync is going to get faster over time for a couple of reasons:
> the minimum blocks per commit will be reduced, and we will get rid of
> most of the seeks to beginning of volume that we currently suffer per
> commit.
>
>> Remember how fabulous btrfs looked in the initial reports? and then
>> corner cases were found that caused real problems and as the
>> algorithms have been changed to prevent those corner cases from being
>> so easy to hit, the common case has suffered somewhat. This isn't an
>> attack on Tux2 or btrfs, it's just a reality of programming. If you
>> are not accounting for all the corner cases, everything is easier,
>> and faster.
>
>>> Mine is a lame i5 minitower with 4GB from Fry's. Yours is clearly way
>>> more substantial, so I can't compare my numbers directly to yours.
>>
>> If you are doing tests with a 4G ramdisk on a machine with only 4G of
>> RAM, it seems like you end up testing a lot more than just the
>> filesystem. Testing in such low memory situations can indentify
>> significant issues, but it is questionable as a 'which filesystem is
>> better' benchmark.
>
> A 1.3 GB tmpfs, and sorry, it is 10 GB (the machine next to it is 4G).
> I am careful to ensure the test environment does not have spurious
> memory or cpu hogs. I will not claim that this is the most sterile
> test environment possible, but it is adequate for the task at hand.
> Nearly always, when I find big variations in the test numbers it turns
> out to be a quirk of one filesystem that is not exhibited by the
> others. Everything gets multiple runs and lands in a spreadsheet. Any
> fishy variance is investigated.
>
> By the way, the low variance kings by far are Ext4 and Tux3, and of
> those two, guess which one is more consistent. XFS is usually steady,
> but can get "emotional" with lots of tasks, and Btrfs has regular wild
> mood swings whenever the stars change alignment. And while I'm making
> gross generalizations: XFS and Btrfs go OOM way before Ext4 and Tux3.
>
>> Just a suggestion, but before you do a huge post about how great your
>> filesystem is performing, making the code avaialble so that others
>> can test it when prompted by your post is probably a very good idea.
>> If it means that you have to send out your post a week later, it's a
>> very small cost for the benefit of having other people able to easily
>> try it on hardware that you don't have access to.
>
> Next time. This time I wanted it off my plate as soon as possible so I
> could move on to enospc work. And this way is more involving, we get a
> little suspense before the rematch.
>
>> If there is a reason to post wihtout the code being in the main,
>> publicised repo, then your post should point people at what code they
>> can use to duplicate it.
>
> I could have included the patch in the post, it is small enough. If it
> still isn't in the repo in a few days then I will post it, to avoid
> giving the impression I'm desperately trying to fix obscure bugs in
> it, which isn't the case.
>
>> but really, 11 months without updating the main repo?? This is Open
>> Source development, publish early and often.
>
> It's not as bad as that:
>
> https://github.com/OGAWAHirofumi/linux-tux3/commits/hirofumi
> https://github.com/OGAWAHirofumi/linux-tux3/commits/hirofumi-user
>
>> something to investigate, but I have seen probelms on ext* in the
>> past. ext4 may have fixed this, or it may just have moved the point
>> where it triggers.
>
> My spectrum of tests is small and I am not hunting for anomalies, only
> reporting what happened to come up. It is not very surprising that some
> odd things happen with 10,000 tasks, there is probably not much test
> coverage there. On the whole I was surprised and impressed when all
> filesystems mostly just worked. I was expecting to hit scheduler
> issues for one thing, and nothing obvious came up. Also, not one oops
> on any filesystem (even Tux3) and only one assert, already reported
> upstream and turned out to be fixed a week or two ago.
>
>>>> ...
>>> Your machine makes mine look like a PCjr. ...
>>
>> The interesting thing here is that on the faster machine btrfs didn't
>> speed up significantly while ext4 and xfs did. It will be interesting
>> to see what the results are for tux3
>
> The numbers are well into the something-is-really-wrong zone (and I
> should have flagged that earlier but it was a long day). That test is
> supposed to be -s, all synchronous, and his numbers are more typical of
> async. Needs double checking all round, including here. Anybody can
> replicate that test, it is only an apt-get install dbench away (hint
> hint).
>
> Differences: my numbers are kvm with loopback mount on tmpfs. His are
> on ramdisk and probably native. I have to reboot to make a ramdisk big
> enough to run dbench and I would rather not right now.
>
> How important is it to get to the bottom of the variance in test
> results running on RAM? Probably important in the long run, because
> storage devices are looking more like RAM all the time, but as of
> today, maybe not very urgent.
>
> Also, I was half expecting somebody to question the wisdom of running
> benchmarks under KVM instead of native, but nobody did. Just for the
> record, I would respond: running virtual probably accounts for the
> majority of server instances today.
>
>> and both of you need to remember that while servers are getting
>> faster, we are also seeing much lower power, weaker servers showing
>> up as well. And while these smaller servers are not trying to do teh
>> 10000 thread fsync workload, they are using flash based storage more
>> frequently than they are spinning rust (frequently through the
>> bottleneck of a SD card) so continuing tests on low end devices is good.
>
> Low end servers and embedded concerns me more, indeed.
>> what drives are available now? see if you can get a couple (either
>> directly or donated)
>
> Right, time to hammer on flash.
>
> Regards,
>
> Daniel
> --
Thanks
Best Regards
Have fun
C.S.
On Sat, May 2, 2015 at 6:00 PM, Christian Stroetmann
<[email protected]> wrote:
> On the 2nd of May 2015 12:26, Daniel Phillips wrote:
>
> Aloha everybody
>
>> On Friday, May 1, 2015 6:07:48 PM PDT, David Lang wrote:
>>>
>>> On Fri, 1 May 2015, Daniel Phillips wrote:
>>>>
>>>> On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote:
>>>>>
>>>>>
>>>>> Well, yes - I never claimed XFS is a general purpose filesystem. It
>>>>> is a high performance filesystem. Is is also becoming more relevant
>>>>> to general purpose systems as low cost storage gains capabilities
>>>>> that used to be considered the domain of high performance storage...
>>>>
>>>>
>>>> OK. Well, Tux3 is general purpose and that means we care about single
>>>> spinning disk and small systems.
>>>
>>>
>>> keep in mind that if you optimize only for the small systems you may not
>>> scale as well to the larger ones.
>>
>>
>> Tux3 is designed to scale, and it will when the time comes. I look forward
>> to putting Shardmap through its billion file test in due course. However,
>> right now it would be wise to stay focused on basic functionality suited to
>> a workstation because volunteer devs tend to have those. After that, phones
>> are a natural direction, where hard core ACID commit and really smooth file
>> ops are particularly attractive.
>>
>
> Has anybody else a deja vu?
Yes, the onto-troll strikes again...
--
Thanks,
//richard
On 2nd of May 2015 18:30, Richard Weinberger wrote:
> On Sat, May 2, 2015 at 6:00 PM, Christian Stroetmann
> <[email protected]> wrote:
>> On the 2nd of May 2015 12:26, Daniel Phillips wrote:
>>
>> Aloha everybody
>>
>>> On Friday, May 1, 2015 6:07:48 PM PDT, David Lang wrote:
>>>> On Fri, 1 May 2015, Daniel Phillips wrote:
>>>>> On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote:
>>>>>>
>>>>>> Well, yes - I never claimed XFS is a general purpose filesystem. It
>>>>>> is a high performance filesystem. Is is also becoming more relevant
>>>>>> to general purpose systems as low cost storage gains capabilities
>>>>>> that used to be considered the domain of high performance storage...
>>>>>
>>>>> OK. Well, Tux3 is general purpose and that means we care about single
>>>>> spinning disk and small systems.
>>>>
>>>> keep in mind that if you optimize only for the small systems you may not
>>>> scale as well to the larger ones.
>>>
>>> Tux3 is designed to scale, and it will when the time comes. I look forward
>>> to putting Shardmap through its billion file test in due course. However,
>>> right now it would be wise to stay focused on basic functionality suited to
>>> a workstation because volunteer devs tend to have those. After that, phones
>>> are a natural direction, where hard core ACID commit and really smooth file
>>> ops are particularly attractive.
>>>
>> Has anybody else a deja vu?
> Yes, the onto-troll strikes again...
>
Everybody has her/his own interpretation about what open source means.
I really thought there could be some kind of a constructive discussion
about such a file system or at least about interesting technical
features that can be used for other file systems like e.g. a potential
EXT5, when I relaxed my position some days ago and proposed that also
ideas are referenced correctly in relation with open source projects,
specifically in relation with Tux3.
Now, I have the impression that this is not possible and due to this any
progress is hard to achieve.
Thanks
Best Regards
Do not feed the trolls.
C.S.
Hi!
> > It is a fact of life that when you change one aspect of an intimately interconnected system,
> > something else will change as well. You have naive/nonexistent free space management now; when you
> > design something workable there it is going to impact everything else you've already done. It's an
> > easy bet that the impact will be negative, the only question is to what degree.
>
> You might lose that bet. For example, suppose we do strictly linear allocation
> each delta, and just leave nice big gaps between the deltas for future
> expansion. Clearly, we run at similar or identical speed to the current naive
> strategy until we must start filling in the gaps, and at that point our layout
> is not any worse than XFS, which started bad and stayed that way.
Umm, are you sure. If "some areas of disk are faster than others" is
still true on todays harddrives, the gaps will decrease the
performance (as you'll "use up" the fast areas more quickly).
Anyway... you have brand new filesystem. Of course it should be
faster/better/nicer than the existing filesystems. So don't be too
harsh with XFS people.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Tue, May 12, 2015 at 12:12:23AM +0200, Pavel Machek wrote:
> Umm, are you sure. If "some areas of disk are faster than others" is
> still true on todays harddrives, the gaps will decrease the
> performance (as you'll "use up" the fast areas more quickly).
It's still true. The difference between O.D. and I.D. (outer diameter
vs inner diameter) LBA's is typically a factor of 2. This is why
"short-stroking" works as a technique, and another way that people
doing competitive benchmarking can screw up and produce misleading
numbers. (If you use partitions instead of the whole disk, you have
to use the same partition in order to make sure you aren't comparing
apples with oranges.)
Cheers,
- Ted
Hi Pavel,
On 05/11/2015 03:12 PM, Pavel Machek wrote:
>>> It is a fact of life that when you change one aspect of an intimately interconnected system,
>>> something else will change as well. You have naive/nonexistent free space management now; when you
>>> design something workable there it is going to impact everything else you've already done. It's an
>>> easy bet that the impact will be negative, the only question is to what degree.
>>
>> You might lose that bet. For example, suppose we do strictly linear allocation
>> each delta, and just leave nice big gaps between the deltas for future
>> expansion. Clearly, we run at similar or identical speed to the current naive
>> strategy until we must start filling in the gaps, and at that point our layout
>> is not any worse than XFS, which started bad and stayed that way.
>
> Umm, are you sure. If "some areas of disk are faster than others" is
> still true on todays harddrives, the gaps will decrease the
> performance (as you'll "use up" the fast areas more quickly).
That's why I hedged my claim with "similar or identical". The
difference in media speed seems to be a relatively small effect
compared to extra seeks. It seems that XFS puts big spaces between
new directories, and suffers a lot of extra seeks because of it.
I propose to batch new directories together initially, then change
the allocation goal to a new, relatively empty area if a big batch
of files lands on a directory in a crowded region. The "big" gaps
would be on the order of delta size, so not really very big.
Anyway, some people seem to have pounced on the words "naive" and
"linear allocation" and jumped to the conclusion that our whole
strategy is naive. Far from it. We don't just throw files randomly
at the disk. We sort and partition files and metadata, and we
carefully arrange the order of our allocation operations so that
linear allocation produces a nice layout for both read and write.
This turned out to be so much better than fiddling with the goal
of individual allocations that we concluded we would get best
results by sticking with linear allocation, but improve our sort
step. The new plan is to partition updates into batches according
to some affinity metrics, and set the linear allocation goal per
batch. So for example, big files and append-type files can get
special treatment in separate batches, while files that seem to
be related because of having the same directory parent and being
written in the same delta will continue to be streamed out using
"naive" linear allocation, which is not necessarily as naive as
one might think.
It will take time and a lot of performance testing to get this
right, but nobody should get the idea that it is any inherent
design limitation. The opposite is true: we have no restrictions
at all in media layout.
Compared to Ext4, we do need to address the issue that data moves
around when updated. This can cause rapid fragmentation. Btrfs has
shown issues with that for big, randomly updated files. We want to
fix it without falling back on update-in-place as Btrfs does.
Actually, Tux3 already has update-in-place, and unlike Btrfs, we
can switch to it for non-empty files. But we think that perfect data
isolation per delta is something worth fighting for, and we would
rather not force users to fiddle around with mode settings just to
make something work as well as it already does on Ext4. We will
tackle this issue by partitioning as above, and use a dedicated
allocation strategy for such files, which are easy to detect.
Metadata moving around per update does not seem to be a problem
because it is all single blocks that need very little slack space
to stay close to home.
> Anyway... you have brand new filesystem. Of course it should be
> faster/better/nicer than the existing filesystems. So don't be too
> harsh with XFS people.
They have done a lot of good work, but they still have a long way
to go. I don't see any shame in that.
Regards,
Daniel
On Mon, 11 May 2015, Daniel Phillips wrote:
> On 05/11/2015 03:12 PM, Pavel Machek wrote:
>>>> It is a fact of life that when you change one aspect of an intimately interconnected system,
>>>> something else will change as well. You have naive/nonexistent free space management now; when you
>>>> design something workable there it is going to impact everything else you've already done. It's an
>>>> easy bet that the impact will be negative, the only question is to what degree.
>>>
>>> You might lose that bet. For example, suppose we do strictly linear allocation
>>> each delta, and just leave nice big gaps between the deltas for future
>>> expansion. Clearly, we run at similar or identical speed to the current naive
>>> strategy until we must start filling in the gaps, and at that point our layout
>>> is not any worse than XFS, which started bad and stayed that way.
>>
>> Umm, are you sure. If "some areas of disk are faster than others" is
>> still true on todays harddrives, the gaps will decrease the
>> performance (as you'll "use up" the fast areas more quickly).
>
> That's why I hedged my claim with "similar or identical". The
> difference in media speed seems to be a relatively small effect
> compared to extra seeks. It seems that XFS puts big spaces between
> new directories, and suffers a lot of extra seeks because of it.
> I propose to batch new directories together initially, then change
> the allocation goal to a new, relatively empty area if a big batch
> of files lands on a directory in a crowded region. The "big" gaps
> would be on the order of delta size, so not really very big.
This is an interesting idea, but what happens if the files don't arrive as a big
batch, but rather trickle in over time (think a logserver that if putting files
into a bunch of directories at a fairly modest rate per directory)
And when you then decide that you have to move the directory/file info, doesn't
that create a potentially large amount of unexpected IO that could end up
interfering with what the user is trying to do?
David Lang
On 05/11/2015 04:17 PM, Theodore Ts'o wrote:
> On Tue, May 12, 2015 at 12:12:23AM +0200, Pavel Machek wrote:
>> Umm, are you sure. If "some areas of disk are faster than others" is
>> still true on todays harddrives, the gaps will decrease the
>> performance (as you'll "use up" the fast areas more quickly).
>
> It's still true. The difference between O.D. and I.D. (outer diameter
> vs inner diameter) LBA's is typically a factor of 2. This is why
> "short-stroking" works as a technique,
That is true, and the effect is not dominant compared to introducing
a lot of extra seeks.
> and another way that people
> doing competitive benchmarking can screw up and produce misleading
> numbers.
If you think we screwed up or produced misleading numbers, could you
please be up front about it instead of making insinuations and
continuing your tirade against benchmarking and those who do it.
> (If you use partitions instead of the whole disk, you have
> to use the same partition in order to make sure you aren't comparing
> apples with oranges.)
You can rest assured I did exactly that.
Somebody complained that things would look much different with seeks
factored out, so here are some new "competitive benchmarks" using
fs_mark on a ram disk:
tasks 1 16 64
------------------------------------
ext4: 231 2154 5439
btrfs: 152 962 2230
xfs: 268 2729 6466
tux3: 315 5529 20301
(Files per second, more is better)
The shell commands are:
fs_mark -dtest -D5 -N100 -L1 -p5 -r5 -s1048576 -w4096 -n1000 -t1
fs_mark -dtest -D5 -N100 -L1 -p5 -r5 -s65536 -w4096 -n1000 -t16
fs_mark -dtest -D5 -N100 -L1 -p5 -r5 -s4096 -w4096 -n1000 -t64
The ram disk removes seek overhead and greatly reduces media transfer
overhead. This does not change things much: it confirms that Tux3 is
significantly faster than the others at synchronous loads. This is
apparently true independently of media type, though to be sure SSD
remains to be tested.
The really interesting result is how much difference there is between
filesystems, even on a ram disk. Is it just CPU or is it synchronization
strategy and lock contention? Does our asynchronous front/back design
actually help a lot, instead of being a disadvantage as you predicted?
It is too bad that fs_mark caps number of tasks at 64, because I am
sure that some embarrassing behavior would emerge at high task counts,
as with my tests on spinning disk.
Anyway, everybody but you loves competitive benchmarks, that is why I
post them. They are not only useful for tracking down performance bugs,
but as you point out, they help us advertise the reasons why Tux3 is
interesting and ought to be merged.
Regards,
Daniel
Hi David,
On 05/11/2015 05:12 PM, David Lang wrote:
> On Mon, 11 May 2015, Daniel Phillips wrote:
>
>> On 05/11/2015 03:12 PM, Pavel Machek wrote:
>>>>> It is a fact of life that when you change one aspect of an intimately interconnected system,
>>>>> something else will change as well. You have naive/nonexistent free space management now; when you
>>>>> design something workable there it is going to impact everything else you've already done. It's an
>>>>> easy bet that the impact will be negative, the only question is to what degree.
>>>>
>>>> You might lose that bet. For example, suppose we do strictly linear allocation
>>>> each delta, and just leave nice big gaps between the deltas for future
>>>> expansion. Clearly, we run at similar or identical speed to the current naive
>>>> strategy until we must start filling in the gaps, and at that point our layout
>>>> is not any worse than XFS, which started bad and stayed that way.
>>>
>>> Umm, are you sure. If "some areas of disk are faster than others" is
>>> still true on todays harddrives, the gaps will decrease the
>>> performance (as you'll "use up" the fast areas more quickly).
>>
>> That's why I hedged my claim with "similar or identical". The
>> difference in media speed seems to be a relatively small effect
>> compared to extra seeks. It seems that XFS puts big spaces between
>> new directories, and suffers a lot of extra seeks because of it.
>> I propose to batch new directories together initially, then change
>> the allocation goal to a new, relatively empty area if a big batch
>> of files lands on a directory in a crowded region. The "big" gaps
>> would be on the order of delta size, so not really very big.
>
> This is an interesting idea, but what happens if the files don't arrive as a big batch, but rather
> trickle in over time (think a logserver that if putting files into a bunch of directories at a
> fairly modest rate per directory)
If files are trickling in then we can afford to spend a lot more time
finding nice places to tuck them in. Log server files are an especially
irksome problem for a redirect-on-write filesystem because the final
block tends to be rewritten many times and we must move it to a new
location each time, so every extent ends up as one block. Oh well. If
we just make sure to have some free space at the end of the file that
only that file can use (until everywhere else is full) then the long
term result will be slightly ravelled blocks that nonetheless tend to
be on the same track or flash block as their logically contiguous
neighbours. There will be just zero or one empty data blocks mixed
into the file tail as we commit the tail block over and over with the
same allocation goal. Sometimes there will be a block or two of
metadata as well, which will eventually bake themselves into the
middle of contiguous data and stop moving around.
Putting this together, we have:
* At delta flush, break out all the log type files
* Dedicate some block groups to append type files
* Leave lots of space between files in those block groups
* Peek at the last block of the file to set the allocation goal
Something like that. What we don't want is to throw those files into
the middle of a lot of rewrite-all files, messing up both kinds of file.
We don't care much about keeping these files near the parent directory
because one big seek per log file in a grep is acceptable, we just need
to avoid thousands of big seeks within the file, and not dribble single
blocks all over the disk.
It would also be nice to merge together extents somehow as the final
block is rewritten. One idea is to retain the final block dirty until
the next delta, and write it again into a contiguous position, so the
final block is always flushed twice. We already have the opportunistic
merge logic, but the redirty behavior and making sure it only happens
to log files would be a bit fiddly.
We will also play the incremental defragmentation card at some point,
but first we should try hard to control fragmentation in the first
place. Tux3 is well suited to online defragmentation because the delta
commit model makes it easy to move things around efficiently and safely,
but it does generate extra IO, so as a basic mechanism it is not ideal.
When we get to piling on features, that will be high on the list,
because it is relatively easy, and having that fallback gives a certain
sense of security.
> And when you then decide that you have to move the directory/file info, doesn't that create a
> potentially large amount of unexpected IO that could end up interfering with what the user is trying
> to do?
Right, we don't like that and don't plan to rely on it. What we hope
for is behavior that, when you slowly stir the pot, tends to improve the
layout just as often as it degrades it. It may indeed become harder to
find ideal places to put things as time goes by, but we also gain more
information to base decisions on.
Regards,
Daniel
On Mon, May 11, 2015 at 07:34:34PM -0700, Daniel Phillips wrote:
> Anyway, everybody but you loves competitive benchmarks, that is why I
I think Ted and I are on the same page here. "Competitive
benchmarks" only matter to the people who are trying to sell
something. You're trying to sell Tux3, but....
> post them. They are not only useful for tracking down performance bugs,
> but as you point out, they help us advertise the reasons why Tux3 is
> interesting and ought to be merged.
.... benchmarks won't get tux3 merged.
Addressing the significant issues that have been raised during
previous code reviews is what will get it merged. I posted that
list elsewhere in this thread which you replied that they were all
"on the list of things to do except for the page forking design".
The "except page forking design" statement is your biggest hurdle
for getting tux3 merged, not performance. Without page forking, tux3
cannot be merged at all. But it's not filesystem developers you need
to convince about the merits of the page forking design and
implementation - it's the mm and core kernel developers that need to
review and accept that code *before* we can consider merging tux3.
IOWs, you need to focus on the important things needed to acheive
your stated goal of getting tux3 merged. New filesystems should be
faster than those based on 20-25 year old designs, so you don't need
to waste time trying to convince people that tux3, when complete,
will be fast.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Monday, May 11, 2015 10:38:42 PM PDT, Dave Chinner wrote:
> I think Ted and I are on the same page here. "Competitive
> benchmarks" only matter to the people who are trying to sell
> something. You're trying to sell Tux3, but....
By "same page", do you mean "transparently obvious about
obstructing other projects"?
> The "except page forking design" statement is your biggest hurdle
> for getting tux3 merged, not performance.
No, the "except page forking design" is because the design is
already good and effective. The small adjustments needed in core
are well worth merging because the benefits are proved by benchmarks.
So benchmarks are key and will not stop just because you don't like
the attention they bring to XFS issues.
> Without page forking, tux3
> cannot be merged at all. But it's not filesystem developers you need
> to convince about the merits of the page forking design and
> implementation - it's the mm and core kernel developers that need to
> review and accept that code *before* we can consider merging tux3.
Please do not say "we" when you know that I am just as much a "we"
as you are. Merging Tux3 is not your decision. The people whose
decision it actually is are perfectly capable of recognizing your
agenda for what it is.
http://www.phoronix.com/scan.php?page=news_item&px=MTA0NzM
"XFS Developer Takes Shots At Btrfs, EXT4"
The real question is, has the Linux development process become
so political and toxic that worthwhile projects fail to benefit
from supposed grassroots community support. You are the poster
child for that.
> IOWs, you need to focus on the important things needed to acheive
> your stated goal of getting tux3 merged. New filesystems should be
> faster than those based on 20-25 year old designs, so you don't need
> to waste time trying to convince people that tux3, when complete,
> will be fast.
You know that Tux3 is already fast. Not just that of course. It
has a higher standard of data integrity than your metadata-only
journalling filesystem and a small enough code base that it can
be reasonably expected to reach the quality expected of an
enterprise class filesystem, quite possibly before XFS gets
there.
Regards,
Daniel
On Mon 2015-05-11 19:34:34, Daniel Phillips wrote:
>
>
> On 05/11/2015 04:17 PM, Theodore Ts'o wrote:
> > On Tue, May 12, 2015 at 12:12:23AM +0200, Pavel Machek wrote:
> >> Umm, are you sure. If "some areas of disk are faster than others" is
> >> still true on todays harddrives, the gaps will decrease the
> >> performance (as you'll "use up" the fast areas more quickly).
> >
> > It's still true. The difference between O.D. and I.D. (outer diameter
> > vs inner diameter) LBA's is typically a factor of 2. This is why
> > "short-stroking" works as a technique,
>
> That is true, and the effect is not dominant compared to introducing
> a lot of extra seeks.
>
> > and another way that people
> > doing competitive benchmarking can screw up and produce misleading
> > numbers.
>
> If you think we screwed up or produced misleading numbers, could you
> please be up front about it instead of making insinuations and
> continuing your tirade against benchmarking and those who do it.
Are not you little harsh with Ted? He was polite.
> The ram disk removes seek overhead and greatly reduces media transfer
> overhead. This does not change things much: it confirms that Tux3 is
> significantly faster than the others at synchronous loads. This is
> apparently true independently of media type, though to be sure SSD
> remains to be tested.
>
> The really interesting result is how much difference there is between
> filesystems, even on a ram disk. Is it just CPU or is it synchronization
> strategy and lock contention? Does our asynchronous front/back design
> actually help a lot, instead of being a disadvantage as you predicted?
>
> It is too bad that fs_mark caps number of tasks at 64, because I am
> sure that some embarrassing behavior would emerge at high task counts,
> as with my tests on spinning disk.
I'd call system with 65 tasks doing heavy fsync load at the some time
"embarrassingly misconfigured" :-). It is nice if your filesystem can
stay fast in that case, but...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On 05/12/2015 02:03 AM, Pavel Machek wrote:
> On Mon 2015-05-11 19:34:34, Daniel Phillips wrote:
>> On 05/11/2015 04:17 PM, Theodore Ts'o wrote:
>>> and another way that people
>>> doing competitive benchmarking can screw up and produce misleading
>>> numbers.
>>
>> If you think we screwed up or produced misleading numbers, could you
>> please be up front about it instead of making insinuations and
>> continuing your tirade against benchmarking and those who do it.
>
> Are not you little harsh with Ted? He was polite.
Polite language does not include words like "screw up" and "misleading
numbers", those are combative words intended to undermine and disparage.
It is not clear how repeating the same words can be construed as less
polite than the original utterance.
>> The ram disk removes seek overhead and greatly reduces media transfer
>> overhead. This does not change things much: it confirms that Tux3 is
>> significantly faster than the others at synchronous loads. This is
>> apparently true independently of media type, though to be sure SSD
>> remains to be tested.
>>
>> The really interesting result is how much difference there is between
>> filesystems, even on a ram disk. Is it just CPU or is it synchronization
>> strategy and lock contention? Does our asynchronous front/back design
>> actually help a lot, instead of being a disadvantage as you predicted?
>>
>> It is too bad that fs_mark caps number of tasks at 64, because I am
>> sure that some embarrassing behavior would emerge at high task counts,
>> as with my tests on spinning disk.
>
> I'd call system with 65 tasks doing heavy fsync load at the some time
> "embarrassingly misconfigured" :-). It is nice if your filesystem can
> stay fast in that case, but...
Well, Tux3 wins the fsync race now whether it is 1 task, 64 tasks or
10,000 tasks. At the high end, maybe it is just a curiosity, or maybe
it tells us something about how Tux3 is will scale on the big machines
that XFS currently lays claim to. And Java programmers are busy doing
all kinds of wild and crazy things with lots of tasks. Java almost
makes them do it. If they need their data durable then they can easily
create loads like my test case.
Suppose you have a web server meant to serve 10,000 transactions
simultaneously and it needs to survive crashes without losing client
state. How will you do it? You could install an expensive, finicky
database, or you could write some Java code that happens to work well
because Linux has a scheduler and a filesystem that can handle it.
Oh wait, we don't have the second one yet, but maybe we soon will.
I will not claim that stupidly fast and scalable fsync is the main
reason that somebody should want Tux3, however, the lack of a high
performance fsync was in fact used as a means of spreading FUD about
Tux3, so I had some fun going way beyond the call of duty to answer
that. By the way, I am still waiting for the original source of the
FUD to concede the point politely, but maybe he is waiting for the
code to land, which it still has not as of today, so I guess that is
fair. Note that it would have landed quite some time ago if Tux3 was
already merged.
Historical note: didn't Java motivate the O(1) scheduler?
Regarda,
Daniel
Daniel Phillips wrote:
> On 05/12/2015 02:03 AM, Pavel Machek wrote:
>> I'd call system with 65 tasks doing heavy fsync load at the some time
>> "embarrassingly misconfigured" :-). It is nice if your filesystem can
>> stay fast in that case, but...
>
> Well, Tux3 wins the fsync race now whether it is 1 task, 64 tasks or
> 10,000 tasks. At the high end, maybe it is just a curiosity, or maybe
> it tells us something about how Tux3 is will scale on the big machines
> that XFS currently lays claim to. And Java programmers are busy doing
> all kinds of wild and crazy things with lots of tasks. Java almost
> makes them do it. If they need their data durable then they can easily
> create loads like my test case.
>
> Suppose you have a web server meant to serve 10,000 transactions
> simultaneously and it needs to survive crashes without losing client
> state. How will you do it? You could install an expensive, finicky
> database, or you could write some Java code that happens to work well
> because Linux has a scheduler and a filesystem that can handle it.
> Oh wait, we don't have the second one yet, but maybe we soon will.
>
> I will not claim that stupidly fast and scalable fsync is the main
> reason that somebody should want Tux3, however, the lack of a high
> performance fsync was in fact used as a means of spreading FUD about
> Tux3, so I had some fun going way beyond the call of duty to answer
> that. By the way, I am still waiting for the original source of the
> FUD to concede the point politely, but maybe he is waiting for the
> code to land, which it still has not as of today, so I guess that is
> fair. Note that it would have landed quite some time ago if Tux3 was
> already merged.
Well, stupidly fast and scalable fsync sounds wonderful to me; it's the
primary pain point in LMDB write performance now.
http://symas.com/mdb/ondisk/
I look forward to testing Tux3 when usable code shows up in a public repo.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
Am 12.05.2015 06:36, schrieb Daniel Phillips:
> Hi David,
>
> On 05/11/2015 05:12 PM, David Lang wrote:
>> On Mon, 11 May 2015, Daniel Phillips wrote:
>>
>>> On 05/11/2015 03:12 PM, Pavel Machek wrote:
>>>>>> It is a fact of life that when you change one aspect of an intimately interconnected system,
>>>>>> something else will change as well. You have naive/nonexistent free space management now; when you
>>>>>> design something workable there it is going to impact everything else you've already done. It's an
>>>>>> easy bet that the impact will be negative, the only question is to what degree.
>>>>> You might lose that bet. For example, suppose we do strictly linear allocation
>>>>> each delta, and just leave nice big gaps between the deltas for future
>>>>> expansion. Clearly, we run at similar or identical speed to the current naive
>>>>> strategy until we must start filling in the gaps, and at that point our layout
>>>>> is not any worse than XFS, which started bad and stayed that way.
>>>> Umm, are you sure. If "some areas of disk are faster than others" is
>>>> still true on todays harddrives, the gaps will decrease the
>>>> performance (as you'll "use up" the fast areas more quickly).
>>> That's why I hedged my claim with "similar or identical". The
>>> difference in media speed seems to be a relatively small effect
>>> compared to extra seeks. It seems that XFS puts big spaces between
>>> new directories, and suffers a lot of extra seeks because of it.
>>> I propose to batch new directories together initially, then change
>>> the allocation goal to a new, relatively empty area if a big batch
>>> of files lands on a directory in a crowded region. The "big" gaps
>>> would be on the order of delta size, so not really very big.
>> This is an interesting idea, but what happens if the files don't arrive as a big batch, but rather
>> trickle in over time (think a logserver that if putting files into a bunch of directories at a
>> fairly modest rate per directory)
> If files are trickling in then we can afford to spend a lot more time
> finding nice places to tuck them in. Log server files are an especially
> irksome problem for a redirect-on-write filesystem because the final
> block tends to be rewritten many times and we must move it to a new
> location each time, so every extent ends up as one block. Oh well. If
> we just make sure to have some free space at the end of the file that
> only that file can use (until everywhere else is full) then the long
> term result will be slightly ravelled blocks that nonetheless tend to
> be on the same track or flash block as their logically contiguous
> neighbours. There will be just zero or one empty data blocks mixed
> into the file tail as we commit the tail block over and over with the
> same allocation goal. Sometimes there will be a block or two of
> metadata as well, which will eventually bake themselves into the
> middle of contiguous data and stop moving around.
>
> Putting this together, we have:
>
> * At delta flush, break out all the log type files
> * Dedicate some block groups to append type files
> * Leave lots of space between files in those block groups
> * Peek at the last block of the file to set the allocation goal
>
> Something like that. What we don't want is to throw those files into
> the middle of a lot of rewrite-all files, messing up both kinds of file.
> We don't care much about keeping these files near the parent directory
> because one big seek per log file in a grep is acceptable, we just need
> to avoid thousands of big seeks within the file, and not dribble single
> blocks all over the disk.
>
> It would also be nice to merge together extents somehow as the final
> block is rewritten. One idea is to retain the final block dirty until
> the next delta, and write it again into a contiguous position, so the
> final block is always flushed twice. We already have the opportunistic
> merge logic, but the redirty behavior and making sure it only happens
> to log files would be a bit fiddly.
>
> We will also play the incremental defragmentation card at some point,
> but first we should try hard to control fragmentation in the first
> place. Tux3 is well suited to online defragmentation because the delta
> commit model makes it easy to move things around efficiently and safely,
> but it does generate extra IO, so as a basic mechanism it is not ideal.
> When we get to piling on features, that will be high on the list,
> because it is relatively easy, and having that fallback gives a certain
> sense of security.
So we are again at some more features of SASOS4Fun.
Said this, I can see as an alleged troll expert the agenda and strategy
behind this and related threads, but still no usable code/file system at
all and hence nothing that even might be ready for merging, as I
understand the statements of the file system gurus.
So it is time for the developer(s) to take decisions, what should be
implement respectively manifested in code eventually and then show the
complete result, so that others can make the tests and the benchmarks.
Thanks
Best Regards
Do not feed the trolls.
C.S.
>> And when you then decide that you have to move the directory/file info, doesn't that create a
>> potentially large amount of unexpected IO that could end up interfering with what the user is trying
>> to do?
> Right, we don't like that and don't plan to rely on it. What we hope
> for is behavior that, when you slowly stir the pot, tends to improve the
> layout just as often as it degrades it. It may indeed become harder to
> find ideal places to put things as time goes by, but we also gain more
> information to base decisions on.
>
> Regards,
>
> Daniel
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Tux3 Report: How fast can we fail?
Tux3 now has a preliminary out of space handling algorithm. This might
sound like a small thing, but in fact handling out of space reliably
and efficiently is really hard, especially for Tux3. We developed an
original solution with unusually low overhead in the common case, and
simple enough to prove correct. Reliability seems good so far. But not
to keep anyone in suspense: Tux3 does not fail very fast, but it fails
very reliably. We like to think that Tux3 is better at succeeding than
failing.
We identified the following quality metrics for this algorithm:
1) Never fails to detect out of space in the front end.
2) Always fills a volume to 100% before reporting out of space.
3) Allows rm, rmdir and truncate even when a volume is full.
4) Writing to a nearly full volume is not excessively slow.
5) Overhead is insignificant when a volume is far from full.
Like every filesystem that does delayed allocation, Tux3 must guess how
much media space will be needed to commit any update it accepts into
cache. It must not guess low or the commit may fail and lose data. This
is especially tricky for Tux3 because it does not track individual
updates, but instead, partitions updates atomically into delta groups
and commits each delta as an atomic unit. A single delta can be as
large as writable cache, including thousands of individual updates.
This delta scheme ensures perfect namespace, metadata and data
consistency without complex tracking of relationships between thousands
of cache objects, and also does delayed allocation about as well as it
can be done. Given these benefits, it is not too hard to accept some
extra pain in out of space accounting.
Speaking of accounting, we borrow some of that terminology to talk
about the problem. Each delta has a "budget" and computes a "balance"
that declines each time a transaction "cost" is "charged" against it.
The budget is all of free space, plus some space that belongs to
the current disk image that we know will be released soon, and less
a reserve for taking care of certain backend duties. When the balance
goes negative, the transaction backs out its cost, triggers a delta
transition, and tries again. This has the effect of shrinking the delta
size as a volume approaches full. When the delta budget finally shrinks
to less than the transaction cost, the update fails with ENOSPC.
This is where the "how fast can we fail" question comes up. If our guess
at cost is way higher than actual blocks consumed, deltas take a long
time to shrink. Overestimating transaction cost by a factor of ten
can trigger over a hundred deltas before failing. Fortunately, deltas
are pretty fast, so we only keep the user waiting for a second or so
before delivering the bad news. We also slow down measurably, but not
horribly, when getting close to full. Ext4 by contrast flies along at
full speed right until it fills the volume, and stops on a dime at
exactly at 100% full. I don't think that Tux3 will ever be as good at
failing as that, but we will try to get close.
Before I get into how Tux3's out of space behavior stacks up against
other filesystems, there are some interesting details to touch on about
how we go about things.
Tux3's front/back arrangement is lockless, which is great for
performance but turns into a problem when front and back need to
cooperate about something like free space accounting. If we were willing
to add a spinlock between front and back this would be easy, but don't
want to do that. Not only are we jealously protective of our lockless
design, but if our fast path suddenly became slower because of adding
essential functionality we might need to revise some posted benchmark
results. Better that we should do it right and get our accounting
almost for free.
The world of lockless algorithms is an arcane one indeed, just ask Paul
McKenney about that. The solution we came up with needs just two atomic
adds per transaction, and we will eventually turn one of those into a
per-cpu counter. As mentioned above, a frontend transaction backs out
its cost when the delta balance goes negative, so from the backend's
point of view, the balance is going up and down unpredictably all the
time. Delta transition can happen at any time, and somehow, the backend
must assign the new front delta its budget exactly at transition.
Meanwhile, the front delta balance is still going up and down
unpredictably. See the problem? The issue is, delta transition is truly
asynchronous. We can't change that short of adding locks with the
contention and stalls that go along with them.
Fortunately, one consequence of delta transition is that the total cost
charged to the delta instantly becomes stable when the front delta
becomes the back delta. Volume free space is also stable because only
the backend accesses it. The backend can easily measure the actual
space consumed by the back delta: it is the difference between free
space before and after flushing to media. Updating the front delta
budget is easy because only the backend changes it, but updating the
front delta balance is much harder because the front delta is busy
changing it. If we get this wrong, the resulting slight discrepancies
between budget, balance and charged costs would mean that somebody,
somewhere will hit out of space in the middle of a commit and end up
sticking pins into a voodoo doll that looks like us.
A solution was found that only took a few lines of code and some pencil
pushing. The backend knows what the front delta balance must have been
exactly at transition, because it knows the amount charged to the back
delta, and it knows the original budget. It can therefore deduce how
much the front balance should have increased exactly at transition (it
must always increase) so it adds that difference atomically to the
front delta budget. This has exactly the same effect as setting the
balance atomically at transition time, if that were possible, which it
is not. This technique is airtight, and the whole algorithm ends up
costing less than a hundred nanoseconds per transaction.[1] This is a
good thing because each page of a Tux3 write is a separate transaction,
so any significant overhead would stick out like a sore thumb.
Accounting cost estimates properly and stopping when actually out of
space is just the core of the algorithm. We must feed that core with
good, provable worst case cost estimates. To get an initial idea of
whether the algorithm works, we just plugged in some magic numbers, and
lo and behold, suddenly we where not running out of space in the
backend any more. But to do the job properly we need to consider things
like the file index btree depth, because just plugging in a number large
enough to handle the deepest possible btree would slow down our failure
path way too much.
The best way to account for btree depth is to make it disappear entirely
by removing the btree updates from the delta commit path. We already do
that for bitmaps, which is a good thing because our bitmaps are just
blocks that live in a normal file. Multiplying our worst case by the
maximum number of bitmaps that could possibly be affected, and then
multiplying that by the worst case change to the bitmap metadata,
including its data btree, its inode, and the inode table btree, would be
a real horror. Instead, we log all changes that affect the bitmap and
only update the bitmaps periodically at unify cycles. A Tux3 filesystem
is consistent whether or not we unify, so if space becomes critically
tight the backend can just disable the unify. The only bad effect is
that the log chain can grow and make replay take longer, but that growth
is limited by the fact that there is not much space left for more log
blocks.
If we did not have this nice way of making bitmap overhead disappear,
we would not be anywhere close to a respectable implementation today.
Actually, we weren't even thinking about out of space accounting when
we developed this design element, we were actually trying to get rid of
the overhead of updating bitmaps per delta. Which worked well and is a
significant part of the reason why we can outrun Ext4 while having a
very similar structure. The benefit for space accounting dropped out
just by dumb luck.
The same technique we use for hiding bitmap update cost works just as
well for btree metadata. Eventually, we will move btree leaf redirecting
from the delta flush to the unify flush. That will speed it up by
coalescing some index block writes and also make it vanish from the
transaction cost estimate, saving frontend CPU and speeding up the
failure path. What's not to like? It is on the list of things to do.
Today, I refactored the budgeting algorithm to skip the cost estimate
if a page is already dirty, which tightened up the estimate by a factor
of four or so and made things run smoother. There will be more
incremental improvements as time goes by. For example, we currently
overestimate the cost of a rewrite because we would need to go poking
around in btrees to do that more accurately. Fixing that will be quite
a bit of work, but somebody will probably do it, sometime.
Now the fun part: performance and bugs. Being anxious to know where
Tux3 stands with respect to the usual suspects, I ran some tests and
found that Ext4 is amazingly good at this, while XFS and Btrfs have
some serious issues. Details below.
Tux3 slows down when approaching a full state, hopefully not too much.
To quantify that, here is what happens with a 200 MB dd to a loopback
mounted file on tmpfs:
Volume Size Run Time
No check at all: 1500 MB 0.306s
Far from full: 1500 MB 0.318s
Getting full: 30 MB 0.386s
Just over full: 20 MB 0.624s
The slowdown used to be a lot more before I improved the cost estimate
for write earlier today. Here is how we compare to other filesystems:
Far from full Just over full
tux3: 0.303s 0.468s
ext4: 0.399s 0.400s
xfs: 0.293s 0.326s
btrfs: 0.499s 0.531s
(20 mb dd to ramdisk)
XFS eeks out a narrow win on straight up dd to the ramdisk, good job.
The gap widens when hitting the failure path, but again, not as much as
it did earlier today.
I do most of these no space tests on a ramdisk (actually, a loopback
mount on tmpfs) because it is easy to fill up. To show that the ramdisk
results are not wildly different from a real disk, here we see that the
pattern is largely unchanged:
20 MB dd to a real disk
tux3: 1.568s
ext4: 1.523s
xfs: 1.466s
btrfs: 2.347s
XFS holds its dd lead on a real hard disk. We definitely need to learn
its trick.
Next we look at something with a bit more meat: unzipping the Git
source to multiple directories. Timings on ramdisk are the interesting
ones, because the volume approaches full on the longer test.
10x to ram 40x to ram 10x to hdd 100x to hdd
tux3: 2.251s 8.344s 2.671s 21.686s
ext4: 2.343s 7.923s 3.165s 32.080s
xfs: 2.682s 10.562s 11.870s 123.435s
btrfs: 3.744s 15.825s 3.749s 72.405s
Tux3 is the fastest when not close to full, but Ext4 takes a slight
lead when close to full. Yesterday, that lead was much wider, and one
day we would be pleased to tie Ext4, which is really exemplary at this.
The hard disk times are there because they happened to be easy to get,
and it is interesting to see how much XFS and BTRFS are suffering on
traditional rust, XFS being the worst by far at 5.7 times slower than
Tux3.
The next one is a crash test: repeatedly untar a tarball until it
smashes into the wall, and see how long it takes to quit with an error.
Tar is nice for this because its failure handling is so awful: instead
of exiting when on the first ENOSPC, it keeps banging at the full disk
until it has failed on each and every file in its archive. First I dd
the volume until just before full, then throw a tarball at it.
Time to fail when tar hits disk full:
tux3: 0.563s
ext4: 0.084s
xfs: 0.116s
btrfs: 0.460s
We respectfully concede that Ext4 is the king of fail and Tux3 is
worst. However, we only need to be good enough on this one, with less
than a second being a very crude definition of good enough.
The next one is something I ran into when I was testing out of space
detection with rewrites. This uses the "blurt" program at the end of
this post to do 40K writes from 1000 tasks in parallel, 1K at a time,
using the bash one liner:
for ((j=1;j<10;j++)); do \
for ((i=1;i<10;i++)); do \
echo step $j:$i 1>&2 && mkdir -p fs/$i && \
~/blurt fs/$i/f 40 1000 || break 2; \
done; \
done
Tux3: 4.136s (28% full when done)
Ext4: 5.780s (31% full when done)
XFS: 79.063s (!)
Btrfs: 15.489s (fails with out of space when 30% full)
Blurt is a minor revision of my fsync load generator without the fsync,
and with an error exit on disk full. The intent of the outer loop is
to do rewrites with a thousand tasks in parallel, and see if out of
space accounting is accurate. XFS and Btrfs both embarrassed themselves
horribly. XFS falls off a performance cliff that makes it 19 times
slower than Tux3, and Btrfs hits ENOSPC when only 30% full according to
df, or 47% full if you prefer to believe its own df command:
Data, single: total=728.00MiB, used=342.25MiB
System, DUP: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, DUP: total=65.00MiB, used=5.97MiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=16.00MiB, used=0.00B
It seems that Btrfs still has not put its epic ENOSPC nightmare behind
it. I fervently hope that such a fate does not await Tux3, which hope
would appear to be well on its way to being born out.
XFS should not do such bizarre things after 23 years of development,
while being billed as a mature, enterprise grade filesystem. It simply
is not there yet. Ext4 is exemplary in terms of reliability, and Tux3
has been been really good through this round of torture tests, though
I will not claim that it is properly hardened just yet. I know it isn't.
We don't have any open bugs, but that is probably because we only have
two users. But Tux3 is remarkably solid for the number of man years
that have gone into it. Maybe Tux3 really will be ready for the
enterprise before XFS is.
In all of these tests, Tux3, Ext4 and XFS managed to fill up their
volumes to exactly 100%. Tux3 actually has a 100 block emergency
reserve that it never fills, and wastes a few more blocks if the last
transaction does not exactly use up its budget, but apparently that
still falls within the df utility's definition of 100%. Btrfs never gets
this right: full for it tends to range from 96% to 98%, and sometimes is
much lower, like 28%. It has its own definition of disk full in its own
utility, but that does not seem to be very accurate either. This part of
Btrfs needs major work. Even at this early stage, Tux3 is much better
than that.
One thing we can all rejoice over: nobody ever hit out of space while
trying to commit. At least, nobody ever admitted it. And nobody oopsed,
or asserted, though XFS did exhibit some denial of service issues where
the filesystem was unusable for tens of seconds.
Once again, in the full disclosure department: there are some known
holes remaining in Tux3's out of space handling. The unify suspend
algorithm is not yet implemented, without which we cannot guarantee
that out of space will never happen in commit. With the simple expedient
of a 100 block emergency reserve, it has never yet happened, but no
doubt some as yet untested load can make it happen. ENOSPC handling for
mmap is not yet implemented. Cost estimates for namespace operations
are too crude and ignore btree depth. Cost estimates could be tighter
than they are, to give better performance and report disk full more
promptly. The emergency reserve should be set each delta according to
delta budget. Big truncates need to be split over multiple commits
so they always free more blocks than they consume before commit. That
is about it. On the whole, I am really happy with the way this
has worked out.
Well, that is that for today. Tux3 now has decent out of space handling
that appears to work well and has a good strong theoretical basis. It
needs more work, but is no longer a reason to block Tux3 from merging,
if it ever really was.
Regards,
Daniel
[1] Overhead of an uncontended bus locked add is about 6 nanoseconds on
my i5, and about ten times higher when contended.
/*
* Blurt v0.0
*
* A trivial multitasking filesystem load generator
*
* Daniel Phillips, June 2015
*
* to build: c99 -Wall blurt.c -oblurt
* to run: blurt <basename> <steps> <tasks>
*/
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <sys/wait.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/stat.h>
enum { chunk = 1024, sync = 0 };
char text[chunk] = { "hello world!\n" };
int main(int argc, const char *argv[]) {
const char *basename = argc < 1 ? "foo" : argv[1];
char name[100];
int steps = argc < 3 ? 1 : atoi(argv[2]);
int tasks = argc < 4 ? 1 : atoi(argv[3]);
int fd, status, errors = 0;
for (int t = 0; t < tasks; t++) {
snprintf(name, sizeof name, "%s%i", basename, t);
if (!fork())
goto child;
}
for (int t = 0; t < tasks; t++) {
wait(&status);
if (WIFEXITED(status) && WEXITSTATUS(status))
errors++;
}
return !!errors;
child:
fd = creat(name, S_IRWXU);
if (fd == -1)
goto fail1;
for (int i = 0; i < steps; i++) {
int ret = write(fd, text, sizeof text);
if (ret == -1)
goto fail2;
if (sync)
fsync(fd);
}
return 0;
fail1:
perror("create failed");
return 1;
fail2:
perror("write failed");
return 1;
}
(reposted with correct subject line)
Tux3 now has a preliminary out of space handling algorithm. This might
sound like a small thing, but in fact handling out of space reliably
and efficiently is really hard, especially for Tux3. We developed an
original solution with unusually low overhead in the common case, and
simple enough to prove correct. Reliability seems good so far. But not
to keep anyone in suspense: Tux3 does not fail very fast, but it fails
very reliably. We like to think that Tux3 is better at succeeding than
failing.
We identified the following quality metrics for this algorithm:
1) Never fails to detect out of space in the front end.
2) Always fills a volume to 100% before reporting out of space.
3) Allows rm, rmdir and truncate even when a volume is full.
4) Writing to a nearly full volume is not excessively slow.
5) Overhead is insignificant when a volume is far from full.
Like every filesystem that does delayed allocation, Tux3 must guess how
much media space will be needed to commit any update it accepts into
cache. It must not guess low or the commit may fail and lose data. This
is especially tricky for Tux3 because it does not track individual
updates, but instead, partitions updates atomically into delta groups
and commits each delta as an atomic unit. A single delta can be as
large as writable cache, including thousands of individual updates.
This delta scheme ensures perfect namespace, metadata and data
consistency without complex tracking of relationships between thousands
of cache objects, and also does delayed allocation about as well as it
can be done. Given these benefits, it is not too hard to accept some
extra pain in out of space accounting.
Speaking of accounting, we borrow some of that terminology to talk
about the problem. Each delta has a "budget" and computes a "balance"
that declines each time a transaction "cost" is "charged" against it.
The budget is all of free space, plus some space that belongs to
the current disk image that we know will be released soon, and less
a reserve for taking care of certain backend duties. When the balance
goes negative, the transaction backs out its cost, triggers a delta
transition, and tries again. This has the effect of shrinking the delta
size as a volume approaches full. When the delta budget finally shrinks
to less than the transaction cost, the update fails with ENOSPC.
This is where the "how fast can we fail" question comes up. If our guess
at cost is way higher than actual blocks consumed, deltas take a long
time to shrink. Overestimating transaction cost by a factor of ten
can trigger over a hundred deltas before failing. Fortunately, deltas
are pretty fast, so we only keep the user waiting for a second or so
before delivering the bad news. We also slow down measurably, but not
horribly, when getting close to full. Ext4 by contrast flies along at
full speed right until it fills the volume, and stops on a dime at
exactly at 100% full. I don't think that Tux3 will ever be as good at
failing as that, but we will try to get close.
Before I get into how Tux3's out of space behavior stacks up against
other filesystems, there are some interesting details to touch on about
how we go about things.
Tux3's front/back arrangement is lockless, which is great for
performance but turns into a problem when front and back need to
cooperate about something like free space accounting. If we were willing
to add a spinlock between front and back this would be easy, but don't
want to do that. Not only are we jealously protective of our lockless
design, but if our fast path suddenly became slower because of adding
essential functionality we might need to revise some posted benchmark
results. Better that we should do it right and get our accounting
almost for free.
The world of lockless algorithms is an arcane one indeed, just ask Paul
McKenney about that. The solution we came up with needs just two atomic
adds per transaction, and we will eventually turn one of those into a
per-cpu counter. As mentioned above, a frontend transaction backs out
its cost when the delta balance goes negative, so from the backend's
point of view, the balance is going up and down unpredictably all the
time. Delta transition can happen at any time, and somehow, the backend
must assign the new front delta its budget exactly at transition.
Meanwhile, the front delta balance is still going up and down
unpredictably. See the problem? The issue is, delta transition is truly
asynchronous. We can't change that short of adding locks with the
contention and stalls that go along with them.
Fortunately, one consequence of delta transition is that the total cost
charged to the delta instantly becomes stable when the front delta
becomes the back delta. Volume free space is also stable because only
the backend accesses it. The backend can easily measure the actual
space consumed by the back delta: it is the difference between free
space before and after flushing to media. Updating the front delta
budget is easy because only the backend changes it, but updating the
front delta balance is much harder because the front delta is busy
changing it. If we get this wrong, the resulting slight discrepancies
between budget, balance and charged costs would mean that somebody,
somewhere will hit out of space in the middle of a commit and end up
sticking pins into a voodoo doll that looks like us.
A solution was found that only took a few lines of code and some pencil
pushing. The backend knows what the front delta balance must have been
exactly at transition, because it knows the amount charged to the back
delta, and it knows the original budget. It can therefore deduce how
much the front balance should have increased exactly at transition (it
must always increase) so it adds that difference atomically to the
front delta budget. This has exactly the same effect as setting the
balance atomically at transition time, if that were possible, which it
is not. This technique is airtight, and the whole algorithm ends up
costing less than a hundred nanoseconds per transaction.[1] This is a
good thing because each page of a Tux3 write is a separate transaction,
so any significant overhead would stick out like a sore thumb.
Accounting cost estimates properly and stopping when actually out of
space is just the core of the algorithm. We must feed that core with
good, provable worst case cost estimates. To get an initial idea of
whether the algorithm works, we just plugged in some magic numbers, and
lo and behold, suddenly we where not running out of space in the
backend any more. But to do the job properly we need to consider things
like the file index btree depth, because just plugging in a number large
enough to handle the deepest possible btree would slow down our failure
path way too much.
The best way to account for btree depth is to make it disappear entirely
by removing the btree updates from the delta commit path. We already do
that for bitmaps, which is a good thing because our bitmaps are just
blocks that live in a normal file. Multiplying our worst case by the
maximum number of bitmaps that could possibly be affected, and then
multiplying that by the worst case change to the bitmap metadata,
including its data btree, its inode, and the inode table btree, would be
a real horror. Instead, we log all changes that affect the bitmap and
only update the bitmaps periodically at unify cycles. A Tux3 filesystem
is consistent whether or not we unify, so if space becomes critically
tight the backend can just disable the unify. The only bad effect is
that the log chain can grow and make replay take longer, but that growth
is limited by the fact that there is not much space left for more log
blocks.
If we did not have this nice way of making bitmap overhead disappear,
we would not be anywhere close to a respectable implementation today.
Actually, we weren't even thinking about out of space accounting when
we developed this design element, we were actually trying to get rid of
the overhead of updating bitmaps per delta. Which worked well and is a
significant part of the reason why we can outrun Ext4 while having a
very similar structure. The benefit for space accounting dropped out
just by dumb luck.
The same technique we use for hiding bitmap update cost works just as
well for btree metadata. Eventually, we will move btree leaf redirecting
from the delta flush to the unify flush. That will speed it up by
coalescing some index block writes and also make it vanish from the
transaction cost estimate, saving frontend CPU and speeding up the
failure path. What's not to like? It is on the list of things to do.
Today, I refactored the budgeting algorithm to skip the cost estimate
if a page is already dirty, which tightened up the estimate by a factor
of four or so and made things run smoother. There will be more
incremental improvements as time goes by. For example, we currently
overestimate the cost of a rewrite because we would need to go poking
around in btrees to do that more accurately. Fixing that will be quite
a bit of work, but somebody will probably do it, sometime.
Now the fun part: performance and bugs. Being anxious to know where
Tux3 stands with respect to the usual suspects, I ran some tests and
found that Ext4 is amazingly good at this, while XFS and Btrfs have
some serious issues. Details below.
Tux3 slows down when approaching a full state, hopefully not too much.
To quantify that, here is what happens with a 200 MB dd to a loopback
mounted file on tmpfs:
Volume Size Run Time
No check at all: 1500 MB 0.306s
Far from full: 1500 MB 0.318s
Getting full: 30 MB 0.386s
Just over full: 20 MB 0.624s
The slowdown used to be a lot more before I improved the cost estimate
for write earlier today. Here is how we compare to other filesystems:
Far from full Just over full
tux3: 0.303s 0.468s
ext4: 0.399s 0.400s
xfs: 0.293s 0.326s
btrfs: 0.499s 0.531s
(20 mb dd to ramdisk)
XFS eeks out a narrow win on straight up dd to the ramdisk, good job.
The gap widens when hitting the failure path, but again, not as much as
it did earlier today.
I do most of these no space tests on a ramdisk (actually, a loopback
mount on tmpfs) because it is easy to fill up. To show that the ramdisk
results are not wildly different from a real disk, here we see that the
pattern is largely unchanged:
20 MB dd to a real disk
tux3: 1.568s
ext4: 1.523s
xfs: 1.466s
btrfs: 2.347s
XFS holds its dd lead on a real hard disk. We definitely need to learn
its trick.
Next we look at something with a bit more meat: unzipping the Git
source to multiple directories. Timings on ramdisk are the interesting
ones, because the volume approaches full on the longer test.
10x to ram 40x to ram 10x to hdd 100x to hdd
tux3: 2.251s 8.344s 2.671s 21.686s
ext4: 2.343s 7.923s 3.165s 32.080s
xfs: 2.682s 10.562s 11.870s 123.435s
btrfs: 3.744s 15.825s 3.749s 72.405s
Tux3 is the fastest when not close to full, but Ext4 takes a slight
lead when close to full. Yesterday, that lead was much wider, and one
day we would be pleased to tie Ext4, which is really exemplary at this.
The hard disk times are there because they happened to be easy to get,
and it is interesting to see how much XFS and BTRFS are suffering on
traditional rust, XFS being the worst by far at 5.7 times slower than
Tux3.
The next one is a crash test: repeatedly untar a tarball until it
smashes into the wall, and see how long it takes to quit with an error.
Tar is nice for this because its failure handling is so awful: instead
of exiting when on the first ENOSPC, it keeps banging at the full disk
until it has failed on each and every file in its archive. First I dd
the volume until just before full, then throw a tarball at it.
Time to fail when tar hits disk full:
tux3: 0.563s
ext4: 0.084s
xfs: 0.116s
btrfs: 0.460s
We respectfully concede that Ext4 is the king of fail and Tux3 is
worst. However, we only need to be good enough on this one, with less
than a second being a very crude definition of good enough.
The next one is something I ran into when I was testing out of space
detection with rewrites. This uses the "blurt" program at the end of
this post to do 40K writes from 1000 tasks in parallel, 1K at a time,
using the bash one liner:
for ((j=1;j<10;j++)); do \
for ((i=1;i<10;i++)); do \
echo step $j:$i 1>&2 && mkdir -p fs/$i && \
~/blurt fs/$i/f 40 1000 || break 2; \
done; \
done
Tux3: 4.136s (28% full when done)
Ext4: 5.780s (31% full when done)
XFS: 79.063s (!)
Btrfs: 15.489s (fails with out of space when 30% full)
Blurt is a minor revision of my fsync load generator without the fsync,
and with an error exit on disk full. The intent of the outer loop is
to do rewrites with a thousand tasks in parallel, and see if out of
space accounting is accurate. XFS and Btrfs both embarrassed themselves
horribly. XFS falls off a performance cliff that makes it 19 times
slower than Tux3, and Btrfs hits ENOSPC when only 30% full according to
df, or 47% full if you prefer to believe its own df command:
Data, single: total=728.00MiB, used=342.25MiB
System, DUP: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, DUP: total=65.00MiB, used=5.97MiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=16.00MiB, used=0.00B
It seems that Btrfs still has not put its epic ENOSPC nightmare behind
it. I fervently hope that such a fate does not await Tux3, which hope
would appear to be well on its way to being born out.
XFS should not do such bizarre things after 23 years of development,
while being billed as a mature, enterprise grade filesystem. It simply
is not there yet. Ext4 is exemplary in terms of reliability, and Tux3
has been been really good through this round of torture tests, though
I will not claim that it is properly hardened just yet. I know it isn't.
We don't have any open bugs, but that is probably because we only have
two users. But Tux3 is remarkably solid for the number of man years
that have gone into it. Maybe Tux3 really will be ready for the
enterprise before XFS is.
In all of these tests, Tux3, Ext4 and XFS managed to fill up their
volumes to exactly 100%. Tux3 actually has a 100 block emergency
reserve that it never fills, and wastes a few more blocks if the last
transaction does not exactly use up its budget, but apparently that
still falls within the df utility's definition of 100%. Btrfs never gets
this right: full for it tends to range from 96% to 98%, and sometimes is
much lower, like 28%. It has its own definition of disk full in its own
utility, but that does not seem to be very accurate either. This part of
Btrfs needs major work. Even at this early stage, Tux3 is much better
than that.
One thing we can all rejoice over: nobody ever hit out of space while
trying to commit. At least, nobody ever admitted it. And nobody oopsed,
or asserted, though XFS did exhibit some denial of service issues where
the filesystem was unusable for tens of seconds.
Once again, in the full disclosure department: there are some known
holes remaining in Tux3's out of space handling. The unify suspend
algorithm is not yet implemented, without which we cannot guarantee
that out of space will never happen in commit. With the simple expedient
of a 100 block emergency reserve, it has never yet happened, but no
doubt some as yet untested load can make it happen. ENOSPC handling for
mmap is not yet implemented. Cost estimates for namespace operations
are too crude and ignore btree depth. Cost estimates could be tighter
than they are, to give better performance and report disk full more
promptly. The emergency reserve should be set each delta according to
delta budget. Big truncates need to be split over multiple commits
so they always free more blocks than they consume before commit. That
is about it. On the whole, I am really happy with the way this
has worked out.
Well, that is that for today. Tux3 now has decent out of space handling
that appears to work well and has a good strong theoretical basis. It
needs more work, but is no longer a reason to block Tux3 from merging,
if it ever really was.
Regards,
Daniel
[1] Overhead of an uncontended bus locked add is about 6 nanoseconds on
my i5, and about ten times higher when contended.
/*
* Blurt v0.0
*
* A trivial multitasking filesystem load generator
*
* Daniel Phillips, June 2015
*
* to build: c99 -Wall blurt.c -oblurt
* to run: blurt <basename> <steps> <tasks>
*/
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <sys/wait.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/stat.h>
enum { chunk = 1024, sync = 0 };
char text[chunk] = { "hello world!\n" };
int main(int argc, const char *argv[]) {
const char *basename = argc < 1 ? "foo" : argv[1];
char name[100];
int steps = argc < 3 ? 1 : atoi(argv[2]);
int tasks = argc < 4 ? 1 : atoi(argv[3]);
int fd, status, errors = 0;
for (int t = 0; t < tasks; t++) {
snprintf(name, sizeof name, "%s%i", basename, t);
if (!fork())
goto child;
}
for (int t = 0; t < tasks; t++) {
wait(&status);
if (WIFEXITED(status) && WEXITSTATUS(status))
errors++;
}
return !!errors;
child:
fd = creat(name, S_IRWXU);
if (fd == -1)
goto fail1;
for (int i = 0; i < steps; i++) {
int ret = write(fd, text, sizeof text);
if (ret == -1)
goto fail2;
if (sync)
fsync(fd);
}
return 0;
fail1:
perror("create failed");
return 1;
fail2:
perror("write failed");
return 1;
}
On Mon, 11 May 2015, Daniel Phillips wrote:
> On Monday, May 11, 2015 10:38:42 PM PDT, Dave Chinner wrote:
>> I think Ted and I are on the same page here. "Competitive
>> benchmarks" only matter to the people who are trying to sell
>> something. You're trying to sell Tux3, but....
>
> By "same page", do you mean "transparently obvious about
> obstructing other projects"?
>
>> The "except page forking design" statement is your biggest hurdle
>> for getting tux3 merged, not performance.
>
> No, the "except page forking design" is because the design is
> already good and effective. The small adjustments needed in core
> are well worth merging because the benefits are proved by benchmarks.
> So benchmarks are key and will not stop just because you don't like
> the attention they bring to XFS issues.
>
>> Without page forking, tux3
>> cannot be merged at all. But it's not filesystem developers you need
>> to convince about the merits of the page forking design and
>> implementation - it's the mm and core kernel developers that need to
>> review and accept that code *before* we can consider merging tux3.
>
> Please do not say "we" when you know that I am just as much a "we"
> as you are. Merging Tux3 is not your decision. The people whose
> decision it actually is are perfectly capable of recognizing your
> agenda for what it is.
>
> http://www.phoronix.com/scan.php?page=news_item&px=MTA0NzM
> "XFS Developer Takes Shots At Btrfs, EXT4"
umm, Phoronix has no input on what gets merged into the kernel. they also hae a
reputation for trying to turn anything into click-bait by making it sound like a
fight when it isn't.
> The real question is, has the Linux development process become
> so political and toxic that worthwhile projects fail to benefit
> from supposed grassroots community support. You are the poster
> child for that.
The linux development process is making code available, responding to concerns
from the experts in the community, and letting the code talk for itself.
There have been many people pushing code for inclusion that has not gotten into
the kernel, or has not been used by any distros after it's made it into the
kernel, in spite of benchmarks being posted that seem to show how wonderful the
new code is. ReiserFS was one of the first, and part of what tarnished it's
reputation with many people was how much they were pushing the benchmarks that
were shown to be faulty (the one I remember most vividly was that the entire
benchmark completed in <30 seconds, and they had the FS tuned to not start
flushing data to disk for 30 seconds, so the entire 'benchmark' ran out of ram
without ever touching the disk)
So when Ted and Dave point out problems with the benchmark (the difference in
behavior between a single spinning disk, different partitions on the same disk,
SSDs, and ramdisks), you would be better off acknowledging them and if you can't
adjust and re-run the benchmarks, don't start attacking them as a result.
As Dave says above, it's not the other filesystem people you have to convince,
it's the core VFS and Memory Mangement folks you have to convince. You may need
a little benchmarking to show that there is a real advantage to be gained, but
the real discussion is going to be on the impact that page forking is going to
have on everything else (both in complexity and in performance impact to other
things)
>> IOWs, you need to focus on the important things needed to acheive
>> your stated goal of getting tux3 merged. New filesystems should be
>> faster than those based on 20-25 year old designs, so you don't need
>> to waste time trying to convince people that tux3, when complete,
>> will be fast.
>
> You know that Tux3 is already fast. Not just that of course. It
> has a higher standard of data integrity than your metadata-only
> journalling filesystem and a small enough code base that it can
> be reasonably expected to reach the quality expected of an
> enterprise class filesystem, quite possibly before XFS gets
> there.
We wouldn't expect anyone developing a new filesystem to believe any
differently. If they didn't believe this, why would they be working on the
filesystem instead of just using an existing filesystem.
The ugly reality is that everyone's early versions of their new filesystem looks
really good. The problem is when they extend it to cover the corner cases and
when it gets stressed by real-world (as opposed to benchmark) workloads. This
isn't saying that you are wrong in your belief, just that you may not be right,
and nobody will know until you are to a usable state and other people can start
beating on it.
David Lang
On 05/12/2015 11:39 AM, David Lang wrote:
> On Mon, 11 May 2015, Daniel Phillips wrote:
>>> ...it's the mm and core kernel developers that need to
>>> review and accept that code *before* we can consider merging tux3.
>>
>> Please do not say "we" when you know that I am just as much a "we"
>> as you are. Merging Tux3 is not your decision. The people whose
>> decision it actually is are perfectly capable of recognizing your
>> agenda for what it is.
>>
>> http://www.phoronix.com/scan.php?page=news_item&px=MTA0NzM
>> "XFS Developer Takes Shots At Btrfs, EXT4"
>
> umm, Phoronix has no input on what gets merged into the kernel. they also hae a reputation for
> trying to turn anything into click-bait by making it sound like a fight when it isn't.
Perhaps you misunderstood. Linus decides what gets merged. Andrew
decides. Greg decides. Dave Chinner does not decide, he just does
his level best to create the impression that our project is unfit
to merge. Any chance there might be an agenda?
Phoronix published a headline that identifies Dave Chinner as
someone who takes shots at other projects. Seems pretty much on
the money to me, and it ought to be obvious why he does it.
>> The real question is, has the Linux development process become
>> so political and toxic that worthwhile projects fail to benefit
>> from supposed grassroots community support. You are the poster
>> child for that.
>
> The linux development process is making code available, responding to concerns from the experts in
> the community, and letting the code talk for itself.
Nice idea, but it isn't working. Did you let the code talk to you?
Right, you let the code talk to Dave Chinner, then you listen to
what Dave Chinner has to say about it. Any chance that there might
be some creative licence acting somewhere in that chain?
> There have been many people pushing code for inclusion that has not gotten into the kernel, or has
> not been used by any distros after it's made it into the kernel, in spite of benchmarks being posted
> that seem to show how wonderful the new code is. ReiserFS was one of the first, and part of what
> tarnished it's reputation with many people was how much they were pushing the benchmarks that were
> shown to be faulty (the one I remember most vividly was that the entire benchmark completed in <30
> seconds, and they had the FS tuned to not start flushing data to disk for 30 seconds, so the entire
> 'benchmark' ran out of ram without ever touching the disk)
You know what to do about checking for faulty benchmarks.
> So when Ted and Dave point out problems with the benchmark (the difference in behavior between a
> single spinning disk, different partitions on the same disk, SSDs, and ramdisks), you would be
> better off acknowledging them and if you can't adjust and re-run the benchmarks, don't start
> attacking them as a result.
Ted and Dave failed to point out any actual problem with any
benchmark. They invented issues with benchmarks and promoted those
as FUD.
> As Dave says above, it's not the other filesystem people you have to convince, it's the core VFS and
> Memory Mangement folks you have to convince. You may need a little benchmarking to show that there
> is a real advantage to be gained, but the real discussion is going to be on the impact that page
> forking is going to have on everything else (both in complexity and in performance impact to other
> things)
Yet he clearly wrote "we" as if he believes he is part of it.
Now that ENOSPC is done to a standard way beyond what Btrfs had
when it was merged, the next item on the agenda is writeback. That
involves us and VFS people as you say, and not Dave Chinner, who
only intends to obstruct the process as much as he possibly can. He
should get back to work on his own project. Nobody will miss his
posts if he doesn't make them. They contribute nothing of value,
create a lot of bad blood, and just serve to further besmirch the
famously tarnished reputation of LKML.
>> You know that Tux3 is already fast. Not just that of course. It
>> has a higher standard of data integrity than your metadata-only
>> journalling filesystem and a small enough code base that it can
>> be reasonably expected to reach the quality expected of an
>> enterprise class filesystem, quite possibly before XFS gets
>> there.
>
> We wouldn't expect anyone developing a new filesystem to believe any differently.
It is not a matter of belief, it is a matter of testable fact. For
example, you can count the lines. You can run the same benchmarks.
Proving the data consistency claims would be a little harder, you
need tools for that, and some of those aren't built yet. Or, if you
have technical ability, you can read the code and the copious design
material that has been posted and convince yourself that, yes, there
is something cool here, why didn't anybody do it that way before?
But of course that starts to sound like work. Debating nontechnical
issues and playing politics seems so much more like fun.
> If they didn't
> believe this, why would they be working on the filesystem instead of just using an existing filesystem.
Right, and it is my job to convince you that what I believe for
perfectly valid, demonstrable technical reasons, is really true. I do
not see why you feel it is your job to convince me that the obviously
broken Linux community process is not in fact broken, and that a
certain person who obviously has an agenda, is not actually obstructing.
> The ugly reality is that everyone's early versions of their new filesystem looks really good. The
> problem is when they extend it to cover the corner cases and when it gets stressed by real-world (as
> opposed to benchmark) workloads. This isn't saying that you are wrong in your belief, just that you
> may not be right, and nobody will know until you are to a usable state and other people can start
> beating on it.
With ENOSPC we are at that state. Tux3 would get more testing and advance
faster if it was merged. Things like ifdefs, grandiose new schemes for
writeback infrastructure, dumb little hooks in the mkwrite path, those
are all just manufactured red herrings. Somebody wanted those to be
issues, so now they are issues. Fake ones.
Nobody is trying to trick you. Just stating a fact. You ought to be able
to figure out by now that Tux3 is worth merging.
You might possibly have an argument that merging a filesystem that
crashes as soon as it fills the disk is just sheer stupidity than can
only lead to embarrassment in the long run, but then you would need to
explain why Btrfs was merged. As I recall, it went something like, Chris
had it on a laptop, so it must be a filesystem, and wow look at that
feature list. Then it got merged in a completely unusable state and got
worked on. If it had not been merged, Btrfs would most likely be dead
right now. After all, who cares about an out of tree filesystem?
By the way, I gave my Tux3 presentation at SCALE 7x in Los Angeles in
2009, with Tux3 running as my root filesystem. By the standard applied
to Btrfs, Tux3 should have been merged then, right? After all, our
nospace handling worked just as well as theirs at that time.
Regards,
Daniel
On Tue, 12 May 2015, Daniel Phillips wrote:
> On 05/12/2015 11:39 AM, David Lang wrote:
>> On Mon, 11 May 2015, Daniel Phillips wrote:
>>>> ...it's the mm and core kernel developers that need to
>>>> review and accept that code *before* we can consider merging tux3.
>>>
>>> Please do not say "we" when you know that I am just as much a "we"
>>> as you are. Merging Tux3 is not your decision. The people whose
>>> decision it actually is are perfectly capable of recognizing your
>>> agenda for what it is.
>>>
>>> http://www.phoronix.com/scan.php?page=news_item&px=MTA0NzM
>>> "XFS Developer Takes Shots At Btrfs, EXT4"
>>
>> umm, Phoronix has no input on what gets merged into the kernel. they also hae a reputation for
>> trying to turn anything into click-bait by making it sound like a fight when it isn't.
>
> Perhaps you misunderstood. Linus decides what gets merged. Andrew
> decides. Greg decides. Dave Chinner does not decide, he just does
> his level best to create the impression that our project is unfit
> to merge. Any chance there might be an agenda?
>
> Phoronix published a headline that identifies Dave Chinner as
> someone who takes shots at other projects. Seems pretty much on
> the money to me, and it ought to be obvious why he does it.
Phoronix turns any correction or criticism into an attack.
You need to get out of the mindset that Ted and Dave are Enemies that you need
to overcome, they are friendly competitors, not Enemies. They assume that you
are working in good faith (but are inexperienced compared to them), and you need
to assume that they are working in good faith. If they ever do resort to
underhanded means to sabotage you, Linus and the other kernel developers will
take action. But pointing out limits in your current implementation, problems in
your benchmarks based on how they are run, and concepts that are going to be
difficult to merge is not underhanded, it's exactly the type of assistance that
you should be greatful for in friendly competition.
You were the one who started crowing about how badly XFS performed. Dave gave a
long and detailed explination about the reasons for the differences, and showing
benchmarks on other hardware that showed that XFS works very well there. That's
not an attack on EXT4 (or Tux3), it's an explination.
>>> The real question is, has the Linux development process become
>>> so political and toxic that worthwhile projects fail to benefit
>>> from supposed grassroots community support. You are the poster
>>> child for that.
>>
>> The linux development process is making code available, responding to concerns from the experts in
>> the community, and letting the code talk for itself.
>
> Nice idea, but it isn't working. Did you let the code talk to you?
> Right, you let the code talk to Dave Chinner, then you listen to
> what Dave Chinner has to say about it. Any chance that there might
> be some creative licence acting somewhere in that chain?
I have my own concerns about how things are going to work (I've voiced some of
them), but no, I haven't tried running Tux3 because you say it's not ready yet.
>> There have been many people pushing code for inclusion that has not gotten into the kernel, or has
>> not been used by any distros after it's made it into the kernel, in spite of benchmarks being posted
>> that seem to show how wonderful the new code is. ReiserFS was one of the first, and part of what
>> tarnished it's reputation with many people was how much they were pushing the benchmarks that were
>> shown to be faulty (the one I remember most vividly was that the entire benchmark completed in <30
>> seconds, and they had the FS tuned to not start flushing data to disk for 30 seconds, so the entire
>> 'benchmark' ran out of ram without ever touching the disk)
>
> You know what to do about checking for faulty benchmarks.
That requires that the code be readily available, which last I heard, Tux3
wasn't. Has this been fixed?
>> So when Ted and Dave point out problems with the benchmark (the difference in behavior between a
>> single spinning disk, different partitions on the same disk, SSDs, and ramdisks), you would be
>> better off acknowledging them and if you can't adjust and re-run the benchmarks, don't start
>> attacking them as a result.
>
> Ted and Dave failed to point out any actual problem with any
> benchmark. They invented issues with benchmarks and promoted those
> as FUD.
They pointed out problems with using ramdisk to simulate a SSD and huge
differences between spinning rust and an SSD (or disk array). Those aren't FUD.
>> As Dave says above, it's not the other filesystem people you have to convince, it's the core VFS and
>> Memory Mangement folks you have to convince. You may need a little benchmarking to show that there
>> is a real advantage to be gained, but the real discussion is going to be on the impact that page
>> forking is going to have on everything else (both in complexity and in performance impact to other
>> things)
>
> Yet he clearly wrote "we" as if he believes he is part of it.
He is part of the group of people who use and work with this stuff, so he is
part of it.
> Now that ENOSPC is done to a standard way beyond what Btrfs had
> when it was merged, the next item on the agenda is writeback. That
> involves us and VFS people as you say, and not Dave Chinner, who
> only intends to obstruct the process as much as he possibly can. He
> should get back to work on his own project. Nobody will miss his
> posts if he doesn't make them. They contribute nothing of value,
> create a lot of bad blood, and just serve to further besmirch the
> famously tarnished reputation of LKML.
BTRFS is a perfect example of how not to introduce a new filesystem. Lots of
hype, the presumption that is is going to replace all the existing filesystems
because it's so much better (especially according to benchmarks). But then
progress stalled before it was really ready, and it's still something most
people avoid.
>>> You know that Tux3 is already fast. Not just that of course. It
>>> has a higher standard of data integrity than your metadata-only
>>> journalling filesystem and a small enough code base that it can
>>> be reasonably expected to reach the quality expected of an
>>> enterprise class filesystem, quite possibly before XFS gets
>>> there.
>>
>> We wouldn't expect anyone developing a new filesystem to believe any differently.
>
> It is not a matter of belief, it is a matter of testable fact. For
> example, you can count the lines. You can run the same benchmarks.
>
> Proving the data consistency claims would be a little harder, you
> need tools for that, and some of those aren't built yet. Or, if you
> have technical ability, you can read the code and the copious design
> material that has been posted and convince yourself that, yes, there
> is something cool here, why didn't anybody do it that way before?
> But of course that starts to sound like work. Debating nontechnical
> issues and playing politics seems so much more like fun.
why are you picking a fight? there was no attack in my statement?
>> If they didn't
>> believe this, why would they be working on the filesystem instead of just using an existing filesystem.
>
> Right, and it is my job to convince you that what I believe for
> perfectly valid, demonstrable technical reasons, is really true. I do
> not see why you feel it is your job to convince me that the obviously
> broken Linux community process is not in fact broken, and that a
> certain person who obviously has an agenda, is not actually obstructing.
You will need to have a fully working, usable system before you can convince
people that you are right. A partial system may look good, but how much is
fixing the corner cases that you haven't gotten to yet going to hurt it? That
there are going to be such cases is pretty much a given, and that changing
things to add code to work around the pathalogical conditions is going to hurt
the common case is pretty close to a given (it's one of those things that isn't
mathamatically guaranteed, but happens on 99.99999+% of projects)
>> The ugly reality is that everyone's early versions of their new filesystem looks really good. The
>> problem is when they extend it to cover the corner cases and when it gets stressed by real-world (as
>> opposed to benchmark) workloads. This isn't saying that you are wrong in your belief, just that you
>> may not be right, and nobody will know until you are to a usable state and other people can start
>> beating on it.
>
> With ENOSPC we are at that state. Tux3 would get more testing and advance
> faster if it was merged. Things like ifdefs, grandiose new schemes for
> writeback infrastructure, dumb little hooks in the mkwrite path, those
> are all just manufactured red herrings. Somebody wanted those to be
> issues, so now they are issues. Fake ones.
Ok, so you are happy with your allocation strategy? you didn't seem to be a few
e-mail ago.
but if you think it's ready for users, then start working to submit it in the
next merge window. Dave said that except for one part, there was no reason not
to merge it. That's pretty good. So you need to be discussing that one part with
the the folks that Dave pointed you at.
> Nobody is trying to trick you. Just stating a fact. You ought to be able
> to figure out by now that Tux3 is worth merging.
>
> You might possibly have an argument that merging a filesystem that
> crashes as soon as it fills the disk is just sheer stupidity than can
> only lead to embarrassment in the long run, but then you would need to
> explain why Btrfs was merged. As I recall, it went something like, Chris
> had it on a laptop, so it must be a filesystem, and wow look at that
> feature list. Then it got merged in a completely unusable state and got
> worked on. If it had not been merged, Btrfs would most likely be dead
> right now. After all, who cares about an out of tree filesystem?
As I said above, Btrfs is a perfect example of how not to do things.
The other think you need to realize is that getting something in the kernel
isn't a one-time effort, the code needs to be maintained over time (especially
for a filesystem), and it's very possible for a developer/team/company to be so
toxic and hostile to others that the Linux folks don't want to deal with the
hassle of dealing with them. You are starting out on a path to put yourself into
that category. Calm down and stop taking offense at everything. Your succeeding
doesn't require that other people loose, so stop talking as if it's a zero sum
game and you have to beat down the enemy to get your code accepted.
David Lang
> By the way, I gave my Tux3 presentation at SCALE 7x in Los Angeles in
> 2009, with Tux3 running as my root filesystem. By the standard applied
> to Btrfs, Tux3 should have been merged then, right? After all, our
> nospace handling worked just as well as theirs at that time.
>
> Regards,
>
> Daniel
>
On 12.05.2015 22:54, Daniel Phillips wrote:
> On 05/12/2015 11:39 AM, David Lang wrote:
>> On Mon, 11 May 2015, Daniel Phillips wrote:
>>>> ...it's the mm and core kernel developers that need to
>>>> review and accept that code *before* we can consider merging tux3.
>>> Please do not say "we" when you know that I am just as much a "we"
>>> as you are. Merging Tux3 is not your decision. The people whose
>>> decision it actually is are perfectly capable of recognizing your
>>> agenda for what it is.
>>>
>>> http://www.phoronix.com/scan.php?page=news_item&px=MTA0NzM
>>> "XFS Developer Takes Shots At Btrfs, EXT4"
>> umm, Phoronix has no input on what gets merged into the kernel. they also hae a reputation for
>> trying to turn anything into click-bait by making it sound like a fight when it isn't.
> Perhaps you misunderstood. Linus decides what gets merged. Andrew
> decides. Greg decides. Dave Chinner does not decide, he just does
> his level best to create the impression that our project is unfit
> to merge. Any chance there might be an agenda?
>
> Phoronix published a headline that identifies Dave Chinner as
> someone who takes shots at other projects. Seems pretty much on
> the money to me, and it ought to be obvious why he does it.
Maybe Dave has convincing arguments, that have been misinterpreted by
that website, which is an interesting but also highliy manipulative
publication.
>>> The real question is, has the Linux development process become
>>> so political and toxic that worthwhile projects fail to benefit
>>> from supposed grassroots community support. You are the poster
>>> child for that.
>> The linux development process is making code available, responding to concerns from the experts in
>> the community, and letting the code talk for itself.
> Nice idea, but it isn't working. Did you let the code talk to you?
> Right, you let the code talk to Dave Chinner, then you listen to
> what Dave Chinner has to say about it. Any chance that there might
> be some creative licence acting somewhere in that chain?
We are missing the complete useable thing.
>> There have been many people pushing code for inclusion that has not gotten into the kernel, or has
>> not been used by any distros after it's made it into the kernel, in spite of benchmarks being posted
>> that seem to show how wonderful the new code is. ReiserFS was one of the first, and part of what
>> tarnished it's reputation with many people was how much they were pushing the benchmarks that were
>> shown to be faulty (the one I remember most vividly was that the entire benchmark completed in<30
>> seconds, and they had the FS tuned to not start flushing data to disk for 30 seconds, so the entire
>> 'benchmark' ran out of ram without ever touching the disk)
> You know what to do about checking for faulty benchmarks.
>
>> So when Ted and Dave point out problems with the benchmark (the difference in behavior between a
>> single spinning disk, different partitions on the same disk, SSDs, and ramdisks), you would be
>> better off acknowledging them and if you can't adjust and re-run the benchmarks, don't start
>> attacking them as a result.
> Ted and Dave failed to point out any actual problem with any
> benchmark. They invented issues with benchmarks and promoted those
> as FUD.
In general, benchmarks are a critical issue. In this relation, let me
quote Churchill in a derivated way:
Do not trust a benchmark that you have not forged yourself.
>> As Dave says above, it's not the other filesystem people you have to convince, it's the core VFS and
>> Memory Mangement folks you have to convince. You may need a little benchmarking to show that there
>> is a real advantage to be gained, but the real discussion is going to be on the impact that page
>> forking is going to have on everything else (both in complexity and in performance impact to other
>> things)
> Yet he clearly wrote "we" as if he believes he is part of it.
>
> Now that ENOSPC is done to a standard way beyond what Btrfs had
> when it was merged, the next item on the agenda is writeback. That
> involves us and VFS people as you say, and not Dave Chinner, who
> only intends to obstruct the process as much as he possibly can. He
> should get back to work on his own project. Nobody will miss his
> posts if he doesn't make them. They contribute nothing of value,
> create a lot of bad blood, and just serve to further besmirch the
> famously tarnished reputation of LKML.
At least, I would miss his contributions, specifically his technical
explanations but also his opinions.
>>> You know that Tux3 is already fast. Not just that of course. It
>>> has a higher standard of data integrity than your metadata-only
>>> journalling filesystem and a small enough code base that it can
>>> be reasonably expected to reach the quality expected of an
>>> enterprise class filesystem, quite possibly before XFS gets
>>> there.
>> We wouldn't expect anyone developing a new filesystem to believe any differently.
> It is not a matter of belief, it is a matter of testable fact. For
> example, you can count the lines. You can run the same benchmarks.
>
> Proving the data consistency claims would be a little harder, you
> need tools for that, and some of those aren't built yet. Or, if you
> have technical ability, you can read the code and the copious design
> material that has been posted and convince yourself that, yes, there
> is something cool here, why didn't anybody do it that way before?
> But of course that starts to sound like work. Debating nontechnical
> issues and playing politics seems so much more like fun.
>
>> If they didn't
>> believe this, why would they be working on the filesystem instead of just using an existing filesystem.
> Right, and it is my job to convince you that what I believe for
> perfectly valid, demonstrable technical reasons, is really true. I do
> not see why you feel it is your job to convince me that the obviously
> broken Linux community process is not in fact broken, and that a
> certain person who obviously has an agenda, is not actually obstructing.
>
>> The ugly reality is that everyone's early versions of their new filesystem looks really good. The
>> problem is when they extend it to cover the corner cases and when it gets stressed by real-world (as
>> opposed to benchmark) workloads. This isn't saying that you are wrong in your belief, just that you
>> may not be right, and nobody will know until you are to a usable state and other people can start
>> beating on it.
> With ENOSPC we are at that state. Tux3 would get more testing and advance
> faster if it was merged. Things like ifdefs, grandiose new schemes for
> writeback infrastructure, dumb little hooks in the mkwrite path, those
> are all just manufactured red herrings. Somebody wanted those to be
> issues, so now they are issues. Fake ones.
>
> Nobody is trying to trick you. Just stating a fact. You ought to be able
> to figure out by now that Tux3 is worth merging.
>
> You might possibly have an argument that merging a filesystem that
> crashes as soon as it fills the disk is just sheer stupidity than can
> only lead to embarrassment in the long run, but then you would need to
> explain why Btrfs was merged. As I recall, it went something like, Chris
> had it on a laptop, so it must be a filesystem, and wow look at that
> feature list. Then it got merged in a completely unusable state and got
> worked on. If it had not been merged, Btrfs would most likely be dead
> right now. After all, who cares about an out of tree filesystem?
I would like to say two points to this statement:
Firstly, Btrfs was supported by Oracle, which is definitely a totally
different size than a small group of developers.
Secondly, you are right with your complains. Said this, we do not want
to make the same mistake with Tux3 or any other file system once again.
>
> By the way, I gave my Tux3 presentation at SCALE 7x in Los Angeles in
> 2009, with Tux3 running as my root filesystem. By the standard applied
> to Btrfs, Tux3 should have been merged then, right? After all, our
> nospace handling worked just as well as theirs at that time.
As far as I can remember from the posts on the mailing list, Tux3 has
changed so significantly in the last 6 years with features that I always
reference, that it cannot be the same compared with what has been
presented in 2009 anymore.
>
> Regards,
>
> Daniel
Thanks
Best regards
Have fun
C.S.
On 05/12/2015 02:30 PM, David Lang wrote:
> On Tue, 12 May 2015, Daniel Phillips wrote:
>> Phoronix published a headline that identifies Dave Chinner as
>> someone who takes shots at other projects. Seems pretty much on
>> the money to me, and it ought to be obvious why he does it.
>
> Phoronix turns any correction or criticism into an attack.
Phoronix gets attacked in an unseemly way by a number of people
in the developer community who should behave better. You are
doing it yourself, seemingly oblivious to the valuable role that
the publication plays in our community. Google for filesystem
benchmarks. Where do you find them? Right. Not to mention the
Xorg coverage, community issues, etc etc. The last thing we
need is a monoculture in Linux news, and we are dangerously
close to that now.
So, how is "EXT4 is not as stable or as well tested as most
people think" not a cheap shot? By my first hand experience,
that claim is absurd. Add to that the first hand experience
of roughly two billion other people. Seems to be a bit self
serving too, or was that just an accident.
> You need to get out of the mindset that Ted and Dave are Enemies that you need to overcome, they are
> friendly competitors, not Enemies.
You are wrong about Dave These are not the words of any friend:
"I don't think I'm alone in my suspicion that there was something
stinky about your numbers." -- Dave Chinner
Basically allegations of cheating. And wrong. Maybe Dave just
lives in his own dreamworld where everybody is out to get him, so
he has to attack people he views as competitors first.
Ted has more taste and his FUD attack was more artful, but it
still amounted to nothing more than piling on, he just picked
Dave's straw man uncritically and proceeded to and knock it down
some more. Nice way of distracting attention from the fact that
we actually did what we claimed, and instead of getting the
appropriate recognition for it, we were called cheaters. More or
less in so many words, and more subtly by Ted, but the intent
is clear and unmistakable. Apologies from both are still in order,
but it
> They assume that you are working in good faith (but are
> inexperienced compared to them), and you need to assume that they are working in good faith. If they
> ever do resort to underhanded means to sabotage you, Linus and the other kernel developers will take
> action. But pointing out limits in your current implementation, problems in your benchmarks based on
> how they are run, and concepts that are going to be difficult to merge is not underhanded, it's
> exactly the type of assistance that you should be greatful for in friendly competition.
>
> You were the one who started crowing about how badly XFS performed.
Not at all, somebody else posted the terrible XFS benchmark
result, then Dave put up a big smokescreen to try to deflect
atention from it. There is a term for that kind of logical
fallacy:
http://en.wikipedia.org/wiki/Proof_by_intimidation
Seems to have worked well on you. But after all those words,
XFS does not run any faster, and it clearly needs to.
Dave gave a long and detailed
> explination about the reasons for the differences, and showing benchmarks on other hardware that
> showed that XFS works very well there. That's not an attack on EXT4 (or Tux3), it's an explination.
>
>>>> The real question is, has the Linux development process become
>>>> so political and toxic that worthwhile projects fail to benefit
>>>> from supposed grassroots community support. You are the poster
>>>> child for that.
>>>
>>> The linux development process is making code available, responding to concerns from the experts in
>>> the community, and letting the code talk for itself.
>>
>> Nice idea, but it isn't working. Did you let the code talk to you?
>> Right, you let the code talk to Dave Chinner, then you listen to
>> what Dave Chinner has to say about it. Any chance that there might
>> be some creative licence acting somewhere in that chain?
>
> I have my own concerns about how things are going to work (I've voiced some of them), but no, I
> haven't tried running Tux3 because you say it's not ready yet.
>
>>> There have been many people pushing code for inclusion that has not gotten into the kernel, or has
>>> not been used by any distros after it's made it into the kernel, in spite of benchmarks being posted
>>> that seem to show how wonderful the new code is. ReiserFS was one of the first, and part of what
>>> tarnished it's reputation with many people was how much they were pushing the benchmarks that were
>>> shown to be faulty (the one I remember most vividly was that the entire benchmark completed in <30
>>> seconds, and they had the FS tuned to not start flushing data to disk for 30 seconds, so the entire
>>> 'benchmark' ran out of ram without ever touching the disk)
>>
>> You know what to do about checking for faulty benchmarks.
>
> That requires that the code be readily available, which last I heard, Tux3 wasn't. Has this been fixed?
>
>>> So when Ted and Dave point out problems with the benchmark (the difference in behavior between a
>>> single spinning disk, different partitions on the same disk, SSDs, and ramdisks), you would be
>>> better off acknowledging them and if you can't adjust and re-run the benchmarks, don't start
>>> attacking them as a result.
>>
>> Ted and Dave failed to point out any actual problem with any
>> benchmark. They invented issues with benchmarks and promoted those
>> as FUD.
>
> They pointed out problems with using ramdisk to simulate a SSD and huge differences between spinning
> rust and an SSD (or disk array). Those aren't FUD.
>
>>> As Dave says above, it's not the other filesystem people you have to convince, it's the core VFS and
>>> Memory Mangement folks you have to convince. You may need a little benchmarking to show that there
>>> is a real advantage to be gained, but the real discussion is going to be on the impact that page
>>> forking is going to have on everything else (both in complexity and in performance impact to other
>>> things)
>>
>> Yet he clearly wrote "we" as if he believes he is part of it.
>
> He is part of the group of people who use and work with this stuff, so he is part of it.
>
>> Now that ENOSPC is done to a standard way beyond what Btrfs had
>> when it was merged, the next item on the agenda is writeback. That
>> involves us and VFS people as you say, and not Dave Chinner, who
>> only intends to obstruct the process as much as he possibly can. He
>> should get back to work on his own project. Nobody will miss his
>> posts if he doesn't make them. They contribute nothing of value,
>> create a lot of bad blood, and just serve to further besmirch the
>> famously tarnished reputation of LKML.
>
> BTRFS is a perfect example of how not to introduce a new filesystem. Lots of hype, the presumption
> that is is going to replace all the existing filesystems because it's so much better (especially
> according to benchmarks). But then progress stalled before it was really ready, and it's still
> something most people avoid.
>
>>>> You know that Tux3 is already fast. Not just that of course. It
>>>> has a higher standard of data integrity than your metadata-only
>>>> journalling filesystem and a small enough code base that it can
>>>> be reasonably expected to reach the quality expected of an
>>>> enterprise class filesystem, quite possibly before XFS gets
>>>> there.
>>>
>>> We wouldn't expect anyone developing a new filesystem to believe any differently.
>>
>> It is not a matter of belief, it is a matter of testable fact. For
>> example, you can count the lines. You can run the same benchmarks.
>>
>> Proving the data consistency claims would be a little harder, you
>> need tools for that, and some of those aren't built yet. Or, if you
>> have technical ability, you can read the code and the copious design
>> material that has been posted and convince yourself that, yes, there
>> is something cool here, why didn't anybody do it that way before?
>> But of course that starts to sound like work. Debating nontechnical
>> issues and playing politics seems so much more like fun.
>
> why are you picking a fight? there was no attack in my statement?
>
>>> If they didn't
>>> believe this, why would they be working on the filesystem instead of just using an existing
>>> filesystem.
>>
>> Right, and it is my job to convince you that what I believe for
>> perfectly valid, demonstrable technical reasons, is really true. I do
>> not see why you feel it is your job to convince me that the obviously
>> broken Linux community process is not in fact broken, and that a
>> certain person who obviously has an agenda, is not actually obstructing.
>
> You will need to have a fully working, usable system before you can convince people that you are
> right. A partial system may look good, but how much is fixing the corner cases that you haven't
> gotten to yet going to hurt it? That there are going to be such cases is pretty much a given, and
> that changing things to add code to work around the pathalogical conditions is going to hurt the
> common case is pretty close to a given (it's one of those things that isn't mathamatically
> guaranteed, but happens on 99.99999+% of projects)
>
>>> The ugly reality is that everyone's early versions of their new filesystem looks really good. The
>>> problem is when they extend it to cover the corner cases and when it gets stressed by real-world (as
>>> opposed to benchmark) workloads. This isn't saying that you are wrong in your belief, just that you
>>> may not be right, and nobody will know until you are to a usable state and other people can start
>>> beating on it.
>>
>> With ENOSPC we are at that state. Tux3 would get more testing and advance
>> faster if it was merged. Things like ifdefs, grandiose new schemes for
>> writeback infrastructure, dumb little hooks in the mkwrite path, those
>> are all just manufactured red herrings. Somebody wanted those to be
>> issues, so now they are issues. Fake ones.
>
> Ok, so you are happy with your allocation strategy? you didn't seem to be a few e-mail ago.
>
> but if you think it's ready for users, then start working to submit it in the next merge window.
> Dave said that except for one part, there was no reason not to merge it. That's pretty good. So you
> need to be discussing that one part with the the folks that Dave pointed you at.
>
>> Nobody is trying to trick you. Just stating a fact. You ought to be able
>> to figure out by now that Tux3 is worth merging.
>>
>> You might possibly have an argument that merging a filesystem that
>> crashes as soon as it fills the disk is just sheer stupidity than can
>> only lead to embarrassment in the long run, but then you would need to
>> explain why Btrfs was merged. As I recall, it went something like, Chris
>> had it on a laptop, so it must be a filesystem, and wow look at that
>> feature list. Then it got merged in a completely unusable state and got
>> worked on. If it had not been merged, Btrfs would most likely be dead
>> right now. After all, who cares about an out of tree filesystem?
>
> As I said above, Btrfs is a perfect example of how not to do things.
>
> The other think you need to realize is that getting something in the kernel isn't a one-time effort,
> the code needs to be maintained over time (especially for a filesystem), and it's very possible for
> a developer/team/company to be so toxic and hostile to others that the Linux folks don't want to
> deal with the hassle of dealing with them. You are starting out on a path to put yourself into that
> category. Calm down and stop taking offense at everything. Your succeeding doesn't require that
> other people loose, so stop talking as if it's a zero sum game and you have to beat down the enemy
> to get your code accepted.
>
> David Lang
>
>> By the way, I gave my Tux3 presentation at SCALE 7x in Los Angeles in
>> 2009, with Tux3 running as my root filesystem. By the standard applied
>> to Btrfs, Tux3 should have been merged then, right? After all, our
>> nospace handling worked just as well as theirs at that time.
>>
>> Regards,
>>
>> Daniel
>>
>
On Tue, 12 May 2015, Daniel Phillips wrote:
> On 05/12/2015 02:30 PM, David Lang wrote:
>> On Tue, 12 May 2015, Daniel Phillips wrote:
>>> Phoronix published a headline that identifies Dave Chinner as
>>> someone who takes shots at other projects. Seems pretty much on
>>> the money to me, and it ought to be obvious why he does it.
>>
>> Phoronix turns any correction or criticism into an attack.
>
> Phoronix gets attacked in an unseemly way by a number of people
> in the developer community who should behave better. You are
> doing it yourself, seemingly oblivious to the valuable role that
> the publication plays in our community. Google for filesystem
> benchmarks. Where do you find them? Right. Not to mention the
> Xorg coverage, community issues, etc etc. The last thing we
> need is a monoculture in Linux news, and we are dangerously
> close to that now.
It's on my 'sites to check daily' list, but they have also had some pretty nasty
errors in their benchmarks, some of which have been pointed out repeatedly over
the years (doing fsync dependent workloads in situations where one FS actually
honors the fsyncs and another doesn't is a classic)
> So, how is "EXT4 is not as stable or as well tested as most
> people think" not a cheap shot? By my first hand experience,
> that claim is absurd. Add to that the first hand experience
> of roughly two billion other people. Seems to be a bit self
> serving too, or was that just an accident.
I happen to think that it's correct. It's not that Ext4 isn't tested, but that
people's expectations of how much it's been tested, and at what scale don't
match the reality.
>> You need to get out of the mindset that Ted and Dave are Enemies that you need to overcome, they are
>> friendly competitors, not Enemies.
>
> You are wrong about Dave These are not the words of any friend:
>
> "I don't think I'm alone in my suspicion that there was something
> stinky about your numbers." -- Dave Chinner
you are looking for offense. That just means that something is wrong with them,
not that they were deliberatly falsified.
> Basically allegations of cheating. And wrong. Maybe Dave just
> lives in his own dreamworld where everybody is out to get him, so
> he has to attack people he views as competitors first.
you are the one doing the attacking. Please stop. Take a break if needed, and
then get back to producing software rather than complaining about how everyone
is out to get you.
David Lang
On Tue, May 12, 2015 at 03:35:43PM -0700, David Lang wrote:
>
> I happen to think that it's correct. It's not that Ext4 isn't tested, but
> that people's expectations of how much it's been tested, and at what scale
> don't match the reality.
Ext4 is used at Google, on a very large number of disks. Exactly how
large is not something I'm allowed to say, but there's a very amusing
Ted Talk by Randall Munroe (of xkcd fame) on that topic:
http://tedsummaries.com/2014/05/14/randall-munroe-comics-that-ask-what-if/
One thing I can say is that shortly after we deployed ext4 at Google,
thanks to having a very large number of disks, and because we have
very good system monitoring, we detected a file system corruption
problem that happened with a very low probability, but we had enough
disks that we could detect the pattern. (Fortunately, because
Google's cluster file system has replication and/or erasure coding, no
user data was lost.) Even though we could notice the problem, it took
us several months to track down the problem.
When we finally did, it turned out to be a race condition which only
took place under high memory pressure. What was *very* amusing was
after fixing the problem for ext4, I looked at ext3, and discovered
that (a) the ext4 had inerited the bug was also in ext3, and (b) the
bug in ext3 had not been noticed in several enterprise distribution
testing runs done by Red Hat, SuSE, and IBM --- for well over a
**decade**.
What this means is that it's hard for *any* file system to be that
well tested; it's hard to substitute for years and years of production
use, hopefully in systems that have very rigorous monitoring so you
would notice if data or file system metadata is getting corrupted in
ways that can't be explained as hardware errors. The fact that we
found a bug that was never discovered in ext3 after years and years of
use in many enterprises is a testimony to that fact.
(This is also why the fact that Facebook has started using btrfs in
production is going to be a very good thing for btrfs. I'm sure they
will find all sorts of problems once they start running at large
scale, which is a _good_ thing; that's how those problems get fixed.)
Of course, using xfstests certainly helps a lot, and so in my opinion
all serious file system developers should be regularly using xfstests
as a part of the daily development cycle, and to be be extremely
ruthless about not allowing any test regressions.
Best regards,
- Ted
On 05/12/2015 02:30 PM, David Lang wrote:
> On Tue, 12 May 2015, Daniel Phillips wrote:
>> Phoronix published a headline that identifies Dave Chinner as
>> someone who takes shots at other projects. Seems pretty much on
>> the money to me, and it ought to be obvious why he does it.
>
> Phoronix turns any correction or criticism into an attack.
Phoronix gets attacked in an unseemly way by a number of people
in the developer community who should behave better. You are
doing it yourself, seemingly oblivious to the valuable role that
the publication plays in our community. Google for filesystem
benchmarks. Where do you find them? Right. Not to mention the
Xorg coverage, community issues, etc etc. The last thing we need
is a monoculture in Linux news, and we are dangerously close to
that now.
So, how is "EXT4 is not as stable or as well tested as most
people think" not a cheap shot? By my first hand experience, that
claim is absurd. Add to that the first hand experience of roughly
two billion other people. Seems to be a bit self serving too, or
was that just an accident.
> You need to get out of the mindset that Ted and Dave are Enemies that you need to overcome, they are
> friendly competitors, not Enemies.
You are wrong about Dave, These are not the words of any friend:
"I don't think I'm alone in my suspicion that there was something
stinky about your numbers." -- Dave Chinner
Basically allegations of cheating. And wrong. Maybe Dave just
lives in his own dreamworld where everybody is out to get him, so
he has to attack people he views as competitors first.
Ted has more taste and his FUD attack was more artful, but it
still amounted to nothing more than piling on, He just picked up
Dave's straw man uncritically and proceeded to knock it down
some more. Nice way of distracting attention from the fact that
we actually did what we claimed, and instead of getting the
appropriate recognition for it, we were called cheaters. More or
less in so many words by Dave, and more subtly by Ted, but the
intent is clear and unmistakable. Apologies from both are still
in order, but it will be a rainy day in that hot place before we
ever see either of them do the right thing.
That said, Ted is no enemy, he is brilliant and usually conducts
himself admirably. Except sometimes. I wish I would say the same
about Dave, but what I see there is a guy who has invested his
entire identity in his XFS career and is insecure that something
might conspire against him to disrupt it. I mean, come on, if you
convince Redhat management to elevate your life's work to the
status of something that most of the paid for servers in the
world are going to run, do you continue attacking your peers or
do you chill a bit?
> They assume that you are working in good faith (but are
> inexperienced compared to them), and you need to assume that they are working in good faith. If they
> ever do resort to underhanded means to sabotage you, Linus and the other kernel developers will take
> action. But pointing out limits in your current implementation, problems in your benchmarks based on
> how they are run, and concepts that are going to be difficult to merge is not underhanded, it's
> exactly the type of assistance that you should be greatful for in friendly competition.
>
> You were the one who started crowing about how badly XFS performed.
Not at all, somebody else posted the terrible XFS benchmark result,
then Dave put up a big smokescreen to try to deflect atention from
it. There is a term for that kind of logical fallacy:
http://en.wikipedia.org/wiki/Proof_by_intimidation
Seems to have worked well on you. But after all those words, XFS
does not run any faster, and it clearly needs to.
> Dave gave a long and detailed explination about the reasons for the differences, and showing
benchmarks on other hardware that
> showed that XFS works very well there. That's not an attack on EXT4 (or Tux3), it's an explination.
Long, detailed, and bogus. Summary: "oh, XFS doesn't work well on
that hardware? Get new hardware." Excuse me, but other filesystems
do work well on that hardware, the problem is not with the hardware.
> I have my own concerns about how things are going to work (I've voiced some of them), but no, I
> haven't tried running Tux3 because you say it's not ready yet.
I did not say that. I said it is not ready for users. It is more
than ready for anybody who wants to develop it, or benchmark it,
or put test data on it, and has been for a long time. Except for
enospc, and that was apparently not an issue for Btrfs, was it.
>> You know what to do about checking for faulty benchmarks.
>
> That requires that the code be readily available, which last I heard, Tux3 wasn't. Has this been fixed?
You heard wrong. The code is readily available and you can clone it
from here:
https://github.com/OGAWAHirofumi/linux-tux3.git
The hirofumi-user branch has the user tools including mkfs and basic
fsck, and the hirofumi branch is a 3.19 Linus kernel that includes Tux3.
(So is hirofumi-user branch, but Hirofumi likes people to build from
the other one, which is pure kernel.)
We do of course have patches not pushed to the public repository yet,
which includes enospc, so the public code is easily crashable. If I
were you, I would wait for enospc to land, but that is by no means
necessary if your objective is just to verify that we tell the truth.
> They pointed out problems with using ramdisk to simulate a SSD and huge differences between spinning
> rust and an SSD (or disk array). Those aren't FUD.
Not FUD perhaps, but wrong all the same. I have plenty of evidence
at hand to be sure of that, so I don't need to theorize about it.
Ramdisk is surprisingly predictive of performance on other media,
and is arguably closer to what the new generation of NVRAM behaves
like than flash is.
>>> As Dave says above, it's not the other filesystem people you have to convince, it's the core VFS and
>>> Memory Mangement folks you have to convince. You may need a little benchmarking to show that there
>>> is a real advantage to be gained, but the real discussion is going to be on the impact that page
>>> forking is going to have on everything else (both in complexity and in performance impact to other
>>> things)
>>
>> Yet he clearly wrote "we" as if he believes he is part of it.
>
> He is part of the group of people who use and work with this stuff, so he is part of it.
He is not part of a committee that decides what to merge, yet he
spoke as if he was. Just a slip maybe? Let's call it that. Slip or
not, it is a divisive and offensive attitude.
> BTRFS is a perfect example of how not to introduce a new filesystem. Lots of hype, the presumption
> that is is going to replace all the existing filesystems because it's so much better (especially
> according to benchmarks). But then progress stalled before it was really ready, and it's still
> something most people avoid.
Disagree. Merging Btrfs was the only way to save it. Not everyone
avoids it. Btrfs has its share of ardent supporters, ready or not.
One day Btrfs will be ready and the rough spots will be a fading
memory. That is healthy. What Dave is trying to do to Tux3 is kind
of sick.
Even though I do not like the Btrfs design, I hope it succeeds and
fills that void where a big, fat, full featured filesystem that does
everything including sending email should be.
>> Proving the data consistency claims would be a little harder, you
>> need tools for that, and some of those aren't built yet. Or, if you
>> have technical ability, you can read the code and the copious design
>> material that has been posted and convince yourself that, yes, there
>> is something cool here, why didn't anybody do it that way before?
>> But of course that starts to sound like work. Debating nontechnical
>> issues and playing politics seems so much more like fun.
>
> why are you picking a fight? there was no attack in my statement?
Sorry, did I pick a fight? You *are* debating nontechnical issues
and politics, and it *does* sound like work to go do your own
benchmarks. And if it is not fun for you, then why are you doing it?
Please do not take that the wrong way, you obviously enjoy it and
there is nothing wrong with that.
>>> If they didn't
>>> believe this, why would they be working on the filesystem instead of just using an existing
>>> filesystem.
>>
>> Right, and it is my job to convince you that what I believe for
>> perfectly valid, demonstrable technical reasons, is really true. I do
>> not see why you feel it is your job to convince me that the obviously
>> broken Linux community process is not in fact broken, and that a
>> certain person who obviously has an agenda, is not actually obstructing.
>
> You will need to have a fully working, usable system before you can convince people that you are
> right.
Logical fallacy alert. You say there is only one way to convince
somebody of something, when in fact more ways may exist. And "fully
working" translates as "I get to decide what fully working means".
Ask yourself this: in order to convince you that you will die if you
jump off the empire state building, do I actually need to jump off
it, or may I explain to you the principles of gravitation instead?
Anyway, I will offer "has enospc" as a reasonable definition of "fully
working". Tux3 has actually been doing the things (out of space
handling excepted) a normal filesystem does for years. Just not
always as fast or reliably as it now does
A partial system may look good, but how much is fixing the corner cases that you haven't
> gotten to yet going to hurt it?
Straw man. To which corner cases do you refer, and why should we fix
them now instead of attending to the issues that we feel are important?
That there are going to be such cases is pretty much a given, and
> that changing things to add code to work around the pathalogical conditions is going to hurt the
> common case is pretty close to a given (it's one of those things that isn't mathamatically
> guaranteed, but happens on 99.99999+% of projects)
Another straw man. To which pathological condition do you refer, and
why is it so important that we need to drop everything and attend to
it now?
>>> The ugly reality is that everyone's early versions of their new filesystem looks really good. The
>>> problem is when they extend it to cover the corner cases and when it gets stressed by real-world (as
>>> opposed to benchmark) workloads. This isn't saying that you are wrong in your belief, just that you
>>> may not be right, and nobody will know until you are to a usable state and other people can start
>>> beating on it.
>>
>> With ENOSPC we are at that state. Tux3 would get more testing and advance
>> faster if it was merged. Things like ifdefs, grandiose new schemes for
>> writeback infrastructure, dumb little hooks in the mkwrite path, those
>> are all just manufactured red herrings. Somebody wanted those to be
>> issues, so now they are issues. Fake ones.
>
> Ok, so you are happy with your allocation strategy? you didn't seem to be a few e-mail ago.
I am not happy with our allocation strategy, it can be improved
immensely. It is also not the most important thing in the world,
because nobody intends put their mission critical files on it.
I do see people trying to raise that issue as a merge blocker, which
would be an excellent example of how broken our community process is
if it did actually turn out to block our merge. If it concerns you
then store some files on it yourself and see if it really is a killer
problem. Alternatively, it might be exactly the sort of thing that
an interested contributor could take on, and if that is true, then
delaying merge so it can bottleneck on me instead would not make
sense.
If you actually go look at the code, you will see there is some rather
nice infrastructure in there for supporting allocation policy, and
there actually is a preliminary allocation policy, it just does not
meet our standards for production work.
> but if you think it's ready for users, then start working to submit it in the next merge window.
Red Herring. It is not supposed to be ready for users. It is supposed
to be ready for developers. Development kernel, right? Experimental
status and all that. Users are cordially invited to stay away until
further notice.
> Dave said that except for one part, there was no reason not to merge it. That's pretty good. So you
> need to be discussing that one part with the the folks that Dave pointed you at.
Oops, I missed that, are you sure? Perhaps you mean the writeback
interface. Already started on that, already talking. But do keep in
mind that his demand was always a makework project, and frankly, a
nonsensical way to go about things. It's an >internal< api, see.
Internal apis are declared to be flexible, by Linus himself. We
already have a nice, simple patch that implements a simple api that
works fine, we use it all the time. Dave was the one who suggested
we do it exactly like that, so we did. Then Dave moved the goalposts
by insisting that we should throw that one way and tackle a much
bigger project in core that is essentially a R&D project. Not
willing to play that game for a possibly endless number of iterations,
I turned instead to things that actually matter.
Anyway, the writeback project involves us, and VFS developers, you
know who they are. I would prefer that Dave not be involved. For
the record, Jan Kara is great to work with, did you see that patch
set he produced for us? Sadly, I was not able to get into it to
the extent it deserved at the time.
> As I said above, Btrfs is a perfect example of how not to do things.
Unfair. It worked. The alternative is most probably, no Btrfs, ever.
Which do you choose?
The fact that Hirofumi and I kept on with Tux3 got it to where it
is today after all the nasty things that went on and are still going
on is nothing short of a miracle. Thank Hirofumi. If it were not for
him I would have quit years ago and that would have been the end of
it. There are a lot more fun things to do in life than put up with
incessant FUD attacks from the ilk of Dave Chinner. You should tattoo
that on your arm so you can contemplate it when thinking about whether
the Linux community is dysfunctional or not.
> The other think you need to realize is that getting something in the kernel isn't a one-time effort,
> the code needs to be maintained over time (especially for a filesystem), and it's very possible for
> a developer/team/company to be so toxic and hostile to others that the Linux folks don't want to
> deal with the hassle of dealing with them. You are starting out on a path to put yourself into that
> category. Calm down and stop taking offense at everything. Your succeeding doesn't require that
> other people loose, so stop talking as if it's a zero sum game and you have to beat down the enemy
> to get your code accepted.
That argument is "blame the victim", with a bit of intimidation thrown
in. If we are to work together in an atmosphere of harmony and mutual
respect then let's see some effort from more than one side please.
Regards,
Daniel
On 05/12/2015 03:35 PM, David Lang wrote:
> On Tue, 12 May 2015, Daniel Phillips wrote:
>> On 05/12/2015 02:30 PM, David Lang wrote:
>>> You need to get out of the mindset that Ted and Dave are Enemies that you need to overcome, they are
>>> friendly competitors, not Enemies.
>>
>> You are wrong about Dave These are not the words of any friend:
>>
>> "I don't think I'm alone in my suspicion that there was something
>> stinky about your numbers." -- Dave Chinner
>
> you are looking for offense. That just means that something is wrong with them, not that they were
> deliberatly falsified.
I am not mistaken. Dave made sure to eliminate any doubt about
what he meant. He said "Oh, so nicely contrived. But terribly
obvious now that I've found it" among other things.
Good work, Dave. Never mind that we did not hide it.
Let's look at some more of the story. Hirofumi ran the test and
I posted the results and explained the significant. I did not
even know that dbench had fsyncs at that time, since I had never
used it myself, nor that Hirofumi had taken them out in order to
test the things he was interested in. Which turned out to be very
interesting, don't you agree?
Anyway, Hirofumi followed up with a clear explanation, here:
http://phunq.net/pipermail/tux3/2013-May/002022.html
Instead of accepting that, Dave chose to ride right over it and
carry on with his thinly veiled allegations of intellectual fraud,
using such words as "it's deceptive at best." Dave managed to
insult two people that day.
Dave dismissed the basic breakthrough we had made as "silly
marketing fluff". By now I hope you understand that the result in
question was anything but silly marketing fluff. There are real,
technical reasons that Tux3 wins benchmarks, and the specific
detail that Dave attacked so ungraciously is one of them.
Are you beginning to see who the victim of this mugging was?
>> Basically allegations of cheating. And wrong. Maybe Dave just
>> lives in his own dreamworld where everybody is out to get him, so
>> he has to attack people he views as competitors first.
>
> you are the one doing the attacking.
Defending, not attacking. There is a distinction.
> Please stop. Take a break if needed, and then get back to
> producing software rather than complaining about how everyone is out to get you.
Dave is not "everyone", and a "shut up" will not fix this.
What will fix this is a simple, professional statement that
an error was made, that there was no fraud or anything even
remotely resembling it, and that instead a technical
contribution was made. It is not even important that it come
from Dave. But it is important that the aspersions that were
cast be recognized for what they were.
By the way, do you remember the scene from "Unforgiven" where
the sherrif is kicking the guy on the ground and saying "I'm
not kicking you?" It feels like that.
As far as who should take a break goes, note that either of
us can stop the thread. Does it necessarily have to be me?
If you would prefer some light reading, you could read "How fast
can we fail?", which I believe is relevant to the question of
whether Tux3 is mergeable or not.
https://lkml.org/lkml/2015/5/12/663
Regards,
Daniel
On Tue 2015-05-12 13:54:58, Daniel Phillips wrote:
> On 05/12/2015 11:39 AM, David Lang wrote:
> > On Mon, 11 May 2015, Daniel Phillips wrote:
> >>> ...it's the mm and core kernel developers that need to
> >>> review and accept that code *before* we can consider merging tux3.
> >>
> >> Please do not say "we" when you know that I am just as much a "we"
> >> as you are. Merging Tux3 is not your decision. The people whose
> >> decision it actually is are perfectly capable of recognizing your
> >> agenda for what it is.
> >>
> >> http://www.phoronix.com/scan.php?page=news_item&px=MTA0NzM
> >> "XFS Developer Takes Shots At Btrfs, EXT4"
> >
> > umm, Phoronix has no input on what gets merged into the kernel. they also hae a reputation for
> > trying to turn anything into click-bait by making it sound like a fight when it isn't.
>
> Perhaps you misunderstood. Linus decides what gets merged. Andrew
> decides. Greg decides. Dave Chinner does not decide, he just does
> his level best to create the impression that our project is unfit
> to merge. Any chance there might be an agenda?
Dunno. _Your_ agenda seems to be "attack other maintainers so much
that you can later claim they are biased".
Not going to work, sorry.
> > As Dave says above, it's not the other filesystem people you have to convince, it's the core VFS and
> > Memory Mangement folks you have to convince. You may need a little benchmarking to show that there
> > is a real advantage to be gained, but the real discussion is going to be on the impact that page
> > forking is going to have on everything else (both in complexity and in performance impact to other
> > things)
>
> Yet he clearly wrote "we" as if he believes he is part of it.
>
> Now that ENOSPC is done to a standard way beyond what Btrfs had
> when it was merged, the next item on the agenda is writeback. That
> involves us and VFS people as you say, and not Dave Chinner, who
> only intends to obstruct the process as much as he possibly can. He
Why would he do that? Aha, maybe because you keep attacking him all
the time. Or maybe because your code is not up to the kernel
standards. You want to claim it is the former, but it really looks
like the latter.
Just stop doing that. You are not creating nice atmosphere and you are
not getting tux3 being merged in any way.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Mon 2015-05-11 16:53:10, Daniel Phillips wrote:
> Hi Pavel,
>
> On 05/11/2015 03:12 PM, Pavel Machek wrote:
> >>> It is a fact of life that when you change one aspect of an intimately interconnected system,
> >>> something else will change as well. You have naive/nonexistent free space management now; when you
> >>> design something workable there it is going to impact everything else you've already done. It's an
> >>> easy bet that the impact will be negative, the only question is to what degree.
> >>
> >> You might lose that bet. For example, suppose we do strictly linear allocation
> >> each delta, and just leave nice big gaps between the deltas for future
> >> expansion. Clearly, we run at similar or identical speed to the current naive
> >> strategy until we must start filling in the gaps, and at that point our layout
> >> is not any worse than XFS, which started bad and stayed that way.
> >
> > Umm, are you sure. If "some areas of disk are faster than others" is
> > still true on todays harddrives, the gaps will decrease the
> > performance (as you'll "use up" the fast areas more quickly).
>
> That's why I hedged my claim with "similar or identical". The
> difference in media speed seems to be a relatively small effect
When you knew it can't be identical? That's rather confusing, right?
Perhaps you should post more details how your benchmark is structured
next time, so we can see you did not make any trivial mistakes...?
Or just clean the code up so that it can get merged, so that we can
benchmark ourselves...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On 05/13/2015 12:25 AM, Pavel Machek wrote:
> On Mon 2015-05-11 16:53:10, Daniel Phillips wrote:
>> Hi Pavel,
>>
>> On 05/11/2015 03:12 PM, Pavel Machek wrote:
>>>>> It is a fact of life that when you change one aspect of an intimately interconnected system,
>>>>> something else will change as well. You have naive/nonexistent free space management now; when you
>>>>> design something workable there it is going to impact everything else you've already done. It's an
>>>>> easy bet that the impact will be negative, the only question is to what degree.
>>>>
>>>> You might lose that bet. For example, suppose we do strictly linear allocation
>>>> each delta, and just leave nice big gaps between the deltas for future
>>>> expansion. Clearly, we run at similar or identical speed to the current naive
>>>> strategy until we must start filling in the gaps, and at that point our layout
>>>> is not any worse than XFS, which started bad and stayed that way.
>>>
>>> Umm, are you sure. If "some areas of disk are faster than others" is
>>> still true on todays harddrives, the gaps will decrease the
>>> performance (as you'll "use up" the fast areas more quickly).
>>
>> That's why I hedged my claim with "similar or identical". The
>> difference in media speed seems to be a relatively small effect
>
> When you knew it can't be identical? That's rather confusing, right?
Maybe. The top of thread is about a measured performance deficit of
a factor of five. Next to that, a media transfer rate variation by
a factor of two already starts to look small, and gets smaller when
scrutinized.
Let's say our delta size is 400MB (typical under load) and we leave
a "nice big gap" of 112 MB after flushing each one. Let's say we do
two thousand of those before deciding that we have enough information
available to switch to some smarter strategy. We used one GB of a
a 4TB disk, say. The media transfer rate decreased by a factor of:
(1 - 2/1000) = .2%.
The performance deficit in question and the difference in media rate are
three orders of magnitude apart, does that justify the term "similar or
identical?".
> Perhaps you should post more details how your benchmark is structured
> next time, so we can see you did not make any trivial mistakes...?
Makes sense to me, though I do take considerable care to ensure that
my results are reproducible. That is born out by the fact that Mike
did reproduce, albeit from the published branch, which is a bit behind
current work. And he went on to do some original testing of his own.
I had no idea Tux3 was so much faster than XFS on the Git self test,
because we never specifically tested anything like that, or optimized
for it. Of course I was interested in why. And that was not all, Mike
also noticed a really interesting fact about latency that I failed to
reproduce. That went on to the list of things to investigate as time
permits.
I reproduced Mike's results according to his description, by actually
building Git in the VM and running the selftests just to see if the same
thing happened, which it did. I didn't think that was worth mentioning
at the time, because if somebody publishes benchmarks, my first instinct
is to trust them. Trust and verify.
> Or just clean the code up so that it can get merged, so that we can
> benchmark ourselves...
Third possibility: build from our repository, as Mike did. Obviously,
we need to merge to master so the build process matches the Wiki. But
Hirofumi is busy with other things, so please be patient.
Regards,
Daniel
On 05/13/2015 04:31 AM, Daniel Phillips wrote:
Let me be the first to catch that arithmetic error....
> Let's say our delta size is 400MB (typical under load) and we leave
> a "nice big gap" of 112 MB after flushing each one. Let's say we do
> two thousand of those before deciding that we have enough information
> available to switch to some smarter strategy. We used one GB of a
> a 4TB disk, say. The media transfer rate decreased by a factor of:
>
> (1 - 2/1000) = .2%.
Ahem, no, we used 1/8th of the disk. The time/data rate increased
from unity to 1.125, for an average of 1.0625 across the region.
If we only use 1/10th of the disk instead, by not leaving gaps,
then the average time/data across the region is 1.05. The
difference is, 1.0625 - 1.05, so the gap strategy increases media
transfer time by 1.25%, which is not significant compared to the
performance deficit in question of 400%. So, same argument:
change in media transfer rate is just a distraction from the
original question.
In any case, we probably want to start using a smarter strategy
sooner than 1000 commits, maybe after ten or a hundred commits,
which would make the change in media transfer rate even less
relevant.
The thing is, when data first starts landing on media, we do not
have much information about what the long term load will be. So
just analyze the clues we have in the early commits and put those
early deltas onto disk in the most efficient format, which for
Tux3 seems to be linear per delta. There would be exceptions, but
that is the common case.
Then get smarter later. The intent is to get the best of both:
early efficiency, and long term nice aging behavior. I do not
accept the proposition that one must be sacrificed for the
other, I find that reasoning faulty.
> The performance deficit in question and the difference in media rate are
> three orders of magnitude apart, does that justify the term "similar or
> identical?".
Regards,
Daniel
On Wed, 2015-05-13 at 04:31 -0700, Daniel Phillips wrote:
> Third possibility: build from our repository, as Mike did.
Sorry about that folks. I've lost all interest, it won't happen again.
-Mike
On 05/13/2015 06:08 AM, Mike Galbraith wrote:
> On Wed, 2015-05-13 at 04:31 -0700, Daniel Phillips wrote:
>> Third possibility: build from our repository, as Mike did.
>
> Sorry about that folks. I've lost all interest, it won't happen again.
Thanks for your valuable contribution. Now we are seeing a steady
of stream of people heading to the repository, after you showed
it could be done.
Regards,
Daniel
Am Dienstag, 12. Mai 2015, 18:26:28 schrieb Daniel Phillips:
> On 05/12/2015 03:35 PM, David Lang wrote:
> > On Tue, 12 May 2015, Daniel Phillips wrote:
> >> On 05/12/2015 02:30 PM, David Lang wrote:
> >>> You need to get out of the mindset that Ted and Dave are Enemies that
> >>> you need to overcome, they are friendly competitors, not Enemies.
> >>
> >> You are wrong about Dave These are not the words of any friend:
> >> "I don't think I'm alone in my suspicion that there was something
> >> stinky about your numbers." -- Dave Chinner
> >
> >
> >
> > you are looking for offense. That just means that something is wrong
> > with them, not that they were deliberatly falsified.
>
> I am not mistaken. Dave made sure to eliminate any doubt about
> what he meant. He said "Oh, so nicely contrived. But terribly
> obvious now that I've found it" among other things.
Daniel, what are you trying to achieve here?
I thought you wanted to create interest for your filesystem and acceptance
for merging it.
What I see you are actually creating tough is something different.
Is what you see after you send your mails really what you want to see? If
not… why not? And if you seek change, where can you create change?
I really like to see Tux3 inside the kernel for easier testing, yet I also
see that the way you, in your oppinion, "defend" it, does not seem to move
that goal any closer, quite the opposite. It triggers polarity and
resistance.
I believe it to be more productive to work together with the people who will
decide about what goes into the kernel and the people whose oppinions are
respected by them, instead of against them.
"Assume good faith" can help here. No amount of accusing people of bad
intention will change them. The only thing you have the power to change is
your approach. You absolutely and ultimately do not have the power to change
other people. You can´t force Tux3 in by sheer willpower or attacking
people.
On any account for anyone discussing here: I believe that any personal
attacks, counter-attacks or "you are wrong" kind of speech will not help to
move this discussion out of the circling it seems to be in at the moment.
Ciao,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
On 05/13/2015 12:09 PM, Martin Steigerwald wrote:
> Daniel, what are you trying to achieve here?
>
> I thought you wanted to create interest for your filesystem and acceptance
> for merging it.
>
> What I see you are actually creating tough is something different.
>
> Is what you see after you send your mails really what you want to see? If
> not… why not? And if you seek change, where can you create change?
That is the question indeed, whether to try and change the system
while merging, or just keep smiling and get the job done. The problem
is, I am just too stupid to realize that I can't change the system,
which is famously unpleasant for submitters.
> I really like to see Tux3 inside the kernel for easier testing, yet I also
> see that the way you, in your oppinion, "defend" it, does not seem to move
> that goal any closer, quite the opposite. It triggers polarity and
> resistance.
>
> I believe it to be more productive to work together with the people who will
> decide about what goes into the kernel and the people whose oppinions are
> respected by them, instead of against them.
Obviously true.
> "Assume good faith" can help here. No amount of accusing people of bad
> intention will change them. The only thing you have the power to change is
> your approach. You absolutely and ultimately do not have the power to change
> other people. You can´t force Tux3 in by sheer willpower or attacking
> people.
>
> On any account for anyone discussing here: I believe that any personal
> attacks, counter-attacks or "you are wrong" kind of speech will not help to
> move this discussion out of the circling it seems to be in at the moment.
Thanks for the sane commentary. I have the power to change my behavior.
But if nobody else changes their behavior, the process remains just as
unpleasant for us as it ever was (not just me!). Obviously, this is
not the first time I have been through this, and it has never been
pleasant. After a while, contributors just get tired of the grind and
move on to something more fun. I know I did, and I am far from the
only one.
Regards,
Daniel
On Wed, May 13, 2015 at 12:37:41PM -0700, Daniel Phillips wrote:
> On 05/13/2015 12:09 PM, Martin Steigerwald wrote:
>
> > "Assume good faith" can help here. No amount of accusing people of bad
> > intention will change them. The only thing you have the power to change is
> > your approach. You absolutely and ultimately do not have the power to change
> > other people. You can?t force Tux3 in by sheer willpower or attacking
> > people.
> >
> > On any account for anyone discussing here: I believe that any personal
> > attacks, counter-attacks or "you are wrong" kind of speech will not help to
> > move this discussion out of the circling it seems to be in at the moment.
>
> Thanks for the sane commentary. I have the power to change my behavior.
> But if nobody else changes their behavior, the process remains just as
> unpleasant for us as it ever was (not just me!). Obviously, this is
> not the first time I have been through this, and it has never been
> pleasant. After a while, contributors just get tired of the grind and
> move on to something more fun. I know I did, and I am far from the
> only one.
Daniel, please listen to Martin. He speaks a fundamental truth
here.
As you know, I am also interested in Tux3, and would love to
see it as a filesystem option for NAS servers using Samba. But
please think about the way you're interacting with people on the
list, and whether that makes this outcome more or less likely.
Cheers,
Jeremy.
On Wednesday, May 13, 2015 1:02:34 PM PDT, Jeremy Allison wrote:
> On Wed, May 13, 2015 at 12:37:41PM -0700, Daniel Phillips wrote:
>> On 05/13/2015 12:09 PM, Martin Steigerwald wrote:
>> ...
>
> Daniel, please listen to Martin. He speaks a fundamental truth
> here.
>
> As you know, I am also interested in Tux3, and would love to
> see it as a filesystem option for NAS servers using Samba. But
> please think about the way you're interacting with people on the
> list, and whether that makes this outcome more or less likely.
Thanks Jeremy, that means more from you than anyone.
Regards,
Daniel
Am Mittwoch, 13. Mai 2015, 12:37:41 schrieb Daniel Phillips:
> On 05/13/2015 12:09 PM, Martin Steigerwald wrote:
> > Daniel, what are you trying to achieve here?
> >
> > I thought you wanted to create interest for your filesystem and
> > acceptance for merging it.
> >
> > What I see you are actually creating tough is something different.
> >
> > Is what you see after you send your mails really what you want to see?
> > If
> > not… why not? And if you seek change, where can you create change?
>
> That is the question indeed, whether to try and change the system
> while merging, or just keep smiling and get the job done. The problem
> is, I am just too stupid to realize that I can't change the system,
> which is famously unpleasant for submitters.
>
> > I really like to see Tux3 inside the kernel for easier testing, yet I
> > also see that the way you, in your oppinion, "defend" it, does not seem
> > to move that goal any closer, quite the opposite. It triggers polarity
> > and resistance.
> >
> > I believe it to be more productive to work together with the people who
> > will decide about what goes into the kernel and the people whose
> > oppinions are respected by them, instead of against them.
>
> Obviously true.
>
> > "Assume good faith" can help here. No amount of accusing people of bad
> > intention will change them. The only thing you have the power to change
> > is your approach. You absolutely and ultimately do not have the power
> > to change other people. You can´t force Tux3 in by sheer willpower or
> > attacking people.
> >
> > On any account for anyone discussing here: I believe that any personal
> > attacks, counter-attacks or "you are wrong" kind of speech will not help
> > to move this discussion out of the circling it seems to be in at the
> > moment.
> Thanks for the sane commentary. I have the power to change my behavior.
> But if nobody else changes their behavior, the process remains just as
> unpleasant for us as it ever was (not just me!). Obviously, this is
> not the first time I have been through this, and it has never been
> pleasant. After a while, contributors just get tired of the grind and
> move on to something more fun. I know I did, and I am far from the
> only one.
Daniel, if you want to change the process of patch review and inclusion into
the kernel, model an example of how you would like it to be. This has way
better chances to inspire others to change their behaviors themselves than
accusing them of bad faith.
Its yours to choose.
What outcome do you want to create?
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
On Wednesday, May 13, 2015 1:25:38 PM PDT, Martin Steigerwald wrote:
> Am Mittwoch, 13. Mai 2015, 12:37:41 schrieb Daniel Phillips:
>> On 05/13/2015 12:09 PM, Martin Steigerwald wrote: ...
>
> Daniel, if you want to change the process of patch review and
> inclusion into
> the kernel, model an example of how you would like it to be. This has way
> better chances to inspire others to change their behaviors themselves than
> accusing them of bad faith.
>
> Its yours to choose.
>
> What outcome do you want to create?
The outcome I would like is:
* Everybody has a good think about what has gone wrong in the past,
not only with troublesome submitters, but with mutual respect and
collegial conduct.
* Tux3 is merged on its merits so we get more developers and
testers and move it along faster.
* I left LKML better than I found it.
* Group hugs
Well, group hugs are optional, that one would be situational.
Regards,
Daniel
Am Mittwoch, 13. Mai 2015, 13:38:24 schrieb Daniel Phillips:
> On Wednesday, May 13, 2015 1:25:38 PM PDT, Martin Steigerwald wrote:
> > Am Mittwoch, 13. Mai 2015, 12:37:41 schrieb Daniel Phillips:
> >> On 05/13/2015 12:09 PM, Martin Steigerwald wrote: ...
> >
> > Daniel, if you want to change the process of patch review and
> > inclusion into
> > the kernel, model an example of how you would like it to be. This has
> > way
> > better chances to inspire others to change their behaviors themselves
> > than accusing them of bad faith.
> >
> > Its yours to choose.
> >
> > What outcome do you want to create?
>
> The outcome I would like is:
>
> * Everybody has a good think about what has gone wrong in the past,
> not only with troublesome submitters, but with mutual respect and
> collegial conduct.
>
> * Tux3 is merged on its merits so we get more developers and
> testers and move it along faster.
>
> * I left LKML better than I found it.
>
> * Group hugs
>
> Well, group hugs are optional, that one would be situational.
Great stuff!
Looking forward to it.
Thank you,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
Addendum to that post...
On 05/12/2015 10:46 AM, I wrote:
> ...For example, we currently
> overestimate the cost of a rewrite because we would need to go poking
> around in btrees to do that more accurately. Fixing that will be quite
> a bit of work...
Ah no, I was wrong about that, it will not be a lot of work because
it does not need to be done.
Charging the full cost of a rewrite as if it is a new write is the
right thing to do because we need to be sure the commit can allocate
space to redirect the existing blocks before it frees the old ones.
So that means there is no need for the front end to go delving into
file metadata, ever, which is a relief because that would have been
expensive and messy.
Regards,
Daniel
Greetings,
This diff against head (f59558a04c5ad052dc03ceeda62ccf31f4ab0004) of
https://github.com/OGAWAHirofumi/linux-tux3/tree/hirofumi-user
provides the optimized fsync code that was used to generate the
benchmark results here:
https://lkml.org/lkml/2015/4/28/838
"How fast can we fsync?"
This patch also applies to:
https://github.com/OGAWAHirofumi/linux-tux3/tree/hirofumi
which is a 3.19 kernel cloned from mainline. (Preferred)
Build instructions are on the wiki:
https://github.com/OGAWAHirofumi/linux-tux3/wiki
There is some slight skew in the instructions because this is
not on master yet.
****************************************************************
***** Caveat: No out of space handling on this branch! *******
*** If you run out of space you will get a mysterious assert ***
****************************************************************
Enjoy!
Daniel
diff --git a/fs/tux3/buffer.c b/fs/tux3/buffer.c
index ef0d917..a141687 100644
--- a/fs/tux3/buffer.c
+++ b/fs/tux3/buffer.c
@@ -29,7 +29,7 @@ TUX3_DEFINE_STATE_FNS(unsigned long, buf, BUFDELTA_AVAIL, BUFDELTA_BITS,
* may not work on all arch (If set_bit() and cmpxchg() is not
* exclusive, this has race).
*/
-static void tux3_set_bufdelta(struct buffer_head *buffer, int delta)
+void tux3_set_bufdelta(struct buffer_head *buffer, int delta)
{
unsigned long state, old_state;
diff --git a/fs/tux3/commit.c b/fs/tux3/commit.c
index 909a222..955c441a 100644
--- a/fs/tux3/commit.c
+++ b/fs/tux3/commit.c
@@ -289,12 +289,13 @@ static int commit_delta(struct sb *sb)
req_flag |= REQ_NOIDLE | REQ_FLUSH | REQ_FUA;
}
- trace("commit %i logblocks", be32_to_cpu(sb->super.logcount));
+ trace("commit %i logblocks", logcount(sb));
err = save_metablock(sb, req_flag);
if (err)
return err;
- tux3_wake_delta_commit(sb);
+ if (!fsync_mode(sb))
+ tux3_wake_delta_commit(sb);
/* Commit was finished, apply defered bfree. */
return unstash(sb, &sb->defree, apply_defered_bfree);
@@ -314,8 +315,7 @@ static void post_commit(struct sb *sb, unsigned delta)
static int need_unify(struct sb *sb)
{
- static unsigned crudehack;
- return !(++crudehack % 3);
+ return logcount(sb) > 300; /* FIXME: should be based on bandwidth and tunable */
}
/* For debugging */
@@ -359,7 +359,7 @@ static int do_commit(struct sb *sb, int flags)
* FIXME: there is no need to commit if normal inodes are not
* dirty? better way?
*/
- if (!(flags & __FORCE_DELTA) && !tux3_has_dirty_inodes(sb, delta))
+ if (0 && !(flags & __FORCE_DELTA) && !tux3_has_dirty_inodes(sb, delta))
goto out;
/* Prepare to wait I/O */
@@ -402,6 +402,7 @@ static int do_commit(struct sb *sb, int flags)
#endif
if ((!no_unify && need_unify(sb)) || (flags & __FORCE_UNIFY)) {
+ trace("unify %u, delta %u", sb->unify, delta);
err = unify_log(sb);
if (err)
goto error; /* FIXME: error handling */
diff --git a/fs/tux3/commit_flusher.c b/fs/tux3/commit_flusher.c
index 59d6781..31cd51e 100644
--- a/fs/tux3/commit_flusher.c
+++ b/fs/tux3/commit_flusher.c
@@ -198,6 +198,8 @@ long tux3_writeback(struct super_block *super, struct bdi_writeback *wb,
if (work->reason == WB_REASON_SYNC)
goto out;
+ trace("tux3_writeback, reason = %i", work->reason);
+
if (work->reason == WB_REASON_TUX3_PENDING) {
struct tux3_wb_work *wb_work;
/* Specified target delta for staging. */
@@ -343,3 +345,7 @@ static void schedule_flush_delta(struct sb *sb, struct delta_ref *delta_ref)
sb->delta_pending++;
wake_up_all(&sb->delta_transition_wq);
}
+
+#ifdef __KERNEL__
+#include "commit_fsync.c"
+#endif
diff --git a/fs/tux3/commit_fsync.c b/fs/tux3/commit_fsync.c
new file mode 100644
index 0000000..9a59c59
--- /dev/null
+++ b/fs/tux3/commit_fsync.c
@@ -0,0 +1,341 @@
+/*
+ * Optimized fsync.
+ *
+ * Copyright (c) 2015 Daniel Phillips
+ */
+
+#include <linux/delay.h>
+
+static inline int fsync_pending(struct sb *sb)
+{
+ return atomic_read(&sb->fsync_pending);
+}
+
+static inline int delta_needed(struct sb *sb)
+{
+ return waitqueue_active(&sb->delta_transition_wq);
+}
+
+static inline int fsync_drain(struct sb *sb)
+{
+ return test_bit(TUX3_FSYNC_DRAIN_BIT, &sb->backend_state);
+}
+
+static inline unsigned fsync_group(struct sb *sb)
+{
+ return atomic_read(&sb->fsync_group);
+}
+
+static int suspend_transition(struct sb *sb)
+{
+ while (sb->suspended == NULL) {
+ if (!test_and_set_bit(TUX3_STATE_TRANSITION_BIT, &sb->backend_state)) {
+ sb->suspended = delta_get(sb);
+ return 1;
+ }
+ cpu_relax();
+ }
+ return 0;
+}
+
+static void resume_transition(struct sb *sb)
+{
+ delta_put(sb, sb->suspended);
+ sb->suspended = NULL;
+
+ if (need_unify(sb))
+ delta_transition(sb);
+
+ /* Make sure !suspended is visible before transition clear */
+ smp_mb__before_atomic();
+ clear_bit(TUX3_STATE_TRANSITION_BIT, &sb->backend_state);
+ /* Make sure transition clear is visible before drain clear */
+ smp_mb__before_atomic();
+ clear_bit(TUX3_FSYNC_DRAIN_BIT, &sb->backend_state);
+ wake_up_all(&sb->delta_transition_wq);
+}
+
+static void tux3_wait_for_free(struct sb *sb, unsigned delta)
+{
+ unsigned free_delta = delta + TUX3_MAX_DELTA;
+ /* FIXME: better to be killable */
+ wait_event(sb->delta_transition_wq,
+ delta_after_eq(sb->delta_free, free_delta));
+}
+
+/*
+ * Write log and commit. (Mostly borrowed from do_commit)
+ *
+ * This needs specfic handling for the commit block, so
+ * maybe add an fsync flag to commit_delta.
+ */
+static int commit_fsync(struct sb *sb, unsigned delta, struct blk_plug *plug)
+{
+ write_btree(sb, delta);
+ write_log(sb);
+ blk_finish_plug(plug);
+ commit_delta(sb);
+ post_commit(sb, delta);
+ return 0;
+}
+
+enum { groups_per_commit = 4 };
+
+/*
+ * Backend fsync commit task, serialized with delta backend.
+ */
+void fsync_backend(struct work_struct *work)
+{
+ struct sb *sb = container_of(work, struct fsync_work, work)->sb;
+ struct syncgroup *back = &sb->fsync[(fsync_group(sb) - 1) % fsync_wrap];
+ struct syncgroup *front = &sb->fsync[fsync_group(sb) % fsync_wrap];
+ struct syncgroup *idle = &sb->fsync[(fsync_group(sb) + 1) % fsync_wrap];
+ unsigned back_delta = sb->suspended->delta - 1;
+ unsigned start = fsync_group(sb), groups = 0;
+ struct blk_plug plug;
+ int err; /* How to report?? */
+
+ trace("enter fsync backend, delta = %i", sb->suspended->delta);
+ tux3_start_backend(sb);
+ sb->flags |= SB_FSYNC_FLUSH_FLAG;
+
+ while (1) {
+ sb->ioinfo = NULL;
+ assert(list_empty(&tux3_sb_ddc(sb, back_delta)->dirty_inodes));
+ while (atomic_read(&front->busy)) {
+ struct ioinfo ioinfo;
+ unsigned i;
+ /*
+ * Verify that the tail of the group queue is idle in
+ * the sense that all waiting fsyncs woke up and released
+ * their busy counts. This busy wait is only theoretical
+ * because fsync tasks have plenty of time to wake up
+ * while the the next group commits to media, but handle
+ * it anyway for completeness.
+ */
+ for (i = 0; atomic_read(&idle->busy); i++)
+ usleep_range(10, 1000);
+ if (i)
+ tux3_warn(sb, "*** %u spins on queue full ***", i);
+ reinit_completion(&idle->wait);
+
+ /*
+ * Bump the fsync group counter so fsync backend owns the
+ * next group of fsync inodes and can walk stable lists
+ * while new fsyncs go onto the new frontend lists.
+ */
+ spin_lock(&sb->fsync_lock);
+ atomic_inc(&sb->fsync_group);
+ spin_unlock(&sb->fsync_lock);
+
+ back = front;
+ front = idle;
+ idle = &sb->fsync[(fsync_group(sb) + 1) % fsync_wrap];
+
+ trace("fsync flush group %tu, queued = %i, busy = %i",
+ back - sb->fsync, atomic_read(&sb->fsync_pending),
+ atomic_read(&back->busy));
+
+ if (!sb->ioinfo) {
+ tux3_io_init(&ioinfo, REQ_SYNC);
+ sb->ioinfo = &ioinfo;
+ blk_start_plug(&plug);
+ }
+
+ /*
+ * NOTE: this may flush same inode multiple times, and those
+ * blocks are submitted under plugging. So, by reordering,
+ * later requests by tux3_flush_inodes() can be flushed
+ * before former submitted requests. We do page forking, and
+ * don't free until commit, so reorder should not be problem.
+ * But we should remember this surprise.
+ */
+ err = tux3_flush_inodes_list(sb, back_delta, &back->list);
+ if (err) {
+ tux3_warn(sb, "tux3_flush_inodes_list error %i!", -err);
+ goto ouch;
+ }
+ list_splice_init(&back->list, &tux3_sb_ddc(sb, back_delta)->dirty_inodes);
+ atomic_sub(atomic_read(&back->busy), &sb->fsync_pending);
+
+ if (++groups < groups_per_commit && atomic_read(&front->busy)) {
+ trace("fsync merge group %u", fsync_group(sb));
+ continue;
+ }
+
+ commit_fsync(sb, back_delta, &plug);
+ sb->ioinfo = NULL;
+ wake_up_all(&sb->fsync_collide);
+
+ /*
+ * Wake up commit waiters for all groups in this commit.
+ */
+ trace("complete %i groups, %i to %i", groups, start, start + groups -1);
+ for (i = 0; i < groups; i++) {
+ struct syncgroup *done = &sb->fsync[(start + i) % fsync_wrap];
+ complete_all(&done->wait);
+ }
+
+ if (!fsync_pending(sb) || delta_needed(sb) || need_unify(sb))
+ set_bit(TUX3_FSYNC_DRAIN_BIT, &sb->backend_state);
+
+ start = fsync_group(sb);
+ groups = 0;
+ }
+
+ if (fsync_drain(sb) && !fsync_pending(sb))
+ break;
+
+ usleep_range(10, 500);
+ }
+
+ouch:
+ tux3_end_backend();
+ sb->flags &= ~SB_FSYNC_FLUSH_FLAG;
+ resume_transition(sb);
+ trace("leave fsync backend, group = %i", fsync_group(sb));
+ return; /* FIXME: error? */
+}
+
+int tux3_sync_inode(struct sb *sb, struct inode *inode)
+{
+ void tux3_set_bufdelta(struct buffer_head *buffer, int delta);
+ struct tux3_inode *tuxnode = tux_inode(inode);
+ struct inode_delta_dirty *front_dirty, *back_dirty;
+ struct buffer_head *buffer;
+ struct syncgroup *front;
+ unsigned front_delta;
+ int err = 0, start_backend = 0;
+
+ trace("fsync inode %Lu", (long long)tuxnode->inum);
+
+ /*
+ * Prevent new fsyncs from queuing if fsync_backend wants to exit.
+ */
+ if (fsync_drain(sb))
+ wait_event(sb->delta_transition_wq, !fsync_drain(sb));
+
+ /*
+ * Prevent fsync_backend from exiting and delta from changing until
+ * this fsync is queued and flushed.
+ */
+ atomic_inc(&sb->fsync_pending);
+ start_backend = suspend_transition(sb);
+ front_delta = sb->suspended->delta;
+ front_dirty = tux3_inode_ddc(inode, front_delta);
+ back_dirty = tux3_inode_ddc(inode, front_delta - 1);
+ tux3_wait_for_free(sb, front_delta - 1);
+
+ /*
+ * If another fsync is in progress on this inode then wait to
+ * avoid block collisions.
+ */
+ if (tux3_inode_test_and_set_flag(TUX3_INODE_FSYNC_BIT, inode)) {
+ trace("parallel fsync of inode %Lu", (long long)tuxnode->inum);
+ if (start_backend) {
+ queue_work(sb->fsync_workqueue, &sb->fsync_work.work);
+ start_backend = 0;
+ }
+ err = wait_event_killable(sb->fsync_collide,
+ !tux3_inode_test_and_set_flag(TUX3_INODE_FSYNC_BIT, inode));
+ if (err) {
+ tux3_inode_clear_flag(TUX3_INODE_FSYNC_BIT, inode);
+ atomic_dec(&sb->fsync_pending);
+ goto fail;
+ }
+ }
+
+ /*
+ * We own INODE_FSYNC and the delta backend is not running so
+ * if inode is dirty here then it it will still be dirty when we
+ * move it to the backend dirty list. Otherwise, the inode is
+ * clean and fsync should exit here. We owned INODE_FSYNC for a
+ * short time so there might be tasks waiting on fsync_collide.
+ * Similarly, we might own FSYNC_RUNNING and therefore must start
+ * the fsync backend in case some other task failed to own it and
+ * therefore assumes it is running.
+ */
+ if (!tux3_dirty_flags1(inode, front_delta)) {
+ trace("inode %Lu is already clean", (long long)tuxnode->inum);
+ tux3_inode_clear_flag(TUX3_INODE_FSYNC_BIT, inode);
+ atomic_dec(&sb->fsync_pending);
+ if (start_backend)
+ queue_work(sb->fsync_workqueue, &sb->fsync_work.work);
+ wake_up_all(&sb->fsync_collide);
+ return 0;
+ }
+
+ /*
+ * Exclude new dirties.
+ * Lock order: i_mutex => truncate_lock
+ */
+ mutex_lock(&inode->i_mutex); /* Exclude most dirty sources */
+ down_write(&tux_inode(inode)->truncate_lock); /* Exclude mmap */
+
+ /*
+ * Force block dirty state to previous delta for each dirty
+ * block so block fork protects block data against modify by
+ * parallel tasks while this task waits for commit.
+ *
+ * This walk should not discover any dirty blocks belonging
+ * to the previous delta due to the above wait for delta
+ * commit.
+ */
+ list_for_each_entry(buffer, &front_dirty->dirty_buffers, b_assoc_buffers) {
+ //assert(tux3_bufsta_get_delta(buffer->b_state) != delta - 1);
+ tux3_set_bufdelta(buffer, front_delta - 1);
+ }
+
+ /*
+ * Move the the front end dirty block list to the backend, which
+ * is now empty because the previous delta was completed. Remove
+ * the inode from the frontend dirty list and add it to the front
+ * fsync list. Note: this is not a list move because different
+ * link fields are involved. Later, the inode will be moved to
+ * the backend inode dirty list to be flushed but we cannot put
+ * it there right now because it might clobber the previous fsync
+ * group. Update the inode dirty flags to indicate the inode is
+ * dirty in the back, not the front. The list moves must be
+ * under the spin lock to prevent the back end from bumping
+ * the group counter and proceeding with the commit.
+ */
+ trace("fsync queue inode %Lu to group %u",
+ (long long)tuxnode->inum, fsync_group(sb));
+ spin_lock(&tuxnode->lock);
+ spin_lock(&sb->dirty_inodes_lock);
+ //assert(<inode is not dirty in back>);
+ assert(list_empty(&back_dirty->dirty_buffers));
+ assert(list_empty(&back_dirty->dirty_holes));
+ assert(!list_empty(&front_dirty->dirty_list));
+ list_splice_init(&front_dirty->dirty_buffers, &back_dirty->dirty_buffers);
+ list_splice_init(&front_dirty->dirty_holes, &back_dirty->dirty_holes);
+ list_del_init(&front_dirty->dirty_list);
+ spin_unlock(&sb->dirty_inodes_lock);
+
+ tux3_dirty_switch_to_prev(inode, front_delta);
+ spin_unlock(&tuxnode->lock);
+
+ spin_lock(&sb->fsync_lock);
+ front = &sb->fsync[fsync_group(sb) % fsync_wrap];
+ list_add_tail(&back_dirty->dirty_list, &front->list);
+ atomic_inc(&front->busy); /* detect queue full */
+ assert(sb->current_delta->delta == front_delta); /* last chance to check */
+ spin_unlock(&sb->fsync_lock);
+
+ /*
+ * Allow more dirties during the wait. These will be isolated from
+ * the commit by block forking.
+ */
+ up_write(&tux_inode(inode)->truncate_lock);
+ mutex_unlock(&inode->i_mutex);
+
+ if (start_backend)
+ queue_work(sb->fsync_workqueue, &sb->fsync_work.work);
+
+ wait_for_completion(&front->wait);
+ atomic_dec(&front->busy);
+fail:
+ if (err)
+ tux3_warn(sb, "error %i!!!", err);
+ return err;
+}
diff --git a/fs/tux3/iattr.c b/fs/tux3/iattr.c
index 57a383b..7ac73f5 100644
--- a/fs/tux3/iattr.c
+++ b/fs/tux3/iattr.c
@@ -276,6 +276,8 @@ static int iattr_decode(struct btree *btree, void *data, void *attrs, int size)
}
decode_attrs(inode, attrs, size); // error???
+ tux_inode(inode)->nlink_base = inode->i_nlink;
+
if (tux3_trace)
dump_attrs(inode);
if (tux_inode(inode)->xcache)
diff --git a/fs/tux3/inode.c b/fs/tux3/inode.c
index f747c0e..a10ce38 100644
--- a/fs/tux3/inode.c
+++ b/fs/tux3/inode.c
@@ -922,22 +922,18 @@ void iget_if_dirty(struct inode *inode)
atomic_inc(&inode->i_count);
}
+enum { fsync_fallback = 0 };
+
/* Synchronize changes to a file and directory. */
int tux3_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
{
struct inode *inode = file->f_mapping->host;
struct sb *sb = tux_sb(inode->i_sb);
- /* FIXME: this is sync(2). We should implement real one */
- static int print_once;
- if (!print_once) {
- print_once++;
- tux3_warn(sb,
- "fsync(2) fall-back to sync(2): %Lx-%Lx, datasync %d",
- start, end, datasync);
- }
+ if (fsync_fallback || S_ISDIR(inode->i_mode))
+ return sync_current_delta(sb);
- return sync_current_delta(sb);
+ return tux3_sync_inode(sb, inode);
}
int tux3_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
diff --git a/fs/tux3/log.c b/fs/tux3/log.c
index bb26c73..a934659 100644
--- a/fs/tux3/log.c
+++ b/fs/tux3/log.c
@@ -83,6 +83,7 @@ unsigned log_size[] = {
[LOG_BNODE_FREE] = 7,
[LOG_ORPHAN_ADD] = 9,
[LOG_ORPHAN_DEL] = 9,
+ [LOG_FSYNC_ORPHAN] = 9,
[LOG_FREEBLOCKS] = 7,
[LOG_UNIFY] = 1,
[LOG_DELTA] = 1,
@@ -470,6 +471,11 @@ void log_bnode_free(struct sb *sb, block_t bnode)
log_u48(sb, LOG_BNODE_FREE, bnode);
}
+void log_fsync_orphan(struct sb *sb, unsigned version, tuxkey_t inum)
+{
+ log_u16_u48(sb, LOG_FSYNC_ORPHAN, version, inum);
+}
+
/*
* Handle inum as orphan inode
* (this is log of frontend operation)
diff --git a/fs/tux3/orphan.c b/fs/tux3/orphan.c
index 68d08e8..3ea2d6a 100644
--- a/fs/tux3/orphan.c
+++ b/fs/tux3/orphan.c
@@ -336,7 +336,30 @@ static int load_orphan_inode(struct sb *sb, inum_t inum, struct list_head *head)
tux3_mark_inode_orphan(tux_inode(inode));
/* List inode up, then caller will decide what to do */
list_add(&tux_inode(inode)->orphan_list, head);
+ return 0;
+}
+int replay_fsync_orphan(struct replay *rp, unsigned version, inum_t inum)
+{
+ struct sb *sb = rp->sb;
+ struct inode *inode = tux3_iget(sb, inum);
+ if (IS_ERR(inode)) {
+ int err = PTR_ERR(inode);
+ return err == -ENOENT ? 0 : err;
+ }
+
+ /*
+ * Multiple fsyncs of new inode can create multiple fsync orphan
+ * log records for the same inode. A later delta may have added a
+ * link.
+ */
+ if (inode->i_nlink != 0 || tux3_inode_is_orphan(tux_inode(inode))) {
+ iput(inode);
+ return 0;
+ }
+
+ tux3_mark_inode_orphan(tux_inode(inode));
+ list_add(&tux_inode(inode)->orphan_list, &rp->orphan_in_otree);
return 0;
}
diff --git a/fs/tux3/replay.c b/fs/tux3/replay.c
index f1f77e8..99361d6 100644
--- a/fs/tux3/replay.c
+++ b/fs/tux3/replay.c
@@ -29,6 +29,7 @@ static const char *const log_name[] = {
X(LOG_BNODE_FREE),
X(LOG_ORPHAN_ADD),
X(LOG_ORPHAN_DEL),
+ X(LOG_FSYNC_ORPHAN),
X(LOG_FREEBLOCKS),
X(LOG_UNIFY),
X(LOG_DELTA),
@@ -117,20 +118,20 @@ static void replay_unpin_logblocks(struct sb *sb, unsigned i, unsigned logcount)
static struct replay *replay_prepare(struct sb *sb)
{
block_t logchain = be64_to_cpu(sb->super.logchain);
- unsigned i, logcount = be32_to_cpu(sb->super.logcount);
+ unsigned i, count = logcount(sb);
struct replay *rp;
struct buffer_head *buffer;
int err;
/* FIXME: this address array is quick hack. Rethink about log
* block management and log block address. */
- rp = alloc_replay(sb, logcount);
+ rp = alloc_replay(sb, count);
if (IS_ERR(rp))
return rp;
/* FIXME: maybe, we should use bufvec to read log blocks */
- trace("load %u logblocks", logcount);
- i = logcount;
+ trace("load %u logblocks", count);
+ i = count;
while (i-- > 0) {
struct logblock *log;
@@ -156,7 +157,7 @@ static struct replay *replay_prepare(struct sb *sb)
error:
free_replay(rp);
- replay_unpin_logblocks(sb, i, logcount);
+ replay_unpin_logblocks(sb, i, count);
return ERR_PTR(err);
}
@@ -169,7 +170,7 @@ static void replay_done(struct replay *rp)
clean_orphan_list(&rp->log_orphan_add); /* for error path */
free_replay(rp);
- sb->logpos.next = be32_to_cpu(sb->super.logcount);
+ sb->logpos.next = logcount(sb);
replay_unpin_logblocks(sb, 0, sb->logpos.next);
log_finish_cycle(sb, 0);
}
@@ -319,6 +320,7 @@ static int replay_log_stage1(struct replay *rp, struct buffer_head *logbuf)
case LOG_BFREE_RELOG:
case LOG_LEAF_REDIRECT:
case LOG_LEAF_FREE:
+ case LOG_FSYNC_ORPHAN:
case LOG_ORPHAN_ADD:
case LOG_ORPHAN_DEL:
case LOG_UNIFY:
@@ -450,6 +452,7 @@ static int replay_log_stage2(struct replay *rp, struct buffer_head *logbuf)
return err;
break;
}
+ case LOG_FSYNC_ORPHAN:
case LOG_ORPHAN_ADD:
case LOG_ORPHAN_DEL:
{
@@ -459,6 +462,9 @@ static int replay_log_stage2(struct replay *rp, struct buffer_head *logbuf)
data = decode48(data, &inum);
trace("%s: version 0x%x, inum 0x%Lx",
log_name[code], version, inum);
+ if (code == LOG_FSYNC_ORPHAN)
+ err = replay_fsync_orphan(rp, version, inum);
+ else
if (code == LOG_ORPHAN_ADD)
err = replay_orphan_add(rp, version, inum);
else
@@ -514,11 +520,11 @@ static int replay_logblocks(struct replay *rp, replay_log_t replay_log_func)
{
struct sb *sb = rp->sb;
struct logpos *logpos = &sb->logpos;
- unsigned logcount = be32_to_cpu(sb->super.logcount);
+ unsigned count = logcount(sb);
int err;
logpos->next = 0;
- while (logpos->next < logcount) {
+ while (logpos->next < count) {
trace("log block %i, blocknr %Lx, unify %Lx",
logpos->next, rp->blocknrs[logpos->next],
rp->unify_index);
diff --git a/fs/tux3/super.c b/fs/tux3/super.c
index b104dc7..0913d26 100644
--- a/fs/tux3/super.c
+++ b/fs/tux3/super.c
@@ -63,6 +63,7 @@ static void tux3_inode_init_always(struct tux3_inode *tuxnode)
tuxnode->xcache = NULL;
tuxnode->generic = 0;
tuxnode->state = 0;
+ tuxnode->nlink_base = 0;
#ifdef __KERNEL__
tuxnode->io = NULL;
#endif
@@ -246,6 +247,9 @@ static void __tux3_put_super(struct sb *sbi)
sbi->idefer_map = NULL;
/* FIXME: add more sanity check */
assert(link_empty(&sbi->forked_buffers));
+
+ if (sbi->fsync_workqueue)
+ destroy_workqueue(sbi->fsync_workqueue);
}
static struct inode *create_internal_inode(struct sb *sbi, inum_t inum,
@@ -384,6 +388,21 @@ static int init_sb(struct sb *sb)
for (i = 0; i < ARRAY_SIZE(sb->s_ddc); i++)
INIT_LIST_HEAD(&sb->s_ddc[i].dirty_inodes);
+ for (i = 0; i < fsync_wrap; i++) {
+ INIT_LIST_HEAD(&sb->fsync[i].list);
+ init_completion(&sb->fsync[i].wait);
+ atomic_set(&sb->fsync[i].busy, 0);
+ }
+
+ if (!(sb->fsync_workqueue = create_workqueue("tux3-work")))
+ return -ENOMEM;
+
+ atomic_set(&sb->fsync_group, 0);
+ atomic_set(&sb->fsync_pending, 0);
+ spin_lock_init(&sb->fsync_lock);
+ init_waitqueue_head(&sb->fsync_collide);
+ INIT_WORK(&sb->fsync_work.work, fsync_backend);
+ sb->fsync_work.sb = sb;
sb->idefer_map = tux3_alloc_idefer_map();
if (!sb->idefer_map)
return -ENOMEM;
@@ -773,7 +792,7 @@ static int tux3_fill_super(struct super_block *sb, void *data, int silent)
goto error;
}
}
- tux3_dbg("s_blocksize %lu", sb->s_blocksize);
+ tux3_dbg("s_blocksize %lu, sb = %p", sb->s_blocksize, tux_sb(sb));
rp = tux3_init_fs(sbi);
if (IS_ERR(rp)) {
@@ -781,6 +800,7 @@ static int tux3_fill_super(struct super_block *sb, void *data, int silent)
goto error;
}
+ sb->s_flags |= MS_ACTIVE;
err = replay_stage3(rp, 1);
if (err) {
rp = NULL;
diff --git a/fs/tux3/tux3.h b/fs/tux3/tux3.h
index e2f2d9b..cf4bcc6 100644
--- a/fs/tux3/tux3.h
+++ b/fs/tux3/tux3.h
@@ -252,6 +252,7 @@ enum {
LOG_BNODE_FREE, /* Log of freeing bnode */
LOG_ORPHAN_ADD, /* Log of adding orphan inode */
LOG_ORPHAN_DEL, /* Log of deleting orphan inode */
+ LOG_FSYNC_ORPHAN, /* Log inode fsync with no links */
LOG_FREEBLOCKS, /* Log of freeblocks in bitmap on unify */
LOG_UNIFY, /* Log of marking unify */
LOG_DELTA, /* just for debugging */
@@ -310,6 +311,29 @@ struct tux3_mount_opt {
unsigned int flags;
};
+/* Per fsync group dirty inodes and synchronization */
+struct syncgroup {
+ struct list_head list; /* dirty inodes */
+ struct completion wait; /* commit wait */
+ atomic_t busy; /* fsyncs not completed */
+};
+
+struct fsync_work {
+ struct work_struct work;
+ struct sb *sb;
+};
+
+enum { fsync_wrap = 1 << 4 }; /* Maximum fsync groups in flight */
+
+enum sb_state_bits {
+ TUX3_STATE_TRANSITION_BIT,
+ TUX3_FSYNC_DRAIN_BIT, /* force fsync queue to drain */
+};
+
+enum sb_flag_bits {
+ SB_FSYNC_FLUSH_FLAG = 1 << 0, /* fsync specific actions on flush path */
+};
+
struct tux3_idefer_map;
/* Tux3-specific sb is a handle for the entire volume state */
struct sb {
@@ -321,10 +345,8 @@ struct sb {
struct delta_ref __rcu *current_delta; /* current delta */
struct delta_ref delta_refs[TUX3_MAX_DELTA];
unsigned unify; /* log unify cycle */
-
-#define TUX3_STATE_TRANSITION_BIT 0
unsigned long backend_state; /* delta state */
-
+ unsigned long flags; /* non atomic state */
#ifdef TUX3_FLUSHER_SYNC
struct rw_semaphore delta_lock; /* delta transition exclusive */
#else
@@ -403,7 +425,28 @@ struct sb {
#else
struct super_block vfs_sb; /* Userland superblock */
#endif
-};
+ /*
+ * Fsync and fsync backend
+ */
+ spinlock_t fsync_lock;
+ wait_queue_head_t fsync_collide; /* parallel fsync on same inode */
+ atomic_t fsync_group; /* current fsync group */
+ atomic_t fsync_pending; /* fsyncs started but not yet queued */
+ struct syncgroup fsync[fsync_wrap]; /* fsync commit groups */
+ struct workqueue_struct *fsync_workqueue;
+ struct fsync_work fsync_work;
+ struct delta_ref *suspended;
+ };
+
+static inline int fsync_mode(struct sb *sb)
+{
+ return sb->flags & SB_FSYNC_FLUSH_FLAG;
+}
+
+static inline unsigned logcount(struct sb *sb)
+{
+ return be32_to_cpu(sb->super.logcount);
+}
/* Block segment (physical block extent) info */
#define BLOCK_SEG_HOLE (1 << 0)
@@ -475,6 +518,7 @@ struct tux3_inode {
};
/* Per-delta dirty data for inode */
+ unsigned nlink_base; /* link count on media for fsync */
unsigned state; /* inode dirty state */
unsigned present; /* Attributes decoded from or
* to be encoded to itree */
@@ -553,6 +597,8 @@ static inline struct list_head *tux3_dirty_buffers(struct inode *inode,
enum {
/* Deferred inum allocation, and not stored into itree yet. */
TUX3_I_DEFER_INUM = 0,
+ /* Fsync in progress (protected by i_mutex) */
+ TUX3_INODE_FSYNC_BIT = 1,
/* No per-delta buffers, and no page forking */
TUX3_I_NO_DELTA = 29,
@@ -579,6 +625,11 @@ static inline void tux3_inode_clear_flag(int bit, struct inode *inode)
clear_bit(bit, &tux_inode(inode)->flags);
}
+static inline int tux3_inode_test_and_set_flag(int bit, struct inode *inode)
+{
+ return test_and_set_bit(bit, &tux_inode(inode)->flags);
+}
+
static inline int tux3_inode_test_flag(int bit, struct inode *inode)
{
return test_bit(bit, &tux_inode(inode)->flags);
@@ -723,6 +774,8 @@ static inline block_t bufindex(struct buffer_head *buffer)
/* commit.c */
long tux3_writeback(struct super_block *super, struct bdi_writeback *wb,
struct wb_writeback_work *work);
+int tux3_sync_inode(struct sb *sb, struct inode *inode);
+void fsync_backend(struct work_struct *work);
/* dir.c */
extern const struct file_operations tux_dir_fops;
@@ -967,6 +1020,7 @@ void log_bnode_merge(struct sb *sb, block_t src, block_t dst);
void log_bnode_del(struct sb *sb, block_t node, tuxkey_t key, unsigned count);
void log_bnode_adjust(struct sb *sb, block_t node, tuxkey_t from, tuxkey_t to);
void log_bnode_free(struct sb *sb, block_t bnode);
+void log_fsync_orphan(struct sb *sb, unsigned version, tuxkey_t inum);
void log_orphan_add(struct sb *sb, unsigned version, tuxkey_t inum);
void log_orphan_del(struct sb *sb, unsigned version, tuxkey_t inum);
void log_freeblocks(struct sb *sb, block_t freeblocks);
@@ -995,6 +1049,7 @@ void replay_iput_orphan_inodes(struct sb *sb,
struct list_head *orphan_in_otree,
int destroy);
int replay_load_orphan_inodes(struct replay *rp);
+int replay_fsync_orphan(struct replay *rp, unsigned version, inum_t inum);
/* super.c */
struct replay *tux3_init_fs(struct sb *sbi);
@@ -1045,6 +1100,8 @@ static inline void tux3_mark_inode_dirty_sync(struct inode *inode)
__tux3_mark_inode_dirty(inode, I_DIRTY_SYNC);
}
+unsigned tux3_dirty_flags1(struct inode *inode, unsigned delta);
+void tux3_dirty_switch_to_prev(struct inode *inode, unsigned delta);
void tux3_dirty_inode(struct inode *inode, int flags);
void tux3_mark_inode_to_delete(struct inode *inode);
void tux3_iattrdirty(struct inode *inode);
@@ -1058,6 +1115,7 @@ void tux3_mark_inode_orphan(struct tux3_inode *tuxnode);
int tux3_inode_is_orphan(struct tux3_inode *tuxnode);
int tux3_flush_inode_internal(struct inode *inode, unsigned delta, int req_flag);
int tux3_flush_inode(struct inode *inode, unsigned delta, int req_flag);
+int tux3_flush_inodes_list(struct sb *sb, unsigned delta, struct list_head *dirty_inodes);
int tux3_flush_inodes(struct sb *sb, unsigned delta);
int tux3_has_dirty_inodes(struct sb *sb, unsigned delta);
void tux3_clear_dirty_inodes(struct sb *sb, unsigned delta);
diff --git a/fs/tux3/user/libklib/libklib.h b/fs/tux3/user/libklib/libklib.h
index 31daad5..ae9bba6 100644
--- a/fs/tux3/user/libklib/libklib.h
+++ b/fs/tux3/user/libklib/libklib.h
@@ -117,4 +117,7 @@ extern int __build_bug_on_failed;
#define S_IWUGO (S_IWUSR|S_IWGRP|S_IWOTH)
#define S_IXUGO (S_IXUSR|S_IXGRP|S_IXOTH)
+struct work_struct { };
+struct workqueue_struct { };
+
#endif /* !LIBKLIB_H */
diff --git a/fs/tux3/user/super.c b/fs/tux3/user/super.c
index e34a1b4..0743551 100644
--- a/fs/tux3/user/super.c
+++ b/fs/tux3/user/super.c
@@ -15,6 +15,15 @@
#define trace trace_off
#endif
+static struct workqueue_struct *create_workqueue(char *name) {
+ static struct workqueue_struct fakework = { };
+ return &fakework;
+}
+
+static void destroy_workqueue(struct workqueue_struct *wq) { }
+
+#define INIT_WORK(work, fn)
+
#include "../super.c"
struct inode *__alloc_inode(struct super_block *sb)
diff --git a/fs/tux3/writeback.c b/fs/tux3/writeback.c
index fc20635..5c6bcf0 100644
--- a/fs/tux3/writeback.c
+++ b/fs/tux3/writeback.c
@@ -102,6 +102,22 @@ static inline unsigned tux3_dirty_flags(struct inode *inode, unsigned delta)
return ret;
}
+unsigned tux3_dirty_flags1(struct inode *inode, unsigned delta)
+{
+ return (tux_inode(inode)->state >> tux3_dirty_shift(delta)) & I_DIRTY;
+}
+
+static inline unsigned tux3_iattrsta_update(unsigned state, unsigned delta);
+void tux3_dirty_switch_to_prev(struct inode *inode, unsigned delta)
+{
+ struct tux3_inode *tuxnode = tux_inode(inode);
+ unsigned state = tuxnode->state;
+
+ state |= tux3_dirty_mask(tux3_dirty_flags(inode, delta) & I_DIRTY, delta - 1);
+ state &= ~tux3_dirty_mask(I_DIRTY, delta);
+ tuxnode->state = tux3_iattrsta_update(state, delta - 1);
+}
+
/* This is hook of __mark_inode_dirty() and called I_DIRTY_PAGES too */
void tux3_dirty_inode(struct inode *inode, int flags)
{
@@ -226,6 +242,8 @@ static void tux3_clear_dirty_inode_nolock(struct inode *inode, unsigned delta,
/* Update state if inode isn't dirty anymore */
if (!(tuxnode->state & ~NON_DIRTY_FLAGS))
inode->i_state &= ~I_DIRTY;
+
+ tux3_inode_clear_flag(TUX3_INODE_FSYNC_BIT, inode); /* ugly */
}
/* Clear dirty flags for delta */
@@ -502,12 +520,31 @@ int tux3_flush_inode(struct inode *inode, unsigned delta, int req_flag)
dirty = tux3_dirty_flags(inode, delta);
if (dirty & (TUX3_DIRTY_BTREE | I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
+ struct tux3_inode *tuxnode = tux_inode(inode);
+ struct sb *sb = tux_sb(inode->i_sb);
/*
* If there is btree root, adjust present after
* tux3_flush_buffers().
*/
tux3_iattr_adjust_for_btree(inode, &idata);
+ if (fsync_mode(sb)) {
+ if (idata.i_nlink != tuxnode->nlink_base) {
+ /*
+ * FIXME: we redirty inode attributes here so next delta
+ * will flush correct nlinks. This means that an fsync
+ * of the same inode before the next delta will flush
+ * it again even it has not been changed.
+ */
+ tux3_iattrdirty_delta(inode, sb->suspended->delta);
+ tux3_mark_inode_dirty_sync(inode);
+ idata.i_nlink = tuxnode->nlink_base;
+ }
+ if (!idata.i_nlink)
+ log_fsync_orphan(sb, sb->version, tuxnode->inum);
+ } else
+ tuxnode->nlink_base = idata.i_nlink;
+
err = tux3_save_inode(inode, &idata, delta);
if (err && !ret)
ret = err;
@@ -569,10 +606,8 @@ static int inode_inum_cmp(void *priv, struct list_head *a, struct list_head *b)
return 0;
}
-int tux3_flush_inodes(struct sb *sb, unsigned delta)
+int tux3_flush_inodes_list(struct sb *sb, unsigned delta, struct list_head *dirty_inodes)
{
- struct sb_delta_dirty *s_ddc = tux3_sb_ddc(sb, delta);
- struct list_head *dirty_inodes = &s_ddc->dirty_inodes;
struct inode_delta_dirty *i_ddc, *safe;
inum_t private;
int err;
@@ -612,6 +647,12 @@ error:
return err;
}
+int tux3_flush_inodes(struct sb *sb, unsigned delta)
+{
+ struct sb_delta_dirty *s_ddc = tux3_sb_ddc(sb, delta);
+ return tux3_flush_inodes_list(sb, delta, &s_ddc->dirty_inodes);
+}
+
int tux3_has_dirty_inodes(struct sb *sb, unsigned delta)
{
struct sb_delta_dirty *s_ddc = tux3_sb_ddc(sb, delta);
@@ -663,3 +704,4 @@ unsigned tux3_check_tuxinode_state(struct inode *inode)
{
return tux_inode(inode)->state & ~NON_DIRTY_FLAGS;
}
+
diff --git a/fs/tux3/writeback_iattrfork.c b/fs/tux3/writeback_iattrfork.c
index 658c012..c50a8c2 100644
--- a/fs/tux3/writeback_iattrfork.c
+++ b/fs/tux3/writeback_iattrfork.c
@@ -54,10 +54,9 @@ static void idata_copy(struct inode *inode, struct tux3_iattr_data *idata)
*
* FIXME: this is better to call tux3_mark_inode_dirty() too?
*/
-void tux3_iattrdirty(struct inode *inode)
+void tux3_iattrdirty_delta(struct inode *inode, unsigned delta)
{
struct tux3_inode *tuxnode = tux_inode(inode);
- unsigned delta = tux3_inode_delta(inode);
unsigned state = tuxnode->state;
/* If dirtied on this delta, nothing to do */
@@ -107,6 +106,11 @@ void tux3_iattrdirty(struct inode *inode)
spin_unlock(&tuxnode->lock);
}
+void tux3_iattrdirty(struct inode *inode)
+{
+ tux3_iattrdirty_delta(inode, tux3_inode_delta(inode));
+}
+
/* Caller must hold tuxnode->lock */
static void tux3_iattr_clear_dirty(struct tux3_inode *tuxnode)
{
Hi Rik,
Our linux-tux3 tree currently currently carries this 652 line diff
against core, to make Tux3 work. This is mainly by Hirofumi, except
the fs-writeback.c hook, which is by me. The main part you may be
interested in is rmap.c, which addresses the issues raised at the
2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
LSFMM: Page forking
http://lwn.net/Articles/548091/
This is just a FYI. An upcoming Tux3 report will be a tour of the page
forking design and implementation. For now, this is just to give a
general sense of what we have done. We heard there are concerns about
how ptrace will work. I really am not familiar with the issue, could
you please explain what you were thinking of there?
Enjoy,
Daniel
[1] Which happened to be a 15 minute bus ride away from me at the time.
diffstat tux3.core.patch
fs/Makefile | 1
fs/fs-writeback.c | 100 +++++++++++++++++++++++++--------
include/linux/fs.h | 6 +
include/linux/mm.h | 5 +
include/linux/pagemap.h | 2
include/linux/rmap.h | 14 ++++
include/linux/writeback.h | 23 +++++++
mm/filemap.c | 82 +++++++++++++++++++++++++++
mm/rmap.c | 139 ++++++++++++++++++++++++++++++++++++++++++++++
mm/truncate.c | 98 ++++++++++++++++++++------------
10 files changed, 411 insertions(+), 59 deletions(-)
diff --git a/fs/Makefile b/fs/Makefile
index 91fcfa3..44d7192 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -70,7 +70,6 @@ obj-$(CONFIG_EXT4_FS) += ext4/
obj-$(CONFIG_JBD) += jbd/
obj-$(CONFIG_JBD2) += jbd2/
obj-$(CONFIG_TUX3) += tux3/
-obj-$(CONFIG_TUX3_MMAP) += tux3/
obj-$(CONFIG_CRAMFS) += cramfs/
obj-$(CONFIG_SQUASHFS) += squashfs/
obj-y += ramfs/
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 2d609a5..fcd1c61 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -34,25 +34,6 @@
*/
#define MIN_WRITEBACK_PAGES (4096UL >> (PAGE_CACHE_SHIFT - 10))
-/*
- * Passed into wb_writeback(), essentially a subset of writeback_control
- */
-struct wb_writeback_work {
- long nr_pages;
- struct super_block *sb;
- unsigned long *older_than_this;
- enum writeback_sync_modes sync_mode;
- unsigned int tagged_writepages:1;
- unsigned int for_kupdate:1;
- unsigned int range_cyclic:1;
- unsigned int for_background:1;
- unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
- enum wb_reason reason; /* why was writeback initiated? */
-
- struct list_head list; /* pending work list */
- struct completion *done; /* set if the caller waits */
-};
-
/**
* writeback_in_progress - determine whether there is writeback in progress
* @bdi: the device's backing_dev_info structure.
@@ -192,6 +173,36 @@ void inode_wb_list_del(struct inode *inode)
}
/*
+ * Remove inode from writeback list if clean.
+ */
+void inode_writeback_done(struct inode *inode)
+{
+ struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+ spin_lock(&bdi->wb.list_lock);
+ spin_lock(&inode->i_lock);
+ if (!(inode->i_state & I_DIRTY))
+ list_del_init(&inode->i_wb_list);
+ spin_unlock(&inode->i_lock);
+ spin_unlock(&bdi->wb.list_lock);
+}
+EXPORT_SYMBOL_GPL(inode_writeback_done);
+
+/*
+ * Add inode to writeback dirty list with current time.
+ */
+void inode_writeback_touch(struct inode *inode)
+{
+ struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+ spin_lock(&bdi->wb.list_lock);
+ inode->dirtied_when = jiffies;
+ list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
+ spin_unlock(&bdi->wb.list_lock);
+}
+EXPORT_SYMBOL_GPL(inode_writeback_touch);
+
+/*
* Redirty an inode: set its when-it-was dirtied timestamp and move it to the
* furthest end of its superblock's dirty-inode list.
*
@@ -610,9 +621,9 @@ static long writeback_chunk_size(struct backing_dev_info *bdi,
*
* Return the number of pages and/or inodes written.
*/
-static long writeback_sb_inodes(struct super_block *sb,
- struct bdi_writeback *wb,
- struct wb_writeback_work *work)
+static long generic_writeback_sb_inodes(struct super_block *sb,
+ struct bdi_writeback *wb,
+ struct wb_writeback_work *work)
{
struct writeback_control wbc = {
.sync_mode = work->sync_mode,
@@ -727,6 +738,22 @@ static long writeback_sb_inodes(struct super_block *sb,
return wrote;
}
+static long writeback_sb_inodes(struct super_block *sb,
+ struct bdi_writeback *wb,
+ struct wb_writeback_work *work)
+{
+ if (sb->s_op->writeback) {
+ long ret;
+
+ spin_unlock(&wb->list_lock);
+ ret = sb->s_op->writeback(sb, wb, work);
+ spin_lock(&wb->list_lock);
+ return ret;
+ }
+
+ return generic_writeback_sb_inodes(sb, wb, work);
+}
+
static long __writeback_inodes_wb(struct bdi_writeback *wb,
struct wb_writeback_work *work)
{
@@ -1293,6 +1320,35 @@ static void wait_sb_inodes(struct super_block *sb)
}
/**
+ * writeback_queue_work_sb - schedule writeback work from given super_block
+ * @sb: the superblock
+ * @work: work item to queue
+ *
+ * Schedule writeback work on this super_block. This usually used to
+ * interact with sb->s_op->writeback callback. The caller must
+ * guarantee to @work is not freed while bdi flusher is using (for
+ * example, be safe against umount).
+ */
+void writeback_queue_work_sb(struct super_block *sb,
+ struct wb_writeback_work *work)
+{
+ if (sb->s_bdi == &noop_backing_dev_info)
+ return;
+
+ /* Allow only following fields to use. */
+ *work = (struct wb_writeback_work){
+ .sb = sb,
+ .sync_mode = work->sync_mode,
+ .tagged_writepages = work->tagged_writepages,
+ .done = work->done,
+ .nr_pages = work->nr_pages,
+ .reason = work->reason,
+ };
+ bdi_queue_work(sb->s_bdi, work);
+}
+EXPORT_SYMBOL(writeback_queue_work_sb);
+
+/**
* writeback_inodes_sb_nr - writeback dirty inodes from given super_block
* @sb: the superblock
* @nr: the number of pages to write
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 42efe13..29833d2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -356,6 +356,8 @@ struct address_space_operations {
/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
sector_t (*bmap)(struct address_space *, sector_t);
+ void (*truncatepage)(struct address_space *, struct page *,
+ unsigned int, unsigned int, int);
void (*invalidatepage) (struct page *, unsigned int, unsigned int);
int (*releasepage) (struct page *, gfp_t);
void (*freepage)(struct page *);
@@ -1590,6 +1592,8 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
unsigned long, loff_t *);
+struct bdi_writeback;
+struct wb_writeback_work;
struct super_operations {
struct inode *(*alloc_inode)(struct super_block *sb);
void (*destroy_inode)(struct inode *);
@@ -1599,6 +1603,8 @@ struct super_operations {
int (*drop_inode) (struct inode *);
void (*evict_inode) (struct inode *);
void (*put_super) (struct super_block *);
+ long (*writeback)(struct super_block *super, struct bdi_writeback *wb,
+ struct wb_writeback_work *work);
int (*sync_fs)(struct super_block *sb, int wait);
int (*freeze_super) (struct super_block *);
int (*freeze_fs) (struct super_block *);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index dd5ea30..075f59f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1909,6 +1909,11 @@ vm_unmapped_area(struct vm_unmapped_area_info *info)
}
/* truncate.c */
+void generic_truncate_partial_page(struct address_space *mapping,
+ struct page *page, unsigned int start,
+ unsigned int len);
+void generic_truncate_full_page(struct address_space *mapping,
+ struct page *page, int wait);
extern void truncate_inode_pages(struct address_space *, loff_t);
extern void truncate_inode_pages_range(struct address_space *,
loff_t lstart, loff_t lend);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 4b3736f..13b70160 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -653,6 +653,8 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
extern void delete_from_page_cache(struct page *page);
extern void __delete_from_page_cache(struct page *page, void *shadow);
int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
+int cow_replace_page_cache(struct page *oldpage, struct page *newpage);
+void cow_delete_from_page_cache(struct page *page);
/*
* Like add_to_page_cache_locked, but used to add newly allocated pages:
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d9d7e7e..9b67360 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -228,6 +228,20 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
int page_mkclean(struct page *);
/*
+ * Make clone page for page forking.
+ *
+ * Note: only clones page state so other state such as buffer_heads
+ * must be cloned by caller.
+ */
+struct page *cow_clone_page(struct page *oldpage);
+
+/*
+ * Changes the PTES of shared mappings except the PTE in orig_vma.
+ */
+int page_cow_file(struct vm_area_struct *orig_vma, struct page *oldpage,
+ struct page *newpage);
+
+/*
* called in munlock()/munmap() path to check for other vmas holding
* the page mlocked.
*/
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 0004833..0784b9d 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -59,6 +59,25 @@ enum wb_reason {
};
/*
+ * Passed into wb_writeback(), essentially a subset of writeback_control
+ */
+struct wb_writeback_work {
+ long nr_pages;
+ struct super_block *sb;
+ unsigned long *older_than_this;
+ enum writeback_sync_modes sync_mode;
+ unsigned int tagged_writepages:1;
+ unsigned int for_kupdate:1;
+ unsigned int range_cyclic:1;
+ unsigned int for_background:1;
+ unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
+ enum wb_reason reason; /* why was writeback initiated? */
+
+ struct list_head list; /* pending work list */
+ struct completion *done; /* set if the caller waits */
+};
+
+/*
* A control structure which tells the writeback code what to do. These are
* always on the stack, and hence need no locking. They are always initialised
* in a manner such that unspecified fields are set to zero.
@@ -90,6 +109,10 @@ struct writeback_control {
* fs/fs-writeback.c
*/
struct bdi_writeback;
+void inode_writeback_done(struct inode *inode);
+void inode_writeback_touch(struct inode *inode);
+void writeback_queue_work_sb(struct super_block *sb,
+ struct wb_writeback_work *work);
void writeback_inodes_sb(struct super_block *, enum wb_reason reason);
void writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
enum wb_reason reason);
diff --git a/mm/filemap.c b/mm/filemap.c
index 673e458..8c641d0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -639,6 +639,88 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
}
EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
+/*
+ * Atomically replace oldpage with newpage.
+ *
+ * Similar to migrate_pages(), but the oldpage is for writeout.
+ */
+int cow_replace_page_cache(struct page *oldpage, struct page *newpage)
+{
+ struct address_space *mapping = oldpage->mapping;
+ void **pslot;
+
+ VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
+ VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
+
+ /* Get refcount for radix-tree */
+ page_cache_get(newpage);
+
+ /* Replace page in radix tree. */
+ spin_lock_irq(&mapping->tree_lock);
+ /* PAGECACHE_TAG_DIRTY represents the view of frontend. Clear it. */
+ if (PageDirty(oldpage))
+ radix_tree_tag_clear(&mapping->page_tree, page_index(oldpage),
+ PAGECACHE_TAG_DIRTY);
+ /* The refcount to newpage is used for radix tree. */
+ pslot = radix_tree_lookup_slot(&mapping->page_tree, oldpage->index);
+ radix_tree_replace_slot(pslot, newpage);
+ __inc_zone_page_state(newpage, NR_FILE_PAGES);
+ __dec_zone_page_state(oldpage, NR_FILE_PAGES);
+ spin_unlock_irq(&mapping->tree_lock);
+
+ /* mem_cgroup codes must not be called under tree_lock */
+ mem_cgroup_migrate(oldpage, newpage, true);
+
+ /* Release refcount for radix-tree */
+ page_cache_release(oldpage);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(cow_replace_page_cache);
+
+/*
+ * Delete page from radix-tree, leaving page->mapping unchanged.
+ *
+ * Similar to delete_from_page_cache(), but the deleted page is for writeout.
+ */
+void cow_delete_from_page_cache(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+
+ /* Delete page from radix tree. */
+ spin_lock_irq(&mapping->tree_lock);
+ /*
+ * if we're uptodate, flush out into the cleancache, otherwise
+ * invalidate any existing cleancache entries. We can't leave
+ * stale data around in the cleancache once our page is gone
+ */
+ if (PageUptodate(page) && PageMappedToDisk(page))
+ cleancache_put_page(page);
+ else
+ cleancache_invalidate_page(mapping, page);
+
+ page_cache_tree_delete(mapping, page, NULL);
+#if 0 /* FIXME: backend is assuming page->mapping is available */
+ page->mapping = NULL;
+#endif
+ /* Leave page->index set: truncation lookup relies upon it */
+
+ __dec_zone_page_state(page, NR_FILE_PAGES);
+ BUG_ON(page_mapped(page));
+
+ /*
+ * The following dirty accounting is done by writeback
+ * path. So, we don't need to do here.
+ *
+ * dec_zone_page_state(page, NR_FILE_DIRTY);
+ * dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+ */
+ spin_unlock_irq(&mapping->tree_lock);
+
+ page_cache_release(page);
+}
+EXPORT_SYMBOL_GPL(cow_delete_from_page_cache);
+
#ifdef CONFIG_NUMA
struct page *__page_cache_alloc(gfp_t gfp)
{
diff --git a/mm/rmap.c b/mm/rmap.c
index 71cd5bd..9125246 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -923,6 +923,145 @@ int page_mkclean(struct page *page)
}
EXPORT_SYMBOL_GPL(page_mkclean);
+/*
+ * Make clone page for page forking. (Based on migrate_page_copy())
+ *
+ * Note: only clones page state so other state such as buffer_heads
+ * must be cloned by caller.
+ */
+struct page *cow_clone_page(struct page *oldpage)
+{
+ struct address_space *mapping = oldpage->mapping;
+ gfp_t gfp_mask = mapping_gfp_mask(mapping) & ~__GFP_FS;
+ struct page *newpage = __page_cache_alloc(gfp_mask);
+ int cpupid;
+
+ newpage->mapping = oldpage->mapping;
+ newpage->index = oldpage->index;
+ copy_highpage(newpage, oldpage);
+
+ /* FIXME: right? */
+ BUG_ON(PageSwapCache(oldpage));
+ BUG_ON(PageSwapBacked(oldpage));
+ BUG_ON(PageHuge(oldpage));
+ if (PageError(oldpage))
+ SetPageError(newpage);
+ if (PageReferenced(oldpage))
+ SetPageReferenced(newpage);
+ if (PageUptodate(oldpage))
+ SetPageUptodate(newpage);
+ if (PageActive(oldpage))
+ SetPageActive(newpage);
+ if (PageMappedToDisk(oldpage))
+ SetPageMappedToDisk(newpage);
+
+ /*
+ * Copy NUMA information to the new page, to prevent over-eager
+ * future migrations of this same page.
+ */
+ cpupid = page_cpupid_xchg_last(oldpage, -1);
+ page_cpupid_xchg_last(newpage, cpupid);
+
+ mlock_migrate_page(newpage, oldpage);
+ ksm_migrate_page(newpage, oldpage);
+
+ /* Lock newpage before visible via radix tree */
+ BUG_ON(PageLocked(newpage));
+ __set_page_locked(newpage);
+
+ return newpage;
+}
+EXPORT_SYMBOL_GPL(cow_clone_page);
+
+static int page_cow_one(struct page *oldpage, struct page *newpage,
+ struct vm_area_struct *vma, unsigned long address)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pte_t oldptval, ptval, *pte;
+ spinlock_t *ptl;
+ int ret = 0;
+
+ pte = page_check_address(oldpage, mm, address, &ptl, 1);
+ if (!pte)
+ goto out;
+
+ flush_cache_page(vma, address, pte_pfn(*pte));
+ oldptval = ptep_clear_flush(vma, address, pte);
+
+ /* Take refcount for PTE */
+ page_cache_get(newpage);
+
+ /*
+ * vm_page_prot doesn't have writable bit, so page fault will
+ * be occurred immediately after returned from this page fault
+ * again. And second time of page fault will be resolved with
+ * forked page was set here.
+ */
+ ptval = mk_pte(newpage, vma->vm_page_prot);
+#if 0
+ /* FIXME: we should check following too? Otherwise, we would
+ * get additional read-only => write fault at least */
+ if (pte_write)
+ ptval = pte_mkwrite(ptval);
+ if (pte_dirty(oldptval))
+ ptval = pte_mkdirty(ptval);
+ if (pte_young(oldptval))
+ ptval = pte_mkyoung(ptval);
+#endif
+ set_pte_at(mm, address, pte, ptval);
+
+ /* Update rmap accounting */
+ BUG_ON(!PageMlocked(oldpage)); /* Caller should migrate mlock flag */
+ page_remove_rmap(oldpage);
+ page_add_file_rmap(newpage);
+
+ /* no need to invalidate: a not-present page won't be cached */
+ update_mmu_cache(vma, address, pte);
+
+ pte_unmap_unlock(pte, ptl);
+
+ mmu_notifier_invalidate_page(mm, address);
+
+ /* Release refcount for PTE */
+ page_cache_release(oldpage);
+out:
+ return ret;
+}
+
+/* Change old page in PTEs to new page exclude orig_vma */
+int page_cow_file(struct vm_area_struct *orig_vma, struct page *oldpage,
+ struct page *newpage)
+{
+ struct address_space *mapping = page_mapping(oldpage);
+ pgoff_t pgoff = oldpage->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ struct vm_area_struct *vma;
+ int ret = 0;
+
+ BUG_ON(!PageLocked(oldpage));
+ BUG_ON(!PageLocked(newpage));
+ BUG_ON(PageAnon(oldpage));
+ BUG_ON(mapping == NULL);
+
+ i_mmap_lock_read(mapping);
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+ /*
+ * The orig_vma's PTE is handled by caller.
+ * (e.g. ->page_mkwrite)
+ */
+ if (vma == orig_vma)
+ continue;
+
+ if (vma->vm_flags & VM_SHARED) {
+ unsigned long address = vma_address(oldpage, vma);
+ ret += page_cow_one(oldpage, newpage, vma, address);
+ }
+ }
+ i_mmap_unlock_read(mapping);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(page_cow_file);
+
/**
* page_move_anon_rmap - move a page to our anon_vma
* @page: the page to move to our anon_vma
diff --git a/mm/truncate.c b/mm/truncate.c
index f1e4d60..e5b4673 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -216,6 +216,56 @@ int invalidate_inode_page(struct page *page)
return invalidate_complete_page(mapping, page);
}
+void generic_truncate_partial_page(struct address_space *mapping,
+ struct page *page, unsigned int start,
+ unsigned int len)
+{
+ wait_on_page_writeback(page);
+ zero_user_segment(page, start, start + len);
+ if (page_has_private(page))
+ do_invalidatepage(page, start, len);
+}
+EXPORT_SYMBOL(generic_truncate_partial_page);
+
+static void truncate_partial_page(struct address_space *mapping, pgoff_t index,
+ unsigned int start, unsigned int len)
+{
+ struct page *page = find_lock_page(mapping, index);
+ if (!page)
+ return;
+
+ if (!mapping->a_ops->truncatepage)
+ generic_truncate_partial_page(mapping, page, start, len);
+ else
+ mapping->a_ops->truncatepage(mapping, page, start, len, 1);
+
+ cleancache_invalidate_page(mapping, page);
+ unlock_page(page);
+ page_cache_release(page);
+}
+
+void generic_truncate_full_page(struct address_space *mapping,
+ struct page *page, int wait)
+{
+ if (wait)
+ wait_on_page_writeback(page);
+ else if (PageWriteback(page))
+ return;
+
+ truncate_inode_page(mapping, page);
+}
+EXPORT_SYMBOL(generic_truncate_full_page);
+
+static void truncate_full_page(struct address_space *mapping, struct page *page,
+ int wait)
+{
+ if (!mapping->a_ops->truncatepage)
+ generic_truncate_full_page(mapping, page, wait);
+ else
+ mapping->a_ops->truncatepage(mapping, page, 0, PAGE_CACHE_SIZE,
+ wait);
+}
+
/**
* truncate_inode_pages_range - truncate range of pages specified by start & end byte offsets
* @mapping: mapping to truncate
@@ -298,11 +348,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
if (!trylock_page(page))
continue;
WARN_ON(page->index != index);
- if (PageWriteback(page)) {
- unlock_page(page);
- continue;
- }
- truncate_inode_page(mapping, page);
+ truncate_full_page(mapping, page, 0);
unlock_page(page);
}
pagevec_remove_exceptionals(&pvec);
@@ -312,37 +358,18 @@ void truncate_inode_pages_range(struct address_space *mapping,
}
if (partial_start) {
- struct page *page = find_lock_page(mapping, start - 1);
- if (page) {
- unsigned int top = PAGE_CACHE_SIZE;
- if (start > end) {
- /* Truncation within a single page */
- top = partial_end;
- partial_end = 0;
- }
- wait_on_page_writeback(page);
- zero_user_segment(page, partial_start, top);
- cleancache_invalidate_page(mapping, page);
- if (page_has_private(page))
- do_invalidatepage(page, partial_start,
- top - partial_start);
- unlock_page(page);
- page_cache_release(page);
- }
- }
- if (partial_end) {
- struct page *page = find_lock_page(mapping, end);
- if (page) {
- wait_on_page_writeback(page);
- zero_user_segment(page, 0, partial_end);
- cleancache_invalidate_page(mapping, page);
- if (page_has_private(page))
- do_invalidatepage(page, 0,
- partial_end);
- unlock_page(page);
- page_cache_release(page);
+ unsigned int top = PAGE_CACHE_SIZE;
+ if (start > end) {
+ /* Truncation within a single page */
+ top = partial_end;
+ partial_end = 0;
}
+ truncate_partial_page(mapping, start - 1, partial_start,
+ top - partial_start);
}
+ if (partial_end)
+ truncate_partial_page(mapping, end, 0, partial_end);
+
/*
* If the truncation happened within a single page no pages
* will be released, just zeroed, so we can bail out now.
@@ -386,8 +413,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
lock_page(page);
WARN_ON(page->index != index);
- wait_on_page_writeback(page);
- truncate_inode_page(mapping, page);
+ truncate_full_page(mapping, page, 1);
unlock_page(page);
}
pagevec_remove_exceptionals(&pvec);
On 05/14/2015 04:26 AM, Daniel Phillips wrote:
> Hi Rik,
>
> Our linux-tux3 tree currently currently carries this 652 line diff
> against core, to make Tux3 work. This is mainly by Hirofumi, except
> the fs-writeback.c hook, which is by me. The main part you may be
> interested in is rmap.c, which addresses the issues raised at the
> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
>
> LSFMM: Page forking
> http://lwn.net/Articles/548091/
>
> This is just a FYI. An upcoming Tux3 report will be a tour of the page
> forking design and implementation. For now, this is just to give a
> general sense of what we have done. We heard there are concerns about
> how ptrace will work. I really am not familiar with the issue, could
> you please explain what you were thinking of there?
The issue is that things like ptrace, AIO, infiniband
RDMA, and other direct memory access subsystems can take
a reference to page A, which Tux3 clones into a new page B
when the process writes it.
However, while the process now points at page B, ptrace,
AIO, infiniband, etc will still be pointing at page A.
This causes the process and the other subsystem to each
look at a different page, instead of at shared state,
causing ptrace to do nothing, AIO and RDMA data to be
invisible (or corrupted), etc...
--
All rights reversed
Hi Rik,
Added Mel, Andrea and Peterz to CC as interested parties. There are
probably others, please just jump in.
On 05/14/2015 05:59 AM, Rik van Riel wrote:
> On 05/14/2015 04:26 AM, Daniel Phillips wrote:
>> Hi Rik,
>>
>> Our linux-tux3 tree currently currently carries this 652 line diff
>> against core, to make Tux3 work. This is mainly by Hirofumi, except
>> the fs-writeback.c hook, which is by me. The main part you may be
>> interested in is rmap.c, which addresses the issues raised at the
>> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
>>
>> LSFMM: Page forking
>> http://lwn.net/Articles/548091/
>>
>> This is just a FYI. An upcoming Tux3 report will be a tour of the page
>> forking design and implementation. For now, this is just to give a
>> general sense of what we have done. We heard there are concerns about
>> how ptrace will work. I really am not familiar with the issue, could
>> you please explain what you were thinking of there?
>
> The issue is that things like ptrace, AIO, infiniband
> RDMA, and other direct memory access subsystems can take
> a reference to page A, which Tux3 clones into a new page B
> when the process writes it.
>
> However, while the process now points at page B, ptrace,
> AIO, infiniband, etc will still be pointing at page A.
>
> This causes the process and the other subsystem to each
> look at a different page, instead of at shared state,
> causing ptrace to do nothing, AIO and RDMA data to be
> invisible (or corrupted), etc...
Is this a bit like page migration?
Regards,
Daniel
On 05/14/2015 08:06 PM, Daniel Phillips wrote:
> Hi Rik,
>
> Added Mel, Andrea and Peterz to CC as interested parties. There are
> probably others, please just jump in.
>
> On 05/14/2015 05:59 AM, Rik van Riel wrote:
>> On 05/14/2015 04:26 AM, Daniel Phillips wrote:
>>> Hi Rik,
>>>
>>> Our linux-tux3 tree currently currently carries this 652 line diff
>>> against core, to make Tux3 work. This is mainly by Hirofumi, except
>>> the fs-writeback.c hook, which is by me. The main part you may be
>>> interested in is rmap.c, which addresses the issues raised at the
>>> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
>>>
>>> LSFMM: Page forking
>>> http://lwn.net/Articles/548091/
>>>
>>> This is just a FYI. An upcoming Tux3 report will be a tour of the page
>>> forking design and implementation. For now, this is just to give a
>>> general sense of what we have done. We heard there are concerns about
>>> how ptrace will work. I really am not familiar with the issue, could
>>> you please explain what you were thinking of there?
>>
>> The issue is that things like ptrace, AIO, infiniband
>> RDMA, and other direct memory access subsystems can take
>> a reference to page A, which Tux3 clones into a new page B
>> when the process writes it.
>>
>> However, while the process now points at page B, ptrace,
>> AIO, infiniband, etc will still be pointing at page A.
>>
>> This causes the process and the other subsystem to each
>> look at a different page, instead of at shared state,
>> causing ptrace to do nothing, AIO and RDMA data to be
>> invisible (or corrupted), etc...
>
> Is this a bit like page migration?
Yes. Page migration will fail if there is an "extra"
reference to the page that is not accounted for by
the migration code.
Only pages that have no extra refcount can be migrated.
Similarly, your cow code needs to fail if there is an
extra reference count pinning the page. As long as
the page has a user that you cannot migrate, you cannot
move any of the other users over. They may rely on data
written by the hidden-to-you user, and the hidden-to-you
user may write to the page when you think it is a read
only stable snapshot.
--
All rights reversed
On Thu, May 14, 2015 at 05:06:39PM -0700, Daniel Phillips wrote:
> Hi Rik,
>
> Added Mel, Andrea and Peterz to CC as interested parties. There are
> probably others, please just jump in.
>
> On 05/14/2015 05:59 AM, Rik van Riel wrote:
> > On 05/14/2015 04:26 AM, Daniel Phillips wrote:
> >> Hi Rik,
> >>
> >> Our linux-tux3 tree currently currently carries this 652 line diff
> >> against core, to make Tux3 work. This is mainly by Hirofumi, except
> >> the fs-writeback.c hook, which is by me. The main part you may be
> >> interested in is rmap.c, which addresses the issues raised at the
> >> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
> >>
> >> LSFMM: Page forking
> >> http://lwn.net/Articles/548091/
> >>
> >> This is just a FYI. An upcoming Tux3 report will be a tour of the page
> >> forking design and implementation. For now, this is just to give a
> >> general sense of what we have done. We heard there are concerns about
> >> how ptrace will work. I really am not familiar with the issue, could
> >> you please explain what you were thinking of there?
> >
> > The issue is that things like ptrace, AIO, infiniband
> > RDMA, and other direct memory access subsystems can take
> > a reference to page A, which Tux3 clones into a new page B
> > when the process writes it.
> >
> > However, while the process now points at page B, ptrace,
> > AIO, infiniband, etc will still be pointing at page A.
> >
> > This causes the process and the other subsystem to each
> > look at a different page, instead of at shared state,
> > causing ptrace to do nothing, AIO and RDMA data to be
> > invisible (or corrupted), etc...
>
> Is this a bit like page migration?
>
No, it's not.
--
Mel Gorman
SUSE Labs
On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote:
> On 05/14/2015 08:06 PM, Daniel Phillips wrote:
> > Hi Rik,
> >
> > Added Mel, Andrea and Peterz to CC as interested parties. There are
> > probably others, please just jump in.
> >
> > On 05/14/2015 05:59 AM, Rik van Riel wrote:
> >> On 05/14/2015 04:26 AM, Daniel Phillips wrote:
> >>> Hi Rik,
> >>>
> >>> Our linux-tux3 tree currently currently carries this 652 line diff
> >>> against core, to make Tux3 work. This is mainly by Hirofumi, except
> >>> the fs-writeback.c hook, which is by me. The main part you may be
> >>> interested in is rmap.c, which addresses the issues raised at the
> >>> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
> >>>
> >>> LSFMM: Page forking
> >>> http://lwn.net/Articles/548091/
> >>>
> >>> This is just a FYI. An upcoming Tux3 report will be a tour of the page
> >>> forking design and implementation. For now, this is just to give a
> >>> general sense of what we have done. We heard there are concerns about
> >>> how ptrace will work. I really am not familiar with the issue, could
> >>> you please explain what you were thinking of there?
> >>
> >> The issue is that things like ptrace, AIO, infiniband
> >> RDMA, and other direct memory access subsystems can take
> >> a reference to page A, which Tux3 clones into a new page B
> >> when the process writes it.
> >>
> >> However, while the process now points at page B, ptrace,
> >> AIO, infiniband, etc will still be pointing at page A.
> >>
> >> This causes the process and the other subsystem to each
> >> look at a different page, instead of at shared state,
> >> causing ptrace to do nothing, AIO and RDMA data to be
> >> invisible (or corrupted), etc...
> >
> > Is this a bit like page migration?
>
> Yes. Page migration will fail if there is an "extra"
> reference to the page that is not accounted for by
> the migration code.
>
When I said it's not like page migration, I was referring to the fact
that a COW on a pinned page for RDMA is a different problem to page
migration. The COW of a pinned page can lead to lost writes or
corruption depending on the ordering of events. Page migration fails
when there are unexpected problems to avoid this class of issue which is
fine for page migration but may be a critical failure in a filesystem
depending on exactly why the copy is required.
--
Mel Gorman
SUSE Labs
On 05/14/2015 08:06 PM, Rik van Riel wrote:
> On 05/14/2015 08:06 PM, Daniel Phillips wrote:
>>> The issue is that things like ptrace, AIO, infiniband
>>> RDMA, and other direct memory access subsystems can take
>>> a reference to page A, which Tux3 clones into a new page B
>>> when the process writes it.
>>>
>>> However, while the process now points at page B, ptrace,
>>> AIO, infiniband, etc will still be pointing at page A.
>>>
>>> This causes the process and the other subsystem to each
>>> look at a different page, instead of at shared state,
>>> causing ptrace to do nothing, AIO and RDMA data to be
>>> invisible (or corrupted), etc...
>>
>> Is this a bit like page migration?
>
> Yes. Page migration will fail if there is an "extra"
> reference to the page that is not accounted for by
> the migration code.
>
> Only pages that have no extra refcount can be migrated.
>
> Similarly, your cow code needs to fail if there is an
> extra reference count pinning the page. As long as
> the page has a user that you cannot migrate, you cannot
> move any of the other users over. They may rely on data
> written by the hidden-to-you user, and the hidden-to-you
> user may write to the page when you think it is a read
> only stable snapshot.
Please bear with me as I study these cases one by one.
First one is ptrace. Only for executable files, right?
Maybe we don't need to fork pages in executable files,
Uprobes... If somebody puts a breakpoint in a page and
we fork it, the replacement page has a copy of the
breakpoint, and all the code on the page. Did anything
break?
Note: we have the option of being cowardly and just not
doing page forking for mmapped files, or certain kinds
of mmapped files, etc. But first we should give it the
old college try, to see if absolute perfection is
possible and practical.
Regards,
Daniel
On 05/15/2015 01:09 AM, Mel Gorman wrote:
> On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote:
>> On 05/14/2015 08:06 PM, Daniel Phillips wrote:
>>>> The issue is that things like ptrace, AIO, infiniband
>>>> RDMA, and other direct memory access subsystems can take
>>>> a reference to page A, which Tux3 clones into a new page B
>>>> when the process writes it.
>>>>
>>>> However, while the process now points at page B, ptrace,
>>>> AIO, infiniband, etc will still be pointing at page A.
>>>>
>>>> This causes the process and the other subsystem to each
>>>> look at a different page, instead of at shared state,
>>>> causing ptrace to do nothing, AIO and RDMA data to be
>>>> invisible (or corrupted), etc...
>>>
>>> Is this a bit like page migration?
>>
>> Yes. Page migration will fail if there is an "extra"
>> reference to the page that is not accounted for by
>> the migration code.
>
> When I said it's not like page migration, I was referring to the fact
> that a COW on a pinned page for RDMA is a different problem to page
> migration. The COW of a pinned page can lead to lost writes or
> corruption depending on the ordering of events.
I see the lost writes case, but not the corruption case, Do you
mean corruption by changing a page already in writeout? If so,
don't all filesystems have that problem?
If RDMA to a mmapped file races with write(2) to the same file,
maybe it is reasonable and expected to lose some data.
> Page migration fails
> when there are unexpected problems to avoid this class of issue which is
> fine for page migration but may be a critical failure in a filesystem
> depending on exactly why the copy is required.
Regards,
Daniel
On Fri, May 15, 2015 at 02:54:48AM -0700, Daniel Phillips wrote:
>
>
> On 05/15/2015 01:09 AM, Mel Gorman wrote:
> > On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote:
> >> On 05/14/2015 08:06 PM, Daniel Phillips wrote:
> >>>> The issue is that things like ptrace, AIO, infiniband
> >>>> RDMA, and other direct memory access subsystems can take
> >>>> a reference to page A, which Tux3 clones into a new page B
> >>>> when the process writes it.
> >>>>
> >>>> However, while the process now points at page B, ptrace,
> >>>> AIO, infiniband, etc will still be pointing at page A.
> >>>>
> >>>> This causes the process and the other subsystem to each
> >>>> look at a different page, instead of at shared state,
> >>>> causing ptrace to do nothing, AIO and RDMA data to be
> >>>> invisible (or corrupted), etc...
> >>>
> >>> Is this a bit like page migration?
> >>
> >> Yes. Page migration will fail if there is an "extra"
> >> reference to the page that is not accounted for by
> >> the migration code.
> >
> > When I said it's not like page migration, I was referring to the fact
> > that a COW on a pinned page for RDMA is a different problem to page
> > migration. The COW of a pinned page can lead to lost writes or
> > corruption depending on the ordering of events.
>
> I see the lost writes case, but not the corruption case,
Data corruption can occur depending on the ordering of events and the
applications expectations. If a process starts IO, RDMA pins the page
for read and forks are combined with writes from another thread then when
the IO completes the reads may not be visible. The application may take
improper action at that point.
Users of RDMA are typically expected to use MADV_DONTFORK to avoid this
class of problem.
You can choose to not define this as data corruption because thge kernel
is not directly involved and that's your call.
> Do you
> mean corruption by changing a page already in writeout? If so,
> don't all filesystems have that problem?
>
No, the problem is different. Backing devices requiring stable pages will
block the write until the IO is complete. For those that do not require
stable pages it's ok to allow the write as long as the page is dirtied so
that it'll be written out again and no data is lost.
> If RDMA to a mmapped file races with write(2) to the same file,
> maybe it is reasonable and expected to lose some data.
>
In the RDMA case, there is at least application awareness to work around
the problems. Normally it's ok to have both mapped and write() access
to data although userspace might need a lock to co-ordinate updates and
event ordering.
--
Mel Gorman
SUSE Labs
On Fri, 15 May 2015, Mel Gorman wrote:
> On Fri, May 15, 2015 at 02:54:48AM -0700, Daniel Phillips wrote:
>>
>>
>> On 05/15/2015 01:09 AM, Mel Gorman wrote:
>>> On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote:
>>>> On 05/14/2015 08:06 PM, Daniel Phillips wrote:
>>>>>> The issue is that things like ptrace, AIO, infiniband
>>>>>> RDMA, and other direct memory access subsystems can take
>>>>>> a reference to page A, which Tux3 clones into a new page B
>>>>>> when the process writes it.
>>>>>>
>>>>>> However, while the process now points at page B, ptrace,
>>>>>> AIO, infiniband, etc will still be pointing at page A.
>>>>>>
>>>>>> This causes the process and the other subsystem to each
>>>>>> look at a different page, instead of at shared state,
>>>>>> causing ptrace to do nothing, AIO and RDMA data to be
>>>>>> invisible (or corrupted), etc...
>>>>>
>>>>> Is this a bit like page migration?
>>>>
>>>> Yes. Page migration will fail if there is an "extra"
>>>> reference to the page that is not accounted for by
>>>> the migration code.
>>>
>>> When I said it's not like page migration, I was referring to the fact
>>> that a COW on a pinned page for RDMA is a different problem to page
>>> migration. The COW of a pinned page can lead to lost writes or
>>> corruption depending on the ordering of events.
>>
>> I see the lost writes case, but not the corruption case,
>
> Data corruption can occur depending on the ordering of events and the
> applications expectations. If a process starts IO, RDMA pins the page
> for read and forks are combined with writes from another thread then when
> the IO completes the reads may not be visible. The application may take
> improper action at that point.
if tux3 forks the page and writes the copy while the original page is being
modified by other things, this means that some of the changes won't be in the
version written (and this could catch partial writes with 'interesting' results
if the forking happens at the wrong time)
But if the original page gets re-marked as needing to be written out when it's
changed by one of the other things that are accessing it, there shouldn't be any
long-term corruption.
As far as short-term corruption goes, any time you have a page mmapped it could
get written out at any time, with only some of the application changes applied
to it, so this sort of corruption could happen anyway couldn't it?
> Users of RDMA are typically expected to use MADV_DONTFORK to avoid this
> class of problem.
>
> You can choose to not define this as data corruption because thge kernel
> is not directly involved and that's your call.
>
>> Do you
>> mean corruption by changing a page already in writeout? If so,
>> don't all filesystems have that problem?
>>
>
> No, the problem is different. Backing devices requiring stable pages will
> block the write until the IO is complete. For those that do not require
> stable pages it's ok to allow the write as long as the page is dirtied so
> that it'll be written out again and no data is lost.
so if tux3 is prevented from forking the page in cases where the write would be
blocked, and will get forked again for follow-up writes if it's modified again
otherwise, won't this be the same thing?
David Lang
>> If RDMA to a mmapped file races with write(2) to the same file,
>> maybe it is reasonable and expected to lose some data.
>>
>
> In the RDMA case, there is at least application awareness to work around
> the problems. Normally it's ok to have both mapped and write() access
> to data although userspace might need a lock to co-ordinate updates and
> event ordering.
>
>
On 05/14/2015 03:59 PM, Rik van Riel wrote:
> On 05/14/2015 04:26 AM, Daniel Phillips wrote:
>> Hi Rik,
<>
>
> The issue is that things like ptrace, AIO, infiniband
> RDMA, and other direct memory access subsystems can take
> a reference to page A, which Tux3 clones into a new page B
> when the process writes it.
>
> However, while the process now points at page B, ptrace,
> AIO, infiniband, etc will still be pointing at page A.
>
All these problems can also happen with truncate+new-extending-write
It is the responsibility of the application to take file/range locks
to prevent these page-pinned problems.
> This causes the process and the other subsystem to each
> look at a different page, instead of at shared state,
> causing ptrace to do nothing, AIO and RDMA data to be
> invisible (or corrupted), etc...
>
Again these problems already exist. Consider each in-place-write
being a truncate (punch hole) + new-write is that not the same?
Cheers
Boaz
On 05/17/2015 09:26 AM, Boaz Harrosh wrote:
> On 05/14/2015 03:59 PM, Rik van Riel wrote:
>> On 05/14/2015 04:26 AM, Daniel Phillips wrote:
>>> Hi Rik,
> <>
>>
>> The issue is that things like ptrace, AIO, infiniband
>> RDMA, and other direct memory access subsystems can take
>> a reference to page A, which Tux3 clones into a new page B
>> when the process writes it.
>>
>> However, while the process now points at page B, ptrace,
>> AIO, infiniband, etc will still be pointing at page A.
>>
>
> All these problems can also happen with truncate+new-extending-write
>
> It is the responsibility of the application to take file/range locks
> to prevent these page-pinned problems.
It is unreasonable to expect a process that is being ptraced
(potentially without its knowledge) to take special measures
to protect the ptraced memory from disappearing.
It is impossible for the debugger to take those special measures
for anonymous memory, or unlinked inodes.
I don't think your requirement is workable or reasonable.
--
All rights reversed
On 05/18/2015 05:20 AM, Rik van Riel wrote:
> On 05/17/2015 09:26 AM, Boaz Harrosh wrote:
>> On 05/14/2015 03:59 PM, Rik van Riel wrote:
>>> On 05/14/2015 04:26 AM, Daniel Phillips wrote:
>>>> Hi Rik,
>> <>
>>>
>>> The issue is that things like ptrace, AIO, infiniband
>>> RDMA, and other direct memory access subsystems can take
>>> a reference to page A, which Tux3 clones into a new page B
>>> when the process writes it.
>>>
>>> However, while the process now points at page B, ptrace,
>>> AIO, infiniband, etc will still be pointing at page A.
>>>
>>
>> All these problems can also happen with truncate+new-extending-write
>>
>> It is the responsibility of the application to take file/range locks
>> to prevent these page-pinned problems.
>
> It is unreasonable to expect a process that is being ptraced
> (potentially without its knowledge) to take special measures
> to protect the ptraced memory from disappearing.
If the memory disappears that's a bug. No the memory is just there
it is just not reflecting the latest content of the fs-file.
>
> It is impossible for the debugger to take those special measures
> for anonymous memory, or unlinked inodes.
>
Why? one line of added code after the open and before the mmap do an flock
> I don't think your requirement is workable or reasonable.
>
Therefor it is unreasonable to write/modify a ptraced process
file.
Again what I'm saying is COWing a page on write, has the same effect
as truncate+write. They are both allowed and both might give you the same
"stale" effect. So the presidence is there. We are not introducing a new
anomaly, just introducing a new instance of it. I guess the question
is what applications/procedures are going to break. Need lots of testing
and real life installations to answer that, I guess.
Thanks
Boaz
On Sat, May 16, 2015 at 03:38:04PM -0700, David Lang wrote:
> On Fri, 15 May 2015, Mel Gorman wrote:
>
> >On Fri, May 15, 2015 at 02:54:48AM -0700, Daniel Phillips wrote:
> >>
> >>
> >>On 05/15/2015 01:09 AM, Mel Gorman wrote:
> >>>On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote:
> >>>>On 05/14/2015 08:06 PM, Daniel Phillips wrote:
> >>>>>>The issue is that things like ptrace, AIO, infiniband
> >>>>>>RDMA, and other direct memory access subsystems can take
> >>>>>>a reference to page A, which Tux3 clones into a new page B
> >>>>>>when the process writes it.
> >>>>>>
> >>>>>>However, while the process now points at page B, ptrace,
> >>>>>>AIO, infiniband, etc will still be pointing at page A.
> >>>>>>
> >>>>>>This causes the process and the other subsystem to each
> >>>>>>look at a different page, instead of at shared state,
> >>>>>>causing ptrace to do nothing, AIO and RDMA data to be
> >>>>>>invisible (or corrupted), etc...
> >>>>>
> >>>>>Is this a bit like page migration?
> >>>>
> >>>>Yes. Page migration will fail if there is an "extra"
> >>>>reference to the page that is not accounted for by
> >>>>the migration code.
> >>>
> >>>When I said it's not like page migration, I was referring to the fact
> >>>that a COW on a pinned page for RDMA is a different problem to page
> >>>migration. The COW of a pinned page can lead to lost writes or
> >>>corruption depending on the ordering of events.
> >>
> >>I see the lost writes case, but not the corruption case,
> >
> >Data corruption can occur depending on the ordering of events and the
> >applications expectations. If a process starts IO, RDMA pins the page
> >for read and forks are combined with writes from another thread then when
> >the IO completes the reads may not be visible. The application may take
> >improper action at that point.
>
> if tux3 forks the page and writes the copy while the original page
> is being modified by other things, this means that some of the
> changes won't be in the version written (and this could catch
> partial writes with 'interesting' results if the forking happens at
> the wrong time)
>
Potentially yes. There is likely to be some elevated memory usage but I
imagine that can be controlled.
> But if the original page gets re-marked as needing to be written out
> when it's changed by one of the other things that are accessing it,
> there shouldn't be any long-term corruption.
>
> As far as short-term corruption goes, any time you have a page
> mmapped it could get written out at any time, with only some of the
> application changes applied to it, so this sort of corruption could
> happen anyway couldn't it?
>
That becomes the responsibility of the application. It's up to it to sync
appropriately when it knows updates are complete.
> >Users of RDMA are typically expected to use MADV_DONTFORK to avoid this
> >class of problem.
> >
> >You can choose to not define this as data corruption because thge kernel
> >is not directly involved and that's your call.
> >
> >>Do you
> >>mean corruption by changing a page already in writeout? If so,
> >>don't all filesystems have that problem?
> >>
> >
> >No, the problem is different. Backing devices requiring stable pages will
> >block the write until the IO is complete. For those that do not require
> >stable pages it's ok to allow the write as long as the page is dirtied so
> >that it'll be written out again and no data is lost.
>
> so if tux3 is prevented from forking the page in cases where the
> write would be blocked, and will get forked again for follow-up
> writes if it's modified again otherwise, won't this be the same
> thing?
>
Functionally and from a correctness point of view, it *might* be
equivalent. It depends on the implementation and the page life cycle,
particularly the details of how the writeback and dirty state are coordinated
between the user-visible pages and the page being written back. I've read
none of the code or background so I cannot answer whether it's really
equivalent or not. Just be aware that it's not the same problem as page
migration and that it's not the same as how writeback and dirty state is
handled today.
--
Mel Gorman
SUSE Labs
On 05/17/2015 07:20 PM, Rik van Riel wrote:
> On 05/17/2015 09:26 AM, Boaz Harrosh wrote:
>> On 05/14/2015 03:59 PM, Rik van Riel wrote:
>>> The issue is that things like ptrace, AIO, infiniband
>>> RDMA, and other direct memory access subsystems can take
>>> a reference to page A, which Tux3 clones into a new page B
>>> when the process writes it.
>>>
>>> However, while the process now points at page B, ptrace,
>>> AIO, infiniband, etc will still be pointing at page A.
>>
>> All these problems can also happen with truncate+new-extending-write
>>
>> It is the responsibility of the application to take file/range locks
>> to prevent these page-pinned problems.
>
> It is unreasonable to expect a process that is being ptraced
> (potentially without its knowledge) to take special measures
> to protect the ptraced memory from disappearing.
>
> It is impossible for the debugger to take those special measures
> for anonymous memory, or unlinked inodes.
>
> I don't think your requirement is workable or reasonable.
Hi Rik,
You are quite right to poke at this aggressively. Whether or not
there is an issue needing fixing, we want to know the details. We
really need to do a deep dive in ptrace and know exactly what it
does, and whether Tux3 creates any new kind of hole. I really know
very little about ptrace at the moment, I only have heard that it
is a horrible hack we inherited from some place far away and a time
long ago.
A little guidance from you would help. Somewhere ptrace must modify
the executable page. Unlike uprobes, which makes sense to me, I did
not find where ptrace actually does that on a quick inspection.
Perhaps you could provide a pointer?
Regards,
Daniel
On Thu 14-05-15 01:26:23, Daniel Phillips wrote:
> Hi Rik,
>
> Our linux-tux3 tree currently currently carries this 652 line diff
> against core, to make Tux3 work. This is mainly by Hirofumi, except
> the fs-writeback.c hook, which is by me. The main part you may be
> interested in is rmap.c, which addresses the issues raised at the
> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
>
> LSFMM: Page forking
> http://lwn.net/Articles/548091/
>
> This is just a FYI. An upcoming Tux3 report will be a tour of the page
> forking design and implementation. For now, this is just to give a
> general sense of what we have done. We heard there are concerns about
> how ptrace will work. I really am not familiar with the issue, could
> you please explain what you were thinking of there?
So here are a few things I find problematic about page forking (besides
the cases with elevated page_count already discussed in this thread - there
I believe that anything more complex than "wait for the IO instead of
forking when page has elevated use count" isn't going to work. There are
too many users depending on too subtle details of the behavior...). Some
of them are actually mentioned in the above LWN article:
When you create a copy of a page and replace it in the radix tree, nobody
in mm subsystem is aware that oldpage may be under writeback. That causes
interesting issues:
* truncate_inode_pages() can finish before all IO for the file is finished.
So far filesystems rely on the fact that once truncate_inode_pages()
finishes all running IO against the file is completed and new cannot be
submitted.
* Writeback can come and try to write newpage while oldpage is still under
IO. Then you'll have two IOs against one block which has undefined
results.
* filemap_fdatawait() called from fsync() has additional problem that it is
not aware of oldpage and thus may return although IO hasn't finished yet.
I understand that Tux3 may avoid these issues due to some other mechanisms
it internally has but if page forking should get into mm subsystem, the
above must work.
Honza
> diffstat tux3.core.patch
> fs/Makefile | 1
> fs/fs-writeback.c | 100 +++++++++++++++++++++++++--------
> include/linux/fs.h | 6 +
> include/linux/mm.h | 5 +
> include/linux/pagemap.h | 2
> include/linux/rmap.h | 14 ++++
> include/linux/writeback.h | 23 +++++++
> mm/filemap.c | 82 +++++++++++++++++++++++++++
> mm/rmap.c | 139 ++++++++++++++++++++++++++++++++++++++++++++++
> mm/truncate.c | 98 ++++++++++++++++++++------------
> 10 files changed, 411 insertions(+), 59 deletions(-)
>
> diff --git a/fs/Makefile b/fs/Makefile
> index 91fcfa3..44d7192 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -70,7 +70,6 @@ obj-$(CONFIG_EXT4_FS) += ext4/
> obj-$(CONFIG_JBD) += jbd/
> obj-$(CONFIG_JBD2) += jbd2/
> obj-$(CONFIG_TUX3) += tux3/
> -obj-$(CONFIG_TUX3_MMAP) += tux3/
> obj-$(CONFIG_CRAMFS) += cramfs/
> obj-$(CONFIG_SQUASHFS) += squashfs/
> obj-y += ramfs/
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 2d609a5..fcd1c61 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -34,25 +34,6 @@
> */
> #define MIN_WRITEBACK_PAGES (4096UL >> (PAGE_CACHE_SHIFT - 10))
>
> -/*
> - * Passed into wb_writeback(), essentially a subset of writeback_control
> - */
> -struct wb_writeback_work {
> - long nr_pages;
> - struct super_block *sb;
> - unsigned long *older_than_this;
> - enum writeback_sync_modes sync_mode;
> - unsigned int tagged_writepages:1;
> - unsigned int for_kupdate:1;
> - unsigned int range_cyclic:1;
> - unsigned int for_background:1;
> - unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
> - enum wb_reason reason; /* why was writeback initiated? */
> -
> - struct list_head list; /* pending work list */
> - struct completion *done; /* set if the caller waits */
> -};
> -
> /**
> * writeback_in_progress - determine whether there is writeback in progress
> * @bdi: the device's backing_dev_info structure.
> @@ -192,6 +173,36 @@ void inode_wb_list_del(struct inode *inode)
> }
>
> /*
> + * Remove inode from writeback list if clean.
> + */
> +void inode_writeback_done(struct inode *inode)
> +{
> + struct backing_dev_info *bdi = inode_to_bdi(inode);
> +
> + spin_lock(&bdi->wb.list_lock);
> + spin_lock(&inode->i_lock);
> + if (!(inode->i_state & I_DIRTY))
> + list_del_init(&inode->i_wb_list);
> + spin_unlock(&inode->i_lock);
> + spin_unlock(&bdi->wb.list_lock);
> +}
> +EXPORT_SYMBOL_GPL(inode_writeback_done);
> +
> +/*
> + * Add inode to writeback dirty list with current time.
> + */
> +void inode_writeback_touch(struct inode *inode)
> +{
> + struct backing_dev_info *bdi = inode_to_bdi(inode);
> +
> + spin_lock(&bdi->wb.list_lock);
> + inode->dirtied_when = jiffies;
> + list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
> + spin_unlock(&bdi->wb.list_lock);
> +}
> +EXPORT_SYMBOL_GPL(inode_writeback_touch);
> +
> +/*
> * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
> * furthest end of its superblock's dirty-inode list.
> *
> @@ -610,9 +621,9 @@ static long writeback_chunk_size(struct backing_dev_info *bdi,
> *
> * Return the number of pages and/or inodes written.
> */
> -static long writeback_sb_inodes(struct super_block *sb,
> - struct bdi_writeback *wb,
> - struct wb_writeback_work *work)
> +static long generic_writeback_sb_inodes(struct super_block *sb,
> + struct bdi_writeback *wb,
> + struct wb_writeback_work *work)
> {
> struct writeback_control wbc = {
> .sync_mode = work->sync_mode,
> @@ -727,6 +738,22 @@ static long writeback_sb_inodes(struct super_block *sb,
> return wrote;
> }
>
> +static long writeback_sb_inodes(struct super_block *sb,
> + struct bdi_writeback *wb,
> + struct wb_writeback_work *work)
> +{
> + if (sb->s_op->writeback) {
> + long ret;
> +
> + spin_unlock(&wb->list_lock);
> + ret = sb->s_op->writeback(sb, wb, work);
> + spin_lock(&wb->list_lock);
> + return ret;
> + }
> +
> + return generic_writeback_sb_inodes(sb, wb, work);
> +}
> +
> static long __writeback_inodes_wb(struct bdi_writeback *wb,
> struct wb_writeback_work *work)
> {
> @@ -1293,6 +1320,35 @@ static void wait_sb_inodes(struct super_block *sb)
> }
>
> /**
> + * writeback_queue_work_sb - schedule writeback work from given super_block
> + * @sb: the superblock
> + * @work: work item to queue
> + *
> + * Schedule writeback work on this super_block. This usually used to
> + * interact with sb->s_op->writeback callback. The caller must
> + * guarantee to @work is not freed while bdi flusher is using (for
> + * example, be safe against umount).
> + */
> +void writeback_queue_work_sb(struct super_block *sb,
> + struct wb_writeback_work *work)
> +{
> + if (sb->s_bdi == &noop_backing_dev_info)
> + return;
> +
> + /* Allow only following fields to use. */
> + *work = (struct wb_writeback_work){
> + .sb = sb,
> + .sync_mode = work->sync_mode,
> + .tagged_writepages = work->tagged_writepages,
> + .done = work->done,
> + .nr_pages = work->nr_pages,
> + .reason = work->reason,
> + };
> + bdi_queue_work(sb->s_bdi, work);
> +}
> +EXPORT_SYMBOL(writeback_queue_work_sb);
> +
> +/**
> * writeback_inodes_sb_nr - writeback dirty inodes from given super_block
> * @sb: the superblock
> * @nr: the number of pages to write
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 42efe13..29833d2 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -356,6 +356,8 @@ struct address_space_operations {
>
> /* Unfortunately this kludge is needed for FIBMAP. Don't use it */
> sector_t (*bmap)(struct address_space *, sector_t);
> + void (*truncatepage)(struct address_space *, struct page *,
> + unsigned int, unsigned int, int);
> void (*invalidatepage) (struct page *, unsigned int, unsigned int);
> int (*releasepage) (struct page *, gfp_t);
> void (*freepage)(struct page *);
> @@ -1590,6 +1592,8 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
> extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
> unsigned long, loff_t *);
>
> +struct bdi_writeback;
> +struct wb_writeback_work;
> struct super_operations {
> struct inode *(*alloc_inode)(struct super_block *sb);
> void (*destroy_inode)(struct inode *);
> @@ -1599,6 +1603,8 @@ struct super_operations {
> int (*drop_inode) (struct inode *);
> void (*evict_inode) (struct inode *);
> void (*put_super) (struct super_block *);
> + long (*writeback)(struct super_block *super, struct bdi_writeback *wb,
> + struct wb_writeback_work *work);
> int (*sync_fs)(struct super_block *sb, int wait);
> int (*freeze_super) (struct super_block *);
> int (*freeze_fs) (struct super_block *);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index dd5ea30..075f59f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1909,6 +1909,11 @@ vm_unmapped_area(struct vm_unmapped_area_info *info)
> }
>
> /* truncate.c */
> +void generic_truncate_partial_page(struct address_space *mapping,
> + struct page *page, unsigned int start,
> + unsigned int len);
> +void generic_truncate_full_page(struct address_space *mapping,
> + struct page *page, int wait);
> extern void truncate_inode_pages(struct address_space *, loff_t);
> extern void truncate_inode_pages_range(struct address_space *,
> loff_t lstart, loff_t lend);
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 4b3736f..13b70160 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -653,6 +653,8 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
> extern void delete_from_page_cache(struct page *page);
> extern void __delete_from_page_cache(struct page *page, void *shadow);
> int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
> +int cow_replace_page_cache(struct page *oldpage, struct page *newpage);
> +void cow_delete_from_page_cache(struct page *page);
>
> /*
> * Like add_to_page_cache_locked, but used to add newly allocated pages:
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index d9d7e7e..9b67360 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -228,6 +228,20 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
> int page_mkclean(struct page *);
>
> /*
> + * Make clone page for page forking.
> + *
> + * Note: only clones page state so other state such as buffer_heads
> + * must be cloned by caller.
> + */
> +struct page *cow_clone_page(struct page *oldpage);
> +
> +/*
> + * Changes the PTES of shared mappings except the PTE in orig_vma.
> + */
> +int page_cow_file(struct vm_area_struct *orig_vma, struct page *oldpage,
> + struct page *newpage);
> +
> +/*
> * called in munlock()/munmap() path to check for other vmas holding
> * the page mlocked.
> */
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 0004833..0784b9d 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -59,6 +59,25 @@ enum wb_reason {
> };
>
> /*
> + * Passed into wb_writeback(), essentially a subset of writeback_control
> + */
> +struct wb_writeback_work {
> + long nr_pages;
> + struct super_block *sb;
> + unsigned long *older_than_this;
> + enum writeback_sync_modes sync_mode;
> + unsigned int tagged_writepages:1;
> + unsigned int for_kupdate:1;
> + unsigned int range_cyclic:1;
> + unsigned int for_background:1;
> + unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
> + enum wb_reason reason; /* why was writeback initiated? */
> +
> + struct list_head list; /* pending work list */
> + struct completion *done; /* set if the caller waits */
> +};
> +
> +/*
> * A control structure which tells the writeback code what to do. These are
> * always on the stack, and hence need no locking. They are always initialised
> * in a manner such that unspecified fields are set to zero.
> @@ -90,6 +109,10 @@ struct writeback_control {
> * fs/fs-writeback.c
> */
> struct bdi_writeback;
> +void inode_writeback_done(struct inode *inode);
> +void inode_writeback_touch(struct inode *inode);
> +void writeback_queue_work_sb(struct super_block *sb,
> + struct wb_writeback_work *work);
> void writeback_inodes_sb(struct super_block *, enum wb_reason reason);
> void writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
> enum wb_reason reason);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 673e458..8c641d0 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -639,6 +639,88 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
> }
> EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
>
> +/*
> + * Atomically replace oldpage with newpage.
> + *
> + * Similar to migrate_pages(), but the oldpage is for writeout.
> + */
> +int cow_replace_page_cache(struct page *oldpage, struct page *newpage)
> +{
> + struct address_space *mapping = oldpage->mapping;
> + void **pslot;
> +
> + VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
> + VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
> +
> + /* Get refcount for radix-tree */
> + page_cache_get(newpage);
> +
> + /* Replace page in radix tree. */
> + spin_lock_irq(&mapping->tree_lock);
> + /* PAGECACHE_TAG_DIRTY represents the view of frontend. Clear it. */
> + if (PageDirty(oldpage))
> + radix_tree_tag_clear(&mapping->page_tree, page_index(oldpage),
> + PAGECACHE_TAG_DIRTY);
> + /* The refcount to newpage is used for radix tree. */
> + pslot = radix_tree_lookup_slot(&mapping->page_tree, oldpage->index);
> + radix_tree_replace_slot(pslot, newpage);
> + __inc_zone_page_state(newpage, NR_FILE_PAGES);
> + __dec_zone_page_state(oldpage, NR_FILE_PAGES);
> + spin_unlock_irq(&mapping->tree_lock);
> +
> + /* mem_cgroup codes must not be called under tree_lock */
> + mem_cgroup_migrate(oldpage, newpage, true);
> +
> + /* Release refcount for radix-tree */
> + page_cache_release(oldpage);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(cow_replace_page_cache);
> +
> +/*
> + * Delete page from radix-tree, leaving page->mapping unchanged.
> + *
> + * Similar to delete_from_page_cache(), but the deleted page is for writeout.
> + */
> +void cow_delete_from_page_cache(struct page *page)
> +{
> + struct address_space *mapping = page->mapping;
> +
> + /* Delete page from radix tree. */
> + spin_lock_irq(&mapping->tree_lock);
> + /*
> + * if we're uptodate, flush out into the cleancache, otherwise
> + * invalidate any existing cleancache entries. We can't leave
> + * stale data around in the cleancache once our page is gone
> + */
> + if (PageUptodate(page) && PageMappedToDisk(page))
> + cleancache_put_page(page);
> + else
> + cleancache_invalidate_page(mapping, page);
> +
> + page_cache_tree_delete(mapping, page, NULL);
> +#if 0 /* FIXME: backend is assuming page->mapping is available */
> + page->mapping = NULL;
> +#endif
> + /* Leave page->index set: truncation lookup relies upon it */
> +
> + __dec_zone_page_state(page, NR_FILE_PAGES);
> + BUG_ON(page_mapped(page));
> +
> + /*
> + * The following dirty accounting is done by writeback
> + * path. So, we don't need to do here.
> + *
> + * dec_zone_page_state(page, NR_FILE_DIRTY);
> + * dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> + */
> + spin_unlock_irq(&mapping->tree_lock);
> +
> + page_cache_release(page);
> +}
> +EXPORT_SYMBOL_GPL(cow_delete_from_page_cache);
> +
> #ifdef CONFIG_NUMA
> struct page *__page_cache_alloc(gfp_t gfp)
> {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 71cd5bd..9125246 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -923,6 +923,145 @@ int page_mkclean(struct page *page)
> }
> EXPORT_SYMBOL_GPL(page_mkclean);
>
> +/*
> + * Make clone page for page forking. (Based on migrate_page_copy())
> + *
> + * Note: only clones page state so other state such as buffer_heads
> + * must be cloned by caller.
> + */
> +struct page *cow_clone_page(struct page *oldpage)
> +{
> + struct address_space *mapping = oldpage->mapping;
> + gfp_t gfp_mask = mapping_gfp_mask(mapping) & ~__GFP_FS;
> + struct page *newpage = __page_cache_alloc(gfp_mask);
> + int cpupid;
> +
> + newpage->mapping = oldpage->mapping;
> + newpage->index = oldpage->index;
> + copy_highpage(newpage, oldpage);
> +
> + /* FIXME: right? */
> + BUG_ON(PageSwapCache(oldpage));
> + BUG_ON(PageSwapBacked(oldpage));
> + BUG_ON(PageHuge(oldpage));
> + if (PageError(oldpage))
> + SetPageError(newpage);
> + if (PageReferenced(oldpage))
> + SetPageReferenced(newpage);
> + if (PageUptodate(oldpage))
> + SetPageUptodate(newpage);
> + if (PageActive(oldpage))
> + SetPageActive(newpage);
> + if (PageMappedToDisk(oldpage))
> + SetPageMappedToDisk(newpage);
> +
> + /*
> + * Copy NUMA information to the new page, to prevent over-eager
> + * future migrations of this same page.
> + */
> + cpupid = page_cpupid_xchg_last(oldpage, -1);
> + page_cpupid_xchg_last(newpage, cpupid);
> +
> + mlock_migrate_page(newpage, oldpage);
> + ksm_migrate_page(newpage, oldpage);
> +
> + /* Lock newpage before visible via radix tree */
> + BUG_ON(PageLocked(newpage));
> + __set_page_locked(newpage);
> +
> + return newpage;
> +}
> +EXPORT_SYMBOL_GPL(cow_clone_page);
> +
> +static int page_cow_one(struct page *oldpage, struct page *newpage,
> + struct vm_area_struct *vma, unsigned long address)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + pte_t oldptval, ptval, *pte;
> + spinlock_t *ptl;
> + int ret = 0;
> +
> + pte = page_check_address(oldpage, mm, address, &ptl, 1);
> + if (!pte)
> + goto out;
> +
> + flush_cache_page(vma, address, pte_pfn(*pte));
> + oldptval = ptep_clear_flush(vma, address, pte);
> +
> + /* Take refcount for PTE */
> + page_cache_get(newpage);
> +
> + /*
> + * vm_page_prot doesn't have writable bit, so page fault will
> + * be occurred immediately after returned from this page fault
> + * again. And second time of page fault will be resolved with
> + * forked page was set here.
> + */
> + ptval = mk_pte(newpage, vma->vm_page_prot);
> +#if 0
> + /* FIXME: we should check following too? Otherwise, we would
> + * get additional read-only => write fault at least */
> + if (pte_write)
> + ptval = pte_mkwrite(ptval);
> + if (pte_dirty(oldptval))
> + ptval = pte_mkdirty(ptval);
> + if (pte_young(oldptval))
> + ptval = pte_mkyoung(ptval);
> +#endif
> + set_pte_at(mm, address, pte, ptval);
> +
> + /* Update rmap accounting */
> + BUG_ON(!PageMlocked(oldpage)); /* Caller should migrate mlock flag */
> + page_remove_rmap(oldpage);
> + page_add_file_rmap(newpage);
> +
> + /* no need to invalidate: a not-present page won't be cached */
> + update_mmu_cache(vma, address, pte);
> +
> + pte_unmap_unlock(pte, ptl);
> +
> + mmu_notifier_invalidate_page(mm, address);
> +
> + /* Release refcount for PTE */
> + page_cache_release(oldpage);
> +out:
> + return ret;
> +}
> +
> +/* Change old page in PTEs to new page exclude orig_vma */
> +int page_cow_file(struct vm_area_struct *orig_vma, struct page *oldpage,
> + struct page *newpage)
> +{
> + struct address_space *mapping = page_mapping(oldpage);
> + pgoff_t pgoff = oldpage->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> + struct vm_area_struct *vma;
> + int ret = 0;
> +
> + BUG_ON(!PageLocked(oldpage));
> + BUG_ON(!PageLocked(newpage));
> + BUG_ON(PageAnon(oldpage));
> + BUG_ON(mapping == NULL);
> +
> + i_mmap_lock_read(mapping);
> + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> + /*
> + * The orig_vma's PTE is handled by caller.
> + * (e.g. ->page_mkwrite)
> + */
> + if (vma == orig_vma)
> + continue;
> +
> + if (vma->vm_flags & VM_SHARED) {
> + unsigned long address = vma_address(oldpage, vma);
> + ret += page_cow_one(oldpage, newpage, vma, address);
> + }
> + }
> + i_mmap_unlock_read(mapping);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(page_cow_file);
> +
> /**
> * page_move_anon_rmap - move a page to our anon_vma
> * @page: the page to move to our anon_vma
> diff --git a/mm/truncate.c b/mm/truncate.c
> index f1e4d60..e5b4673 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -216,6 +216,56 @@ int invalidate_inode_page(struct page *page)
> return invalidate_complete_page(mapping, page);
> }
>
> +void generic_truncate_partial_page(struct address_space *mapping,
> + struct page *page, unsigned int start,
> + unsigned int len)
> +{
> + wait_on_page_writeback(page);
> + zero_user_segment(page, start, start + len);
> + if (page_has_private(page))
> + do_invalidatepage(page, start, len);
> +}
> +EXPORT_SYMBOL(generic_truncate_partial_page);
> +
> +static void truncate_partial_page(struct address_space *mapping, pgoff_t index,
> + unsigned int start, unsigned int len)
> +{
> + struct page *page = find_lock_page(mapping, index);
> + if (!page)
> + return;
> +
> + if (!mapping->a_ops->truncatepage)
> + generic_truncate_partial_page(mapping, page, start, len);
> + else
> + mapping->a_ops->truncatepage(mapping, page, start, len, 1);
> +
> + cleancache_invalidate_page(mapping, page);
> + unlock_page(page);
> + page_cache_release(page);
> +}
> +
> +void generic_truncate_full_page(struct address_space *mapping,
> + struct page *page, int wait)
> +{
> + if (wait)
> + wait_on_page_writeback(page);
> + else if (PageWriteback(page))
> + return;
> +
> + truncate_inode_page(mapping, page);
> +}
> +EXPORT_SYMBOL(generic_truncate_full_page);
> +
> +static void truncate_full_page(struct address_space *mapping, struct page *page,
> + int wait)
> +{
> + if (!mapping->a_ops->truncatepage)
> + generic_truncate_full_page(mapping, page, wait);
> + else
> + mapping->a_ops->truncatepage(mapping, page, 0, PAGE_CACHE_SIZE,
> + wait);
> +}
> +
> /**
> * truncate_inode_pages_range - truncate range of pages specified by start & end byte offsets
> * @mapping: mapping to truncate
> @@ -298,11 +348,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
> if (!trylock_page(page))
> continue;
> WARN_ON(page->index != index);
> - if (PageWriteback(page)) {
> - unlock_page(page);
> - continue;
> - }
> - truncate_inode_page(mapping, page);
> + truncate_full_page(mapping, page, 0);
> unlock_page(page);
> }
> pagevec_remove_exceptionals(&pvec);
> @@ -312,37 +358,18 @@ void truncate_inode_pages_range(struct address_space *mapping,
> }
>
> if (partial_start) {
> - struct page *page = find_lock_page(mapping, start - 1);
> - if (page) {
> - unsigned int top = PAGE_CACHE_SIZE;
> - if (start > end) {
> - /* Truncation within a single page */
> - top = partial_end;
> - partial_end = 0;
> - }
> - wait_on_page_writeback(page);
> - zero_user_segment(page, partial_start, top);
> - cleancache_invalidate_page(mapping, page);
> - if (page_has_private(page))
> - do_invalidatepage(page, partial_start,
> - top - partial_start);
> - unlock_page(page);
> - page_cache_release(page);
> - }
> - }
> - if (partial_end) {
> - struct page *page = find_lock_page(mapping, end);
> - if (page) {
> - wait_on_page_writeback(page);
> - zero_user_segment(page, 0, partial_end);
> - cleancache_invalidate_page(mapping, page);
> - if (page_has_private(page))
> - do_invalidatepage(page, 0,
> - partial_end);
> - unlock_page(page);
> - page_cache_release(page);
> + unsigned int top = PAGE_CACHE_SIZE;
> + if (start > end) {
> + /* Truncation within a single page */
> + top = partial_end;
> + partial_end = 0;
> }
> + truncate_partial_page(mapping, start - 1, partial_start,
> + top - partial_start);
> }
> + if (partial_end)
> + truncate_partial_page(mapping, end, 0, partial_end);
> +
> /*
> * If the truncation happened within a single page no pages
> * will be released, just zeroed, so we can bail out now.
> @@ -386,8 +413,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
>
> lock_page(page);
> WARN_ON(page->index != index);
> - wait_on_page_writeback(page);
> - truncate_inode_page(mapping, page);
> + truncate_full_page(mapping, page, 1);
> unlock_page(page);
> }
> pagevec_remove_exceptionals(&pvec);
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR
Hi Jan,
On 05/19/2015 07:00 AM, Jan Kara wrote:
> On Thu 14-05-15 01:26:23, Daniel Phillips wrote:
>> Hi Rik,
>>
>> Our linux-tux3 tree currently currently carries this 652 line diff
>> against core, to make Tux3 work. This is mainly by Hirofumi, except
>> the fs-writeback.c hook, which is by me. The main part you may be
>> interested in is rmap.c, which addresses the issues raised at the
>> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1]
>>
>> LSFMM: Page forking
>> http://lwn.net/Articles/548091/
>>
>> This is just a FYI. An upcoming Tux3 report will be a tour of the page
>> forking design and implementation. For now, this is just to give a
>> general sense of what we have done. We heard there are concerns about
>> how ptrace will work. I really am not familiar with the issue, could
>> you please explain what you were thinking of there?
> So here are a few things I find problematic about page forking (besides
> the cases with elevated page_count already discussed in this thread - there
> I believe that anything more complex than "wait for the IO instead of
> forking when page has elevated use count" isn't going to work. There are
> too many users depending on too subtle details of the behavior...). Some
> of them are actually mentioned in the above LWN article:
>
> When you create a copy of a page and replace it in the radix tree, nobody
> in mm subsystem is aware that oldpage may be under writeback. That causes
> interesting issues:
> * truncate_inode_pages() can finish before all IO for the file is finished.
> So far filesystems rely on the fact that once truncate_inode_pages()
> finishes all running IO against the file is completed and new cannot be
> submitted.
We do not use truncate_inode_pages because of issues like that. We use
some truncate helpers, which were available in some cases, or else had
to be implemented in Tux3 to make everything work properly. The details
are Hirofumi's stomping grounds. I am pretty sure that his solution is
good and tight, or Tux3 would not pass its torture tests.
> * Writeback can come and try to write newpage while oldpage is still under
> IO. Then you'll have two IOs against one block which has undefined
> results.
Those writebacks only come from Tux3 (or indirectly from fs-writeback,
through our writeback) so we are able to ensure that a dirty block is
only written once. (If redirtied, the block will fork so two dirty
blocks are written in two successive deltas.)
> * filemap_fdatawait() called from fsync() has additional problem that it is
> not aware of oldpage and thus may return although IO hasn't finished yet.
We do not use filemap_fdatawait, instead, we wait on completion of our
own writeback, which is under our control.
> I understand that Tux3 may avoid these issues due to some other mechanisms
> it internally has but if page forking should get into mm subsystem, the
> above must work.
It does work, and by example, it does not need a lot of code to make
it work, but the changes are not trivial. Tux3's delta writeback model
will not suit everyone, so you can't just lift our code and add it to
Ext4. Using it in Ext4 would require a per-inode writeback model, which
looks practical to me but far from a weekend project. Maybe something
to consider for Ext5.
It is the job of new designs like Tux3 to chase after that final drop
of performance, not our trusty Ext4 workhorse. Though stranger things
have happened - as I recall, Ext4 had O(n) directory operations at one
time. Fixing that was not easy, but we did it because we had to. Fixing
Ext4's write performance is not urgent by comparison, and the barrier
is high, you would want jbd3 for one thing.
I think the meta-question you are asking is, where is the second user
for this new CoW functionality? With a possible implication that if
there is no second user then Tux3 cannot be merged. Is that is the
question?
Regards,
Daniel
On Tue, 19 May 2015, Daniel Phillips wrote:
>> I understand that Tux3 may avoid these issues due to some other mechanisms
>> it internally has but if page forking should get into mm subsystem, the
>> above must work.
>
> It does work, and by example, it does not need a lot of code to make
> it work, but the changes are not trivial. Tux3's delta writeback model
> will not suit everyone, so you can't just lift our code and add it to
> Ext4. Using it in Ext4 would require a per-inode writeback model, which
> looks practical to me but far from a weekend project. Maybe something
> to consider for Ext5.
>
> It is the job of new designs like Tux3 to chase after that final drop
> of performance, not our trusty Ext4 workhorse. Though stranger things
> have happened - as I recall, Ext4 had O(n) directory operations at one
> time. Fixing that was not easy, but we did it because we had to. Fixing
> Ext4's write performance is not urgent by comparison, and the barrier
> is high, you would want jbd3 for one thing.
>
> I think the meta-question you are asking is, where is the second user
> for this new CoW functionality? With a possible implication that if
> there is no second user then Tux3 cannot be merged. Is that is the
> question?
I don't think they are asking for a second user. What they are saying is that
for this functionality to be accepted in the mm subsystem, these problem cases
need to work reliably, not just work for Tux3 because of your implementation.
So for things that you don't use, you need to make it an error if they get used
on a page that's been forked (or not be an error and 'do the right thing')
For cases where it doesn't matter because Tux3 controls the writeback, and it's
undefined in general what happens if writeback is triggered twice on the same
page, you will need to figure out how to either prevent the second writeback
from triggering if there's one in process, or define how the two writebacks are
going to happen so that you can't end up with them re-ordered by some other
filesystem.
I think that that's what's meant by the top statement that I left in the quote.
Even if your implementation details make it safe, these need to be safe even
without your implementation details to be acceptable in the core kernel.
David Lang
On Tue 19-05-15 13:33:31, David Lang wrote:
> On Tue, 19 May 2015, Daniel Phillips wrote:
>
> >>I understand that Tux3 may avoid these issues due to some other mechanisms
> >>it internally has but if page forking should get into mm subsystem, the
> >>above must work.
> >
> >It does work, and by example, it does not need a lot of code to make
> >it work, but the changes are not trivial. Tux3's delta writeback model
> >will not suit everyone, so you can't just lift our code and add it to
> >Ext4. Using it in Ext4 would require a per-inode writeback model, which
> >looks practical to me but far from a weekend project. Maybe something
> >to consider for Ext5.
> >
> >It is the job of new designs like Tux3 to chase after that final drop
> >of performance, not our trusty Ext4 workhorse. Though stranger things
> >have happened - as I recall, Ext4 had O(n) directory operations at one
> >time. Fixing that was not easy, but we did it because we had to. Fixing
> >Ext4's write performance is not urgent by comparison, and the barrier
> >is high, you would want jbd3 for one thing.
> >
> >I think the meta-question you are asking is, where is the second user
> >for this new CoW functionality? With a possible implication that if
> >there is no second user then Tux3 cannot be merged. Is that is the
> >question?
>
> I don't think they are asking for a second user. What they are
> saying is that for this functionality to be accepted in the mm
> subsystem, these problem cases need to work reliably, not just work
> for Tux3 because of your implementation.
>
> So for things that you don't use, you need to make it an error if
> they get used on a page that's been forked (or not be an error and
> 'do the right thing')
>
> For cases where it doesn't matter because Tux3 controls the
> writeback, and it's undefined in general what happens if writeback
> is triggered twice on the same page, you will need to figure out how
> to either prevent the second writeback from triggering if there's
> one in process, or define how the two writebacks are going to happen
> so that you can't end up with them re-ordered by some other
> filesystem.
>
> I think that that's what's meant by the top statement that I left in
> the quote. Even if your implementation details make it safe, these
> need to be safe even without your implementation details to be
> acceptable in the core kernel.
Yeah, that's what I meant. If you create a function which manipulates
page cache, you better make it work with other functions manipulating page
cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting
developer. Sure you can document all the conditions under which the
function is safe to use but a function that has several paragraphs in front
of it explaning when it is safe to use isn't very good API...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On 05/20/2015 07:44 AM, Jan Kara wrote:
> On Tue 19-05-15 13:33:31, David Lang wrote:
>> On Tue, 19 May 2015, Daniel Phillips wrote:
>>
>>>> I understand that Tux3 may avoid these issues due to some other mechanisms
>>>> it internally has but if page forking should get into mm subsystem, the
>>>> above must work.
>>>
>>> It does work, and by example, it does not need a lot of code to make
>>> it work, but the changes are not trivial. Tux3's delta writeback model
>>> will not suit everyone, so you can't just lift our code and add it to
>>> Ext4. Using it in Ext4 would require a per-inode writeback model, which
>>> looks practical to me but far from a weekend project. Maybe something
>>> to consider for Ext5.
>>>
>>> It is the job of new designs like Tux3 to chase after that final drop
>>> of performance, not our trusty Ext4 workhorse. Though stranger things
>>> have happened - as I recall, Ext4 had O(n) directory operations at one
>>> time. Fixing that was not easy, but we did it because we had to. Fixing
>>> Ext4's write performance is not urgent by comparison, and the barrier
>>> is high, you would want jbd3 for one thing.
>>>
>>> I think the meta-question you are asking is, where is the second user
>>> for this new CoW functionality? With a possible implication that if
>>> there is no second user then Tux3 cannot be merged. Is that is the
>>> question?
>>
>> I don't think they are asking for a second user. What they are
>> saying is that for this functionality to be accepted in the mm
>> subsystem, these problem cases need to work reliably, not just work
>> for Tux3 because of your implementation.
>>
>> So for things that you don't use, you need to make it an error if
>> they get used on a page that's been forked (or not be an error and
>> 'do the right thing')
>>
>> For cases where it doesn't matter because Tux3 controls the
>> writeback, and it's undefined in general what happens if writeback
>> is triggered twice on the same page, you will need to figure out how
>> to either prevent the second writeback from triggering if there's
>> one in process, or define how the two writebacks are going to happen
>> so that you can't end up with them re-ordered by some other
>> filesystem.
>>
>> I think that that's what's meant by the top statement that I left in
>> the quote. Even if your implementation details make it safe, these
>> need to be safe even without your implementation details to be
>> acceptable in the core kernel.
> Yeah, that's what I meant. If you create a function which manipulates
> page cache, you better make it work with other functions manipulating page
> cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting
> developer. Sure you can document all the conditions under which the
> function is safe to use but a function that has several paragraphs in front
> of it explaning when it is safe to use isn't very good API...
Violent agreement, of course. To put it in concrete terms, each of
the page fork support functions must be examined and determined
sane. They are:
* cow_replace_page_cache
* cow_delete_from_page_cache
* cow_clone_page
* page_cow_one
* page_cow_file
Would it be useful to drill down into those, starting from the top
of the list?
Regards,
Daniel
On Wed, 20 May 2015, Daniel Phillips wrote:
> On 05/20/2015 07:44 AM, Jan Kara wrote:
>> Yeah, that's what I meant. If you create a function which manipulates
>> page cache, you better make it work with other functions manipulating page
>> cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting
>> developer. Sure you can document all the conditions under which the
>> function is safe to use but a function that has several paragraphs in front
>> of it explaning when it is safe to use isn't very good API...
>
> Violent agreement, of course. To put it in concrete terms, each of
> the page fork support functions must be examined and determined
> sane. They are:
>
> * cow_replace_page_cache
> * cow_delete_from_page_cache
> * cow_clone_page
> * page_cow_one
> * page_cow_file
>
> Would it be useful to drill down into those, starting from the top
> of the list?
It's a little more than determining that these 5 functions are sane, it's making
sure that if someone mixes the use of these functions with other existing
functions that the result is sane.
but it's probably a good starting point to look at each of these five functions
in detail and consider how they work and could interact badly with other things
touching the page cache.
David Lang
On 05/20/2015 12:22 PM, Daniel Phillips wrote:
> On 05/20/2015 07:44 AM, Jan Kara wrote:
>> On Tue 19-05-15 13:33:31, David Lang wrote:
>> Yeah, that's what I meant. If you create a function which manipulates
>> page cache, you better make it work with other functions manipulating page
>> cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting
>> developer. Sure you can document all the conditions under which the
>> function is safe to use but a function that has several paragraphs in front
>> of it explaning when it is safe to use isn't very good API...
>
> Violent agreement, of course. To put it in concrete terms, each of
> the page fork support functions must be examined and determined
> sane. They are:
>
> * cow_replace_page_cache
> * cow_delete_from_page_cache
> * cow_clone_page
> * page_cow_one
> * page_cow_file
>
> Would it be useful to drill down into those, starting from the top
> of the list?
How do these interact with other page cache functions, like
find_get_page() ?
How does tux3 prevent a user of find_get_page() from reading from
or writing into the pre-COW page, instead of the current page?
--
All rights reversed
On 05/20/2015 12:53 PM, Rik van Riel wrote:
> On 05/20/2015 12:22 PM, Daniel Phillips wrote:
>> On 05/20/2015 07:44 AM, Jan Kara wrote:
>>> On Tue 19-05-15 13:33:31, David Lang wrote:
>
>>> Yeah, that's what I meant. If you create a function which manipulates
>>> page cache, you better make it work with other functions manipulating page
>>> cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting
>>> developer. Sure you can document all the conditions under which the
>>> function is safe to use but a function that has several paragraphs in front
>>> of it explaning when it is safe to use isn't very good API...
>>
>> Violent agreement, of course. To put it in concrete terms, each of
>> the page fork support functions must be examined and determined
>> sane. They are:
>>
>> * cow_replace_page_cache
>> * cow_delete_from_page_cache
>> * cow_clone_page
>> * page_cow_one
>> * page_cow_file
>>
>> Would it be useful to drill down into those, starting from the top
>> of the list?
>
> How do these interact with other page cache functions, like
> find_get_page() ?
Nicely:
https://github.com/OGAWAHirofumi/linux-tux3/blob/hirofumi/fs/tux3/filemap_mmap.c#L182
> How does tux3 prevent a user of find_get_page() from reading from
> or writing into the pre-COW page, instead of the current page?
Careful control of the dirty bits (we have two of them, one each
for front and back). That is what pagefork_for_blockdirty is about.
Regards,
Daniel
On 05/20/2015 03:51 PM, Daniel Phillips wrote:
> On 05/20/2015 12:53 PM, Rik van Riel wrote:
>> How does tux3 prevent a user of find_get_page() from reading from
>> or writing into the pre-COW page, instead of the current page?
>
> Careful control of the dirty bits (we have two of them, one each
> for front and back). That is what pagefork_for_blockdirty is about.
Ah, and of course it does not matter if a reader is on the
pre-cow page. It would be reading the earlier copy, which might
no longer be the current copy, but it raced with the write so
nobody should be surprised. That is a race even without page fork.
Regards,
Daniel
On Wed, 20 May 2015, Daniel Phillips wrote:
> On 05/20/2015 03:51 PM, Daniel Phillips wrote:
>> On 05/20/2015 12:53 PM, Rik van Riel wrote:
>>> How does tux3 prevent a user of find_get_page() from reading from
>>> or writing into the pre-COW page, instead of the current page?
>>
>> Careful control of the dirty bits (we have two of them, one each
>> for front and back). That is what pagefork_for_blockdirty is about.
>
> Ah, and of course it does not matter if a reader is on the
> pre-cow page. It would be reading the earlier copy, which might
> no longer be the current copy, but it raced with the write so
> nobody should be surprised. That is a race even without page fork.
how do you prevent it from continuing to interact with the old version of the
page and never see updates or have it's changes reflected on the current page?
David Lang
Hi Josef,
This is a rollup patch for preliminary nospace handling in Tux3, in
line with my post here:
http://lkml.iu.edu/hypermail/linux/kernel/1505.1/03167.html
You still have ENOSPC issues. Maybe it would be helpful to look at
what we have done. I saw a reproducible case with 1,000 tasks in
parallel last week that went nospace while 28% full. You also are not
giving a very good picture of the true full state via df.
Our algorithm is pretty simple, reliable and fast. I do not see any
reason why Btrfs could not do it basically the same way. In one way it
is easier for you - you are not forced to commit the entire delta, you
can choose the bits you want to force to disk as convenient. You have
more different kinds of cache objects to account, but that should be
just detail. Your current frontend accounting looks plausible.
We're trying something a bit different with df, to see how it flies -
we don't always return the same number to f_blocks, we actually return
the volume size less the accounting reserve, which is variable. The
reserve gets smaller as freespace gets smaller, so it is not a nasty
surprise to the user to see it change, rather a pleasant surprise. What
it does is make the 100% really be 100%, less just a handful of blocks,
and it makes "used" and "available" add up exactly to "blocks". If the
user wants to know how many blocks they really have, they can look at
/proc/partitions.
Regards,
Daniel
diff --git a/fs/tux3/commit.c b/fs/tux3/commit.c
index 909a222..7043580 100644
--- a/fs/tux3/commit.c
+++ b/fs/tux3/commit.c
@@ -297,6 +297,7 @@ static int commit_delta(struct sb *sb)
tux3_wake_delta_commit(sb);
/* Commit was finished, apply defered bfree. */
+ sb->defreed = 0;
return unstash(sb, &sb->defree, apply_defered_bfree);
}
@@ -321,13 +322,13 @@ static int need_unify(struct sb *sb)
/* For debugging */
void tux3_start_backend(struct sb *sb)
{
- assert(current->journal_info == NULL);
+ assert(!change_active());
current->journal_info = sb;
}
void tux3_end_backend(void)
{
- assert(current->journal_info);
+ assert(change_active());
current->journal_info = NULL;
}
@@ -337,12 +338,103 @@ int tux3_under_backend(struct sb *sb)
return current->journal_info == sb;
}
+/* Internal use only */
+static struct delta_ref *to_delta_ref(struct sb *sb, unsigned delta)
+{
+ return &sb->delta_refs[tux3_delta(delta)];
+}
+
+static block_t newfree(struct sb *sb)
+{
+ return sb->freeblocks + sb->defreed;
+}
+
+/*
+ * Reserve size should vary with budget. The reserve can include the
+ * log block overhead on the assumption that every block in the budget
+ * is a data block that generates one log record (or two?).
+ */
+block_t set_budget(struct sb *sb)
+{
+ block_t reserve = sb->freeblocks >> 7; /* FIXME: magic number */
+
+ if (1) {
+ if (reserve > max_reserve_blocks)
+ reserve = max_reserve_blocks;
+ if (reserve < min_reserve_blocks)
+ reserve = min_reserve_blocks;
+ } else if (0)
+ reserve = 10;
+
+ block_t budget = newfree(sb) - reserve;
+ if (1)
+ tux3_msg(sb, "set_budget: free %Li, budget %Li, reserve %Li", newfree(sb), budget, reserve);
+ sb->reserve = reserve;
+ atomic_set(&sb->budget, budget);
+ return reserve;
+}
+
+/*
+ * After transition, the front delta may have used some of the balance
+ * left over from this delta. The charged amount of the back delta is
+ * now stable and gives the exact balance at transition by subtracting
+ * from the old budget. The difference between the new budget and the
+ * balance at transition, which must never be negative, is added to
+ * the current balance, so the effect is exactly the same as if we had
+ * set the new budget and balance atomically at transition time. But
+ * we do not know the new balance at transition time and even if we
+ * did, we would need to add serialization against frontend changes,
+ * which are currently lockless and would like to stay that way. So we
+ * let the current delta charge against the remaining balance until
+ * flush is done, here, then adjust the balance to what it would have
+ * been if the budget had been reset exactly at transition.
+ *
+ * We have:
+ *
+ * consumed = oldfree - free
+ * oldbudget = oldfree - reserve
+ * newbudget = free - reserve
+ * transition_balance = oldbudget - charged
+ *
+ * Factoring out the reserve, the balance adjustment is:
+ *
+ * adjust = newbudget - transition_balance
+ * = (free - reserve) - ((oldfree - reserve) - charged)
+ * = free + (charged - oldfree)
+ * = charged + (free - oldfree)
+ * = charged - consumed
+ *
+ * To extend for variable reserve size, add the difference between
+ * old and new reserve size to the balance adjustment.
+ */
+void reset_balance(struct sb *sb, unsigned delta, block_t unify_cost)
+{
+ enum { initial_logblock = 0 };
+ unsigned charged = atomic_read(&to_delta_ref(sb, delta)->charged);
+ block_t consumed = sb->oldfree - newfree(sb);
+ //block_t old_reserve = sb->reserve;
+
+ if (1)
+ tux3_msg(sb, "budget %i, balance %i, charged %u, consumed %Li, free %Lu, defree %Lu, unify %Lu",
+ atomic_read(&sb->budget), atomic_read(&sb->balance),
+ charged, consumed, sb->freeblocks, sb->defreed, unify_cost);
+
+ sb->oldfree = newfree(sb);
+ set_budget(sb); /* maybe should set in size dependent order */
+ atomic_add(charged - consumed /*+ (old_reserve - sb->reserve)*/, &sb->balance);
+
+ if (consumed - initial_logblock - unify_cost > charged)
+ tux3_warn(sb, "delta %u estimate exceeded by %Lu blocks",
+ delta, consumed - charged);
+}
+
static int do_commit(struct sb *sb, int flags)
{
unsigned delta = sb->delta_staging;
int no_unify = flags & __NO_UNIFY;
struct blk_plug plug;
struct ioinfo ioinfo;
+ block_t unify_cost = 0;
int err = 0;
trace(">>>>>>>>> commit delta %u", delta);
@@ -359,8 +451,10 @@ static int do_commit(struct sb *sb, int flags)
* FIXME: there is no need to commit if normal inodes are not
* dirty? better way?
*/
- if (!(flags & __FORCE_DELTA) && !tux3_has_dirty_inodes(sb, delta))
+ if (1 && !(flags & __FORCE_DELTA) && !tux3_has_dirty_inodes(sb, delta)) {
+ reset_balance(sb, delta, 0);
goto out;
+ }
/* Prepare to wait I/O */
tux3_io_init(&ioinfo, flags);
@@ -402,9 +496,11 @@ static int do_commit(struct sb *sb, int flags)
#endif
if ((!no_unify && need_unify(sb)) || (flags & __FORCE_UNIFY)) {
+ unify_cost = sb->freeblocks;
err = unify_log(sb);
if (err)
goto error; /* FIXME: error handling */
+ unify_cost -= sb->freeblocks;
/* Add delta log for debugging. */
log_delta(sb);
@@ -414,6 +510,8 @@ static int do_commit(struct sb *sb, int flags)
write_log(sb);
blk_finish_plug(&plug);
+ reset_balance(sb, delta, unify_cost);
+
/*
* Commit last block (for now, this is sync I/O).
*
@@ -455,12 +553,6 @@ error:
((int)((a) - (b)) >= 0))
#define delta_before_eq(a,b) delta_after_eq(b,a)
-/* Internal use only */
-static struct delta_ref *to_delta_ref(struct sb *sb, unsigned delta)
-{
- return &sb->delta_refs[tux3_delta(delta)];
-}
-
static int flush_delta(struct sb *sb, int flags)
{
int err;
@@ -510,6 +602,13 @@ static struct delta_ref *delta_get(struct sb *sb)
* free ->current_delta, so we don't need rcu_read_lock().
*/
do {
+ barrier();
+ /*
+ * NOTE: Without this barrier(), at least, gcc-4.8.2 ignores
+ * volatile dereference of sb->current_delta in this loop,
+ * and instead uses the cached value.
+ * (Looks like a gcc bug, this barrier() is the workaround)
+ */
delta_ref = rcu_dereference_check(sb->current_delta, 1);
} while (!atomic_inc_not_zero(&delta_ref->refcount));
@@ -540,6 +639,7 @@ static void __delta_transition(struct sb *sb, struct delta_ref *delta_ref,
reinit_completion(&delta_ref->waitref_done);
/* Assign the delta number */
delta_ref->delta = new_delta;
+ atomic_set(&delta_ref->charged, 0);
/*
* Update current delta, then release reference.
@@ -587,6 +687,7 @@ void tux3_delta_init(struct sb *sb)
for (i = 0; i < ARRAY_SIZE(sb->delta_refs); i++) {
atomic_set(&sb->delta_refs[i].refcount, 0);
+ atomic_set(&sb->delta_refs[i].charged, 0);
init_completion(&sb->delta_refs[i].waitref_done);
}
#ifdef TUX3_FLUSHER_SYNC
@@ -620,11 +721,16 @@ void tux3_delta_setup(struct sb *sb)
#endif
}
-unsigned tux3_get_current_delta(void)
+static inline struct delta_ref *current_delta(void)
{
struct delta_ref *delta_ref = current->journal_info;
assert(delta_ref != NULL);
- return delta_ref->delta;
+ return delta_ref;
+}
+
+unsigned tux3_get_current_delta(void)
+{
+ return current_delta()->delta;
}
/* Choice sb->delta or sb->unify from inode */
@@ -654,7 +760,7 @@ unsigned tux3_inode_delta(struct inode *inode)
*/
void change_begin_atomic(struct sb *sb)
{
- assert(current->journal_info == NULL);
+ assert(!change_active());
current->journal_info = delta_get(sb);
}
@@ -662,7 +768,7 @@ void change_begin_atomic(struct sb *sb)
void change_end_atomic(struct sb *sb)
{
struct delta_ref *delta_ref = current->journal_info;
- assert(delta_ref != NULL);
+ assert(change_active());
current->journal_info = NULL;
delta_put(sb, delta_ref);
}
@@ -694,12 +800,52 @@ void change_end_atomic_nested(struct sb *sb, void *ptr)
* and blocked if disabled asynchronous backend and backend is
* running.
*/
-void change_begin(struct sb *sb)
+
+int change_begin_nospace(struct sb *sb, int cost, int limit)
{
#ifdef TUX3_FLUSHER_SYNC
down_read(&sb->delta_lock);
#endif
+ if (1)
+ tux3_msg(sb, "check space, budget %i, balance %i, cost %u, limit %i",
+ atomic_read(&sb->budget), atomic_read(&sb->balance), cost, limit);
+
change_begin_atomic(sb);
+ if (atomic_sub_return(cost, &sb->balance) >= limit) {
+ atomic_add(cost, ¤t_delta()->charged);
+ return 0;
+ }
+ atomic_add(cost, &sb->balance);
+ if (1)
+ tux3_msg(sb, "wait space, budget %i, balance %i, cost %u, limit %i",
+ atomic_read(&sb->budget), atomic_read(&sb->balance), cost, limit);
+ change_end_atomic(sb);
+ return 1;
+}
+
+int change_nospace(struct sb *sb, int cost, int limit)
+{
+ assert(!change_active());
+ sync_current_delta(sb);
+ if (1)
+ tux3_msg(sb, "final check, budget %i, balance %i, cost %u, limit %i",
+ atomic_read(&sb->budget), atomic_read(&sb->balance), cost, limit);
+ if (cost > atomic_read(&sb->budget + limit)) {
+ if (1)
+ tux3_msg(sb, "*** out of space ***");
+ return 1;
+ }
+ return 0;
+}
+
+int change_begin(struct sb *sb, int cost, int limit)
+{
+ while (change_begin_nospace(sb, cost, limit)) {
+ if (change_nospace(sb, cost, limit)) {
+ return 1;
+ }
+ }
+ return 0;
}
int change_end(struct sb *sb)
@@ -714,34 +860,3 @@ int change_end(struct sb *sb)
#endif
return err;
}
-
-/*
- * This is used for simplify the error path, or separates big chunk to
- * small chunk in loop.
- *
- * E.g. the following
- *
- * change_begin()
- * while (stop) {
- * change_begin_if_need()
- * if (do_something() < 0)
- * break;
- * change_end_if_need()
- * }
- * change_end_if_need()
- */
-void change_begin_if_needed(struct sb *sb, int need_sep)
-{
- if (current->journal_info == NULL)
- change_begin(sb);
- else if (need_sep) {
- change_end(sb);
- change_begin(sb);
- }
-}
-
-void change_end_if_needed(struct sb *sb)
-{
- if (current->journal_info)
- change_end(sb);
-}
diff --git a/fs/tux3/commit_flusher.c b/fs/tux3/commit_flusher.c
index 59d6781..f543cfc 100644
--- a/fs/tux3/commit_flusher.c
+++ b/fs/tux3/commit_flusher.c
@@ -187,6 +187,10 @@ long tux3_writeback(struct super_block *super, struct bdi_writeback *wb,
unsigned target_delta;
int err;
+ if (0)
+ tux3_msg(sb, "writeback delta %i, reason %i",
+ container_of(work, struct tux3_wb_work, work)->delta, work->reason);
+
/* If we didn't finish replay yet, don't flush. */
if (!(super->s_flags & MS_ACTIVE))
return 0;
diff --git a/fs/tux3/filemap.c b/fs/tux3/filemap.c
index a8811c2..e79a148 100644
--- a/fs/tux3/filemap.c
+++ b/fs/tux3/filemap.c
@@ -834,7 +834,6 @@ static int __tux3_file_write_begin(struct file *file,
int tux3_flags)
{
int ret;
-
ret = tux3_write_begin(mapping, pos, len, flags, pagep,
tux3_da_get_block, tux3_flags);
if (ret < 0)
@@ -877,8 +876,8 @@ static int tux3_file_write_end(struct file *file, struct address_space *mapping,
/* Separate big write transaction to small chunk. */
assert(S_ISREG(mapping->host->i_mode));
- change_end_if_needed(tux_sb(mapping->host->i_sb));
-
+ if (change_active())
+ change_end(tux_sb(mapping->host->i_sb));
return ret;
}
diff --git a/fs/tux3/filemap_blocklib.c b/fs/tux3/filemap_blocklib.c
index 1e0127f..aa13810 100644
--- a/fs/tux3/filemap_blocklib.c
+++ b/fs/tux3/filemap_blocklib.c
@@ -167,16 +167,33 @@ static int tux3_write_begin(struct address_space *mapping, loff_t pos,
pgoff_t index = pos >> PAGE_CACHE_SHIFT;
struct page *page;
int status;
-
retry:
page = grab_cache_page_write_begin(mapping, index, flags);
if (!page)
return -ENOMEM;
if (tux3_flags & TUX3_F_SEP_DELTA) {
+ struct sb *sb = tux_sb(mapping->host->i_sb);
+ int cost = one_page_cost(mapping->host);
/* Separate big write transaction to small chunk. */
assert(S_ISREG(mapping->host->i_mode));
- change_begin_if_needed(tux_sb(mapping->host->i_sb), 1);
+
+ if (change_active())
+ change_end(sb);
+
+ if (PageDirty(page))
+ change_begin_atomic(sb);
+ else if (change_begin_nospace(sb, cost, 0)) {
+ unlock_page(page);
+ page_cache_release(page);
+ if (change_nospace(sb, cost, 0)) {
+ /* fail path will truncate page */
+ change_begin_atomic(sb);
+ status = -ENOSPC;
+ goto fail;
+ }
+ goto retry;
+ }
}
/*
@@ -207,6 +224,7 @@ retry:
if (unlikely(status)) {
unlock_page(page);
page_cache_release(page);
+fail:
page = NULL;
}
diff --git a/fs/tux3/inode.c b/fs/tux3/inode.c
index f747c0e..7c97285 100644
--- a/fs/tux3/inode.c
+++ b/fs/tux3/inode.c
@@ -984,7 +984,10 @@ int tux3_setattr(struct dentry *dentry, struct iattr *iattr)
if (need_lock)
down_write(&tux_inode(inode)->truncate_lock);
- change_begin(sb);
+ if (change_begin(sb, 2, 0)) {
+ err = -ENOSPC;
+ goto unlock;
+ }
if (need_truncate)
err = tux3_truncate(inode, iattr->ia_size);
@@ -995,6 +998,7 @@ int tux3_setattr(struct dentry *dentry, struct iattr *iattr)
}
change_end(sb);
+unlock:
if (need_lock)
up_write(&tux_inode(inode)->truncate_lock);
@@ -1060,7 +1064,8 @@ static int tux3_special_update_time(struct inode *inode, struct timespec *time,
return 0;
/* FIXME: no i_mutex, so this is racy */
- change_begin(sb);
+ if (change_begin(sb, 1, 0))
+ return -ENOSPC;
if (flags & S_VERSION)
inode_inc_iversion(inode);
if (flags & S_CTIME)
diff --git a/fs/tux3/inode_vfslib.c b/fs/tux3/inode_vfslib.c
index afae9b8..bdecf53 100644
--- a/fs/tux3/inode_vfslib.c
+++ b/fs/tux3/inode_vfslib.c
@@ -21,10 +21,15 @@ static ssize_t tux3_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
mutex_lock(&inode->i_mutex);
/* For each ->write_end() calls change_end(). */
- change_begin(sb);
+ if (change_begin(sb, 1, 0)) {
+ mutex_unlock(&inode->i_mutex);
+ return -ENOSPC;
+ }
+
/* FIXME: file_update_time() in this can be race with mmap */
ret = __generic_file_write_iter(iocb, from);
- change_end_if_needed(sb);
+ if (change_active())
+ change_end(sb);
mutex_unlock(&inode->i_mutex);
if (ret > 0) {
diff --git a/fs/tux3/log.c b/fs/tux3/log.c
index bb26c73..fdc36b0 100644
--- a/fs/tux3/log.c
+++ b/fs/tux3/log.c
@@ -634,6 +634,7 @@ int defer_bfree(struct sb *sb, struct stash *defree,
assert(count > 0);
assert(block + count <= sb->volblocks);
+ sb->defreed += count;
/*
* count field of stash is 16bits. So, this separates to
diff --git a/fs/tux3/namei.c b/fs/tux3/namei.c
index cb8e0b2..2b1355d 100644
--- a/fs/tux3/namei.c
+++ b/fs/tux3/namei.c
@@ -37,13 +37,16 @@ static int __tux3_mknod(struct inode *dir, struct dentry *dentry,
struct tux_iattr *iattr)
{
struct inode *inode;
+ struct sb *sb = tux_sb(dir->i_sb);
int err;
if (!huge_valid_dev(iattr->rdev) &&
(S_ISBLK(iattr->mode) || S_ISCHR(iattr->mode)))
return -EINVAL;
- change_begin(tux_sb(dir->i_sb));
+ if (change_begin(sb, 5, 0))
+ return -ENOSPC;
+
inode = tux_create_dirent_and_inode(dir, &dentry->d_name, iattr);
if (IS_ERR(inode)) {
err = PTR_ERR(inode);
@@ -56,7 +59,7 @@ static int __tux3_mknod(struct inode *dir, struct dentry *dentry,
inode_inc_link_count(dir);
err = 0;
out:
- change_end(tux_sb(dir->i_sb));
+ change_end(sb);
return err;
}
@@ -93,7 +96,8 @@ static int tux3_link(struct dentry *old_dentry, struct inode *dir,
struct sb *sb = tux_sb(inode->i_sb);
int err;
- change_begin(sb);
+ if (change_begin(sb, 5, 0))
+ return -ENOSPC;
tux3_iattrdirty(inode);
inode->i_ctime = gettime();
inode_inc_link_count(inode);
@@ -134,7 +138,8 @@ static int __tux3_symlink(struct inode *dir, struct dentry *dentry,
if (len > PAGE_CACHE_SIZE)
return -ENAMETOOLONG;
- change_begin(sb);
+ if (change_begin(sb, 6, 0))
+ return -ENOSPC;
inode = tux_create_dirent_and_inode(dir, &dentry->d_name, iattr);
if (IS_ERR(inode)) {
err = PTR_ERR(inode);
@@ -181,7 +186,8 @@ static int tux3_unlink(struct inode *dir, struct dentry *dentry)
struct inode *inode = dentry->d_inode;
struct sb *sb = tux_sb(inode->i_sb);
- change_begin(sb);
+ if (change_begin(sb, 1, min_reserve_blocks * -.75))
+ return -ENOSPC;
int err = tux_del_dirent(dir, dentry);
if (!err) {
tux3_iattrdirty(inode);
@@ -201,7 +207,8 @@ static int tux3_rmdir(struct inode *dir, struct dentry *dentry)
int err = tux_dir_is_empty(inode);
if (!err) {
- change_begin(sb);
+ if (change_begin(sb, 3, min_reserve_blocks * -.75))
+ return -ENOSPC;
err = tux_del_dirent(dir, dentry);
if (!err) {
tux3_iattrdirty(inode);
@@ -237,7 +244,8 @@ static int tux3_rename(struct inode *old_dir, struct dentry *old_dentry,
/* FIXME: is this needed? */
assert(be64_to_cpu(old_entry->inum) == tux_inode(old_inode)->inum);
- change_begin(sb);
+ if (change_begin(sb, 20, 0))
+ return -ENOSPC;
delta = tux3_get_current_delta();
new_subdir = S_ISDIR(old_inode->i_mode) && new_dir != old_dir;
diff --git a/fs/tux3/super.c b/fs/tux3/super.c
index b104dc7..29b17e8 100644
--- a/fs/tux3/super.c
+++ b/fs/tux3/super.c
@@ -370,6 +370,7 @@ static int init_sb(struct sb *sb)
INIT_LIST_HEAD(&sb->orphan_add);
INIT_LIST_HEAD(&sb->orphan_del);
+ sb->defreed = 0;
stash_init(&sb->defree);
stash_init(&sb->deunify);
INIT_LIST_HEAD(&sb->unify_buffers);
@@ -421,6 +422,12 @@ static void __setup_sb(struct sb *sb, struct disksuper *super)
sb->blocksize = 1 << sb->blockbits;
sb->blockmask = (1 << sb->blockbits) - 1;
+#ifdef __KERNEL__
+ sb->blocks_per_page_bits = PAGE_CACHE_SHIFT - sb->blockbits;
+#else
+ sb->blocks_per_page_bits = 0;
+#endif
+ sb->blocks_per_page = 1 << sb->blocks_per_page_bits;
sb->groupbits = 13; // FIXME: put in disk super?
sb->volmask = roundup_pow_of_two64(sb->volblocks) - 1;
sb->entries_per_node = calc_entries_per_node(sb->blocksize);
@@ -656,12 +663,13 @@ static int tux3_statfs(struct dentry *dentry, struct kstatfs *buf)
{
struct super_block *sb = dentry->d_sb;
struct sb *sbi = tux_sb(sb);
-
+ block_t reserve = sbi->reserve, avail = sbi->freeblocks + sbi->defreed;
+ avail -= avail < reserve ? avail : reserve;
buf->f_type = sb->s_magic;
buf->f_bsize = sbi->blocksize;
- buf->f_blocks = sbi->volblocks;
- buf->f_bfree = sbi->freeblocks;
- buf->f_bavail = sbi->freeblocks;
+ buf->f_blocks = sbi->volblocks - reserve;
+ buf->f_bfree = avail;
+ buf->f_bavail = avail; /* FIXME: no special privilege for root yet */
buf->f_files = MAX_INODES;
buf->f_ffree = sbi->freeinodes;
#if 0
@@ -773,7 +781,7 @@ static int tux3_fill_super(struct super_block *sb, void *data, int silent)
goto error;
}
}
- tux3_dbg("s_blocksize %lu", sb->s_blocksize);
+ tux3_dbg("s_blocksize %lu, sb = %p", sb->s_blocksize, tux_sb(sb));
rp = tux3_init_fs(sbi);
if (IS_ERR(rp)) {
@@ -794,6 +802,9 @@ static int tux3_fill_super(struct super_block *sb, void *data, int silent)
goto error;
}
+ sbi->oldfree = sbi->freeblocks;
+ set_budget(sbi);
+ atomic_set(&sbi->balance, atomic_read(&sbi->budget));
return 0;
error:
diff --git a/fs/tux3/tux3.h b/fs/tux3/tux3.h
index e2f2d9b..4eae938 100644
--- a/fs/tux3/tux3.h
+++ b/fs/tux3/tux3.h
@@ -277,6 +277,7 @@ struct delta_ref {
atomic_t refcount;
unsigned delta;
struct completion waitref_done;
+ atomic_t charged; /* block allocation upper bound for this delta */
};
/* Per-delta data structure for sb */
@@ -355,7 +356,8 @@ struct sb {
unsigned blocksize, blockbits, blockmask, groupbits;
u64 freeinodes; /* Number of free inode numbers. This is
* including the deferred allocated inodes */
- block_t volblocks, volmask, freeblocks, nextblock;
+ block_t volblocks, volmask, freeblocks, oldfree, reserve, nextblock;
+ unsigned blocks_per_page, blocks_per_page_bits;
inum_t nextinum; /* FIXME: temporary hack to avoid to find
* same area in itree for free inum. */
unsigned entries_per_node; /* must be per-btree type, get rid of this */
@@ -376,6 +378,7 @@ struct sb {
struct list_head orphan_add; /* defered orphan inode add list */
struct list_head orphan_del; /* defered orphan inode del list */
+ block_t defreed; /* total deferred free blocks */
struct stash defree; /* defer extent frees until after delta */
struct stash deunify; /* defer extent frees until after unify */
@@ -387,6 +390,7 @@ struct sb {
/*
* For frontend and backend
*/
+ atomic_t budget, balance;
spinlock_t countmap_lock;
struct countmap_pin countmap_pin;
struct tux3_idefer_map *idefer_map;
@@ -515,6 +519,12 @@ static inline struct block_device *sb_dev(struct sb *sb)
{
return sb->vfs_sb->s_bdev;
}
+#else
+static inline struct sb *tux_sb(struct super_block *sb)
+{
+ return container_of(sb, struct sb, vfs_sb);
+}
+
#endif /* __KERNEL__ */
/* Get delta from free running counter */
@@ -686,6 +696,15 @@ static inline int has_no_root(struct btree *btree)
return btree->root.depth == 0;
}
+/* Estimate backend allocation cost per data page */
+static inline unsigned one_page_cost(struct inode *inode)
+{
+ struct sb *sb = tux_sb(inode->i_sb);
+ struct btree *btree = &tux_inode(inode)->btree;
+ unsigned depth = has_root(btree) ? btree->root.depth : 0;
+ return sb->blocks_per_page + 2 * depth + 1;
+}
+
/* Redirect ptr which is pointing data of src from src to dst */
static inline void *ptr_redirect(void *ptr, void *src, void *dst)
{
@@ -832,6 +851,14 @@ int replay_bnode_del(struct replay *rp, block_t bnode, tuxkey_t key, unsigned co
int replay_bnode_adjust(struct replay *rp, block_t bnode, tuxkey_t from, tuxkey_t to);
/* commit.c */
+
+enum { min_reserve_blocks = 8, max_reserve_blocks = 128 };
+
+static inline int change_active(void)
+{
+ return !!current->journal_info;
+}
+
void tux3_start_backend(struct sb *sb);
void tux3_end_backend(void);
int tux3_under_backend(struct sb *sb);
@@ -844,10 +871,11 @@ void change_begin_atomic(struct sb *sb);
void change_end_atomic(struct sb *sb);
void change_begin_atomic_nested(struct sb *sb, void **ptr);
void change_end_atomic_nested(struct sb *sb, void *ptr);
-void change_begin(struct sb *sb);
+int change_begin_nospace(struct sb *sb, int cost, int limit);
+int change_nospace(struct sb *sb, int cost, int limit);
+int change_begin(struct sb *sb, int cost, int limit);
int change_end(struct sb *sb);
-void change_begin_if_needed(struct sb *sb, int need_sep);
-void change_end_if_needed(struct sb *sb);
+block_t set_budget(struct sb *sb);
/* dir.c */
void tux_set_entry(struct buffer_head *buffer, struct tux3_dirent *entry,
diff --git a/fs/tux3/user/filemap.c b/fs/tux3/user/filemap.c
index 8d8c812..6c99948 100644
--- a/fs/tux3/user/filemap.c
+++ b/fs/tux3/user/filemap.c
@@ -309,7 +309,7 @@ int tuxwrite(struct file *file, const void *data, unsigned len)
{
struct sb *sb = tux_sb(file->f_inode->i_sb);
int ret;
- change_begin(sb);
+ change_begin(sb, 2 * len, 0);
ret = tuxio(file, (void *)data, len, 1);
change_end(sb);
return ret;
diff --git a/fs/tux3/user/inode.c b/fs/tux3/user/inode.c
index 21823a1..70d5602 100644
--- a/fs/tux3/user/inode.c
+++ b/fs/tux3/user/inode.c
@@ -354,7 +354,7 @@ int tuxtruncate(struct inode *inode, loff_t size)
struct sb *sb = tux_sb(inode->i_sb);
int err;
- change_begin(sb);
+ change_begin(sb, 1, -10);
err = __tuxtruncate(inode, size);
change_end(sb);
diff --git a/fs/tux3/user/tux3user.h b/fs/tux3/user/tux3user.h
index 5b68e5c..a298a91 100644
--- a/fs/tux3/user/tux3user.h
+++ b/fs/tux3/user/tux3user.h
@@ -55,11 +55,6 @@ static inline map_t *mapping(struct inode *inode);
#include "../tux3.h"
-static inline struct sb *tux_sb(struct super_block *sb)
-{
- return container_of(sb, struct sb, vfs_sb);
-}
-
static inline struct super_block *vfs_sb(struct sb *sb)
{
return &sb->vfs_sb;
diff --git a/fs/tux3/xattr.c b/fs/tux3/xattr.c
index c4ea5f9..9aff2e7 100644
--- a/fs/tux3/xattr.c
+++ b/fs/tux3/xattr.c
@@ -724,7 +724,7 @@ int set_xattr(struct inode *inode, const char *name, unsigned len,
struct inode *atable = sb->atable;
mutex_lock(&atable->i_mutex);
- change_begin(sb);
+ change_begin(sb, 0, 0);
atom_t atom;
int err = make_atom(atable, name, len, &atom);
@@ -748,7 +748,7 @@ int del_xattr(struct inode *inode, const char *name, unsigned len)
int err;
mutex_lock(&atable->i_mutex);
- change_begin(sb);
+ change_begin(sb, 0, 0);
atom_t atom;
err = find_atom(atable, name, len, &atom);
On Wednesday, May 20, 2015 8:51:46 PM PDT, David Lang wrote:
> how do you prevent it from continuing to interact with the old
> version of the page and never see updates or have it's changes
> reflected on the current page?
Why would it do that, and what would be surprising about it? Did
you have a specific case in mind?
Regards,
Daniel
On 05/21/2015 03:53 PM, Daniel Phillips wrote:
> On Wednesday, May 20, 2015 8:51:46 PM PDT, David Lang wrote:
>> how do you prevent it from continuing to interact with the old version
>> of the page and never see updates or have it's changes reflected on
>> the current page?
>
> Why would it do that, and what would be surprising about it? Did
> you have a specific case in mind?
After a get_page(), page_cache_get(), or other equivalent
function, a piece of code has the expectation that it can
continue using that page until after it has released the
reference count.
This can be an arbitrarily long period of time.
--
All rights reversed
On Monday, May 25, 2015 9:25:44 PM PDT, Rik van Riel wrote:
> On 05/21/2015 03:53 PM, Daniel Phillips wrote:
>> On Wednesday, May 20, 2015 8:51:46 PM PDT, David Lang wrote:
>>> how do you prevent it from continuing to interact with the old version
>>> of the page and never see updates or have it's changes reflected on
>>> the current page?
>>
>> Why would it do that, and what would be surprising about it? Did
>> you have a specific case in mind?
>
> After a get_page(), page_cache_get(), or other equivalent
> function, a piece of code has the expectation that it can
> continue using that page until after it has released the
> reference count.
>
> This can be an arbitrarily long period of time.
It is perfectly welcome to keep using that page as long as it
wants, Tux3 does not care. When it lets go of the last reference
(and Tux3 has finished with it) then the page is freeable. Did
you have a more specific example where this would be an issue?
Are you talking about kernel or userspace code?
Regards,
Daniel
On Mon, 25 May 2015, Daniel Phillips wrote:
> On Monday, May 25, 2015 9:25:44 PM PDT, Rik van Riel wrote:
>> On 05/21/2015 03:53 PM, Daniel Phillips wrote:
>>> On Wednesday, May 20, 2015 8:51:46 PM PDT, David Lang wrote:
>>>> how do you prevent it from continuing to interact with the old version
>>>> of the page and never see updates or have it's changes reflected on
>>>> the current page?
>>>
>>> Why would it do that, and what would be surprising about it? Did
>>> you have a specific case in mind?
>>
>> After a get_page(), page_cache_get(), or other equivalent
>> function, a piece of code has the expectation that it can
>> continue using that page until after it has released the
>> reference count.
>>
>> This can be an arbitrarily long period of time.
>
> It is perfectly welcome to keep using that page as long as it
> wants, Tux3 does not care. When it lets go of the last reference
> (and Tux3 has finished with it) then the page is freeable. Did
> you have a more specific example where this would be an issue?
> Are you talking about kernel or userspace code?
if the page gets modified again, will that cause any issues? what if the page
gets modified before the copy gets written out, so that there are two dirty
copies of the page in the process of being written?
David Lang
On Monday, May 25, 2015 11:04:39 PM PDT, David Lang wrote:
> if the page gets modified again, will that cause any issues?
> what if the page gets modified before the copy gets written out,
> so that there are two dirty copies of the page in the process of
> being written?
>
> David Lang
How is the page going to get modified again? A forked page isn't
mapped by a pte, so userspace can't modify it by mmap. The forked
page is not in the page cache, so usespace can't modify it by
posix file ops. So the writer would have to be in kernel. Tux3
knows what it is doing, so it won't modify the page. What kernel
code besides Tux3 will modify the page?
Regards,
Daniel
On Mon, 25 May 2015, Daniel Phillips wrote:
> On Monday, May 25, 2015 11:04:39 PM PDT, David Lang wrote:
>> if the page gets modified again, will that cause any issues? what if the
>> page gets modified before the copy gets written out, so that there are two
>> dirty copies of the page in the process of being written?
>>
>> David Lang
>
> How is the page going to get modified again? A forked page isn't
> mapped by a pte, so userspace can't modify it by mmap. The forked
> page is not in the page cache, so usespace can't modify it by
> posix file ops. So the writer would have to be in kernel. Tux3
> knows what it is doing, so it won't modify the page. What kernel
> code besides Tux3 will modify the page?
I'm assuming that Rik is talking about whatever has the reference to the page
via one of the methods that he talked about.
David Lang
On Mon 25-05-15 23:11:11, Daniel Phillips wrote:
> On Monday, May 25, 2015 11:04:39 PM PDT, David Lang wrote:
> >if the page gets modified again, will that cause any issues? what
> >if the page gets modified before the copy gets written out, so
> >that there are two dirty copies of the page in the process of
> >being written?
> >
> >David Lang
>
> How is the page going to get modified again? A forked page isn't
> mapped by a pte, so userspace can't modify it by mmap. The forked
> page is not in the page cache, so usespace can't modify it by
> posix file ops. So the writer would have to be in kernel. Tux3
> knows what it is doing, so it won't modify the page. What kernel
> code besides Tux3 will modify the page?
E.g. video drivers (or infiniband or direct IO for that matter) which
have buffers in user memory (may be mmapped file), grab references to pages
and hand out PFNs of those pages to the hardware to store data in them...
If you fork a page after the driver has handed PFNs to the hardware, you've
just lost all the writes hardware will do.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:
> E.g. video drivers (or infiniband or direct IO for that matter) which
> have buffers in user memory (may be mmapped file), grab references to pages
> and hand out PFNs of those pages to the hardware to store data in them...
> If you fork a page after the driver has handed PFNs to the hardware, you've
> just lost all the writes hardware will do.
Hi Jan,
The page forked because somebody wrote to it with write(2) or mmap write at
the same time as a video driver (or infiniband or direct IO) was doing io
to
it. Isn't the application trying hard to lose data in that case? It would
not need page fork to lose data that way.
Regards,
Daniel
On Monday, May 25, 2015 11:13:46 PM PDT, David Lang wrote:
> I'm assuming that Rik is talking about whatever has the
> reference to the page via one of the methods that he talked
> about.
This would be a good moment to provide specifics.
Regards,
Daniel
On Tue 26-05-15 01:08:56, Daniel Phillips wrote:
> On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:
> > E.g. video drivers (or infiniband or direct IO for that matter) which
> >have buffers in user memory (may be mmapped file), grab references to pages
> >and hand out PFNs of those pages to the hardware to store data in them...
> >If you fork a page after the driver has handed PFNs to the hardware, you've
> >just lost all the writes hardware will do.
>
> Hi Jan,
>
> The page forked because somebody wrote to it with write(2) or mmap write at
> the same time as a video driver (or infiniband or direct IO) was
> doing io to
> it. Isn't the application trying hard to lose data in that case? It
> would not need page fork to lose data that way.
So I can think of two valid uses:
1) You setup IO to part of a page and modify from userspace a different
part of a page.
2) At least for video drivers there is one ioctl() which creates object
with buffers in memory and another ioctl() to actually ship it to hardware
(may be called repeatedly). So in theory app could validly dirty the pages
before it ships them to hardware. If this happens repeatedly and interacts
badly with background writeback, you will end up with a forked page in a
buffer and from that point on things are broken.
So my opinion is: Don't fork the page if page_count is elevated. You can
just wait for the IO if you need stable pages in that case. It's slow but
it's safe and it should be pretty rare. Is there any problem with that?
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
> We identified the following quality metrics for this algorithm:
>
> 1) Never fails to detect out of space in the front end.
> 2) Always fills a volume to 100% before reporting out of space.
> 3) Allows rm, rmdir and truncate even when a volume is full.
Hmm. Can you also overwrite existing data in files when a volume is
full? I guess applications expect that to work..
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Tue 2015-05-26 01:09:59, Daniel Phillips wrote:
> On Monday, May 25, 2015 11:13:46 PM PDT, David Lang wrote:
> >I'm assuming that Rik is talking about whatever has the reference to the
> >page via one of the methods that he talked about.
>
> This would be a good moment to provide specifics.
Hmm. This seems like a good moment for you to audit whole kernel, to
make sure it does not do stuff you don't expect it to.
You are changing core semantics, stuff that was allowed before is not
allowed now, so it looks like you should do the auditing...
You may want to start with video4linux, as Jan pointed out.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On (05/26/15 01:08), Daniel Phillips wrote:
> On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:
> > E.g. video drivers (or infiniband or direct IO for that matter) which
> >have buffers in user memory (may be mmapped file), grab references to pages
> >and hand out PFNs of those pages to the hardware to store data in them...
> >If you fork a page after the driver has handed PFNs to the hardware, you've
> >just lost all the writes hardware will do.
>
> Hi Jan,
>
> The page forked because somebody wrote to it with write(2) or mmap write at
> the same time as a video driver (or infiniband or direct IO) was doing io to
> it. Isn't the application trying hard to lose data in that case? It would
> not need page fork to lose data that way.
>
Hello,
is it possible to page-fork-bomb the system by some 'malicious' app?
-ss
On Tue 26-05-15 19:22:39, Sergey Senozhatsky wrote:
> On (05/26/15 01:08), Daniel Phillips wrote:
> > On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:
> > > E.g. video drivers (or infiniband or direct IO for that matter) which
> > >have buffers in user memory (may be mmapped file), grab references to pages
> > >and hand out PFNs of those pages to the hardware to store data in them...
> > >If you fork a page after the driver has handed PFNs to the hardware, you've
> > >just lost all the writes hardware will do.
> >
> > Hi Jan,
> >
> > The page forked because somebody wrote to it with write(2) or mmap write at
> > the same time as a video driver (or infiniband or direct IO) was doing io to
> > it. Isn't the application trying hard to lose data in that case? It would
> > not need page fork to lose data that way.
> >
>
> Hello,
>
> is it possible to page-fork-bomb the system by some 'malicious' app?
Well, you can have only two copies of each page - the one under writeout
and the one in page cache. Furthermore you are limited by dirty throttling
so I don't think this would allow any out-of-ordinary DOS vector...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Hi Sergey,
On 05/26/2015 03:22 AM, Sergey Senozhatsky wrote:
>
> Hello,
>
> is it possible to page-fork-bomb the system by some 'malicious' app?
Not in any new way. A page fork can happen either in the front end,
where it has to wait for memory like any other normal memory user,
or in the backend, where Tux3 may have privileged access to low
memory reserves and therefore must place bounds on its memory use
like any other user of low memory reserves.
This is not specific to page fork. We must place such bounds for
any memory that the backend uses. Fortunately, the backend does not
allocate memory extravagently, for fork or anything else, so when
this does get to the top of our to-do list it should not be too
hard to deal with. We plan to attack that after merge, as we have
never observed a problem in practice. Rather, Tux3 already seems
to survive low memory situations pretty well compared to some other
filesystems.
Regards,
Daniel
On 05/26/2015 02:00 AM, Jan Kara wrote:
> On Tue 26-05-15 01:08:56, Daniel Phillips wrote:
>> On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:
>>> E.g. video drivers (or infiniband or direct IO for that matter) which
>>> have buffers in user memory (may be mmapped file), grab references to pages
>>> and hand out PFNs of those pages to the hardware to store data in them...
>>> If you fork a page after the driver has handed PFNs to the hardware, you've
>>> just lost all the writes hardware will do.
>>
>> Hi Jan,
>>
>> The page forked because somebody wrote to it with write(2) or mmap write at
>> the same time as a video driver (or infiniband or direct IO) was
>> doing io to
>> it. Isn't the application trying hard to lose data in that case? It
>> would not need page fork to lose data that way.
>
> So I can think of two valid uses:
>
> 1) You setup IO to part of a page and modify from userspace a different
> part of a page.
Suppose the use case is reading textures from video memory into a mmapped
file, and at the same time, the application is allowed to update the
textures in the file via mmap or write(2). Fork happens at mkwrite time.
If the page is already dirty, we do not fork it. The video API must have
made the page writable and dirty, so I do not see an issue.
> 2) At least for video drivers there is one ioctl() which creates object
> with buffers in memory and another ioctl() to actually ship it to hardware
> (may be called repeatedly). So in theory app could validly dirty the pages
> before it ships them to hardware. If this happens repeatedly and interacts
> badly with background writeback, you will end up with a forked page in a
> buffer and from that point on things are broken.
Writeback does not fork pages. An app may dirty a page that is in process
of being shipped to hardware (must be a distinct part of the page, or it is
a race) and the data being sent to hardware will not be disturbed. If there
is an issue here, I do not see it.
> So my opinion is: Don't fork the page if page_count is elevated. You can
> just wait for the IO if you need stable pages in that case. It's slow but
> it's safe and it should be pretty rare. Is there any problem with that?
That would be our fallback if anybody discovers a specific case where page
fork breaks something, which so far has not been demonstrated.
With a known fallback, it is hard to see why we should delay merging over
that. Perfection has never been a requirement for merging filesystems. On
the contrary, imperfection is a reason for merging, so that the many
eyeballs effect may prove its value.
Regards,
Daniel
On 05/26/2015 04:22 PM, Daniel Phillips wrote:
> On 05/26/2015 02:00 AM, Jan Kara wrote:
>> So my opinion is: Don't fork the page if page_count is elevated. You can
>> just wait for the IO if you need stable pages in that case. It's slow but
>> it's safe and it should be pretty rare. Is there any problem with that?
>
> That would be our fallback if anybody discovers a specific case where page
> fork breaks something, which so far has not been demonstrated.
>
> With a known fallback, it is hard to see why we should delay merging over
> that. Perfection has never been a requirement for merging filesystems. On
However, avoiding data corruption by erring on the side of safety is
a pretty basic requirement.
> the contrary, imperfection is a reason for merging, so that the many
> eyeballs effect may prove its value.
If you skip the page fork when there is an elevated page count, tux3
should be safe (at least from that aspect). Only do the COW when there
is no "strange" use of the page going on.
--
All rights reversed
On 05/26/2015 02:36 PM, Rik van Riel wrote:
> On 05/26/2015 04:22 PM, Daniel Phillips wrote:
>> On 05/26/2015 02:00 AM, Jan Kara wrote:
>>> So my opinion is: Don't fork the page if page_count is elevated. You can
>>> just wait for the IO if you need stable pages in that case. It's slow but
>>> it's safe and it should be pretty rare. Is there any problem with that?
>>
>> That would be our fallback if anybody discovers a specific case where page
>> fork breaks something, which so far has not been demonstrated.
>>
>> With a known fallback, it is hard to see why we should delay merging over
>> that. Perfection has never been a requirement for merging filesystems. On
>
> However, avoiding data corruption by erring on the side of safety is
> a pretty basic requirement.
Erring on the side of safety is still an error. As a community we have
never been fond of adding code or overhead to fix theoretical bugs. I
do not see why we should relax that principle now.
We can fix actual bugs, but theoretical bugs are only shapeless specters
passing in the night. We should not become frozen in fear of them.
>> the contrary, imperfection is a reason for merging, so that the many
>> eyeballs effect may prove its value.
>
> If you skip the page fork when there is an elevated page count, tux3
> should be safe (at least from that aspect). Only do the COW when there
> is no "strange" use of the page going on.
Then you break the I in ACID. There must be a compelling reason to do
that.
Regards,
Daniel
Jan Kara <[email protected]> writes:
Hi,
> So there are a few things to have in mind:
> 1) There is nothing like a "writeable" page. Page is always writeable (at
> least on x86 architecture). When a page is mapped into some virtual address
> space (or more of them), this *mapping* can be either writeable or read-only.
> mkwrite changes the mapping from read-only to writeable but kernel /
> hardware is free to write to the page regardless of the mapping.
>
> 2) When kernel / hardware writes to the page, it first modifies the page
> and then marks it dirty.
>
> So what can happen in this scenario is:
>
> 1) You hand kernel a part of a page as a buffer. page_mkwrite() happens,
> page is dirtied, kernel notes a PFN of the page somewhere internally.
>
> 2) Writeback comes and starts writeback for the page.
>
> 3) Kernel ships the PFN to the hardware.
>
> 4) Userspace comes and wants to write to the page (different part than the
> HW is instructed to use). page_mkwrite is called, page is forked.
> Userspace writes to the forked page.
>
> 5) HW stores its data in the original page.
>
> Userspace never sees data from the HW! Data corrupted where without page
> forking everything would work just fine.
I'm not sure I'm understanding your pseudocode logic correctly though.
This logic doesn't seems to be a page forking specific issue. And
this pseudocode logic seems to be missing the locking and revalidate of
page.
If you can show more details, it would be helpful to see more, and
discuss the issue of page forking, or we can think about how to handle
the corner cases.
Well, before that, why need more details?
For example, replace the page fork at (4) with "truncate", "punch
hole", or "invalidate page".
Those operations remove the old page from radix tree, so the
userspace's write creates the new page, and HW still refererences the
old page. (I.e. situation should be same with page forking, in my
understand of this pseudocode logic.)
IOW, this pseudocode logic seems to be broken without page forking if
no lock and revalidate. Usually, we prevent unpleasant I/O by
lock_page or PG_writeback, and an obsolated page is revalidated under
lock_page.
For page forking, we may also be able to prevent similar situation by
locking, flags, and revalidate. But those details might be different
with current code, because page states are different.
> Another possible scenario:
>
> 1) Userspace app tells kernel to setup a HW buffer in a page.
>
> 2) Userspace app fills page with data -> page_mkwrite is called, page is
> dirtied.
>
> 3) Userspace app tells kernel to ship buffer to video HW.
>
> 4) Writeback comes and starts writeback for the page
>
> 5) Video HW is done with the page. Userspace app fills new set of data into
> the page -> page_mkwrite is called, page is forked.
>
> 6) Userspace app tells kernel to ship buffer to video HW. But HW gets the
> old data from the original page.
>
> Again a data corruption issue where previously things were working fine.
This logic seems to be same as above. Replace the page fork at (5).
With no revalidate of page, (6) will use the old page.
Thanks.
--
OGAWA Hirofumi <[email protected]>
On Mon 22-06-15 00:36:00, OGAWA Hirofumi wrote:
> Jan Kara <[email protected]> writes:
> > So there are a few things to have in mind:
> > 1) There is nothing like a "writeable" page. Page is always writeable (at
> > least on x86 architecture). When a page is mapped into some virtual address
> > space (or more of them), this *mapping* can be either writeable or read-only.
> > mkwrite changes the mapping from read-only to writeable but kernel /
> > hardware is free to write to the page regardless of the mapping.
> >
> > 2) When kernel / hardware writes to the page, it first modifies the page
> > and then marks it dirty.
> >
> > So what can happen in this scenario is:
> >
> > 1) You hand kernel a part of a page as a buffer. page_mkwrite() happens,
> > page is dirtied, kernel notes a PFN of the page somewhere internally.
> >
> > 2) Writeback comes and starts writeback for the page.
> >
> > 3) Kernel ships the PFN to the hardware.
> >
> > 4) Userspace comes and wants to write to the page (different part than the
> > HW is instructed to use). page_mkwrite is called, page is forked.
> > Userspace writes to the forked page.
> >
> > 5) HW stores its data in the original page.
> >
> > Userspace never sees data from the HW! Data corrupted where without page
> > forking everything would work just fine.
>
> I'm not sure I'm understanding your pseudocode logic correctly though.
> This logic doesn't seems to be a page forking specific issue. And
> this pseudocode logic seems to be missing the locking and revalidate of
> page.
>
> If you can show more details, it would be helpful to see more, and
> discuss the issue of page forking, or we can think about how to handle
> the corner cases.
>
> Well, before that, why need more details?
>
> For example, replace the page fork at (4) with "truncate", "punch
> hole", or "invalidate page".
>
> Those operations remove the old page from radix tree, so the
> userspace's write creates the new page, and HW still refererences the
> old page. (I.e. situation should be same with page forking, in my
> understand of this pseudocode logic.)
Yes, if userspace truncates the file, the situation we end up with is
basically the same. However for truncate to happen some malicious process
has to come and truncate the file - a failure scenario that is acceptable
for most use cases since it doesn't happen unless someone is actively
trying to screw you. With page forking it is enough for flusher thread
to start writeback for that page to trigger the problem - event that is
basically bound to happen without any other userspace application
interfering.
> IOW, this pseudocode logic seems to be broken without page forking if
> no lock and revalidate. Usually, we prevent unpleasant I/O by
> lock_page or PG_writeback, and an obsolated page is revalidated under
> lock_page.
Well, good luck with converting all the get_user_pages() users in kernel to
use lock_page() or PG_writeback checks to avoid issues with page forking. I
don't think that's really feasible.
> For page forking, we may also be able to prevent similar situation by
> locking, flags, and revalidate. But those details might be different
> with current code, because page states are different.
Sorry, I don't understand what do you mean in this paragraph. Can you
explain it a bit more?
> > Another possible scenario:
> >
> > 1) Userspace app tells kernel to setup a HW buffer in a page.
> >
> > 2) Userspace app fills page with data -> page_mkwrite is called, page is
> > dirtied.
> >
> > 3) Userspace app tells kernel to ship buffer to video HW.
> >
> > 4) Writeback comes and starts writeback for the page
> >
> > 5) Video HW is done with the page. Userspace app fills new set of data into
> > the page -> page_mkwrite is called, page is forked.
> >
> > 6) Userspace app tells kernel to ship buffer to video HW. But HW gets the
> > old data from the original page.
> >
> > Again a data corruption issue where previously things were working fine.
>
> This logic seems to be same as above. Replace the page fork at (5).
> With no revalidate of page, (6) will use the old page.
Yes, the same arguments as above apply...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Jan Kara <[email protected]> writes:
>> I'm not sure I'm understanding your pseudocode logic correctly though.
>> This logic doesn't seems to be a page forking specific issue. And
>> this pseudocode logic seems to be missing the locking and revalidate of
>> page.
>>
>> If you can show more details, it would be helpful to see more, and
>> discuss the issue of page forking, or we can think about how to handle
>> the corner cases.
>>
>> Well, before that, why need more details?
>>
>> For example, replace the page fork at (4) with "truncate", "punch
>> hole", or "invalidate page".
>>
>> Those operations remove the old page from radix tree, so the
>> userspace's write creates the new page, and HW still refererences the
>> old page. (I.e. situation should be same with page forking, in my
>> understand of this pseudocode logic.)
>
> Yes, if userspace truncates the file, the situation we end up with is
> basically the same. However for truncate to happen some malicious process
> has to come and truncate the file - a failure scenario that is acceptable
> for most use cases since it doesn't happen unless someone is actively
> trying to screw you. With page forking it is enough for flusher thread
> to start writeback for that page to trigger the problem - event that is
> basically bound to happen without any other userspace application
> interfering.
Acceptable conclusion is where came from? That pseudocode logic doesn't
say about usage at all. And even if assume it is acceptable, as far as I
can see, for example /proc/sys/vm/drop_caches is enough to trigger, or a
page on non-exists block (sparse file. i.e. missing disk space check in
your logic). And if really no any lock/check, there would be another
races.
>> IOW, this pseudocode logic seems to be broken without page forking if
>> no lock and revalidate. Usually, we prevent unpleasant I/O by
>> lock_page or PG_writeback, and an obsolated page is revalidated under
>> lock_page.
>
> Well, good luck with converting all the get_user_pages() users in kernel to
> use lock_page() or PG_writeback checks to avoid issues with page forking. I
> don't think that's really feasible.
What does all get_user_pages() conversion mean? Well, maybe right more
or less, I also think there is the issue in/around get_user_pages() that
we have to tackle.
IMO, if there is a code that pseudocode logic actually, it is the
breakage. And "it is acceptable and limitation, and give up to fix", I
don't think it is the right way to go. If there is really code broken
like your logic, I think we should fix.
Could you point which code is using your logic? Since that seems to be
so racy, I can't believe yet there are that racy codes actually.
>> For page forking, we may also be able to prevent similar situation by
>> locking, flags, and revalidate. But those details might be different
>> with current code, because page states are different.
>
> Sorry, I don't understand what do you mean in this paragraph. Can you
> explain it a bit more?
This just means a forked page (old page) and a truncated page have
different set of flags and state, so we may have to adjust revalidation.
Thanks.
--
OGAWA Hirofumi <[email protected]>
On Sun 05-07-15 21:54:45, OGAWA Hirofumi wrote:
> Jan Kara <[email protected]> writes:
>
> >> I'm not sure I'm understanding your pseudocode logic correctly though.
> >> This logic doesn't seems to be a page forking specific issue. And
> >> this pseudocode logic seems to be missing the locking and revalidate of
> >> page.
> >>
> >> If you can show more details, it would be helpful to see more, and
> >> discuss the issue of page forking, or we can think about how to handle
> >> the corner cases.
> >>
> >> Well, before that, why need more details?
> >>
> >> For example, replace the page fork at (4) with "truncate", "punch
> >> hole", or "invalidate page".
> >>
> >> Those operations remove the old page from radix tree, so the
> >> userspace's write creates the new page, and HW still refererences the
> >> old page. (I.e. situation should be same with page forking, in my
> >> understand of this pseudocode logic.)
> >
> > Yes, if userspace truncates the file, the situation we end up with is
> > basically the same. However for truncate to happen some malicious process
> > has to come and truncate the file - a failure scenario that is acceptable
> > for most use cases since it doesn't happen unless someone is actively
> > trying to screw you. With page forking it is enough for flusher thread
> > to start writeback for that page to trigger the problem - event that is
> > basically bound to happen without any other userspace application
> > interfering.
>
> Acceptable conclusion is where came from? That pseudocode logic doesn't
> say about usage at all. And even if assume it is acceptable, as far as I
> can see, for example /proc/sys/vm/drop_caches is enough to trigger, or a
> page on non-exists block (sparse file. i.e. missing disk space check in
> your logic). And if really no any lock/check, there would be another
> races.
So drop_caches won't cause any issues because it avoids mmaped pages.
Also page reclaim or page migration don't cause any issues because
they avoid pages with increased refcount (and increased refcount would stop
drop_caches from reclaiming the page as well if it was not for the mmaped
check before). Generally, elevated page refcount currently guarantees page
isn't migrated, reclaimed, or otherwise detached from the mapping (except
for truncate where the combination of mapping-index becomes invalid) and
your page forking would change that assumption - which IMHO has a big
potential for some breakage somewhere. And frankly I fail to see why you
and Daniel care so much about this corner case because from performance POV
it's IMHO a non-issue and you bother with page forking because of
performance, don't you?
> >> IOW, this pseudocode logic seems to be broken without page forking if
> >> no lock and revalidate. Usually, we prevent unpleasant I/O by
> >> lock_page or PG_writeback, and an obsolated page is revalidated under
> >> lock_page.
> >
> > Well, good luck with converting all the get_user_pages() users in kernel to
> > use lock_page() or PG_writeback checks to avoid issues with page forking. I
> > don't think that's really feasible.
>
> What does all get_user_pages() conversion mean? Well, maybe right more
> or less, I also think there is the issue in/around get_user_pages() that
> we have to tackle.
>
>
> IMO, if there is a code that pseudocode logic actually, it is the
> breakage. And "it is acceptable and limitation, and give up to fix", I
> don't think it is the right way to go. If there is really code broken
> like your logic, I think we should fix.
>
> Could you point which code is using your logic? Since that seems to be
> so racy, I can't believe yet there are that racy codes actually.
So you can have a look for example at
drivers/media/v4l2-core/videobuf2-dma-contig.c which implements setting up
of a video device buffer at virtual address specified by user. Now I don't
know whether there really is any userspace video program that sets up the
video buffer in mmaped file. I would agree with you that it would be a
strange thing to do but I've seen enough strange userspace code that I
would not be too surprised.
Another example of similar kind is at
drivers/infiniband/core/umem.c where we again set up buffer for infiniband
cards at users specified virtual address. And there are more drivers in
kernel like that.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Jan Kara <[email protected]> writes:
>> > Yes, if userspace truncates the file, the situation we end up with is
>> > basically the same. However for truncate to happen some malicious process
>> > has to come and truncate the file - a failure scenario that is acceptable
>> > for most use cases since it doesn't happen unless someone is actively
>> > trying to screw you. With page forking it is enough for flusher thread
>> > to start writeback for that page to trigger the problem - event that is
>> > basically bound to happen without any other userspace application
>> > interfering.
>>
>> Acceptable conclusion is where came from? That pseudocode logic doesn't
>> say about usage at all. And even if assume it is acceptable, as far as I
>> can see, for example /proc/sys/vm/drop_caches is enough to trigger, or a
>> page on non-exists block (sparse file. i.e. missing disk space check in
>> your logic). And if really no any lock/check, there would be another
>> races.
>
> So drop_caches won't cause any issues because it avoids mmaped pages.
> Also page reclaim or page migration don't cause any issues because
> they avoid pages with increased refcount (and increased refcount would stop
> drop_caches from reclaiming the page as well if it was not for the mmaped
> check before). Generally, elevated page refcount currently guarantees page
> isn't migrated, reclaimed, or otherwise detached from the mapping (except
> for truncate where the combination of mapping-index becomes invalid) and
> your page forking would change that assumption - which IMHO has a big
> potential for some breakage somewhere.
Lifetime and visibility from user are different topic. The issue here
is visibility. Of course, those has relation more or less though,
refcount doesn't stop to drop page from radix-tree at all.
Well, anyway, your claim seems to be assuming the userspace app
workarounds the issues. And it sounds like still not workarounds the
ENOSPC issue (validate at page fault/GUP) even if assuming userspace
behave as perfect. Calling it as kernel assumption is strange.
If you claim, there is strange logic widely used already, and of course,
we can't simply break it because of compatibility. I would be able to
agree. But your claim sounds like that logic is sane and well designed
behavior. So I disagree.
> And frankly I fail to see why you and Daniel care so much about this
> corner case because from performance POV it's IMHO a non-issue and you
> bother with page forking because of performance, don't you?
Trying to penalize the corner case path, instead of normal path, should
try at first. Penalizing normal path to allow corner case path is insane
basically.
Make normal path faster and more reliable is what we are trying.
> So you can have a look for example at
> drivers/media/v4l2-core/videobuf2-dma-contig.c which implements setting up
> of a video device buffer at virtual address specified by user. Now I don't
> know whether there really is any userspace video program that sets up the
> video buffer in mmaped file. I would agree with you that it would be a
> strange thing to do but I've seen enough strange userspace code that I
> would not be too surprised.
>
> Another example of similar kind is at
> drivers/infiniband/core/umem.c where we again set up buffer for infiniband
> cards at users specified virtual address. And there are more drivers in
> kernel like that.
Unfortunately, I'm not looking those yet though. I guess those would be
helpful to see the details.
Thanks.
--
OGAWA Hirofumi <[email protected]>
On Friday, July 31, 2015 8:37:35 AM PDT, Raymond Jennings wrote:
> Returning ENOSPC when you have free space you can't yet prove is safer than
> not returning it and risking a data loss when you get hit by a write/commit
> storm. :)
Remember when delayed allocation was scary and unproven, because proving
that ENOSPC will always be returned when needed is extremely difficult?
But the performance advantage was compelling, so we just worked at it
until it worked. There were times when it didn't work properly, but the
code was in the tree so it got fixed.
It's like that now with page forking - a new technique with compelling
advantages, and some challenges. In the past, we (the Linux community)
would rise to the challenge and err on the side of pushing optimizations
in early. That was our mojo, and that is how Linux became the dominant
operating system it is today. Do we, the Linux community, still have that
mojo?
Regards,
Daniel
On Fri, 31 Jul 2015, Daniel Phillips wrote:
> Subject: Re: [FYI] tux3: Core changes
>
> On Friday, July 31, 2015 8:37:35 AM PDT, Raymond Jennings wrote:
>> Returning ENOSPC when you have free space you can't yet prove is safer than
>> not returning it and risking a data loss when you get hit by a write/commit
>> storm. :)
>
> Remember when delayed allocation was scary and unproven, because proving
> that ENOSPC will always be returned when needed is extremely difficult?
> But the performance advantage was compelling, so we just worked at it
> until it worked. There were times when it didn't work properly, but the
> code was in the tree so it got fixed.
>
> It's like that now with page forking - a new technique with compelling
> advantages, and some challenges. In the past, we (the Linux community)
> would rise to the challenge and err on the side of pushing optimizations
> in early. That was our mojo, and that is how Linux became the dominant
> operating system it is today. Do we, the Linux community, still have that
> mojo?
We, the Linux Community have less tolerance for losing people's data and
preventing them from operating than we used to when it was all tinkerer's
personal data and secondary systems.
So rather than pushing optimizations out to everyone and seeing what breaks, we
now do more testing and checking for failures before pushing things out.
This means that when something new is introduced, we default to the safe,
slightly slower way initially (there will be enough other bugs to deal with in
any case), and then as we gain experience from the tinkerers enabling the
performance optimizations, we make those optimizations reliable and only then
push them out to all users.
If you define this as "loosing our mojo", then yes we have. But most people see
the pace of development as still being high, just with more testing and
polishing before it gets out to users.
David Lang
On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote:
> If you define this as "loosing our mojo", then yes we have.
A pity. There remains so much to do that simply will not get
done in the absence of mojo.
Regards,
Daniel
On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote:
> We, the Linux Community have less tolerance for losing people's data and preventing them from operating than we used to when it was all tinkerer's personal data and secondary systems.
>
> So rather than pushing optimizations out to everyone and seeing what breaks, we now do more testing and checking for failures before pushing things out.
By the way, I am curious about whose data you think will get lost
as a result of pushing out Tux3 with a possible theoretical bug
in a wildly improbable scenario that has not actually been
described with sufficient specificity to falsify, let alone
demonstrated.
Regards,
Daniel
On Fri, 31 Jul 2015, Daniel Phillips wrote:
> On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote:
>> We, the Linux Community have less tolerance for losing people's data and
>> preventing them from operating than we used to when it was all tinkerer's
>> personal data and secondary systems.
>>
>> So rather than pushing optimizations out to everyone and seeing what
>> breaks, we now do more testing and checking for failures before pushing
>> things out.
>
> By the way, I am curious about whose data you think will get lost
> as a result of pushing out Tux3 with a possible theoretical bug
> in a wildly improbable scenario that has not actually been
> described with sufficient specificity to falsify, let alone
> demonstrated.
you weren't asking about any particular feature of Tux, you were asking if we
were still willing to push out stuff that breaks for users and fix it later.
Especially for filesystems that can loose the data of whoever is using it, the
answer seems to be a clear no.
there may be bugs in what's pushed out that we don't know about. But we don't
push out potential data corruption bugs that we do know about (or think we do)
so if you think this should be pushed out with this known corner case that's not
handled properly, you have to convince people that it's _so_ improbable that
they shouldn't care about it.
David Lang
On Friday, July 31, 2015 3:27:12 PM PDT, David Lang wrote:
> On Fri, 31 Jul 2015, Daniel Phillips wrote:
>
>> On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote: ...
>
> you weren't asking about any particular feature of Tux, you
> were asking if we were still willing to push out stuff that
> breaks for users and fix it later.
I think you left a key word out of my ask: "theoretical".
> Especially for filesystems that can loose the data of whoever
> is using it, the answer seems to be a clear no.
>
> there may be bugs in what's pushed out that we don't know
> about. But we don't push out potential data corruption bugs that
> we do know about (or think we do)
>
> so if you think this should be pushed out with this known
> corner case that's not handled properly, you have to convince
> people that it's _so_ improbable that they shouldn't care about
> it.
There should also be an onus on the person posing the worry
to prove their case beyond a reasonable doubt, which has not been
done in case we are discussing here. Note: that is a technical
assessment to which a technical response is appropriate.
I do think that we should put a cap on this fencing and make
a real effort to get Tux3 into mainline. We should at least
set a ground rule that a problem should be proved real before it
becomes a reason to derail a project in the way that our project
has been derailed. Otherwise, it's hard to see what interest is
served.
OK, lets get back to the program. I accept your assertion that
we should convince people that the issue is improbable. To do
that, I need a specific issue to address. So far, no such issue
has been provided with specificity. Do you see why this is
frustrating?
Please, community. Give us specific issues to address, or give us
some way out of this eternal limbo. Or better, lets go back to the
old way of doing things in Linux, which is what got us where we
are today. Not this.
Note: Hirofumi's email is clear, logical and speaks to the
question. This branch of the thread is largely pointless, though
it essentially says the same thing in non-technical terms. Perhaps
your next response should be to Hirofumi, and perhaps it should be
technical.
Regards,
Daniel
On Friday, July 31, 2015 5:00:43 PM PDT, Daniel Phillips wrote:
> Note: Hirofumi's email is clear, logical and speaks to the
> question. This branch of the thread is largely pointless, though
> it essentially says the same thing in non-technical terms. Perhaps
> your next response should be to Hirofumi, and perhaps it should be
> technical.
Now, let me try to lead the way, but being specific. RDMA was raised
as a potential failure case for Tux3 page forking. But the RDMA api
does not let you use memory mmaped by Tux3 as a source or destination
of IO. Instead, it sets up its own pages and hands them out to the
RDMA app from a pool. So no issue. One down, right?
Regards,
Daniel
On Fri 31-07-15 17:16:45, Daniel Phillips wrote:
> On Friday, July 31, 2015 5:00:43 PM PDT, Daniel Phillips wrote:
> >Note: Hirofumi's email is clear, logical and speaks to the
> >question. This branch of the thread is largely pointless, though
> >it essentially says the same thing in non-technical terms. Perhaps
> >your next response should be to Hirofumi, and perhaps it should be
> >technical.
>
> Now, let me try to lead the way, but being specific. RDMA was raised
> as a potential failure case for Tux3 page forking. But the RDMA api
> does not let you use memory mmaped by Tux3 as a source or destination
> of IO. Instead, it sets up its own pages and hands them out to the
> RDMA app from a pool. So no issue. One down, right?
Can you please tell me how you arrived to that conclusion? As far as I'm
looking at the code in drivers/infiniband/ I don't see anything there
preventing userspace from passing in mmapped memory...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 31-07-15 13:44:44, OGAWA Hirofumi wrote:
> Jan Kara <[email protected]> writes:
>
> >> > Yes, if userspace truncates the file, the situation we end up with is
> >> > basically the same. However for truncate to happen some malicious process
> >> > has to come and truncate the file - a failure scenario that is acceptable
> >> > for most use cases since it doesn't happen unless someone is actively
> >> > trying to screw you. With page forking it is enough for flusher thread
> >> > to start writeback for that page to trigger the problem - event that is
> >> > basically bound to happen without any other userspace application
> >> > interfering.
> >>
> >> Acceptable conclusion is where came from? That pseudocode logic doesn't
> >> say about usage at all. And even if assume it is acceptable, as far as I
> >> can see, for example /proc/sys/vm/drop_caches is enough to trigger, or a
> >> page on non-exists block (sparse file. i.e. missing disk space check in
> >> your logic). And if really no any lock/check, there would be another
> >> races.
> >
> > So drop_caches won't cause any issues because it avoids mmaped pages.
> > Also page reclaim or page migration don't cause any issues because
> > they avoid pages with increased refcount (and increased refcount would stop
> > drop_caches from reclaiming the page as well if it was not for the mmaped
> > check before). Generally, elevated page refcount currently guarantees page
> > isn't migrated, reclaimed, or otherwise detached from the mapping (except
> > for truncate where the combination of mapping-index becomes invalid) and
> > your page forking would change that assumption - which IMHO has a big
> > potential for some breakage somewhere.
>
> Lifetime and visibility from user are different topic. The issue here
> is visibility. Of course, those has relation more or less though,
> refcount doesn't stop to drop page from radix-tree at all.
Well, refcount prevents dropping page from a radix-tree in some cases -
memory pressure, page migration to name the most prominent ones. It doesn't
prevent page from being dropped because of truncate, that is correct. In
general, the rule we currently obey is that kernel doesn't detach a page
with increased refcount from a radix tree unless there is a syscall asking
kernel to do that.
> Well, anyway, your claim seems to be assuming the userspace app
> workarounds the issues. And it sounds like still not workarounds the
> ENOSPC issue (validate at page fault/GUP) even if assuming userspace
> behave as perfect. Calling it as kernel assumption is strange.
Realistically, I don't think userspace apps workaround anything. They just
do what happens to work. Nobody happens to delete files while application
works on it and expect application to gracefully handle that. So everyone
is happy. I'm not sure about which ENOSPC issue you are speaking BTW. Can
you please ellaborate?
> If you claim, there is strange logic widely used already, and of course,
> we can't simply break it because of compatibility. I would be able to
> agree. But your claim sounds like that logic is sane and well designed
> behavior. So I disagree.
To me the rule: "Do not detach a page from a radix tree if it has an elevated
refcount unless explicitely requested by a syscall" looks like a sane one.
Yes.
> > And frankly I fail to see why you and Daniel care so much about this
> > corner case because from performance POV it's IMHO a non-issue and you
> > bother with page forking because of performance, don't you?
>
> Trying to penalize the corner case path, instead of normal path, should
> try at first. Penalizing normal path to allow corner case path is insane
> basically.
>
> Make normal path faster and more reliable is what we are trying.
Elevated refcount of a page is in my opinion a corner case path. That's why
I think that penalizing that case by waiting for IO instead of forking is
acceptable cost for the improved compatibility & maintainability of the
code.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Jan Kara <[email protected]> writes:
> I'm not sure about which ENOSPC issue you are speaking BTW. Can you
> please ellaborate?
1. GUP simulate page fault, and prepare to modify
2. writeback clear dirty, and make PTE read-only
3. snapshot/reflink make block cow
4. driver called GUP modifies page, and dirty page without simulate page fault
>> If you claim, there is strange logic widely used already, and of course,
>> we can't simply break it because of compatibility. I would be able to
>> agree. But your claim sounds like that logic is sane and well designed
>> behavior. So I disagree.
>
> To me the rule: "Do not detach a page from a radix tree if it has an elevated
> refcount unless explicitely requested by a syscall" looks like a sane one.
> Yes.
>
>> > And frankly I fail to see why you and Daniel care so much about this
>> > corner case because from performance POV it's IMHO a non-issue and you
>> > bother with page forking because of performance, don't you?
>>
>> Trying to penalize the corner case path, instead of normal path, should
>> try at first. Penalizing normal path to allow corner case path is insane
>> basically.
>>
>> Make normal path faster and more reliable is what we are trying.
>
> Elevated refcount of a page is in my opinion a corner case path. That's why
> I think that penalizing that case by waiting for IO instead of forking is
> acceptable cost for the improved compatibility & maintainability of the
> code.
What is "elevated refcount"? What is difference with normal refcount?
Are you saying "refcount >= specified threshold + waitq/wakeup" or
such? If so, it is not the path. It is the state. IOW, some group may
not hit much, but some group may hit much, on normal path.
So it sounds like yet another "stable page". I.e. unpredictable
performance. (BTW, by recall of "stable page", noticed "stable page"
would not provide stabled page data for that logic too.)
Well, assuming "elevated refcount == threshold + waitq/wakeup", so
IMO, it is not attractive. Rather the last option if there is no
others as design choice.
Thanks.
--
OGAWA Hirofumi <[email protected]>
On Sun 09-08-15 22:42:42, OGAWA Hirofumi wrote:
> Jan Kara <[email protected]> writes:
>
> > I'm not sure about which ENOSPC issue you are speaking BTW. Can you
> > please ellaborate?
>
> 1. GUP simulate page fault, and prepare to modify
> 2. writeback clear dirty, and make PTE read-only
> 3. snapshot/reflink make block cow
I assume by point 3. you mean that snapshot / reflink happens now and thus
the page / block is marked as COW. Am I right?
> 4. driver called GUP modifies page, and dirty page without simulate page fault
OK, but this doesn't hit ENOSPC because as you correctly write in point 4.,
the page gets modified without triggering another page fault so COW for the
modified page isn't triggered. Modified page contents will be in both the
original and the reflinked file, won't it?
And I agree that the fact that snapshotted file's original contents can
still get modified is a bug. A one which is difficult to fix.
> >> If you claim, there is strange logic widely used already, and of course,
> >> we can't simply break it because of compatibility. I would be able to
> >> agree. But your claim sounds like that logic is sane and well designed
> >> behavior. So I disagree.
> >
> > To me the rule: "Do not detach a page from a radix tree if it has an elevated
> > refcount unless explicitely requested by a syscall" looks like a sane one.
> > Yes.
> >
> >> > And frankly I fail to see why you and Daniel care so much about this
> >> > corner case because from performance POV it's IMHO a non-issue and you
> >> > bother with page forking because of performance, don't you?
> >>
> >> Trying to penalize the corner case path, instead of normal path, should
> >> try at first. Penalizing normal path to allow corner case path is insane
> >> basically.
> >>
> >> Make normal path faster and more reliable is what we are trying.
> >
> > Elevated refcount of a page is in my opinion a corner case path. That's why
> > I think that penalizing that case by waiting for IO instead of forking is
> > acceptable cost for the improved compatibility & maintainability of the
> > code.
>
> What is "elevated refcount"? What is difference with normal refcount?
> Are you saying "refcount >= specified threshold + waitq/wakeup" or
> such? If so, it is not the path. It is the state. IOW, some group may
> not hit much, but some group may hit much, on normal path.
Yes, by "elevated refcount" I meant refcount > 2 (one for pagecache, one for
your code inspecting the page).
> So it sounds like yet another "stable page". I.e. unpredictable
> performance. (BTW, by recall of "stable page", noticed "stable page"
> would not provide stabled page data for that logic too.)
>
> Well, assuming "elevated refcount == threshold + waitq/wakeup", so
> IMO, it is not attractive. Rather the last option if there is no
> others as design choice.
I agree the performance will be less predictable and that is not good. But
changing what is visible in the file when writeback races with GUP is a
worse problem to me.
Maybe if GUP marked pages it got ref for so that we could trigger the slow
behavior only for them (Peter Zijlstra proposed in [1] an infrastructure so
that pages pinned by get_user_pages() would be properly accounted and then
we could use PG_mlocked and elevated refcount as a more reliable indication
of pages that need special handling).
Honza
[1] http://thread.gmane.org/gmane.linux.kernel.mm/117679
--
Jan Kara <[email protected]>
SUSE Labs, CR
Jan Kara <[email protected]> writes:
> On Sun 09-08-15 22:42:42, OGAWA Hirofumi wrote:
>> Jan Kara <[email protected]> writes:
>>
>> > I'm not sure about which ENOSPC issue you are speaking BTW. Can you
>> > please ellaborate?
>>
>> 1. GUP simulate page fault, and prepare to modify
>> 2. writeback clear dirty, and make PTE read-only
>> 3. snapshot/reflink make block cow
>
> I assume by point 3. you mean that snapshot / reflink happens now and thus
> the page / block is marked as COW. Am I right?
Right.
>> 4. driver called GUP modifies page, and dirty page without simulate page fault
>
> OK, but this doesn't hit ENOSPC because as you correctly write in point 4.,
> the page gets modified without triggering another page fault so COW for the
> modified page isn't triggered. Modified page contents will be in both the
> original and the reflinked file, won't it?
And above result can be ENOSPC too, depending on implement and race
condition. Also, if FS converted zerod blocks to hole like hammerfs,
simply ENOSPC happens. I.e. other process uses all spaces, but then no
->page_mkwrite() callback to check ENOSPC.
> And I agree that the fact that snapshotted file's original contents can
> still get modified is a bug. A one which is difficult to fix.
Yes, it is why I'm thinking this logic is issue, before page forking.
>> So it sounds like yet another "stable page". I.e. unpredictable
>> performance. (BTW, by recall of "stable page", noticed "stable page"
>> would not provide stabled page data for that logic too.)
>>
>> Well, assuming "elevated refcount == threshold + waitq/wakeup", so
>> IMO, it is not attractive. Rather the last option if there is no
>> others as design choice.
>
> I agree the performance will be less predictable and that is not good. But
> changing what is visible in the file when writeback races with GUP is a
> worse problem to me.
>
> Maybe if GUP marked pages it got ref for so that we could trigger the slow
> behavior only for them (Peter Zijlstra proposed in [1] an infrastructure so
> that pages pinned by get_user_pages() would be properly accounted and then
> we could use PG_mlocked and elevated refcount as a more reliable indication
> of pages that need special handling).
I'm not reading Peter's patchset fully though, looks like good, and
maybe similar strategy in my mind currently. Also I'm thinking to add
callback for FS at start and end of GUP's pin window. (for just an
example, callback can be used to stop writeback by FS if FS wants.)
Thanks.
--
OGAWA Hirofumi <[email protected]>
On 07/31/2015 01:27 PM, Daniel Phillips wrote:
> On Friday, July 31, 2015 8:37:35 AM PDT, Raymond Jennings wrote:
>> Returning ENOSPC when you have free space you can't yet prove is safer
>> than
>> not returning it and risking a data loss when you get hit by a
>> write/commit
>> storm. :)
>
> Remember when delayed allocation was scary and unproven, because proving
> that ENOSPC will always be returned when needed is extremely difficult?
> But the performance advantage was compelling, so we just worked at it
> until it worked. There were times when it didn't work properly, but the
> code was in the tree so it got fixed.
>
> It's like that now with page forking - a new technique with compelling
> advantages, and some challenges. In the past, we (the Linux community)
> would rise to the challenge and err on the side of pushing optimizations
> in early. That was our mojo, and that is how Linux became the dominant
> operating system it is today. Do we, the Linux community, still have that
> mojo?
Do you have the mojo to come up with a proposal on how
to make things work, in a way that ensures data consistency
for Linux users?
Yes, we know page forking is not compatible with the way
Linux currently uses refcounts.
The question is, does anyone have an idea on how we could
fix that?
Not necessarily an implementation yet, just an idea might
be enough to move forward at this stage.
However, if nobody wants to work on even an idea, page
forking may simply not be a safe thing to do.