2004-11-15 21:19:55

by Miklos Szeredi

[permalink] [raw]
Subject: [PATCH] [Request for inclusion] Filesystem in Userspace

Andrew, Linus!

Please consider adding the FUSE filesystem to the mainline kernel.

FUSE exports the filesystem functionality to userspace. The
communication interface is designed to be simple, efficient, secure
and able to support most of the usual filesystem semantics.

Reasons why I think inclusion is a good idea:

- It is widely used. There are many filesystems available which use
FUSE, and there are probably even more in-house applications

- It's been around for 3 years, it's very stable and both the kernel
interface and the library API have matured

- It's non-intrusive, the patch doesn't touch other parts of the
kernel

Patches for 2.6.10-rc2 and 2.6.10-rc1-mm5 are available from:

http://fuse.sourceforge.net/kernel_patches/

More information can be found on the homepage at:

http://fuse.sourceforge.net/

Comments are welcome.

Thanks,
Miklos


2004-11-15 21:45:33

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

On Mon, Nov 15, 2004 at 10:15:03PM +0100, Miklos Szeredi wrote:
> Patches for 2.6.10-rc2 and 2.6.10-rc1-mm5 are available from:
>
> http://fuse.sourceforge.net/kernel_patches/

Instructions for submitting patches can be found in
Documentation/SubmittingPatches. Please follow those rules instead of
just pointing people at a web site.


thanks,

greg k-h

2004-11-15 22:36:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace



On Mon, 15 Nov 2004, Miklos Szeredi wrote:
>
> Please consider adding the FUSE filesystem to the mainline kernel.

Quite frankly I think it's too messy.

I'd like FUSE a whole lot more if it _only_ did the general page cache
reading, but it seems to do a whole lot more, most of it broken.

In other words, I think it's fundamentally wrong to not have a special
"fuse_file_read". If it isn't just "generic_file_read()" (possibly
together with a re-validation callback but even that is very debatable
indeed) there's something wrong with it imho.

The code looks like it was started before the page cache was all done, and
nobody ever cleaned it up to use the full VFS power - or for some suspect
reason decided that they wanted to support insane filesystems.

Together with removing the 2.4.x code and sending a real patch that has
the cleanups, and maybe I'd reconsider.

Linus

2004-11-16 09:09:09

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Linus,

I did send a pointer to the cleaned up patch, maybe this wasn't
explicit enough:

http://fuse.sourceforge.net/kernel_patches/fuse-2.1-2.6.10-rc2.patch

It's 90k uncompressed so I didnt want to include it inline, but I can
send it privately if you want.

> I'd like FUSE a whole lot more if it _only_ did the general page cache
> reading, but it seems to do a whole lot more, most of it broken.

The cruft is the 2.4 code, and it _is_ removed from the patch.

Most of the 2.6 code is the page cache reading and writing. It's
complicated because of

- clustered reads with readpages()

- async writes

Both are non-essential, but both improve performance.

The need to have a special fuse_file_read()/fuse_file_write() comes
from the fact that some filesystems want a 1 to 1 mapping betwen the
read/write syscalls and the read/write operations. This is a sort of
"direct IO" operating mode. Again this feature is non-essential, but
isn't the result of unmaintained code.

Miklos

2004-11-16 09:18:45

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

On Tue, 2004-11-16 at 10:08 +0100, Miklos Szeredi wrote:
> Linus,
>
> I did send a pointer to the cleaned up patch, maybe this wasn't
> explicit enough:
>
> http://fuse.sourceforge.net/kernel_patches/fuse-2.1-2.6.10-rc2.patch
+static void request_wait_answer(struct fuse_req *req)
+{
+ spin_unlock(&fuse_lock);
+ wait_event(req->waitq, req->finished);
+ spin_lock(&fuse_lock);
+}

+ spin_lock(&fuse_lock);
+ req->out.h.error = -ENOTCONN;
+ if (fc->file) {
+ req->in.h.unique = get_unique(fc);
+ list_add_tail(&req->list, &fc->pending);
+ wake_up(&fc->waitq);
+ request_wait_answer(req);
+ list_del(&req->list);
+ }
+ spin_unlock(&fuse_lock);

somehow I find dropping the lock and then doing a list_del() without any kind of verification very suspicious.
Either you need the lock or you don't. If you do, the code is wrong. If you don't... don't take the lock :)




2004-11-16 09:44:05

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace


> somehow I find dropping the lock and then doing a list_del() without
> any kind of verification very suspicious.

list_del() is done with the lock held. Look closely.

Miklos

2004-11-16 09:56:02

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace


> yes but how do you know the entry is still on the list and valid ?

Because, it's always kept on one of two lists: pending and processing.
The entry is valid valid because it's "owned" by the caller, it's
never freed inside request_send().

> you dropped the lock. A normal code pattern is that you then HAVE
> to revalidate the assumptions which you guard by that lock.

The lock guards the list not the list element which is being removed.

Miklos

2004-11-16 09:53:19

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

On Tue, 2004-11-16 at 10:40 +0100, Miklos Szeredi wrote:
> > somehow I find dropping the lock and then doing a list_del() without
> > any kind of verification very suspicious.
>
> list_del() is done with the lock held. Look closely.

yes but how do you know the entry is still on the list and valid ?
you dropped the lock. A normal code pattern is that you then HAVE
to revalidate the assumptions which you guard by that lock.
If there are no such assumptions... then you didn't need the lock.


2004-11-16 10:14:33

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Hi Miklos,

On Tue, 16 Nov 2004 10:08:54 +0100, Miklos Szeredi <[email protected]> wrote:
> I did send a pointer to the cleaned up patch, maybe this wasn't
> explicit enough:
>
> http://fuse.sourceforge.net/kernel_patches/fuse-2.1-2.6.10-rc2.patch

Few comments:

- Breaks if CONFIG_PROC_FS is not enabled.
- Explicit casts are not needed when converting void pointers
(found in various places).

Pekka

2004-11-16 10:20:33

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

> - Breaks if CONFIG_PROC_FS is not enabled.

Yes. Would a device node be better? Perhaps. This way there's no
need to allocate a major/minor for a device.

> - Explicit casts are not needed when converting void pointers
> (found in various places).

But they don't hurt either. At least I can be sure to assign the
right kind of pointer.

Thanks,
Miklos

2004-11-16 10:22:26

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

On Tue, 2004-11-16 at 10:52 +0100, Miklos Szeredi wrote:
> > yes but how do you know the entry is still on the list and valid ?
>
> Because, it's always kept on one of two lists: pending and processing.
> The entry is valid valid because it's "owned" by the caller, it's
> never freed inside request_send().
>
> > you dropped the lock. A normal code pattern is that you then HAVE
> > to revalidate the assumptions which you guard by that lock.
>
> The lock guards the list not the list element which is being removed.

Locking rules like that need to be clearly documented.

--
dwmw2

2004-11-16 10:25:57

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

> >
> > The lock guards the list not the list element which is being removed.
>
> Locking rules like that need to be clearly documented.

OK, I'll add some comments.

Thanks,
Miklos

2004-11-16 10:39:35

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Hi,

On Tue, 16 Nov 2004 11:20:22 +0100, Miklos Szeredi <[email protected]> wrote:
> > - Breaks if CONFIG_PROC_FS is not enabled.
>
> Yes. Would a device node be better? Perhaps. This way there's no
> need to allocate a major/minor for a device.

...or fix your Kconfig to select procfs. :)

On Tue, 16 Nov 2004 11:20:22 +0100, Miklos Szeredi <[email protected]> wrote:
> > - Explicit casts are not needed when converting void pointers
> > (found in various places).
>
> But they don't hurt either. At least I can be sure to assign the
> right kind of pointer.

Hmm? The conversion is guaranteed by the standard which makes them
redundant. And redundancy does hurt maintainability. The have been
patches to get rid of the existing casts so please don't introduce new
ones.

Pekka

2004-11-16 10:45:30

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace


> >
> > Yes. Would a device node be better? Perhaps. This way there's no
> > need to allocate a major/minor for a device.
>
> ...or fix your Kconfig to select procfs. :)

Yep, that's clearly the winning solution :)

> Hmm? The conversion is guaranteed by the standard which makes them
> redundant. And redundancy does hurt maintainability. The have been
> patches to get rid of the existing casts so please don't introduce new
> ones.

OK, I don't really care either way.

Miklos

2004-11-16 10:59:50

by Simon Braunschmidt

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace



Pekka Enberg schrieb:
> Hi,
>
> On Tue, 16 Nov 2004 11:20:22 +0100, Miklos Szeredi <[email protected]> wrote:
>
>>> - Breaks if CONFIG_PROC_FS is not enabled.
>>
>>Yes. Would a device node be better? Perhaps. This way there's no
>>need to allocate a major/minor for a device.
>
>
> ...or fix your Kconfig to select procfs. :)
>
> On Tue, 16 Nov 2004 11:20:22 +0100, Miklos Szeredi <[email protected]> wrote:
>
>>> - Explicit casts are not needed when converting void pointers
>>>(found in various places).
>>
>>But they don't hurt either. At least I can be sure to assign the
>>right kind of pointer.
>
>
> Hmm? The conversion is guaranteed by the standard which makes them
> redundant.

And redundancy does hurt maintainability.

Naturally, it would be the other way around.
Sure you can write all your code in binary, or even better compressed,
but i wouldnt maintain those beasts ;-)

The have been
> patches to get rid of the existing casts so please don't introduce new
> ones.
>
> Pekka

I vote for explicit casts, makes code more readable.

*duck*
Simon

2004-11-16 11:03:22

by Jan Kratochvil

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Hi,

On Tue, Nov 16, 2004 at 11:20:22AM +0100, Miklos Szeredi wrote:
> > - Breaks if CONFIG_PROC_FS is not enabled.
>
> Yes. Would a device node be better? Perhaps. This way there's no
> need to allocate a major/minor for a device.

"fuse/version" you have in /proc while it belongs to /proc
"fuse/dev" you have in /proc while it belongs to /dev

Also I am not sure human-readable "fuse/version" is required there at all.
Regular FUSE request enlisted in 'enum fuse_opcode' would be enough.


Regards,
Lace

2004-11-16 11:20:57

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Hi,

On Tue, 16 Nov 2004 12:01:06 +0100, Simon Braunschmidt
<[email protected]> wrote:
> And redundancy does hurt maintainability.
>
> Naturally, it would be the other way around.
> Sure you can write all your code in binary, or even better compressed,
> but i wouldnt maintain those beasts ;-)

No, that is obfuscation and has nothing to do with this. The cast I
mentioned is _redudant_ because the common case is:

struct foo * f = (struct foo *) priv; /* priv is void pointer */

And the cast gives you absolutely zero benefit in terms of
readability. For arithmetic types, you use casts to be explicit about
different conversions, but for void pointers there's only one
conversion which makes sense and that's what the standard guarantees.

On Tue, 16 Nov 2004 12:01:06 +0100, Simon Braunschmidt
<[email protected]> wrote:
> I vote for explicit casts, makes code more readable.

I vote for the established kernel coding style.

Pekka

2004-11-16 12:19:58

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Hi Miklov,

I have a couple of more comments.

> --- /dev/null Wed Dec 31 16:00:00 196900
> +++ b/fs/fuse/dev.c 2004-11-15 20:20:16 +01:00
> @@ -0,0 +1,606 @@
> +static int get_unique(struct fuse_conn *fc)
> +{
> + do fc->reqctr++;
> + while (!fc->reqctr);
> + return fc->reqctr;
> +}
> +

What are you doing here? Why do you need to avoid zero? Anyway, if you
really need to do that, this would be more readable IMHO:

static int get_unique(struct fuse_conn *fc)
{
if (++fc->reqctr == 0)
fc->reqctr = 1;
return fc->reqctr;
}

An added bonus of producing better code for architectures that support
conditional move...

> +static struct proc_dir_entry *proc_fs_fuse;
> +struct proc_dir_entry *proc_fuse_dev;
>
> +int fuse_dev_init(void)
> +{
> + proc_fs_fuse = NULL;
> + proc_fuse_dev = NULL;

Pointers with static storage class are initialized to NULL by default
so these are redundant.

> --- /dev/null Wed Dec 31 16:00:00 196900
> +++ b/fs/fuse/inode.c 2004-11-15 20:20:16 +01:00
> @@ -0,0 +1,523 @@

[snip]

> +enum { opt_fd,
> + opt_rootmode,
> + opt_uid,
> + opt_default_permissions,
> + opt_allow_other,
> + opt_allow_root,
> + opt_kernel_cache,
> + opt_large_read,
> + opt_direct_io,
> + opt_max_read,
> + opt_err };
> +

Enums in upper case, please.

Pekka

2004-11-16 12:31:19

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

>> The have been
>> patches to get rid of the existing casts so please don't introduce new
>> ones.
>
>I vote for explicit casts, makes code more readable.

And makes it more error prone. Once upon a time, a user wrote:

ptr = (int *)malloc(...)

And justified the use of the cast because gcc generated a warning, and I
replied that if he'd included <stdlib.h> (yeah, user space), the warning would
be gone, even without a cast. Sigh.



Jan Engelhardt
--
Gesellschaft für Wissenschaftliche Datenverarbeitung
Am Fassberg, 37077 Göttingen, http://www.gwdg.de

2004-11-16 14:01:28

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

> "fuse/version" you have in /proc while it belongs to /proc
> "fuse/dev" you have in /proc while it belongs to /dev

Well, 'Documentation/devices.txt' says:

THE DEVICE REGISTRY IS OFFICIALLY FROZEN FOR LINUS TORVALDS' KERNEL
TREE. At Linus' request, no more allocations will be made official
for Linus' kernel tree; the 3 June 2001 version of this list is the
official final version of this registry.

So placing it in /proc doesn't seem to me such a bad idea.

> Also I am not sure human-readable "fuse/version" is required there at all.
> Regular FUSE request enlisted in 'enum fuse_opcode' would be enough.

This would break the assumption that no requests can be received until
the filesystem is mounted.

Miklos

2004-11-16 15:19:33

by Ralph Corderoy

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace


Hi,

Pekka Enberg wrote:
> No, that is obfuscation and has nothing to do with this. The cast I
> mentioned is _redudant_ because the common case is:
>
> struct foo * f = (struct foo *) priv; /* priv is void pointer */

These casts are also a problem when priv changes type; the compiler is
being told to not complain.

Cheers,


Ralph.

2004-11-16 16:41:26

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

On Tue, Nov 16, 2004 at 03:01:10PM +0100, Miklos Szeredi wrote:
> > "fuse/version" you have in /proc while it belongs to /proc
> > "fuse/dev" you have in /proc while it belongs to /dev
>
> Well, 'Documentation/devices.txt' says:
>
> THE DEVICE REGISTRY IS OFFICIALLY FROZEN FOR LINUS TORVALDS' KERNEL
> TREE. At Linus' request, no more allocations will be made official
> for Linus' kernel tree; the 3 June 2001 version of this list is the
> official final version of this registry.

Not true, you can get new numbers.

Don't put things that should be in /dev into /proc, not allowed.

> So placing it in /proc doesn't seem to me such a bad idea.

No. Actually, put it in sysfs, and then udev will create your /dev node
for you automatically. And in sysfs you can put your other stuff
(version, etc.) which is the proper place for it.

thanks,

greg k-h

2004-11-16 16:49:11

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

> > Well, 'Documentation/devices.txt' says:
> >
> > THE DEVICE REGISTRY IS OFFICIALLY FROZEN FOR LINUS TORVALDS' KERNEL
> > TREE. At Linus' request, no more allocations will be made official
> > for Linus' kernel tree; the 3 June 2001 version of this list is the
> > official final version of this registry.
>
> Not true, you can get new numbers.

I don't understand, what's the reason for this warning then? To scare
away developers wanting to allocate lots of device numbers?

> Don't put things that should be in /dev into /proc, not allowed.
>
> > So placing it in /proc doesn't seem to me such a bad idea.
>
> No. Actually, put it in sysfs, and then udev will create your /dev node
> for you automatically. And in sysfs you can put your other stuff
> (version, etc.) which is the proper place for it.

I'll do that.

Thanks,
Miklos

2004-11-16 17:04:21

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

On Tue, Nov 16, 2004 at 05:45:25PM +0100, Miklos Szeredi wrote:
> > > Well, 'Documentation/devices.txt' says:
> > >
> > > THE DEVICE REGISTRY IS OFFICIALLY FROZEN FOR LINUS TORVALDS' KERNEL
> > > TREE. At Linus' request, no more allocations will be made official
> > > for Linus' kernel tree; the 3 June 2001 version of this list is the
> > > official final version of this registry.
> >
> > Not true, you can get new numbers.
>
> I don't understand, what's the reason for this warning then? To scare
> away developers wanting to allocate lots of device numbers?

It's an old message, and yes, it's there to scare people away. Glad to
see it's working :)

thanks,

greg k-h

2004-11-16 17:51:39

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace


> It's an old message, and yes, it's there to scare people away. Glad to
> see it's working :)

So if I only need a single device number should I register a "misc"
device? misc_register() seems to create the relevant sysfs entry.

Thanks,
Miklos

2004-11-16 17:59:34

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

On Tue, Nov 16, 2004 at 06:50:52PM +0100, Miklos Szeredi wrote:
>
> > It's an old message, and yes, it's there to scare people away. Glad to
> > see it's working :)
>
> So if I only need a single device number should I register a "misc"
> device? misc_register() seems to create the relevant sysfs entry.

Yes, that is a good way to get a device, without having to reserve a
number.

thanks,

greg k-h

2004-11-16 19:09:45

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace


> > So if I only need a single device number should I register a "misc"
> > device? misc_register() seems to create the relevant sysfs entry.
>
> Yes, that is a good way to get a device, without having to reserve a
> number.

No, I think reserving a number is still necessary: there seems to be
only a very small space for dynamically registered misc devices (max
15), so that's not any better than reserving a static one.

I just don't yet see the big picture WRT device number allocation.

So what I'm interested in, is if I get a reserved minor number for the
misc (major=10) device, will I be kicked in the butt (by Linus or
anybody else) like for the /proc approach?

Thanks
Miklos

2004-11-16 19:20:45

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

On Tue, Nov 16, 2004 at 08:09:10PM +0100, Miklos Szeredi wrote:
>
> > > So if I only need a single device number should I register a "misc"
> > > device? misc_register() seems to create the relevant sysfs entry.
> >
> > Yes, that is a good way to get a device, without having to reserve a
> > number.
>
> No, I think reserving a number is still necessary: there seems to be
> only a very small space for dynamically registered misc devices (max
> 15), so that's not any better than reserving a static one.

So reserve a minor number of the misc class. Ask LANNANA for a number
and you should be fine.

> So what I'm interested in, is if I get a reserved minor number for the
> misc (major=10) device, will I be kicked in the butt (by Linus or
> anybody else) like for the /proc approach?

Depends on what your /dev node is trying to do. What is is doing
anyway? Any ioctls? Any wierd, non-chardev like things?

Again, inline code would have been nice to see so those of us who live
in our email clients could have reviewed it...

thanks,

greg k-h

2004-11-16 19:27:48

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

>No, I think reserving a number is still necessary: there seems to be
>only a very small space for dynamically registered misc devices (max
>15), so that's not any better than reserving a static one.

15? But include/linux/miscdevice.h lists more than 20 static numbers for
possibly-going-to-be-loaded-modules!


Jan Engelhardt
--
Gesellschaft für Wissenschaftliche Datenverarbeitung
Am Fassberg, 37077 Göttingen, http://www.gwdg.de

2004-11-16 19:34:57

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace


> 15? But include/linux/miscdevice.h lists more than 20 static numbers for
> possibly-going-to-be-loaded-modules!

Yes, minors 0-139 are static ones, and 140-254 are dynamic ones.
Those 20 are all in the static range.

Miklos

2004-11-16 19:34:58

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

> Depends on what your /dev node is trying to do. What is is doing
> anyway?

Filesystem requests are passed to userspace and the reply is send back
to the kernel. So it's definitely a character device or socket like
thing.

> Any ioctls? Any wierd, non-chardev like things?

Nothing extraordinary. Messages are sent/received with plain read and
write.

> Again, inline code would have been nice to see so those of us who live
> in our email clients could have reviewed it...

Next time I'll try to split it up in managable parts, and send it
inline.

Thanks,
Miklos

2004-11-16 19:43:17

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

On Tue, Nov 16, 2004 at 08:30:21PM +0100, Miklos Szeredi wrote:
> > Any ioctls? Any wierd, non-chardev like things?
>
> Nothing extraordinary. Messages are sent/received with plain read and
> write.

Ok, that sounds acceptable.

> > Again, inline code would have been nice to see so those of us who live
> > in our email clients could have reviewed it...
>
> Next time I'll try to split it up in managable parts, and send it
> inline.

I, and others, will appreciate that.

thanks,

greg k-h

2004-11-16 19:56:23

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

>If you didn't mistype above this means there is space for 115 and NOT just
>15 dynamic devices and that ought to be plenty for you.
no, but misunderstood. it's clear now.
>
>btw. On a different subject, does fuse allow several user space
>filesystems at the same time or only one?

20:48 io:../fuse-1.9/util # df -Ta
Filesystem Type 1K-blocks Used Available Use% Mounted on
[...]
lt-fusexmp fuse 34337852 16626616 17711236 49% /mnt/fuji
lt-fusexmp fuse 34337852 16626616 17711236 49% /mnt/mmc
lt-hello fuse 0 0 0 - /mnt/smc

Yes, seems so.



Jan Engelhardt
--
Gesellschaft für Wissenschaftliche Datenverarbeitung
Am Fassberg, 37077 Göttingen, http://www.gwdg.de

2004-11-16 19:50:37

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

On Tue, 16 Nov 2004, Miklos Szeredi wrote:
> > 15? But include/linux/miscdevice.h lists more than 20 static numbers for
> > possibly-going-to-be-loaded-modules!
>
> Yes, minors 0-139 are static ones, and 140-254 are dynamic ones.
> Those 20 are all in the static range.

If you didn't mistype above this means there is space for 115 and NOT just
15 dynamic devices and that ought to be plenty for you.

btw. On a different subject, does fuse allow several user space
filesystems at the same time or only one?

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2004-11-16 20:17:17

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

> If you didn't mistype above this means there is space for 115 and NOT just
> 15 dynamic devices and that ought to be plenty for you.

But unfortunately I did mistype, and it's really only 15.

> btw. On a different subject, does fuse allow several user space
> filesystems at the same time or only one?

There's no limit on the number of mounted FUSE filesystems. And only
one device is needed, since the mount is bound to the opened file
descriptor.

Miklos

2004-11-17 15:42:50

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

> No. Actually, put it in sysfs, and then udev will create your /dev node
> for you automatically. And in sysfs you can put your other stuff
> (version, etc.) which is the proper place for it.

Next question: _where_ to put other stuff? In /proc this has a
logical place for filesystems: /proc/fs/fsname/other_stuff. But
there's no filesystem section in sysfs.

So?

Thanks,
Miklos

2004-11-17 16:59:53

by Nikita Danilov

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Miklos Szeredi writes:
> > No. Actually, put it in sysfs, and then udev will create your /dev node
> > for you automatically. And in sysfs you can put your other stuff
> > (version, etc.) which is the proper place for it.
>
> Next question: _where_ to put other stuff? In /proc this has a
> logical place for filesystems: /proc/fs/fsname/other_stuff. But
> there's no filesystem section in sysfs.

/sys/fs used to exist for for some. Moreover, /sys/fs/foofs/ was added
automagically when foofs file system type was registered. But it was
ultimately removed, because nobody took the time to fix all races
between accessing /sys/fs/foofs/gadget and
umount/filesystem-module-unloading.

Another way is to implement special "control" file-system type (using
fs/libfs.c functions), to be used like

mount -tfoofs /device /mnt/point
mount -tfoo_ctrlfs -o host=/mnt/point /mnt/control-point

Again, nobody took the time to actually do this for any real
file-system, as far as I know.

>
> So?

Go ahead bravely. :)

>
> Thanks,
> Miklos

Nikita.

2004-11-17 17:13:09

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace


>mount -tfoo_ctrlfs -o host=/mnt/point /mnt/control-point

Looks to me like a pollution of the mount table if you do this on a lot of
filesystems.



Jan Engelhardt
--

2004-11-17 16:43:11

by Alan

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

On Maw, 2004-11-16 at 14:01, Miklos Szeredi wrote:
> > "fuse/version" you have in /proc while it belongs to /proc
> > "fuse/dev" you have in /proc while it belongs to /dev
>
> Well, 'Documentation/devices.txt' says:
>
> THE DEVICE REGISTRY IS OFFICIALLY FROZEN FOR LINUS TORVALDS' KERNEL
> TREE. At Linus' request, no more allocations will be made official
> for Linus' kernel tree; the 3 June 2001 version of this list is the
> official final version of this registry.

This is just to keep Linus happy, every vendor on the planet ignores it
and co-operates with LANANA so that we have a single unified cross
vendor namespace and numbering scheme. The numbering matters a lot less
now with udev but the naming is critical to all the poor app and config
tool authors.

So LANANA is authoritative except for Linus computer 8)


2004-11-17 17:36:37

by Nikita Danilov

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Jan Engelhardt writes:
>
> >mount -tfoo_ctrlfs -o host=/mnt/point /mnt/control-point
>
> Looks to me like a pollution of the mount table if you do this on a lot of
> filesystems.
>

If you have a lot of file-systems your mount table is already polluted.

As they say, in UNIX everything is a file, and in Linux---a filesystem.

>
>
> Jan Engelhardt
> --

Nikita.

2004-11-17 17:41:57

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

> > >mount -tfoo_ctrlfs -o host=/mnt/point /mnt/control-point
> >
> > Looks to me like a pollution of the mount table if you do this on a lot of
> > filesystems.
>
>If you have a lot of file-systems your mount table is already polluted.

That does not justify to pollute with *_ctlfs it
to double the size it already is.


2004-11-17 18:00:00

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace


> /sys/fs used to exist for for some. Moreover, /sys/fs/foofs/ was added
> automagically when foofs file system type was registered. But it was
> ultimately removed, because nobody took the time to fix all races
> between accessing /sys/fs/foofs/gadget and
> umount/filesystem-module-unloading.

I don't see why this would be any harder for filesystem code than for
other types of drivers. Maybe someone can enlighten me.

Anyway, I can try to clean it up: remove all the racy bits and keep
what I need (which is mainly just the /sys/fs directory). Where can I
find the most recent version of this?

Thanks,
Miklos

2004-11-17 18:10:58

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [Request for inclusion] Filesystem in Userspace


>Unless, of course, by "polluted" you mean that output of "cat
>/proc/self/mounts" becomes longer.

Precisely that. I recall there once was "you're exceeding the maximum number of
filesystems" or such, has that "bug"/"non-feature" been lifted?



Jan Engelhardt
--
Gesellschaft für Wissenschaftliche Datenverarbeitung
Am Fassberg, 37077 Göttingen, http://www.gwdg.de

2004-11-17 18:18:04

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace


> Think about putting the /sys/sysfs entry into the tree before sysfs has
> been fully initialized :)

I'd rather not think about it. It reminds me of Gödel's paradox...

> Look through the 2.5 kernel series, it was in there for a while.
>
> But don't really worry about it, just create /sys/fs/ and use it to put
> your own stuff in it if you want. Don't try to recreate the old, buggy
> stuff :)

OK.

Thanks,
Miklos

2004-11-17 18:17:53

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

On Wed, Nov 17, 2004 at 06:56:13PM +0100, Miklos Szeredi wrote:
>
> > /sys/fs used to exist for for some. Moreover, /sys/fs/foofs/ was added
> > automagically when foofs file system type was registered. But it was
> > ultimately removed, because nobody took the time to fix all races
> > between accessing /sys/fs/foofs/gadget and
> > umount/filesystem-module-unloading.
>
> I don't see why this would be any harder for filesystem code than for
> other types of drivers. Maybe someone can enlighten me.

Think about putting the /sys/sysfs entry into the tree before sysfs has
been fully initialized :)

There are other fun races that Al Viro pointed out dealing with
superblock lifetimes from what I remember.

> Anyway, I can try to clean it up: remove all the racy bits and keep
> what I need (which is mainly just the /sys/fs directory). Where can I
> find the most recent version of this?

Look through the 2.5 kernel series, it was in there for a while.

But don't really worry about it, just create /sys/fs/ and use it to put
your own stuff in it if you want. Don't try to recreate the old, buggy
stuff :)

Good luck,

greg k-h

2004-11-17 18:25:13

by Nikita Danilov

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Miklos Szeredi writes:
>
> > /sys/fs used to exist for for some. Moreover, /sys/fs/foofs/ was added
> > automagically when foofs file system type was registered. But it was
> > ultimately removed, because nobody took the time to fix all races
> > between accessing /sys/fs/foofs/gadget and
> > umount/filesystem-module-unloading.
>
> I don't see why this would be any harder for filesystem code than for
> other types of drivers. Maybe someone can enlighten me.
>
> Anyway, I can try to clean it up: remove all the racy bits and keep
> what I need (which is mainly just the /sys/fs directory). Where can I
> find the most recent version of this?

It was removed at 2003.06.05, by "[fs] Remove kobject support for
filesystems" change-set ([email protected]), you can extract patch from
bitkeeper.

Reiser4 adds /sys/fs and /sys/fs/reiser4 manually (see kattr.[ch] in its
sources), and uses

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc4/2.6.9-rc4-mm1/broken-out/reiser4-kobject-umount-race.patch
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc4/2.6.9-rc4-mm1/broken-out/reiser4-kobject-umount-race-cleanup.patch

to avoid _some_ races (with umount), but these patches provide no
protection against races with module unloading.

>
> Thanks,
> Miklos

Nikita.

2004-11-17 18:15:17

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

On Wed, Nov 17, 2004 at 04:42:36PM +0100, Miklos Szeredi wrote:
> > No. Actually, put it in sysfs, and then udev will create your /dev node
> > for you automatically. And in sysfs you can put your other stuff
> > (version, etc.) which is the proper place for it.
>
> Next question: _where_ to put other stuff? In /proc this has a
> logical place for filesystems: /proc/fs/fsname/other_stuff. But
> there's no filesystem section in sysfs.
>
> So?

Feel free to create /sys/fs/ for you to put your stuff in.

thanks,

greg k-h

2004-11-17 18:07:47

by Nikita Danilov

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Jan Engelhardt writes:
> > > >mount -tfoo_ctrlfs -o host=/mnt/point /mnt/control-point
> > >
> > > Looks to me like a pollution of the mount table if you do this on a lot of
> > > filesystems.
> >
> >If you have a lot of file-systems your mount table is already polluted.
>
> That does not justify to pollute with *_ctlfs it
> to double the size it already is.

"mount-table" (fs/namespace.c:mount_hashtable) is consulted only when
path-resolution crosses dentry marked as mount-point (has non-zero
->d_mounted field), which is rare, and this means that number of
elements in mount_hashtable has little effect on the cost of path-name
resolution.

Unless, of course, by "polluted" you mean that output of "cat
/proc/self/mounts" becomes longer.

Nikita.

2004-11-17 18:56:06

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

On Wed, Nov 17, 2004 at 08:58:42PM +0300, Nikita Danilov wrote:
> "mount-table" (fs/namespace.c:mount_hashtable) is consulted only when
> path-resolution crosses dentry marked as mount-point (has non-zero
> ->d_mounted field), which is rare, and this means that number of
> elements in mount_hashtable has little effect on the cost of path-name
> resolution.

Not to mention the fact that hash chains there are considerably
shorter than those in dcache. I would be _very_ surprised if loop in
lookup_mnt() would become a hotspot in profiles.

2004-11-17 19:20:55

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Hi!

> Please consider adding the FUSE filesystem to the mainline kernel.
>
> FUSE exports the filesystem functionality to userspace. The
> communication interface is designed to be simple, efficient, secure
> and able to support most of the usual filesystem semantics.

Coda should do the job, too... What are advantages of FUSE over Coda?

Pavel
--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

2004-11-17 19:49:49

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace


> Coda should do the job, too... What are advantages of FUSE over Coda?

No, it couldn't do the job half as well. You know, I did use Coda,
until I had enough of it. Now look at how many userspace filesystems
were written based on CODA and how many on FUSE.

Coda is very different. You can only read/write whole files in Coda.
It's got a different attribute invalidation modell, a different access
checking modell. Generally it's much less flexible, which is OK since
it was not designed for this job. On the other hand it has things
which most filesystems don't need (reintegration).

So while on the surface they might seem similar, there's really not
that much common between them.

There's LUFS which is _much_ more close to FUSE, but it isn't as good
either (well of course, since it isn't written by me ;).

Miklos

2004-11-17 20:03:51

by Mike Waychison

[permalink] [raw]
Subject: Re: [Request for inclusion] Filesystem in Userspace

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jan Engelhardt wrote:
>>Unless, of course, by "polluted" you mean that output of "cat
>>/proc/self/mounts" becomes longer.
>
>
> Precisely that. I recall there once was "you're exceeding the maximum number of
> filesystems" or such, has that "bug"/"non-feature" been lifted?
>

Two issues are resolved in current 2.6:

- /proc/mounts used to be limitted to a page of data, fixed mid 2.4
- there used to be a limit of pseudo-block backed filesystems in a
system, which was fixed by distros by using multiple majors, fixed
properly in 2.6.4-rc1


- --
Mike Waychison
Sun Microsystems, Inc.
1 (650) 352-5299 voice
1 (416) 202-8336 voice

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NOTICE: The opinions expressed in this email are held by me,
and may not represent the views of Sun Microsystems, Inc.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBm616dQs4kOxk3/MRAgmbAJ9M9h6/Jp94h9cAr3u167CFLSa/5QCfYng3
w6QNoKtus1zExR7pRXXhVqQ=
=23z1
-----END PGP SIGNATURE-----

2004-11-17 21:48:54

by Bryan Henderson

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

>> Well, 'Documentation/devices.txt' says:
>>
>> THE DEVICE REGISTRY IS OFFICIALLY FROZEN FOR LINUS TORVALDS' KERNEL
>> TREE. At Linus' request, no more allocations will be made official
>> for Linus' kernel tree; the 3 June 2001 version of this list is the
>> official final version of this registry.
>
>This is just to keep Linus happy, every vendor on the planet ignores it
>and co-operates with LANANA

LANANA's web site says that the LANANA version of devices.txt went
"officially" in the "2.6 bitkeeper tree," which I have to assume is
Linus', in March 2004. But the LANANA version (from lanana.org) does
still contain the "frozen as of 2001 in Linus' tree" blurb above.

2004-11-17 20:48:37

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Hi!

> > Coda should do the job, too... What are advantages of FUSE over Coda?
>
> No, it couldn't do the job half as well. You know, I did use Coda,
> until I had enough of it. Now look at how many userspace filesystems
> were written based on CODA and how many on FUSE.

Well, there are filesystems based on NFS, which just plain do not
work... So #filesystems is not quite good argument.

> Coda is very different. You can only read/write whole files in Coda.
> It's got a different attribute invalidation modell, a different access
> checking modell. Generally it's much less flexible, which is OK since

I know I've asked before... but how is the "fuse-userspace-part
swapped out and memory full of dirty data on fuse" deadlock solved?

Coda has at least the advantage that coda-userspace-part does not need
to be page-locked.
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2004-11-18 08:17:26

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

> I know I've asked before... but how is the "fuse-userspace-part
> swapped out and memory full of dirty data on fuse" deadlock solved?

By either

1) not allowing share writable mappings

2) doing non-blocking asynchronous writepage

In the first case there will never be dirty data, since normal writes
go synchronously through the page cache.

In the second case there is no deadlock, because the memory subsystem
doesn't wait for data to be written. If the filesystem refuses to
write back data in a timely manner, memory will get full and OOM
killer will go to work. Deadlock simply cannot happen.

Miklos

2004-11-18 17:13:31

by Bryan Henderson

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

> 2) doing non-blocking asynchronous writepage
>
>In the second case there is no deadlock, because the memory subsystem
>doesn't wait for data to be written. If the filesystem refuses to
>write back data in a timely manner, memory will get full and OOM
>killer will go to work.

I don't see how the OOM killer can help you here. The OOM killer deals
with the system being out of virtual memory; writepage is about freeing
real memory. The real memory allocator will wait forever if it has to for
pageouts to complete so that it can evict a page and free up real memory.

If a pageout requires a user space process to run, and the user space
process requires additional real memory (e.g. in order to swap something
in from swap space) to do the pageout, you can have a deadlock.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

2004-11-18 17:18:37

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace


> I don't see how the OOM killer can help you here. The OOM killer deals
> with the system being out of virtual memory;

What? I think you are confusing something. I'm not an expert, but I
think usually you have lot's of virtual memory (4Gbyte per process),
so killing off processes to get more of it makes no sense.

Please corrent me if I'm wrong.

Miklos

2004-11-18 17:31:15

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

> A normal write is a VFS write() call, I assume. While they're going
> through the page cache, the pages are dirty, right? Is it possible that
> FUSE needs more real memory after dirtying those pages in order to finish
> cleaning them?

It's possible, but I don't see why that's a problem. If it can get
more memory it's OK. If allocation fails, then the write() will fail
with ENOMEM, if OOM killer get's to work and kills the FUSE process,
then write will return with ENOTCONN or something like that.

> What about the 3rd case: private writable mapping? How does that work?

That only reads pages and never writes them. It's just like malloc,
but prefilled with the file contents.

Miklos

2004-11-18 17:38:03

by Bryan Henderson

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

> 1) not allowing share writable mappings
>...
>In the first case there will never be dirty data, since normal writes
>go synchronously through the page cache.

A normal write is a VFS write() call, I assume. While they're going
through the page cache, the pages are dirty, right? Is it possible that
FUSE needs more real memory after dirtying those pages in order to finish
cleaning them?

What about the 3rd case: private writable mapping? How does that work?

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

2004-11-18 18:05:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace



On Thu, 18 Nov 2004, Miklos Szeredi wrote:
>
> It's possible, but I don't see why that's a problem. If it can get
> more memory it's OK. If allocation fails, then the write() will fail
> with ENOMEM, if OOM killer get's to work and kills the FUSE process,
> then write will return with ENOTCONN or something like that.

Why do you think it would kill the FUSE process? And why do you think
killing _any_ process would make the system come back to life? After all,
memory wasn't filled by process usage, it was filled by dirty FS pages.

I really do believe that user-space filesystems have problems. There's a
reason we tend to do them in kernel space.

But limiting the outstanding writes some way may at least hide the thing.

Linus

2004-11-18 18:26:24

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

> Why do you think it would kill the FUSE process? And why do you think
> killing _any_ process would make the system come back to life? After all,
> memory wasn't filled by process usage, it was filled by dirty FS pages.

Well, killing the fuse process _will_ make the system come back to
life, since then all the dirty pages belonging to the filesystem will
be discarded.

> I really do believe that user-space filesystems have problems. There's a
> reason we tend to do them in kernel space.

Well, NFS with a network failure has the same problem. It's not the
userspace that's the problem, it's the non-reliability.

> But limiting the outstanding writes some way may at least hide the thing.

Currently shared writable mappings aren't allowed for non-root by
default in FUSE. And since non-mmap writes do not dirty pages on the
long run, it's harder to run out of space with them: you need a new
filedescriptor for each page you want to steal, and you will run out
of them sooner than of pages.

So I believe that FUSE is quite secure in this respect. Please prove
me wrong!

Thanks,
Miklos

2004-11-18 18:30:46

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Linus Torvalds wrote:
> Why do you think it would kill the FUSE process? And why do you think
> killing _any_ process would make the system come back to life? After all,
> memory wasn't filled by process usage, it was filled by dirty FS pages.
>
> I really do believe that user-space filesystems have problems. There's a
> reason we tend to do them in kernel space.

Are kernel space filesystems immune from this problem? What happens
when they need to kmalloc() in order to write some data?

-- Jamie

2004-11-18 18:36:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace



On Thu, 18 Nov 2004, Miklos Szeredi wrote:
>
> Well, killing the fuse process _will_ make the system come back to
> life, since then all the dirty pages belonging to the filesystem will
> be discarded.

They will? Why? They're still mapped into other processes, still dirty.
How do they go away?

> > I really do believe that user-space filesystems have problems. There's a
> > reason we tend to do them in kernel space.
>
> Well, NFS with a network failure has the same problem. It's not the
> userspace that's the problem, it's the non-reliability.

No, it _is_ the userspace.

Yes, NFS is unreliable too, but it doesn't have the behaviour that when
the client locks up, the server locks up too. The two aren't "linked", and
they are protected from each other using up too much memory.

In contrast, a fuse process that needs to do IO is _not_ protected from
the clients having eaten up all the memory it needs to do the IO.

Btw, this is not a new issue. This is the _exact_ same issue that "run the
NFS server on the same machine as the client" has. And yes, it did have
problems. People still did it, because it allowed for user-space
filesystem demos.

> Currently shared writable mappings aren't allowed for non-root by
> default in FUSE.

Yes, that's a valid approach.

Linus

2004-11-18 18:51:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace



On Thu, 18 Nov 2004, Jamie Lokier wrote:
>
> Linus Torvalds wrote:
> > Why do you think it would kill the FUSE process? And why do you think
> > killing _any_ process would make the system come back to life? After all,
> > memory wasn't filled by process usage, it was filled by dirty FS pages.
> >
> > I really do believe that user-space filesystems have problems. There's a
> > reason we tend to do them in kernel space.
>
> Are kernel space filesystems immune from this problem? What happens
> when they need to kmalloc() in order to write some data?

That's why we have GFP_NOFS and other flags (PF_MEMALLOC etc). So yes,
they are "immune" in the sense that they have been inocculated, but not in
the sense that they can't have the bug conceptually.

So the kernel not only keeps a set of reserved pages for atomic
allocations, but also the VM knows not to recurse into a filesystem
operation when the reason for the memory allocation was a low-memory
circumstance. When a filesystem asks for memory in the page-out path, the
VM may still throw out cached pages for that FS, but it won't try to write
them back.

Guys, there is a _reason_ why microkernels suck. This is an example of how
things are _not_ "independent". The filesystems depend on the VM, and the
VM depends on the filesystem. You can't just split them up as if they were
two separate things (or rather: you _can_ split them up, but they still
very much need to know about each other in very intimate ways).

So what do you do? You limit shared dirty pages (inefficient memory use),
or you disallow certain behaviours, or you add tons of new interfaces to
expose essentially the same "every thing that can allocate and is on the
write-out path takes a GFP flag".

User-space filesystems are hard to get right. I'd claim that they are
almost impossible, unless you limit them somehow (shared writable mappings
are the nastiest part - if you don't have those, you can reasonably limit
your problems by limiting the number of dirty pages you accept through
normal "write()" calls).

Linus

2004-11-18 18:57:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace



On Thu, 18 Nov 2004, Alan Cox wrote:
>
> > I really do believe that user-space filesystems have problems. There's a
> > reason we tend to do them in kernel space.
> >
> > But limiting the outstanding writes some way may at least hide the thing.
>
> Possibly dumb question. Is there a reason we can't have a prctl() that
> flips the PF_* flags for a user space daemon in the same way as we do
> for kernel threads that do I/O processing ?

It's more than just PF_MEMALLOC.

And PF_MEMALLOC really is to avoid _recursion_, which is the smallest
problem. It does so by allowing the process to dip into the critical
resources, but that only works if you know that the process is actually
freeing pages right then and there. If you set it willy-nilly, you'll just
run out of pages soon, and you'll be dead.

The GFP_IO and GFP_FS pages are the _real_ protectors. They don't dip into
the (very limited) set of pages, they say "we can still free 90% of
memory, we just have to ignore that dangerous 10%".

And yes, you could somehow expose those as process flags too, and make
people who do GFP_USER or GFP_KERNEL actually look at some process flag
and do the proper masking.

So clearly you _can_ do it. But it requires very intimate knowledge of VM
behaviour or the VM knowing about you.

Linus

2004-11-18 19:16:21

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

> But still -- if the real memory shortage isn't because there's no place to
> page out to, but rather that the process that's supposed to be writing the
> pages is deadlocked, the OOM killer will not kick in.

I see your point now. I'm still not sure that things can go as far as
to totally deadlock the system. It can stop the writeback from
progressing, but killing the FUSE process still solves the problem.

Miklos

2004-11-18 19:18:47

by Bryan Henderson

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

>> A normal write is a VFS write() call, I assume. While they're going
>> through the page cache, the pages are dirty, right? Is it possible
that
>> FUSE needs more real memory after dirtying those pages in order to
finish
>> cleaning them?
>
>It's possible, but I don't see why that's a problem. If it can get
>more memory it's OK. If allocation fails, then the write() will fail
>with ENOMEM, if OOM killer get's to work and kills the FUSE process,
>then write will return with ENOTCONN or something like that.

The "allocation" is a fetch or store instruction by the FUSE process,
which generates a page fault. To satisfy that, the kernel has to allocate
some real memory. A fetch or store instruction doesn't fail when there's
no real memory available. It just waits for the kernel to make some
available. The kernel does that by picking some less deserving page and
evicting it. That eviction may require a pageout. If the guy who's doing
the fetch or store is the guy who's supposed to do that pageout, you have
a deadlock.

I don't see where in this path the write() has a chance to fail.

Furthermore, it's not right for the write() to fail or for any process to
be killed by the OOM Killer. The system has the resources to complete the
job. It just hasn't scheduled them correctly and thus backed itself into
a corner.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

2004-11-18 19:23:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace



On Thu, 18 Nov 2004, Miklos Szeredi wrote:
>
> Will the clients be allowed to fill up the _whole_ memory with dirty
> pages?

Sure. It's not a situation that is easy to get into, but it's a nasty
case.

> Page writeback will start sooner than that, and then the
> client will not be able to dirty more pages until some are freed.

Ehh - the _CPU_ handles dirtying pages all on its own. The OS never even
knows that a page got dirtied, so "starting writeout early" is not much of
an option.

We actually had (for a short while) code that tracked the dirty bit in
software (ie make it unwritable by default, and take the write fault), but
people showed that that was actually a real performance problem on some
loads.

> BTW, I've never myself seen a deadlock, and I've not had any report of
> it.

Almost nobody uses shared writable mappings. Certainly not on "odd"
things. They are historically used by things like innd for the active
file, by some odd applications that want to do their own memory
management, and by databases. That's pretty much it.

So it's entirely possible that you have never even _seen_ a shared
writable mapping even if you stressed the filesystem very hard. They
really are that rare.

There's a few VM testers out there that do nasty things with writable
shared mappings. You could try them just for fun, but personally, if we
are seriously talking about merging FUSE, I'd actually prefer for writable
mappings to not be supported at all.

It wouldn't be the only filesystem that doesn't support the thing. I think
even NFS didn't support them until I did the pagecache rewrite. Nobody
really complained (well, _very_ few did).

IOW, from a merging standpoint, simple really _is_ better. Even if you
really really want to use exotic features like "direct IO" and writable
mappings some day, let's just put it this way: it's a lot easier to merge
something that has no questions about strange cases, and then _later_ add
in the strange cases, than it is to merge it all on day #1.

I'm a sucker. Ask anybody. I'll accept the exact same patch that I
rejected earlier if you just do it the right way. I'm convinced that some
people actually do it on purpose just for the amusement value ("Look, he
did it _again_. What a doofus!")

Linus

2004-11-18 19:31:30

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace


> The GFP_IO and GFP_FS pages are the _real_ protectors. They don't dip into
> the (very limited) set of pages, they say "we can still free 90% of
> memory, we just have to ignore that dangerous 10%".

I don't see how this makes more problems to userspace filesystems.
When you clear GFP_IO or GFP_FS for an allocation you are limiting
yourself from freeing some sort of memory. But that will not make it
easier to actually _get_ that memory.

With FUSE the allocation is NOT limited. Deadlock will not happen
since page writeback is non-blocking, so while more FUSE backed pages
can't be written, other filesystem's pages can be written back. The
situation is better not worse.

What am I missing?

Thanks,
Miklos

2004-11-18 19:46:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace



On Thu, 18 Nov 2004, Miklos Szeredi wrote:
>
> OK, sorry. I'd rephrase it then to say will the system allow _all_
> it's pages to be used for file data?

Yup, pretty much.

It's actually even _normal_ behaviour for many of the core users of shared
files. People who really do databases get quite upset if you don't let
them mmap as much memory as they want, because for them, they really tune
their cache sizes for the size of memory, and they think the OS (and
anything else, for that matter) just gets in their way. They want 99% of
memory to be used for the shared mapping, and the remaining 1% for their
code.

(That's a bit extreme, but you get the idea).

Historically, we've often tried to "partition" memory in various ways (ie
"the buffer cache can only grow up to 40% of real memory" etc). It ends up
being good for some things (watermarks etc), but almost ever time it ends
up being bad as a hard _limit_. So yes, the kernel tends to let people
do what they think they want to do.

"Give them rope",

Linus

2004-11-18 19:56:09

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

> The "allocation" is a fetch or store instruction by the FUSE process,
> which generates a page fault. To satisfy that, the kernel has to allocate
> some real memory. A fetch or store instruction doesn't fail when there's
> no real memory available. It just waits for the kernel to make some
> available. The kernel does that by picking some less deserving page and
> evicting it. That eviction may require a pageout. If the guy who's doing
> the fetch or store is the guy who's supposed to do that pageout, you have
> a deadlock.

OK. I still maintaintain, that this is an impossible situation, but
maybe I'm wrong.

> Furthermore, it's not right for the write() to fail or for any process to
> be killed by the OOM Killer. The system has the resources to complete the
> job. It just hasn't scheduled them correctly and thus backed itself into
> a corner.

Yes, but a kernel based filesystem would be in the same situation.
It's not a problem unique to userspace filesystems. And I think the
kernel is careful enough not to get into the corner. So there's no
problem.

Miklos

2004-11-18 20:10:44

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

> It's actually even _normal_ behaviour for many of the core users of shared
> files. People who really do databases get quite upset if you don't let
> them mmap as much memory as they want, because for them, they really tune
> their cache sizes for the size of memory, and they think the OS (and
> anything else, for that matter) just gets in their way. They want 99% of
> memory to be used for the shared mapping, and the remaining 1% for their
> code.

I'll try to write a FUSE deadlocker with the newly learnt info, and
let you know the result.

Thanks,
Miklos

2004-11-18 20:17:26

by Nikita Danilov

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Linus Torvalds <[email protected]> writes:

[...]

>
> We actually had (for a short while) code that tracked the dirty bit in
> software (ie make it unwritable by default, and take the write fault), but
> people showed that that was actually a real performance problem on some
> loads.

Dirtiness can be tracked for a fraction of pages (for example, make pte
unwritable when page crosses active/inactive list boundary, or
alike). This will allow kernel to guarantee that there really is _known_
amount of clean pages in the system, without taking a lot of unnecessary
faults.

[... notes on kernel development methodology skipped ...]

>
> Linus

Nikita.

2004-11-18 20:27:35

by Elladan

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

On Thu, Nov 18, 2004 at 10:31:27AM -0800, Linus Torvalds wrote:
>
>
> On Thu, 18 Nov 2004, Miklos Szeredi wrote:
> >
> > Well, killing the fuse process _will_ make the system come back to
> > life, since then all the dirty pages belonging to the filesystem will
> > be discarded.
>
> They will? Why? They're still mapped into other processes, still dirty.
> How do they go away?

If the filesystem a dirty page is mapped against ceases to exist,
wouldn't it make sense to destroy the page? All such user processes can
receive SIGBUS and crash. This is sort of a general problem with a
filesystem on a "removable" device like a USB stick, and it seems the
only sane solution is to blow all the mappings against that filesystem
away.

Of course, it sucks since the final result will be a crash, but mmap()
isn't a reliable interface for accessing media that might go away...

So, a reasonable solution would be to detect when a user fs process
dies, and initiate a forced unmount procedure where you walk all the
pages mapped against that filesystem and blow them away. Similarly for
a case like pulling a USB reader out while it's being written to (you
could nag the user to reinsert, but that might be impossible).

This doesn't solve the deadlock as such except as a sort of
panic-recovery hack, but it seems sensible in general.

-J

2004-11-18 20:29:52

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

> Ehh - the _CPU_ handles dirtying pages all on its own. The OS never even
> knows that a page got dirtied, so "starting writeout early" is not much of
> an option.

OK, sorry. I'd rephrase it then to say will the system allow _all_
it's pages to be used for file data?

> IOW, from a merging standpoint, simple really _is_ better. Even if you
> really really want to use exotic features like "direct IO" and writable
> mappings some day, let's just put it this way: it's a lot easier to merge
> something that has no questions about strange cases, and then _later_ add
> in the strange cases, than it is to merge it all on day #1.

OK, I see your point. And yes, I know writable mappings are rare.

> I'm a sucker. Ask anybody. I'll accept the exact same patch that I
> rejected earlier if you just do it the right way. I'm convinced that some
> people actually do it on purpose just for the amusement value ("Look, he
> did it _again_. What a doofus!")

Actually I did plan to split up FUSE the next time I submit it, so
these extra features can be taken on their own merrit.

Thanks,
Miklos

2004-11-18 19:08:17

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace


> > Well, killing the fuse process _will_ make the system come back to
> > life, since then all the dirty pages belonging to the filesystem will
> > be discarded.
>
> They will? Why? They're still mapped into other processes, still dirty.
> How do they go away?

Just as if they were written back properly. It makes no sense to keep
pages under writeback around if we know the filesystem is gone for
good.

> In contrast, a fuse process that needs to do IO is _not_ protected from
> the clients having eaten up all the memory it needs to do the IO.

Will the clients be allowed to fill up the _whole_ memory with dirty
pages? Page writeback will start sooner than that, and then the
client will not be able to dirty more pages until some are freed.

BTW, I've never myself seen a deadlock, and I've not had any report of
it. I've been able to deadlock FUSE on 2.4 with a shared writable
mapping and an artificial program that was designed for this, but I
haven't managed this on 2.6.

Maybe someone can help me. Anybody who writes a program that
deadlocks Linux with a FUSE filesystem, gets a medal, and I'll humbly
apologize :)

Thanks,
Miklos

2004-11-18 21:03:04

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Linus Torvalds <[email protected]> wrote:
>
>
>
> On Thu, 18 Nov 2004, Alan Cox wrote:
> >
> > > I really do believe that user-space filesystems have problems. There's a
> > > reason we tend to do them in kernel space.
> > >
> > > But limiting the outstanding writes some way may at least hide the thing.
> >
> > Possibly dumb question. Is there a reason we can't have a prctl() that
> > flips the PF_* flags for a user space daemon in the same way as we do
> > for kernel threads that do I/O processing ?
>
> It's more than just PF_MEMALLOC.
>
> And PF_MEMALLOC really is to avoid _recursion_, which is the smallest
> problem. It does so by allowing the process to dip into the critical
> resources, but that only works if you know that the process is actually
> freeing pages right then and there. If you set it willy-nilly, you'll just
> run out of pages soon, and you'll be dead.

I've seen one 2.4-based project which had essentially a userspace
blockdevice driver. Marking that special, trusted process PF_MEMALLOC did
indeed fix low-on-memory deadlocks. Obviously it's something one does with
caution, but there are times when it makes sense.

I think there are codepaths which unconditionally turn off PF_MEMALLOC, so
they need to be tweaked to do a save/set/restore operation for it all to
work.

2004-11-18 21:08:48

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Miklos Szeredi <[email protected]> wrote:
>
> Maybe someone can help me. Anybody who writes a program that
> deadlocks Linux with a FUSE filesystem

Grab http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz and
learn to drive run-bash-shared-mappings.sh.

> gets a medal

My emedals.com account awaits your contribution ;)

2004-11-18 21:52:17

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace


> > Maybe someone can help me. Anybody who writes a program that
> > deadlocks Linux with a FUSE filesystem
>
> Grab http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz and
> learn to drive run-bash-shared-mappings.sh.
>
> > gets a medal
>
> My emedals.com account awaits your contribution ;)

OK. I'll let you know how it goes :)

Miklos

2004-11-18 22:15:09

by Jan Hudec

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

On Thu, Nov 18, 2004 at 20:51:31 +0100, Miklos Szeredi wrote:
> > The "allocation" is a fetch or store instruction by the FUSE process,
> > which generates a page fault. To satisfy that, the kernel has to allocate
> > some real memory. A fetch or store instruction doesn't fail when there's
> > no real memory available. It just waits for the kernel to make some
> > available. The kernel does that by picking some less deserving page and
> > evicting it. That eviction may require a pageout. If the guy who's doing
> > the fetch or store is the guy who's supposed to do that pageout, you have
> > a deadlock.
>
> OK. I still maintaintain, that this is an impossible situation, but
> maybe I'm wrong.

No. It is a hard-to-see, but real situation.

> > Furthermore, it's not right for the write() to fail or for any process to
> > be killed by the OOM Killer. The system has the resources to complete the
> > job. It just hasn't scheduled them correctly and thus backed itself into
> > a corner.
>
> Yes, but a kernel based filesystem would be in the same situation.
> It's not a problem unique to userspace filesystems. And I think the
> kernel is careful enough not to get into the corner. So there's no
> problem.

The kernel may also have that problem and solves it the
not-exactly-right way -- by unleashing the OOM killer. But FUSE won't
unleash the OOM killer -- it will deadlock the swapper, because the
swapper waits for the fuse process and the fuse process won't run until
the swapper cleans some pages. Because the swapper does not know, that
trying to writeout pages is futile since this request is from writeout
path. It does know in the in-kernel case.

-------------------------------------------------------------------------------
Jan 'Bulb' Hudec <[email protected]>


Attachments:
(No filename) (1.77 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2004-11-19 00:53:37

by Bryan Henderson

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

>but I
>think usually you have lot's of virtual memory (4Gbyte per process),
>so killing off processes to get more of it makes no sense.

I think it's fair to say you have 4G of virtual address space per process,
but try to store 4G of information per process in it, and you will
probably find you can't. What's essentially scarce is swap space. Killing
off processes frees up swap space.

It was probably wrong of me to say the OOM can't free up real memory,
because where real memory backs virtual memory, the only way to free up
the real memory is to page out the virtual memory or destroy it, and the
OOM killer destroys it.

But still -- if the real memory shortage isn't because there's no place to
page out to, but rather that the process that's supposed to be writing the
pages is deadlocked, the OOM killer will not kick in.

2004-11-18 18:42:07

by Alan

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace


> I really do believe that user-space filesystems have problems. There's a
> reason we tend to do them in kernel space.
>
> But limiting the outstanding writes some way may at least hide the thing.

Possibly dumb question. Is there a reason we can't have a prctl() that
flips the PF_* flags for a user space daemon in the same way as we do
for kernel threads that do I/O processing ?

2004-11-19 07:02:00

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

>>but I
>>think usually you have lot's of virtual memory (4Gbyte per process),
>>so killing off processes to get more of it makes no sense.
>
>I think it's fair to say you have 4G of virtual address space per process,
>but try to store 4G of information per process in it, and you will
>probably find you can't. What's essentially scarce is swap space. Killing
>off processes frees up swap space.

3G in the default case, because there's 1G for kernel space.



Jan Engelhardt
--
Gesellschaft für Wissenschaftliche Datenverarbeitung
Am Fassberg, 37077 Göttingen, http://www.gwdg.de

2004-11-19 09:46:41

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Hi!

> > The GFP_IO and GFP_FS pages are the _real_ protectors. They don't dip into
> > the (very limited) set of pages, they say "we can still free 90% of
> > memory, we just have to ignore that dangerous 10%".
>
> I don't see how this makes more problems to userspace filesystems.
> When you clear GFP_IO or GFP_FS for an allocation you are limiting
> yourself from freeing some sort of memory. But that will not make it
> easier to actually _get_ that memory.
>
> With FUSE the allocation is NOT limited. Deadlock will not happen
> since page writeback is non-blocking, so while more FUSE backed pages
> can't be written, other filesystem's pages can be written back. The
> situation is better not worse.
>
> What am I missing?

I believe problem is when there are no other filesystem's
pages.... and at that point FUSE is worse than kernel filesystems
because you do not have reserved pools to use for freeing memory.

Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2004-11-19 11:28:16

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

> Grab http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz and
> learn to drive run-bash-shared-mappings.sh.

Thanks Andrew, this indeed caused a deadlock. Strangely the deadlock
happens much more easily if 'usemem' is not run in parallel with
'bash-shared-mapping'.

> > gets a medal
>
> My emedals.com account awaits your contribution ;)

The medal is yours!

Apologies to everyone whom I disbelieved, and thanks for enlightening me.

The solution I'm thinking is along the lines of accounting the number
of writable pages assigned to FUSE filesystems. Limiting this should
solve the deadlock problem. This would only impact performance for
shared writable mappings, which are rare anyway.

Thanks,
Miklos

2004-11-20 12:03:26

by Jan Hudec

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

On Thu, Nov 18, 2004 at 18:14:15 +0100, Miklos Szeredi wrote:
>
> > I don't see how the OOM killer can help you here. The OOM killer deals
> > with the system being out of virtual memory;
>
> What? I think you are confusing something. I'm not an expert, but I
> think usually you have lot's of virtual memory (4Gbyte per process),
> so killing off processes to get more of it makes no sense.
>
> Please corrent me if I'm wrong.

YOU are confusing something.

Virtual memory is RAM + swap (+ mmapped files, which behave similarly to
swap)

Virtual address space is a range of addresses that can be assigned to
that virtual memory and used to access it. Each process has 3GiB virtual
address space for disposition and kernel has another 1GiB mapped to
every process, making the total of 4GiB allowed by the CPU (talking
about i386 -- other CPU's can have different ranges).

If you run out of virtual memory, that is there is no room in RAM nor in
swap, than you have to kill some process -- that's what OOM killer is
about -- to free some.

-------------------------------------------------------------------------------
Jan 'Bulb' Hudec <[email protected]>


Attachments:
(No filename) (1.14 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2004-11-21 07:08:58

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Hi!

> > I know I've asked before... but how is the "fuse-userspace-part
> > swapped out and memory full of dirty data on fuse" deadlock solved?
>
> By either
>
> 1) not allowing share writable mappings
>
> 2) doing non-blocking asynchronous writepage
>
> In the first case there will never be dirty data, since normal writes
> go synchronously through the page cache.

Ok, this one works, I agree... But it will be way slower than coda's
file-backed approach, right?

> In the second case there is no deadlock, because the memory subsystem
> doesn't wait for data to be written. If the filesystem refuses to
> write back data in a timely manner, memory will get full and OOM
> killer will go to work. Deadlock simply cannot happen.

Hmmm, so if userspace part is swapped out and data is dirtied
"too quickly", OOM is practically guaranteed? That is not nice.

What if userspave daemon is not fast enough to handle writes (going
through slow network or something), is there some mechanksm to
throttle back the writer, or will it just OOM?
Pavel
--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

2004-11-21 07:43:19

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace


> Ok, this one works, I agree... But it will be way slower than coda's
> file-backed approach, right?

No. In both cases there will be two memory copies:

- from userspace fs to pagecache
- from pagecache to read buffer

And vica versa for write. FUSE will have some overhead of context
switches, but in the optimal case (sequential read), the readahead
code will bundle read requests, and with a 128MByte read, the cost of
a context switch is probably infinitesimal.

> > In the second case there is no deadlock, because the memory subsystem
> > doesn't wait for data to be written. If the filesystem refuses to
> > write back data in a timely manner, memory will get full and OOM
> > killer will go to work. Deadlock simply cannot happen.
>
> Hmmm, so if userspace part is swapped out and data is dirtied
> "too quickly", OOM is practically guaranteed? That is not nice.

Yeah. I imagined, that the kernel won't assign all it's free pages to
file mappings, but people on the list enlightened me to the contrary.

There seem to be two kinds of solutions:

1) make the userspace part aware of the memory allocation problems

2) limit the number of pages that can be dirtied


I don't really like 1) because that really kills the point of
implementing the filesystem in userspace.

So I would go along the lines of 2). However there is no way to know
when pages are dirtied (ther is no fault), so accounting the dirty
pages exactly is not possible. However accounting the _writable_
pages should be possible with no overhead, since there is a fault when
the page of a mapping is first touched.

Limiting these pages for FUSE, would mean that the user could do
writable mmap, but with the performance penaly of having a limited
number of pages present in memory. If the userspace FS refuses to
write back data, the filesystem user will be blocked until the
writeback is completed. No deadlock (hopefully).

Miklos

2004-11-21 07:51:11

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

> And vica versa for write. FUSE will have some overhead of context
> switches, but in the optimal case (sequential read), the readahead
> code will bundle read requests, and with a 128MByte read, the cost of
^^^^^^^^
I meant 128kbyte of course. We don't have that much memory yet :)

Miklos

2004-11-21 09:51:42

by Jan Hudec

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

On Sun, Nov 21, 2004 at 08:42:55 +0100, Miklos Szeredi wrote:
>
> > Ok, this one works, I agree... But it will be way slower than coda's
> > file-backed approach, right?
>
> No. In both cases there will be two memory copies:
>
> - from userspace fs to pagecache
> - from pagecache to read buffer
>
> And vica versa for write. FUSE will have some overhead of context
> switches, but in the optimal case (sequential read), the readahead
> code will bundle read requests, and with a 128MByte read, the cost of
> a context switch is probably infinitesimal.

But you won't bundle the *write* requests. And that's what will be
painfuly slow.

Now the file-backed approach is the only that will work for writes
without swapper deadlock.

> > > In the second case there is no deadlock, because the memory subsystem
> > > doesn't wait for data to be written. If the filesystem refuses to
> > > write back data in a timely manner, memory will get full and OOM
> > > killer will go to work. Deadlock simply cannot happen.
> >
> > Hmmm, so if userspace part is swapped out and data is dirtied
> > "too quickly", OOM is practically guaranteed? That is not nice.
>
> Yeah. I imagined, that the kernel won't assign all it's free pages to
> file mappings, but people on the list enlightened me to the contrary.
>
> There seem to be two kinds of solutions:
>
> 1) make the userspace part aware of the memory allocation problems
>
> 2) limit the number of pages that can be dirtied

3) Use a real disk file to back the page cache. Exactly like coda
does it.

> I don't really like 1) because that really kills the point of
> implementing the filesystem in userspace.
>
> So I would go along the lines of 2). However there is no way to know
> when pages are dirtied (ther is no fault), so accounting the dirty
> pages exactly is not possible. However accounting the _writable_
> pages should be possible with no overhead, since there is a fault when
> the page of a mapping is first touched.
>
> Limiting these pages for FUSE, would mean that the user could do
> writable mmap, but with the performance penaly of having a limited
> number of pages present in memory. If the userspace FS refuses to
> write back data, the filesystem user will be blocked until the
> writeback is completed. No deadlock (hopefully).

I would not be so sure about this. The deadlock is caused by the fact
that such pages may exist, not by their number. Limiting the number will
only decrease the probability the deadlock will happen, but won't make
it go away.

-------------------------------------------------------------------------------
Jan 'Bulb' Hudec <[email protected]>


Attachments:
(No filename) (2.62 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2004-11-21 10:32:20

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

> But you won't bundle the *write* requests. And that's what will be
> painfuly slow.

Well, painfully slow is just about 100Mbytes/sec on my 1GHz C3
machine. Be real.


> 3) Use a real disk file to back the page cache. Exactly like coda
> does it.

Yes, that could be done. Not exactly like coda, because the
filesystem would have no way of knowing exacty _what_ was written.
For many applications it's _way_ too inefficient to write back the
whole file on every little change. And writing back the file on
release would mean a) big latency b) inconsistency between the backing
filesystem and the virtual one.

Actually latency _is_ the major problem with the CODA like interface.
Try doing a 'less bigfile' with an sftp filesystem based on CODA and
FUSE, and you'll see what I mean. It will _feel_ a hell of a lot
slower with CODA, just because you'd have to wait for the whole file
to download to get the first byte.

> I would not be so sure about this. The deadlock is caused by the fact
> that such pages may exist, not by their number. Limiting the number will
> only decrease the probability the deadlock will happen, but won't make
> it go away.

No. The problem _is_ caused by their number, because it will only
happen if pages cannot be freed in any other way, only by doing
writeback to the userspace filesystem. If there are pages which _can_
be freed other ways, then deadlock won't happen.

Now if you limit the total number of pages that FUSE can use for
writable memory mappings to 10% of the total memory in the machine,
you can be pretty sure, that the remaining 90% will not be filled with
unpageable memory (otherwise you'd be in pretty big trouble anyway).

Miklos

2004-11-21 10:41:07

by Jan Hudec

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

On Sun, Nov 21, 2004 at 11:31:58 +0100, Miklos Szeredi wrote:
> > But you won't bundle the *write* requests. And that's what will be
> > painfuly slow.
>
> Well, painfully slow is just about 100Mbytes/sec on my 1GHz C3
> machine. Be real.

Measured under what conditions?

> > 3) Use a real disk file to back the page cache. Exactly like coda
> > does it.
>
> Yes, that could be done. Not exactly like coda, because the
> filesystem would have no way of knowing exacty _what_ was written.
> For many applications it's _way_ too inefficient to write back the
> whole file on every little change. And writing back the file on
> release would mean a) big latency b) inconsistency between the backing
> filesystem and the virtual one.
>
> Actually latency _is_ the major problem with the CODA like interface.
> Try doing a 'less bigfile' with an sftp filesystem based on CODA and
> FUSE, and you'll see what I mean. It will _feel_ a hell of a lot
> slower with CODA, just because you'd have to wait for the whole file
> to download to get the first byte.
>
> > I would not be so sure about this. The deadlock is caused by the fact
> > that such pages may exist, not by their number. Limiting the number will
> > only decrease the probability the deadlock will happen, but won't make
> > it go away.
>
> No. The problem _is_ caused by their number, because it will only
> happen if pages cannot be freed in any other way, only by doing
> writeback to the userspace filesystem. If there are pages which _can_
> be freed other ways, then deadlock won't happen.
>
> Now if you limit the total number of pages that FUSE can use for
> writable memory mappings to 10% of the total memory in the machine,
> you can be pretty sure, that the remaining 90% will not be filled with
> unpageable memory (otherwise you'd be in pretty big trouble anyway).

There may be only one dirty page in the whole system and it may be the
one backed by FUSE. Now yes, if it was not backed by FUSE, I would be in
trouble -- but the OOM killer would get me out. It will *NOT* get me out
with fuse, because it thinks the page will be cleaned, which it won't.

Thus by limiting the numeber of pages, you decrease the probability that
the deadlock will happen. You may even get to reasonably small numbers.
But you can't get to 0. The deadlock will still be there, it will just
be unlikely.

-------------------------------------------------------------------------------
Jan 'Bulb' Hudec <[email protected]>


Attachments:
(No filename) (2.44 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2004-11-21 11:29:52

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace


> Measured under what conditions?

dorka:/home/miko/fuse-bugfix# touch /tmp/fusenull
dorka:/home/miko/fuse-bugfix# example/null /tmp/fusenull
dorka:/home/miko/fuse-bugfix# time dd if=/dev/zero of=/tmp/fusenull bs=4096 count=262144
262144+0 records in
262144+0 records out
1073741824 bytes transferred in 12.591578 seconds (85274604 bytes/sec)

real 0m12.662s
user 0m0.189s
sys 0m8.156s
dorka:/home/miko/fuse-bugfix# fusermount -u /tmp/fusenull

fusenull is a /dev/null like filesystem implemented with the libfuse
API (see fuse sources).

> There may be only one dirty page in the whole system and it may be the
> one backed by FUSE. Now yes, if it was not backed by FUSE, I would be in
> trouble -- but the OOM killer would get me out. It will *NOT* get me out
> with fuse, because it thinks the page will be cleaned, which it won't.

OK, I see your point. But can't the memory subsystem be tought, that
those pages are not guaranteed to be written back in a limited time?

Miklos

2004-11-21 11:58:01

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

On Sun, 21 Nov 2004, Miklos Szeredi wrote:
> OK, I see your point. But can't the memory subsystem be tought, that
> those pages are not guaranteed to be written back in a limited time?

It already is. Your address space ->writepage can do

redirty_page_for_writepage(wbc, page);
unlock_page(page);
return 0;

And that is fine. NTFS does this. As does Reiserfs I believe.

For NTFS I do it exactly when I get to -ENOMEM so that I don't have enough
memory to coplete the writepage so I abort the write and redirty the page
so it gets tried again at a later time when more memory is freed. The
writeback control (wbc) ensures the VM doesn't just keep calling us trying
to clean the page to free it. It knows it is pointless so it gives up.
The OOM killer can then kill some other app which will free memory, and
then the writepage will be retried and it will succeed. Now I know
the fuse fs can be swapped out but why that would lead to a deadlock I
can't see. There always is something else to kill to free memory so the
fs can be swapped back in. And if the fs is killed surely all its pages
will be invalidated and thrown away by fuse, no?

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2004-11-21 12:05:20

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace


> It already is. Your address space ->writepage can do
>
> redirty_page_for_writepage(wbc, page);
> unlock_page(page);
> return 0;
>
> And that is fine. NTFS does this. As does Reiserfs I believe.

As does fuse. Actually there might be a very limited number of pages
in transit (10 per mount currently), but above that writepage will not
attempt to send any more requests. I don't see what are the effects
of these in-transit pages on the OOM killer triggering.

> For NTFS I do it exactly when I get to -ENOMEM so that I don't have enough
> memory to coplete the writepage so I abort the write and redirty the page
> so it gets tried again at a later time when more memory is freed. The
> writeback control (wbc) ensures the VM doesn't just keep calling us trying
> to clean the page to free it. It knows it is pointless so it gives up.
> The OOM killer can then kill some other app which will free memory, and
> then the writepage will be retried and it will succeed. Now I know
> the fuse fs can be swapped out but why that would lead to a deadlock I
> can't see. There always is something else to kill to free memory so the
> fs can be swapped back in. And if the fs is killed surely all its pages
> will be invalidated and thrown away by fuse, no?

They will.

Miklos

2004-11-21 18:35:12

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Hi!

> So I would go along the lines of 2). However there is no way to know
> when pages are dirtied (ther is no fault), so accounting the dirty
> pages exactly is not possible. However accounting the _writable_
> pages should be possible with no overhead, since there is a fault when
> the page of a mapping is first touched.

Ugh, this is going to be "interesting". Perhaps it can have little overhead,
but hacking pagefault handlers is going to be hard.
Pavel
--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

2004-11-23 04:32:12

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace


> Ugh, this is going to be "interesting". Perhaps it can have little
> overhead, but hacking pagefault handlers is going to be hard.

Well yes. The third option (mentioned by Jan Hudec) of backing dirty
pages with a disk file is also appealing, but does not seem to me any
simpler.

Miklos


2004-11-24 06:17:33

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Hi Andrew,

On Thursday 18 November 2004 15:57, Andrew Morton wrote:
> I've seen one 2.4-based project which had essentially a userspace
> blockdevice driver. Marking that special, trusted process PF_MEMALLOC did
> indeed fix low-on-memory deadlocks. Obviously it's something one does with
> caution, but there are times when it makes sense.

Like the cluster stack, unless we're happy with inhaling all the membership,
failover, fencing and etc code into the kernel.

> I think there are codepaths which unconditionally turn off PF_MEMALLOC, so
> they need to be tweaked to do a save/set/restore operation for it all to
> work.

The only one I spotted is in dm-ioctl.c. We get away with the one in
page_alloc.c by branching around it in PF_MEMALLOC mode.

Regards,

Daniel

2004-11-24 12:15:46

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Alan Cox wrote:

>>I really do believe that user-space filesystems have problems. There's a
>>reason we tend to do them in kernel space.
>>
>>But limiting the outstanding writes some way may at least hide the thing.
>>
>>
>
>Possibly dumb question. Is there a reason we can't have a prctl() that
>flips the PF_* flags for a user space daemon in the same way as we do
>for kernel threads that do I/O processing ?
>
>
>
http://lkml.org/lkml/2004/7/26/68

discusses a userspace filesystem (implemented as a userspace nfs server
mounted on a loopback nfs mount), the problem, a solution (exactly your
suggestion), and a more generic solution.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2004-11-24 18:43:14

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

> http://lkml.org/lkml/2004/7/26/68
>
> discusses a userspace filesystem (implemented as a userspace nfs server
> mounted on a loopback nfs mount), the problem, a solution (exactly your
> suggestion), and a more generic solution.

Thanks for the pointer, very interesting read.

However, I don't like the idea that the userspace filesystem must
cooperate with the kernel in this regard. With this you lose one of
the advantages of doing filesystem in userspace: namely that you can
be sure, that anything you do cannot bring the system down.

And I firmly believe that this can be done without having to special
case filesystem serving processes.

There are already "strange" filesystems in the kernel which cannot
really get rid of dirty data. I'm thinking of tmpfs and ramfs.
Neither of them are prone to deadlock, though both of them are "worse
off" than a userspace filesystem, in the sense that they have not even
the remotest chance of getting rid of the dirty data.

Of course, implementing this is probably not trivial. But I don't see
it as a theoretical problem as Linus does.

Is there something which I'm missing here?

Thanks,
Miklos

2004-11-24 23:00:34

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

> Both are limited in the number of pages they can dirty by the number
> of pages available. If you are willing to limit your filesystem the same
> way you enjoy the same benefit.

This limitation will in the worst case cause a performance problem,
but userspace filesystems will have inferior performace to in-kernel
filesystems anyway. So I don't regard this a problem.

Miklos

2004-11-25 07:30:52

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace


> > There are already "strange" filesystems in the kernel which cannot
> > really get rid of dirty data. I'm thinking of tmpfs and ramfs.
> > Neither of them are prone to deadlock, though both of them are "worse
> > off" than a userspace filesystem, in the sense that they have not even
> > the remotest chance of getting rid of the dirty data.
> >
> > Of course, implementing this is probably not trivial. But I don't see
> > it as a theoretical problem as Linus does.
> >
> > Is there something which I'm missing here?
>
> But they KNOW that they won't be able to get rid of the dirty data. But
> fuse does not.

Why not? I can set bdi->memory_backed to 1 just like ramfs, implement
my own writeback thread, and voila, no deadlock.

Of course I believe, that it's probably easier to tweak the page cache
to teach it that fuse pages _can_ be written back, but not reliably
like a disk filesystem. And there's the small problem of limiting the
number of writable pages allocated to FUSE.

Miklos

2004-11-25 07:48:50

by Jan Hudec

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

On Thu, Nov 25, 2004 at 08:29:58 +0100, Miklos Szeredi wrote:
>
> > > There are already "strange" filesystems in the kernel which cannot
> > > really get rid of dirty data. I'm thinking of tmpfs and ramfs.
> > > Neither of them are prone to deadlock, though both of them are "worse
> > > off" than a userspace filesystem, in the sense that they have not even
> > > the remotest chance of getting rid of the dirty data.
> > >
> > > Of course, implementing this is probably not trivial. But I don't see
> > > it as a theoretical problem as Linus does.
> > >
> > > Is there something which I'm missing here?
> >
> > But they KNOW that they won't be able to get rid of the dirty data. But
> > fuse does not.
>
> Why not? I can set bdi->memory_backed to 1 just like ramfs, implement
> my own writeback thread, and voila, no deadlock.

Yes, you can. Just you have to take care never to occupy too much
memory.

> Of course I believe, that it's probably easier to tweak the page cache
> to teach it that fuse pages _can_ be written back, but not reliably
> like a disk filesystem. And there's the small problem of limiting the
> number of writable pages allocated to FUSE.

It's not that easy. How do you tell when the page is no longer likely to
get cleaned?

The file backing would be easier, but to be really easy, the interface
would be a bit different (and actualy simpler, since it would need no
data channel, just a control one).

The trick is, that the coda file-granularity interface is not that hard
to extend to page-granularity. Several filesystems allow "files with
holes". So the fuse process could just touch a file and truncate it to
desired length on open. Then kernel would tell it which pages it wants
and the process would acknowledge when they are actualy filled. For
write, kernel would just notify the process of dirty ranges and what --
and when -- the process does with that is not kernel's business.

The legacy interface should still be easy to support in a library.

-------------------------------------------------------------------------------
Jan 'Bulb' Hudec <[email protected]>


Attachments:
(No filename) (2.07 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2004-11-25 09:53:20

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace


> > Why not? I can set bdi->memory_backed to 1 just like ramfs, implement
> > my own writeback thread, and voila, no deadlock.
>
> Yes, you can. Just you have to take care never to occupy too much
> memory.

Yes, this is the hard part IMO, because it should be done in a way
that when FUSE wants a new writable page and it's over the limit, it
should block until it goes under the limit. Not trivial at all.

> > Of course I believe, that it's probably easier to tweak the page cache
> > to teach it that fuse pages _can_ be written back, but not reliably
> > like a disk filesystem. And there's the small problem of limiting the
> > number of writable pages allocated to FUSE.
>
> It's not that easy. How do you tell when the page is no longer likely to
> get cleaned?

Tell the page cache it's never going to get cleaned, but call the
writepage() anyway. It's sort of splitting the bdi->memory_backed
flag into two: dont_count_as_dirty and dont_writeback. Ramfs, etc
would set both, FUSE would set only dont_count_as_dirty. See what I'm
thinking of?

> The file backing would be easier, but to be really easy, the interface
> would be a bit different (and actualy simpler, since it would need no
> data channel, just a control one).

That's not a problem.

> The trick is, that the coda file-granularity interface is not that hard
> to extend to page-granularity. Several filesystems allow "files with
> holes". So the fuse process could just touch a file and truncate it to
> desired length on open. Then kernel would tell it which pages it wants
> and the process would acknowledge when they are actualy filled. For
> write, kernel would just notify the process of dirty ranges and what --
> and when -- the process does with that is not kernel's business.

Yes, this would be a nice interface. How would you solve the problem
of freeing the space which is no longer needed?

Imagine that you have a 4G virtual file on a FUSE filesystem, and some
process is reading that file sequentially. But there's only 10MByte
of free space on the local disk. With the above method you'd never be
able to read the whole file. Do you see the problem?

Miklos

2004-11-26 22:01:46

by Jan Hudec

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

On Wed, Nov 24, 2004 at 14:05:51 +0100, Miklos Szeredi wrote:
> > http://lkml.org/lkml/2004/7/26/68
> >
> > discusses a userspace filesystem (implemented as a userspace nfs server
> > mounted on a loopback nfs mount), the problem, a solution (exactly your
> > suggestion), and a more generic solution.
>
> Thanks for the pointer, very interesting read.
>
> However, I don't like the idea that the userspace filesystem must
> cooperate with the kernel in this regard. With this you lose one of
> the advantages of doing filesystem in userspace: namely that you can
> be sure, that anything you do cannot bring the system down.
>
> And I firmly believe that this can be done without having to special
> case filesystem serving processes.
>
> There are already "strange" filesystems in the kernel which cannot
> really get rid of dirty data. I'm thinking of tmpfs and ramfs.
> Neither of them are prone to deadlock, though both of them are "worse
> off" than a userspace filesystem, in the sense that they have not even
> the remotest chance of getting rid of the dirty data.
>
> Of course, implementing this is probably not trivial. But I don't see
> it as a theoretical problem as Linus does.
>
> Is there something which I'm missing here?

But they KNOW that they won't be able to get rid of the dirty data. But
fuse does not.

-------------------------------------------------------------------------------
Jan 'Bulb' Hudec <[email protected]>


Attachments:
(No filename) (1.42 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2004-11-27 04:00:42

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Hi!

> > Of course I believe, that it's probably easier to tweak the page cache
> > to teach it that fuse pages _can_ be written back, but not reliably
> > like a disk filesystem. And there's the small problem of limiting the
> > number of writable pages allocated to FUSE.
>
> It's not that easy. How do you tell when the page is no longer likely to
> get cleaned?
>
> The file backing would be easier, but to be really easy, the interface
> would be a bit different (and actualy simpler, since it would need no
> data channel, just a control one).
>
> The trick is, that the coda file-granularity interface is not that hard
> to extend to page-granularity. Several filesystems allow "files with
> holes". So the fuse process could just touch a file and truncate it to
> desired length on open. Then kernel would tell it which pages it wants
> and the process would acknowledge when they are actualy filled. For
> write, kernel would just notify the process of dirty ranges and what --
> and when -- the process does with that is not kernel's business.

Well, it would work fine and nice... until you want to export file
with holes with fuse. That probably could be made to work, but...

Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2004-11-27 17:08:56

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

On Fri, 19 Nov 2004, Miklos Szeredi wrote:

> The solution I'm thinking is along the lines of accounting the number
> of writable pages assigned to FUSE filesystems. Limiting this should
> solve the deadlock problem. This would only impact performance for
> shared writable mappings, which are rare anyway.

Note that NFS, and any filesystems on iSCSI or g/e/ndb block
devices have the exact same problem. To explain why this is
the case, lets start with the VM allocation and pageout
thresholds:

pages_min ------------------

GFP_ATOMIC ------------------

PF_MEMALLOC ------------------

0 ------------------

When writing out a dirty page, the pageout code is allowed
to allocate network buffers down to the PF_MEMALLOC boundary.

However, when receiving the ACK network packets from the server,
the network stack is only allowed to allocate memory down to the
GFP_ATOMIC watermark.

This means it is relatively easy to get the system to deadlock,
under a heavy shared mmap workload. Limiting the number of
simultaneous writeouts might make the problem harder to trigger,
but is still no solution since the network layer could exhaust
its allowed memory for other packets, and never get around to
processing the ACKs for the pageout related network traffic!

I have a solution in mind, but it's not pretty. It might be safe
now that DaveM no longer travels. Still have to come up with a
way to avoid being maimed by the other network developers, though...

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2004-11-27 17:17:05

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Hi!

> >The solution I'm thinking is along the lines of accounting the number
> >of writable pages assigned to FUSE filesystems. Limiting this should
> >solve the deadlock problem. This would only impact performance for
> >shared writable mappings, which are rare anyway.
>
> Note that NFS, and any filesystems on iSCSI or g/e/ndb block
> devices have the exact same problem. To explain why this is
> the case, lets start with the VM allocation and pageout
> thresholds:
>
> pages_min ------------------
>
> GFP_ATOMIC ------------------
>
> PF_MEMALLOC ------------------
>
> 0 ------------------
>
> When writing out a dirty page, the pageout code is allowed
> to allocate network buffers down to the PF_MEMALLOC boundary.
>
> However, when receiving the ACK network packets from the server,
> the network stack is only allowed to allocate memory down to the
> GFP_ATOMIC watermark.

Ouch... Shame on me, I never realized that this is a problem. I knew
that swapping over nbd does not work, but I did not realize that
writeout is as critical as swapping...

:-(. This means that read/write nbd is pretty bad idea. I wonder why
it is not broken for people? Probably noone uses so big ammount of
data over nbd...

Can you describe that solution? You can do it anonymously if you want
to ;-)))).
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2004-11-30 18:47:50

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Miklos Szeredi wrote:

>>http://lkml.org/lkml/2004/7/26/68
>>
>>discusses a userspace filesystem (implemented as a userspace nfs server
>>mounted on a loopback nfs mount), the problem, a solution (exactly your
>>suggestion), and a more generic solution.
>>
>>
>
>Thanks for the pointer, very interesting read.
>
>However, I don't like the idea that the userspace filesystem must
>cooperate with the kernel in this regard. With this you lose one of
>the advantages of doing filesystem in userspace: namely that you can
>be sure, that anything you do cannot bring the system down.
>
>
I don't like it either.

>And I firmly believe that this can be done without having to special
>case filesystem serving processes.
>
>
I firmly believe the opposite. When a userspace process calls the kernel
which (directly or indirectly) calls the userspace filesystem, the
filesystem must have elevated priviledges, or it can deadlock when
calling back in.

>There are already "strange" filesystems in the kernel which cannot
>really get rid of dirty data. I'm thinking of tmpfs and ramfs.
>Neither of them are prone to deadlock, though both of them are "worse
>off" than a userspace filesystem, in the sense that they have not even
>the remotest chance of getting rid of the dirty data.
>
>
>
As others have mentioned, they are limited in the number of pages they
are allowed to dirty.

>Of course, implementing this is probably not trivial. But I don't see
>it as a theoretical problem as Linus does.
>
>
>
I don't see a theoretical problem, just some practical ones.

All can be overcome IMO, and it would be well worth it, too.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2004-11-30 19:18:00

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

> >And I firmly believe that this can be done without having to special
> >case filesystem serving processes.
> >
> >
> I firmly believe the opposite. When a userspace process calls the kernel
> which (directly or indirectly) calls the userspace filesystem, the
> filesystem must have elevated priviledges, or it can deadlock when
> calling back in.

No.

Just by making the filesystem not contribute to the dirty counters
(like ramfs), the deadlock problem can be solved. In this case you
simply could not get a deadlock, because all those dirty pages
produced by the filesystem look just like normal allocations, which
simply cannot be touched when the userspace filesystem wants to
allocate more memory.

In this case the allocation would just fail. Deadlock doesn't happen,
though the filesystem wasn't given any elevated privileges.

Of course this isn't a good situation: the memory is filled in with
dirty data of the filesystem which it cannot write back. All sorts of
programs might fail because of the OOM situation, and the system could
even crash.

The hard part is ensuring that this doesn't happen, but that has
nothing to do with the userspace part of the filesystem.

> As others have mentioned, they are limited in the number of pages they
> are allowed to dirty.

I looked at ramfs, it isn't even limited. You can easily crash your
system just by filling it up with data, but no deadlock will happen.
Same situation as described above. Nothing strange about this, so I
don't really understand why people so vehemently believe that the
"userspace filesystem deadlock" phenomena cannot be solved.

> I don't see a theoretical problem, just some practical ones.
>
> All can be overcome IMO, and it would be well worth it, too.

We are in agreement then :)

Thanks,
Miklos

2004-11-30 20:04:11

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Miklos Szeredi wrote:

>>>And I firmly believe that this can be done without having to special
>>>case filesystem serving processes.
>>>
>>>
>>>
>>>
>>I firmly believe the opposite. When a userspace process calls the kernel
>>which (directly or indirectly) calls the userspace filesystem, the
>>filesystem must have elevated priviledges, or it can deadlock when
>>calling back in.
>>
>>
>
>No.
>
>Just by making the filesystem not contribute to the dirty counters
>(like ramfs), the deadlock problem can be solved. In this case you
>simply could not get a deadlock, because all those dirty pages
>produced by the filesystem look just like normal allocations, which
>simply cannot be touched when the userspace filesystem wants to
>allocate more memory.
>
>In this case the allocation would just fail. Deadlock doesn't happen,
>though the filesystem wasn't given any elevated privileges.
>
>Of course this isn't a good situation: the memory is filled in with
>dirty data of the filesystem which it cannot write back. All sorts of
>
>
you're describing the deadlock here: all memory is full, no process
which allocates memory can make any progress. This is not a true oom
situation: there can be plenty of memory in dirty pagecache which we
could reclaim if we had that tiny bit of reserve memory.

>I looked at ramfs, it isn't even limited. You can easily crash your
>system just by filling it up with data, but no deadlock will happen.
>
>
Right. But ramfs doesn't call a userspace process which calls the kernel
right back.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2004-11-30 21:17:32

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace


> you're describing the deadlock here: all memory is full, no process
> which allocates memory can make any progress.

Yes they, can: the allocation will fail, function will return -ENOMEM,
malloc will return NULL, pagefault will fail with OOM. This is
progress, though not the best sort. It is most certainly _not_ a
deadlock.

> This is not a true oom situation: there can be plenty of memory in
> dirty pagecache which we could reclaim if we had that tiny bit of
> reserve memory.

The amount of reserved memory that would be needed depends upon the
filesystem. Some filesystems would need only very little to be able
to free some memory, some would need a lot (e.g. a bzip2 compressing
filesystem). There's no magic solution with reserving memory.

And this is not unique to userspace filesystems, as Rik van Riel
pointed out earlier, network filesystems are also prone to deadlock:

http://lkml.org/lkml/2004/11/27/81

> >I looked at ramfs, it isn't even limited. You can easily crash your
> >system just by filling it up with data, but no deadlock will happen.
> >
> >
> Right. But ramfs doesn't call a userspace process which calls the kernel
> right back.

Doesn't matter one little whit. The end result is the same: Out Of
Memory, which is _not_ equivalent to deadlock. Please think it over.

Miklos

2004-11-30 21:38:21

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Miklos Szeredi wrote:

>>you're describing the deadlock here: all memory is full, no process
>>which allocates memory can make any progress.
>>
>>
>
>Yes they, can: the allocation will fail, function will return -ENOMEM,
>malloc will return NULL, pagefault will fail with OOM. This is
>progress, though not the best sort. It is most certainly _not_ a
>deadlock.
>
>
>
Looks like we are in a deadlock here :)

However you choose to call it, it is unacceptable IMO.

>>This is not a true oom situation: there can be plenty of memory in
>>dirty pagecache which we could reclaim if we had that tiny bit of
>>reserve memory.
>>
>>
>
>The amount of reserved memory that would be needed depends upon the
>filesystem. Some filesystems would need only very little to be able
>to free some memory, some would need a lot (e.g. a bzip2 compressing
>filesystem). There's no magic solution with reserving memory.
>
>
So the userspace filesystem would pass that amount to the kernel. It's
not pretty, but it is workable.

>And this is not unique to userspace filesystems, as Rik van Riel
>pointed out earlier, network filesystems are also prone to deadlock:
>
>http://lkml.org/lkml/2004/11/27/81
>
>
>
This looks like a bug to me. Maybe jiggling the thresholds would help.

>>>I looked at ramfs, it isn't even limited. You can easily crash your
>>>system just by filling it up with data, but no deadlock will happen.
>>>
>>>
>>>
>>>
>>Right. But ramfs doesn't call a userspace process which calls the kernel
>>right back.
>>
>>
>
>Doesn't matter one little whit. The end result is the same: Out Of
>Memory, which is _not_ equivalent to deadlock. Please think it over.
>
>
The situation with userspace filesystems is:

some process allocates memory, blocking on kswapd as memory is full
kswapd calls userspace filesystem to free memory
userspace filesystem calls kernel, which allocates memory and blocks
on kswapd
eventually all processes in the system block on kswapd

I have observed (and fixed) this on a real system.

with ramfs, once it accounts for memory, there would be no deadlock and
no oom.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2004-11-30 21:54:42

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Hi!

> >Of course, implementing this is probably not trivial. But I don't see
> >it as a theoretical problem as Linus does.
>
> I don't see a theoretical problem, just some practical ones.
>
> All can be overcome IMO, and it would be well worth it, too.

Well, coda does overcome this one, so it is possible :-).

Of course, coda works on whole files, which has other
implications. Yes, I'd very much like to see this solved. I was
developing/using uservfs for quite long and it is very nice for the
user.
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2004-11-30 21:59:22

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

> Looks like we are in a deadlock here :)
>
> However you choose to call it, it is unacceptable IMO.

That I can agree with. I never said that this was a full solution,
only that this shows, that the deadlock is not inherent in userspace
filesystems.

> So the userspace filesystem would pass that amount to the kernel. It's
> not pretty, but it is workable.

Not pretty is an understatement IMO.

> >And this is not unique to userspace filesystems, as Rik van Riel
> >pointed out earlier, network filesystems are also prone to deadlock:
> >
> >http://lkml.org/lkml/2004/11/27/81
> >
> >
> >
> This looks like a bug to me. Maybe jiggling the thresholds would help.

Yes, and it's the jiggling I want to avoid.

> The situation with userspace filesystems is:
>
> some process allocates memory, blocking on kswapd as memory is full
> kswapd calls userspace filesystem to free memory
> userspace filesystem calls kernel, which allocates memory and blocks
> on kswapd
> eventually all processes in the system block on kswapd
>
> I have observed (and fixed) this on a real system.

I have observed it too (not yet fixed, but working on it). But
realize that my proposal would excempt userspace filesystem pages from
being blocked on by kswapd. That's a fundamental difference.

Since you don't believe me, I'll have to make an implementation, so
you can experiment with it. And if you'll still be able to cause a
deadlock, I'll subject myself to extreme repentance, and promise never
to touch an operating system ever again :)

> with ramfs, once it accounts for memory, there would be no deadlock and
> no oom.

And once fuse acounts for memory there will be no deadlock and no oom.
See the symmetry?

Miklos

2004-11-30 23:01:51

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace

Miklos Szeredi wrote:

>I have observed it too (not yet fixed, but working on it). But
>realize that my proposal would excempt userspace filesystem pages from
>being blocked on by kswapd. That's a fundamental difference.
>
>Since you don't believe me, I'll have to make an implementation, so
>you can experiment with it. And if you'll still be able to cause a
>deadlock, I'll subject myself to extreme repentance, and promise never
>to touch an operating system ever again :)
>
>
>
>>with ramfs, once it accounts for memory, there would be no deadlock and
>>no oom.
>>
>>
>
>And once fuse acounts for memory there will be no deadlock and no oom.
>See the symmetry?
>
>
>
If you plan on partitioning system memory into none-fuse and fuse
memory, yes, that could work. but it's horribly inflexible - right now
memory is balanced dynamically according to actual use. you may also
have a hard time with mmap.

my proposal (with the per-process allocation thredsholds) only reserves
a small amount of memory to the fuse(s), with the rest allocated
dynamically using the normal kernel policies, with no special
restrictions on fuse.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2004-11-30 23:26:24

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace


> If you plan on partitioning system memory into none-fuse and fuse
> memory, yes, that could work. but it's horribly inflexible - right now
> memory is balanced dynamically according to actual use.

No partitioning is needed. If fuse doesn't consume too much memory
for dirty data buffers that memory is free to use for other purposes.

But fuse would be limited in the number of pages which it can use for
dirty buffers exactly to prevent it from causing OOM.

> you may also have a hard time with mmap.

What sort? You can mmap a 4G file on a machine with 32M memory. More
memory can improve performance, of course, but otherwise the amount of
memory doesn't matter.

> my proposal (with the per-process allocation thredsholds) only reserves
> a small amount of memory to the fuse(s), with the rest allocated
> dynamically using the normal kernel policies, with no special
> restrictions on fuse.

Yes, but what you reserve (which may be large for some filesystems) is
totally unusable memory except in the special case of helping writeout
in low memory situation, while in my solution the rest of the system
is not limited only the fuse filesystem.

There's not that much difference between what we are saying, but as I
said, I detest the thought, that the filesystem process has to be
special, and I'm prepared to give up some flexibility and performance
for this.

Miklos