2014-04-15 14:00:16

by Emmanuel Colbus

[permalink] [raw]
Subject: [RFC][1/11][MANUX] Kernel compatibility : ext2

With regards to the adaptations I've made, the most notable ones apply
to ext2. Here are the choices I've done :

- I've attributed myself identifier 5 as creator OS in the superblock.
Is it okay? (The os-dependant fields currently have the same
interpretation as under Linux, but I have still chosen to take a
separate identifier, in case it changes).
- I've taken identifier 0x08 as a new read-only compatible function
named "ext2l". I know by experience that the Linux kernel accepts it,
but as for you, do you have any objection?


Now, let's go to what is likely going to be the most controversial
development.

In my operating system, I have had a need for an additional information
in directory inodes, which is the last time this directory's *PARENT*
changed. Of course, in my little personnal "ext2l" partitions, I can do
it with no problem, and the same is true of those partitions I mark with
my OS as creator OS. The issue is that I also needed it in other
partitions, including linux-created ext2 ones. Thus, I have used the
osd1 field for this, including in cases where I'm *not* the creator OS.

Well, at this point, you're likely to say : "Hey, you can't do that,
this isn't allowed!". Yes, I know it, and I will certainly have to
renounce the ability to mount in read-write mode Hurd-created ext2
partitions. However, before you start throwing random (and non-random)
things at me, I would like to mention two things :

1) Except for the Hurd, no OS has ever made use of this field in 20
years or so;
2) I don't actually care that this data be *kept* by any other operating
system. That is, if Linux always smashes it to 0, I won't complain - in
fact, this would be an optimal behaviour. That's because this
information is only useful *during* my operations, between boot and
shutdown, no *across* boots.

So my third question is : as far as you're concerned, is this an
acceptable behaviour?

(By the way, if you have issues with it, I can propose a solution.
Initially, I simply thought I could take a new bit in the read-write
compatible functions, and then mark all the filesystems I would use this
way with this bit. However, I noticed this wouldn't work, because if
Linux suddenly decided to make use of this field, it would need a way to
tell my kernel about this, so we would also need to choose a second bit
to mean "This filesystem actually uses the osd1 field, don't touch it".
However, once this is done, since I don't care that my own data in this
field be preserved, the first bit would become useless... Thus, the
solution would simply be that you choose an unused bit in the read-write
compatible functions to mean "leave the osd1 field alone!", so that you
can set it if you ever decide to make use of this field; and that I
simply test it and refuse to mount these partitions.)

Thank you,

Emmanuel


2014-04-15 20:05:06

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC][1/11][MANUX] Kernel compatibility : ext2

On Tue, Apr 15, 2014 at 03:42:43PM +0200, Emmanuel Colbus wrote:
> The issue is that I also needed it in other
> partitions, including linux-created ext2 ones. Thus, I have used the
> osd1 field for this, including in cases where I'm *not* the creator OS.
>
> Well, at this point, you're likely to say : "Hey, you can't do that,
> this isn't allowed!"......

It seems really strange to you are asking for permission here. You're
doing your own operating system, so you don't need to ask permission.
You can do whatever you want. The flip side is that even if it's "ok"
for now, we could make changes in the future that might break
assumptions that you are making today. And if that happens, you
shouldn't expect that we will do anything for your convenience, just
because someone in the past said, "I give you my permission".

> 1) Except for the Hurd, no OS has ever made use of this field in 20
> years or so;

In this specific case, the osd1 field is in fact used by ext4, as the
i_version field, which is required for NFSv4 support. The ext4 file
system is a superset of ext2, and in fact some distributions have
started shipping with a configuration which allows the ext4 file
system code to mount ext2 and ext3 file systems.

> (By the way, if you have issues with it, I can propose a solution.
> Initially, I simply thought I could take a new bit in the read-write
> compatible functions, and then mark all the filesystems I would use this
> way with this bit. However, I noticed this wouldn't work, because if
> Linux suddenly decided to make use of this field, it would need a way to
> tell my kernel about this, so we would also need to choose a second bit
> to mean "This filesystem actually uses the osd1 field, don't touch it".
> However, once this is done, since I don't care that my own data in this
> field be preserved, the first bit would become useless... Thus, the
> solution would simply be that you choose an unused bit in the read-write
> compatible functions to mean "leave the osd1 field alone!", so that you
> can set it if you ever decide to make use of this field; and that I
> simply test it and refuse to mount these partitions.)

Um, no. The Linux implementation gets to use any unclaimed fields in
the inode or the superblock, and we're not necessarily going to go out
of the way and extra complexity into the ext4 kernel, just to
accomodate every single random OS developer who decides they want to
go off and do their own thing. This is a strategy which simply
doesn't scale. Can you imagine what might happen if people start
coming out of the woodwork demanding special accomodation for Tomix,
Dikux, and Harrix?

If you want to start off by cloning our code or our design, you're
completely free to do that. That's what an open source license is all
about. But you don't get to dictate to the upstream that they make
changes to accomodate MANUX extensions. If you want to try to
negotiate with us --- maybe, although there is some fields such as
inode fields which are extremely precious where it would have to be a
really, really good reason. So you'll need to tell us what it is that
you want the extra field for, and why it would be to the benefit of
the greater ext4 community that we accomodate you.

Best regards,

- Ted

2014-04-15 21:50:46

by Emmanuel Colbus

[permalink] [raw]
Subject: Re: [RFC][1/11][MANUX] Kernel compatibility : ext2

Le 15/04/2014 22:04, Theodore Ts'o a ?crit :
> On Tue, Apr 15, 2014 at 03:42:43PM +0200, Emmanuel Colbus wrote:
>> The issue is that I also needed it in other
>> partitions, including linux-created ext2 ones. Thus, I have used the
>> osd1 field for this, including in cases where I'm *not* the creator OS.
>>
>> Well, at this point, you're likely to say : "Hey, you can't do that,
>> this isn't allowed!"......
>
> It seems really strange to you are asking for permission here. You're
> doing your own operating system, so you don't need to ask permission.
> You can do whatever you want. The flip side is that even if it's "ok"
> for now, we could make changes in the future that might break
> assumptions that you are making today. And if that happens, you
> shouldn't expect that we will do anything for your convenience, just
> because someone in the past said, "I give you my permission".

Well, of course you can't *disallow* it, but I think it's better for me
to at least hear your opinion on it.

>
>> 1) Except for the Hurd, no OS has ever made use of this field in 20
>> years or so;
>
> In this specific case, the osd1 field is in fact used by ext4, as the
> i_version field, which is required for NFSv4 support. The ext4 file
> system is a superset of ext2, and in fact some distributions have
> started shipping with a configuration which allows the ext4 file
> system code to mount ext2 and ext3 file systems.

Ah, I didn't knew about this (ext4 being used to mount ext2).

By the way, just asking : if this is an NFS version field, is the
content of this field significant when no NFS has ever been used to
export this filesystem?

(Yes, I understand that I'm taking risks here, AND that your answer
won't be definitive...)

>
>> (By the way, if you have issues with it, I can propose a solution.
>> Initially, I simply thought I could take a new bit in the read-write
>> compatible functions, and then mark all the filesystems I would use this
>> way with this bit. However, I noticed this wouldn't work, because if
>> Linux suddenly decided to make use of this field, it would need a way to
>> tell my kernel about this, so we would also need to choose a second bit
>> to mean "This filesystem actually uses the osd1 field, don't touch it".
>> However, once this is done, since I don't care that my own data in this
>> field be preserved, the first bit would become useless... Thus, the
>> solution would simply be that you choose an unused bit in the read-write
>> compatible functions to mean "leave the osd1 field alone!", so that you
>> can set it if you ever decide to make use of this field; and that I
>> simply test it and refuse to mount these partitions.)
>
> Um, no. The Linux implementation gets to use any unclaimed fields in
> the inode or the superblock, and we're not necessarily going to go out
> of the way and extra complexity into the ext4 kernel, just to
> accomodate every single random OS developer who decides they want to
> go off and do their own thing. This is a strategy which simply
> doesn't scale. Can you imagine what might happen if people start
> coming out of the woodwork demanding special accomodation for Tomix,
> Dikux, and Harrix?

Yes, I understand. Sadly, that's something I was fearing...

>
> If you want to start off by cloning our code or our design, you're
> completely free to do that. That's what an open source license is all
> about. But you don't get to dictate to the upstream that they make
> changes to accomodate MANUX extensions. If you want to try to
> negotiate with us --- maybe, although there is some fields such as
> inode fields which are extremely precious where it would have to be a
> really, really good reason. So you'll need to tell us what it is that
> you want the extra field for, and why it would be to the benefit of
> the greater ext4 community that we accomodate you.

Absolutely. You're the ones that have a serious OS here, and I was only
asking you about this - undoubtedly - weird thing I had done, certainly
not attempting to dictate you anything. I asked, and I understand that
you're not exactly giving me a green light here (to put it mildly :-) ).

And, if you felt that I was attempting to *dictate* you anything, I'm
deeply sorry, and I would like to present you my apologies. I would like
to ensure you that I had absolutely no intent to do anything like this,
and that when I wrote "I can propose a solution", I actually meant it
*only* as a proposition, and that I've fully registered your rejection
of it.



Oh, by the way, you said "So you'll need to tell us what it is that
> you want the extra field for, and why it would be to the benefit of
> the greater ext4 community that we accomodate you"

Well, I don't have that much hope that you'll accomodate me, but since I
don't see how I could do any harm by telling you what I'm doing with
this field, I'll do it.

My OS heavily uses chroots for security purposes (these are not true
Linux-like chroots, but this isn't relevant). One of the issues of
chroots is that one can escape from them, by simply having one process
open a fd towards a directory, another one move the directory inside a
second directory located outside of the first process' chroot, and then
have the first process perform enough fchdir(fd, ".."); or something of
the like. To prevent this, I decided to put in each directory the time
of the last change of its ".." entry. This way, whenever a process tries
to perform such an action, I check whether this time is superior or
equal to the time of creation of its chroot, and, in this case, I
perform additional safety checks to ensure the directory is actually
still within the chroot - otherwise, I simply deny the operation.

It seems that the capsicum developers have encountered the same problem,
and they decided to solve it by simply disallowing any operation on
".."; so I would like to mention that this solution is, in my opinion,
an alternative that allows higher flexibility.


Alright, then. Here's what I plan to do :
- In the short term, I'm going to continue with what I'm currently doing
with ext2 filesystems, but warn my users against mounting such a
filesystem in read-write mode if they're also mounting it with ext4 and
exporting it with NFS;
- When I implement a generic solution for this problem, I'll simply use
it for ext2 too, except perhaps for my ext2l partitions. I mean, I
already knew I couldn't count on it forever, simply because there are
far too many filesystems out there; all I was hoping was that you would
tell me this wasn't an issue for the good old ext2. I understand this
isn't the case.

>
> Best regards,
>
> - Ted
>

Respectfully,

Emmanuel

2014-04-15 22:27:52

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC][1/11][MANUX] Kernel compatibility : ext2

On Tue, Apr 15, 2014 at 11:47:56PM +0200, Emmanuel Colbus wrote:
>
> By the way, just asking : if this is an NFS version field, is the
> content of this field significant when no NFS has ever been used to
> export this filesystem?

No. But ext4 may end up changing the value of the i_version field,
from one non-zero value to some other non-zero value field.

> My OS heavily uses chroots for security purposes (these are not true
> Linux-like chroots, but this isn't relevant). One of the issues of
> chroots is that one can escape from them, by simply having one process
> open a fd towards a directory, another one move the directory inside a
> second directory located outside of the first process' chroot, and then
> have the first process perform enough fchdir(fd, ".."); or something of
> the like.

If there a process which is out side of the chroot which is
cooperating to help someone breakout of the chroot, that means you
have a bad actor who is outside of the chroot already. So why bother
worrying about this case?

The more interesting way to break out of a crhoot, which doesn't
require a 2nd process to help you escape, is to chroot while inside a
chroot:

http://www.bpfh.net/simes/computing/chroot-break.html

And if you care about this problem, Linux has a much more general
solution using mount namespaces. FreeBSD has its own a solution
involving restrictions on chroot:

http://www.freebsd.org/cgi/man.cgi?query=chroot&sektion=2&apropos=0&manpath=FreeBSD+4.0-RELEASE


> Alright, then. Here's what I plan to do :
> - In the short term, I'm going to continue with what I'm currently doing
> with ext2 filesystems, but warn my users against mounting such a
> filesystem in read-write mode if they're also mounting it with ext4 and
> exporting it with NFS;

The main issue is what is the goal of your creating your own OS? If
it's for your own edication, that's great. Have fun, it's a great way
to learn. If you're going to actually try to market this to other
users, you should make sure you understand how much effort it takes to
support a new file system, let alone a new operating system. Hurd
tried to go down a path somewhat like yours, and it's taken them
years, and the result from a performance point of view is still pretty
bad. Keep in mind that ext2 has many limitations, including crash
recovery, performance, and scalability.

If you are planning on creating a production quality OS using ext2 as
its base, it does seem a little naive, though.

Regards,

- Ted

2014-04-16 02:15:06

by Emmanuel Colbus

[permalink] [raw]
Subject: Re: [RFC][1/11][MANUX] Kernel compatibility : ext2

Le 16/04/2014 00:27, Theodore Ts'o a ?crit :
> On Tue, Apr 15, 2014 at 11:47:56PM +0200, Emmanuel Colbus wrote:
>> My OS heavily uses chroots for security purposes (these are not true
>> Linux-like chroots, but this isn't relevant). One of the issues of
>> chroots is that one can escape from them, by simply having one process
>> open a fd towards a directory, another one move the directory inside a
>> second directory located outside of the first process' chroot, and then
>> have the first process perform enough fchdir(fd, ".."); or something of
>> the like.
>
> If there a process which is out side of the chroot which is
> cooperating to help someone breakout of the chroot, that means you
> have a bad actor who is outside of the chroot already. So why bother
> worrying about this case?

Well, in my OS, every process is chrooted. Thus, in this case, the bad
actor is trying to help another bad actor leave its chroot to access the
whole machine; which is what I'm trying to prevent.

>
> The more interesting way to break out of a crhoot, which doesn't
> require a 2nd process to help you escape, is to chroot while inside a
> chroot:
>
> http://www.bpfh.net/simes/computing/chroot-break.html
>
> And if you care about this problem, Linux has a much more general
> solution using mount namespaces. FreeBSD has its own a solution
> involving restrictions on chroot:
>
> http://www.freebsd.org/cgi/man.cgi?query=chroot&sektion=2&apropos=0&manpath=FreeBSD+4.0-RELEASE

I also have my solution. You see, although I talk about them as
"chroots", these constructions are substantialy different from the
chroots mandated by POSIX; and in fact, I neither support POSIX' chroot
nor Linux' chroot() syscall. (Yes, I understand this is both a Linux and
a POSIX incompatibility, and I fully acknowledge it. And I'm deeply
sorry about this, but I believe I will not be able to change it, because
this call simply doesn't seems to be compatible with my security model.
Unless and until I find a way to put a POSIX chroot on top of my totally
and definitely unPOSIX ones, that is.)

So, my own syscall is named xchroot(), and it simply defeats this
attempt by, as a side effect, CLOSING all the file descriptors of the
caller that refer to directories, and setting the new current working
directory of the caller to the new root.

(Well, in fact, the caller can ask it to keep some of these descriptors,
but to do so, he has to indicate where he wants the corresponding
directories to appear in its new chroot. So this approach is a lost cause.)


~~~~~~~~~~~~~~~~~~~~~

To make things clearer, my goal was to create a Linux-compatible
operating system able to withstand userspace zero-day attacks.
*Withstand* being a key word here, as I did not intended to *prevent*
them : all I wanted was to prevent, as much as possible, an exploit in
an application from giving access to the rest of the computer. (As an
example, the recent heartbleed exploit is something my conception would
have done exactly *nothing* against.)

To do this, I've introduced several new elements :

- Directory hardlinks. I mentionned them before.
- Rootlinks. These are asymetric links within the filesystem that
associate a regular file with a directory. They only exist within my
ext2l partitions. (Again, it's a re-use of the fragment address field.)
- A new virtual filesystem, named /... (slash-triple-point). Any process
can access it, even if it doesn't appears in its root, and its content
depends upon the identity of the process that looks into it (well, more
exactly, upon the identity of the xchroot he's running within).
- A new syscall, xchroot(), that allows the use of the aforementionned
creations (rootlinks and /... ). That is, thanks to it, a process can :
- ask the kernel to chroot it in the directory that's being targetted
by the rootlink associated with the file this process is currently
executing. (Of course, if there's no such thing, this call simply
fails.)
- ask the kernel to put new files within its /... filesystem. That is,
it can tell the kernel "put the file ./my_file.txt in my /...
filesystem, under the name /.../rocknroll/blues.dat." The kernel will
comply, and the process will then be able to access the file
through this new name;
- and of course, it can ask it to remove some or all the files within
its own /... , and ask to be chrooted in a directory identified by
its name.
- Little wrappers, called launchers, that are used to launch all the
programs outside one's xchroot.
- A new packaging system, that allows the use of directory hardlinks and
the construction of these launchers using a reasonably secure format
(because I also wanted to fortify the system against deliberately
hostile packages).

Also, all the user's home directories are made so that, to their shell,
they look like they are within /... . My own $HOME variable, for
example, is /.../home/ecolbus .

Then, the things happen this way. At installation, the packaging system
looks up which files and directories the new package needs,
cryptographically checks that it has been allowed to declare such
dependancies (well, actually, I haven't done this part yet... But that's
what it ought to do), and installs them using hardlinks towards the
actual files. Then, it looks for the file required to build the little
launcher, and feeds it to the launcher builder, which builds and
installs it. Then, the packaging system creates hardlinks towards the
launcher in the roots of all the programs that have a dependancy on the
first one.

When another program (say, the user's shell) tries to call the
newly-installed one (let's call him /bin/cat), it looks up its $PATH as
usual, and finds it. Except he actually only found the launcher, not the
true program, so it's the launcher that gets called.

The launcher then parses its command line, and determine which ones of
its arguments are true files, and which ones are simply options (like
-n). It then calls xchroot(), by asking the kernel to put the required
files, if any, in its /... partition, AND to remove all the others, AND
to change its root to the directory that's targetted by the rootlink
associated with the launcher's file.

The kernel dutifully performs its work, making the files available
within /... and changing its caller's root directory to the required
directory, which turns out to be the true cat's root directory. After
this, the launcher performs an execve() on cat's true binary.

This one performs its operations as usual, opening the file within /...
as its command line requests it, doing its work on it, and then exiting.

This way, the program has been able to perform its operation without
either the caller or the called processes ever having access to each
other's root. Also, even if there had been a zero-day exploit within the
called process (cat), the exploit would only have gained access to cat's
root directory and the file that contained it, which is completely useless.

(Of course, if the user had specified more than one file on the command
line, then it could have accessed the other ones. But I still consider
that this is a very serious mitigation compared to giving full access to
the user's entire homedir).

Also, as a nice side effect, even if the caller and the called had had
incompatible dependancies (say one depends on the glibc, the other on
the uclibc), this would have been unproblematic, and no process would
have noticed it.


Next, to answer the most obvious questions :

- How do you determine which file a process is executing?
I simply remember this information from the time of its last execve().

- Are the launchers statically compiled?
Of course they are, otherwise the library attack would be quite obvious :-).

- Do two processes that share the same root directory also share the
same /... files?
No, unless none of them ever called xchroot() successfully since they
were separated by fork().


- what about chroot() privileges?
Since using it can't lead to an escape, my xchroot() syscall is
completely unprivileged.


- what about directory-refering fd transfer using UNIX sockets?
This isn't yet handled, but when it will be, these will either be
disallowed or implemented in such a way that they can't be used for an
escape.


- Why put the manipulation of /... and of the process' root
directory within the same syscall?

That's because of the use case of "hey, dear kernel, I have this file
descriptor that refers to a directory, and I would like to keep it after
my root has been changed. Could you please refrain from closing it, and
instead make it appear in my /... , as /.../what/ever ?"

Since the closure of the file descriptors that refer to directories has
to be a side effect of the syscall that changes the process' root, BUT
the request to make it appear within /... can only appear within a
syscall that alters /... , there had to be a single syscall that
performed both operations.

That being said, it is possible to use this syscall to only perform one
of these operations at a time.


- What if somebody tries to do mkdir("..."); within somebody else's root ?
Nice try, but I've decided to disallow the use of this name in my ext2l
partitions. I don't think this violates any standard, because in fact,
if I'm not mistaken, even the "." and ".." entries aren't specified, so
I don't see why adding "..." to the list of disallowed file names would
be an issue. (In fact, I've decided to mark as reserved *any* file name
that contains no character different from '.'; the other ones being
reserved for extensions. Not that I had any intent to use them, though;
I just thought it would be nice to have such reserved names.)
This might be a little Linux ABI breakage, though. Sorry.


- Does this has a cost?

Oh, yes :
- the launchers slightly slow down the system. Not that much, since
they perform very few syscalls, but that's obviously higher than 0. It's
less than the glibc's initialization, though.
- more importantly, using these functionnalities requires using an
unified package system, so that's so much lost in terms of adaptability.
- also, my OS can only use my ext2l partitions as root partition.
- far, far, far more importantly than all of this, this breaks POSIX.
And I know it, and fully assume it. And it's not only about the chroot()
syscall.

The big, BIG issue happens when an application does something like :
"ls /bin/true".

Since ls has no dependancy on /bin/true, there is no such file in ls'
root directory. Since the files can only be transfered within the /...
filesystem, this operation *cannot* succeed.

Fortunately, my launchers have special code that allows them to try to
deal with this situation. What they do is :
- they choose a new name, different from all the existing ones, and
make the requested file appear in a directory that carries this new name
in /... ;
- and they *alter* their command line, so that it seems like they have
been asked to work on the new name.

The result is that the caller sees something like this :

$ ls /bin/true
/.../??/true
$

This is a clear-cut POSIX violation, and as I said, I fully assume it.
That's because I consider that security is about tradeoffs, and that the
loss of standardization and compatibility that this implies is more than
offset by the gains in terms of security that this architecture
provides. That's my opinion, and I'm sticking to it.

All I could do was to put everybody's homedir within /... , so that this
use case remains marginal - limited to operations performed by an
application on a file upon which it has a dependancy and manual commands
by the user.

~~~~~~~~~~~~~~~~~~~


>
>
>> Alright, then. Here's what I plan to do :
>> - In the short term, I'm going to continue with what I'm currently doing
>> with ext2 filesystems, but warn my users against mounting such a
>> filesystem in read-write mode if they're also mounting it with ext4 and
>> exporting it with NFS;
>
> The main issue is what is the goal of your creating your own OS? If
> it's for your own edication, that's great. Have fun, it's a great way
> to learn. If you're going to actually try to market this to other
> users, you should make sure you understand how much effort it takes to
> support a new file system, let alone a new operating system. Hurd
> tried to go down a path somewhat like yours, and it's taken them
> years, and the result from a performance point of view is still pretty
> bad. Keep in mind that ext2 has many limitations, including crash
> recovery, performance, and scalability.

Well, when I started doing this OS, this was exactly my goal : self
education. I've to say it worked well. And then, I also decided to try
my ideas about strong security, so I tried them. I had many failures,
and it took me a very long amount of time to find a workable concept,
but in the end, I got the whole thing working and self-hosted. And then
I noticed that, much to my surprise, I had actually succeeded into
getting a minimal working OS with these securities, so I decided to
publish it.

>
> If you are planning on creating a production quality OS using ext2 as
> its base, it does seem a little naive, though.

Hmmm... No, I understand journalization will eventually be needed. But
then, until this happens, I've time...

Emmanuel