Hi folks!
The Linux Plumbers conference is next week! The filesystems mini
conference is next Tuesday, 21 September, starting at 14:00 UTC:
https://linuxplumbersconf.org/event/11/sessions/111/#20210921
(it's the light green column second from the right)
As most of you are probably aware, LSFMM has been cancelled for a second
year in a row, which leaves LPC as the only developer focused gathering
this year. This year's conference is virtual, like last year. If you'd
like to participate, it's not too late to register ($50 USD):
https://linuxplumbersconf.org/event/11/page/112-attend
---
The first session will be run by Matthew Wilcox, who will say something
about the ongoing folio work, assuming Linus isn't going to pull it for
5.15. We'll see where things are in a week.
Christian Brauner will run the second session, discussing what idmapped
filesystem mounts are for and the current status of supporting more
filesystems.
---
Next up will be a session run by me about both of the atomic file write
features that have been variously proposed and asked for by various
enterprisey users. The first proposal refers to direct atomic file
writes to storage hardware. I /think/ this mostly involves enabling the
relevant plumbing in the block layer and then wiring up iomap/xfs to use
it. Possibly also a new pwritev2 flag or file mode or something.
(Christoph did this in 2019: https://lwn.net/Articles/789600/ )
The /other/ atomic file write feature, of course, is my longstanding RFC
to add a VFS call that enables atomic swapping of file contents, which
enables a program to reflink a file's contents to an O_TMPFILE file,
write some changes to the tempfile, and then swap /all/ the changed
blocks back to the original file. This call would be durable even if
the system goes down. The feature is actually the brainchild of the
online filesystem repair effort, but I figured it wasn't so hard to
extend a tendril to userspace to make it more generally useful.
https://lwn.net/Articles/818977/
https://lore.kernel.org/linux-fsdevel/161723932606.3149451.12366114306150243052.stgit@magnolia/
---
Allison will be running the fourth session about our current progress
towards enabling sysadmins to shrink filesystems, and what pieces we're
going to need to clear space from the end of the filesystem and then
reduce the size. FWIW Dave has been working on an inode relocation
("reno") tool, and I've been working on a free space defragmenter that
kind of barely works. Originally this was a XFS-focused discussion, but
it seems that Andreas still remembers the last time someone tried to add
it to ext4.
---
Session #4 discusses the proliferation of cloud storage technologies and
the new failure modes that Ted and I have observed with these devices.
I think Ted had a few things to say about cloud devices, and due to the
repeated requests I think it would be worth polling the audience to find
out if they'd like filesystems to be more resilient (when possible) in
the face of not-totally-reliable storage. Obviously everyone wants that
as a broad goal, but what should we pitch? Metadata IO retries?
Restoring lost information from a mirror? Online repair?
---
The final session will be a presentation about the XFS roadmap for 2022.
I'll start with a recap of the new features from last year's LTS that
have been maturing this year, and which pieces have landed for this
year's LTS kernel.
I /hope/ that this will attract a conversation between (x)fs developers
and real application developers about what features could be coming down
the pipeline and what features would they most be interested in.
---
To all the XFS developers: it has been a very long time since I've seen
all your faces! I would love to have a developer BOF of some kind to
see you all again, and to introduce Catherine Hoang (our newest
addition) to the group.
If nobody else shows up to the roadmap we could do it there, but I'd
like to have /some/ kind of venue for everyone who don't find the
timeslots convenient (i.e. Dave and Chandan). This doesn't have to take
a long time -- even a 15 minute meet and greet to help everyone
(re)associate names with faces would go a long way towards feeling
normal(ish) again. ;)
--D
On 16 Sep 2021 at 07:09, Darrick J. Wong wrote:
> Hi folks!
>
> The Linux Plumbers conference is next week! The filesystems mini
> conference is next Tuesday, 21 September, starting at 14:00 UTC:
>
<snip>
>
> To all the XFS developers: it has been a very long time since I've seen
> all your faces! I would love to have a developer BOF of some kind to
> see you all again, and to introduce Catherine Hoang (our newest
> addition) to the group.
>
> If nobody else shows up to the roadmap we could do it there, but I'd
> like to have /some/ kind of venue for everyone who don't find the
> timeslots convenient (i.e. Dave and Chandan). This doesn't have to take
> a long time -- even a 15 minute meet and greet to help everyone
> (re)associate names with faces would go a long way towards feeling
> normal(ish) again. ;)
14:00 UTC maps to 19:30 for me. I am fine with this time slot.
--
chandan
Hi!
We did a small update to the schedule:
> Christian Brauner will run the second session, discussing what idmapped
> filesystem mounts are for and the current status of supporting more
> filesystems.
We have extended this session as we'd like to discuss and get some feedback
from users about project quotas and project ids:
Project quotas were originally mostly a collaborative feature and later got
used by some container runtimes to implement limitation of used space on a
filesystem shared by multiple containers. As a result current semantics of
project quotas are somewhat surprising and handling of project ids is not
consistent among filesystems. The main two contending points are:
1) Currently the inode owner can set project id of the inode to any
arbitrary number if he is in init_user_ns. It cannot change project id at
all in other user namespaces.
2) Should project IDs be mapped in user namespaces or not? User namespace
code does implement the mapping, VFS quota code maps project ids when using
them. However e.g. XFS does not map project IDs in its calls setting them
in the inode. Among other things this results in some funny errors if you
set project ID to (unsigned)-1.
In the session we'd like to get feedback how project quotas / ids get used
/ could be used so that we can define the common semantics and make the
code consistently follow these rules.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Let me also post Amir's thoughts on this from a private thread:
On Fri 17-09-21 10:30:43, Jan Kara wrote:
> We did a small update to the schedule:
>
> > Christian Brauner will run the second session, discussing what idmapped
> > filesystem mounts are for and the current status of supporting more
> > filesystems.
>
> We have extended this session as we'd like to discuss and get some feedback
> from users about project quotas and project ids:
>
> Project quotas were originally mostly a collaborative feature and later got
> used by some container runtimes to implement limitation of used space on a
> filesystem shared by multiple containers. As a result current semantics of
> project quotas are somewhat surprising and handling of project ids is not
> consistent among filesystems. The main two contending points are:
>
> 1) Currently the inode owner can set project id of the inode to any
> arbitrary number if he is in init_user_ns. It cannot change project id at
> all in other user namespaces.
>
> 2) Should project IDs be mapped in user namespaces or not? User namespace
> code does implement the mapping, VFS quota code maps project ids when using
> them. However e.g. XFS does not map project IDs in its calls setting them
> in the inode. Among other things this results in some funny errors if you
> set project ID to (unsigned)-1.
>
> In the session we'd like to get feedback how project quotas / ids get used
> / could be used so that we can define the common semantics and make the
> code consistently follow these rules.
I think that legacy projid semantics might not be a perfect fit for
container isolation requirements. I added project quota support to docker
at the time because it was handy and it did the job of limiting and
querying disk usage of containers with an overlayfs storage driver.
With btrfs storage driver, subvolumes are used to create that isolation.
The TREE_ID proposal [1] got me thinking that it is not so hard to
implement "tree id" as an extention or in addition to project id.
The semantics of "tree id" would be:
1. tree id is a quota entity accounting inodes and blocks
2. tree id can be changed only on an empty directory
3. tree id can be set to TID only if quota inode usage of TID is 0
4. tree id is always inherited from parent
5. No rename() or link() across tree id (clone should be possible)
AFAIK btrfs subvol meets all the requirements of "tree id".
Implementing tree id in ext4/xfs could be done by adding a new field to
inode on-disk format and a new quota entity to quota on-disk format and
quotatools.
An alternative simpler way is to repurpose project id and project quota:
* Add filesystem feature projid-is-treeid
* The feature can be enabled on fresh mkfs or after fsck verifies "tree id"
rules are followed for all usage of projid
* Once the feature is enabled, filesystem enforces the new semantics
about setting projid and projid_inherit
This might be a good option if there is little intersection between
systems that need to use the old project semantics and systems
that would rather have the tree id semantics.
I think that with the "tree id" semantics, the user_ns/idmapped
questions become easier to answer.
Allocating tree id ranges per userns to avoid exhausting the tree id
namespace is a very similar problem to allocating uids per userns.
Thanks,
Amir.
[1] https://lore.kernel.org/linux-fsdevel/[email protected]/
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 17-09-21 10:36:08, Jan Kara wrote:
> Let me also post Amir's thoughts on this from a private thread:
And now I'm actually replying to Amir :-p
> On Fri 17-09-21 10:30:43, Jan Kara wrote:
> > We did a small update to the schedule:
> >
> > > Christian Brauner will run the second session, discussing what idmapped
> > > filesystem mounts are for and the current status of supporting more
> > > filesystems.
> >
> > We have extended this session as we'd like to discuss and get some feedback
> > from users about project quotas and project ids:
> >
> > Project quotas were originally mostly a collaborative feature and later got
> > used by some container runtimes to implement limitation of used space on a
> > filesystem shared by multiple containers. As a result current semantics of
> > project quotas are somewhat surprising and handling of project ids is not
> > consistent among filesystems. The main two contending points are:
> >
> > 1) Currently the inode owner can set project id of the inode to any
> > arbitrary number if he is in init_user_ns. It cannot change project id at
> > all in other user namespaces.
> >
> > 2) Should project IDs be mapped in user namespaces or not? User namespace
> > code does implement the mapping, VFS quota code maps project ids when using
> > them. However e.g. XFS does not map project IDs in its calls setting them
> > in the inode. Among other things this results in some funny errors if you
> > set project ID to (unsigned)-1.
> >
> > In the session we'd like to get feedback how project quotas / ids get used
> > / could be used so that we can define the common semantics and make the
> > code consistently follow these rules.
>
> I think that legacy projid semantics might not be a perfect fit for
> container isolation requirements. I added project quota support to docker
> at the time because it was handy and it did the job of limiting and
> querying disk usage of containers with an overlayfs storage driver.
>
> With btrfs storage driver, subvolumes are used to create that isolation.
> The TREE_ID proposal [1] got me thinking that it is not so hard to
> implement "tree id" as an extention or in addition to project id.
>
> The semantics of "tree id" would be:
> 1. tree id is a quota entity accounting inodes and blocks
> 2. tree id can be changed only on an empty directory
> 3. tree id can be set to TID only if quota inode usage of TID is 0
> 4. tree id is always inherited from parent
> 5. No rename() or link() across tree id (clone should be possible)
>
> AFAIK btrfs subvol meets all the requirements of "tree id".
>
> Implementing tree id in ext4/xfs could be done by adding a new field to
> inode on-disk format and a new quota entity to quota on-disk format and
> quotatools.
>
> An alternative simpler way is to repurpose project id and project quota:
> * Add filesystem feature projid-is-treeid
> * The feature can be enabled on fresh mkfs or after fsck verifies "tree id"
> rules are followed for all usage of projid
> * Once the feature is enabled, filesystem enforces the new semantics
> about setting projid and projid_inherit
>
> This might be a good option if there is little intersection between
> systems that need to use the old project semantics and systems
> that would rather have the tree id semantics.
Yes, I actually think that having both tree-id and project-id on a
filesystem would be too confusing. And I'm not aware of realistic usecases.
I've heard only of people wanting current semantics (although these we more
of the kind: "sometime in the past people used the feature like this") and
the people complaining current semantics is not useful for them. This was
discussed e.g. in ext4 list [2].
> I think that with the "tree id" semantics, the user_ns/idmapped
> questions become easier to answer.
> Allocating tree id ranges per userns to avoid exhausting the tree id
> namespace is a very similar problem to allocating uids per userns.
It still depends how exactly tree ids get used - if you want to use them to
limit space usage of a container, you still have to forbid changing of tree
ids inside the container, don't you?
> [1] https://lore.kernel.org/linux-fsdevel/[email protected]/
[2] https://lore.kernel.org/linux-ext4/[email protected]
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri, Sep 17, 2021 at 12:38 PM Jan Kara <[email protected]> wrote:
>
> On Fri 17-09-21 10:36:08, Jan Kara wrote:
> > Let me also post Amir's thoughts on this from a private thread:
>
> And now I'm actually replying to Amir :-p
>
> > On Fri 17-09-21 10:30:43, Jan Kara wrote:
> > > We did a small update to the schedule:
> > >
> > > > Christian Brauner will run the second session, discussing what idmapped
> > > > filesystem mounts are for and the current status of supporting more
> > > > filesystems.
> > >
> > > We have extended this session as we'd like to discuss and get some feedback
> > > from users about project quotas and project ids:
> > >
> > > Project quotas were originally mostly a collaborative feature and later got
> > > used by some container runtimes to implement limitation of used space on a
> > > filesystem shared by multiple containers. As a result current semantics of
> > > project quotas are somewhat surprising and handling of project ids is not
> > > consistent among filesystems. The main two contending points are:
> > >
> > > 1) Currently the inode owner can set project id of the inode to any
> > > arbitrary number if he is in init_user_ns. It cannot change project id at
> > > all in other user namespaces.
> > >
> > > 2) Should project IDs be mapped in user namespaces or not? User namespace
> > > code does implement the mapping, VFS quota code maps project ids when using
> > > them. However e.g. XFS does not map project IDs in its calls setting them
> > > in the inode. Among other things this results in some funny errors if you
> > > set project ID to (unsigned)-1.
> > >
> > > In the session we'd like to get feedback how project quotas / ids get used
> > > / could be used so that we can define the common semantics and make the
> > > code consistently follow these rules.
> >
> > I think that legacy projid semantics might not be a perfect fit for
> > container isolation requirements. I added project quota support to docker
> > at the time because it was handy and it did the job of limiting and
> > querying disk usage of containers with an overlayfs storage driver.
> >
> > With btrfs storage driver, subvolumes are used to create that isolation.
> > The TREE_ID proposal [1] got me thinking that it is not so hard to
> > implement "tree id" as an extention or in addition to project id.
> >
> > The semantics of "tree id" would be:
> > 1. tree id is a quota entity accounting inodes and blocks
> > 2. tree id can be changed only on an empty directory
> > 3. tree id can be set to TID only if quota inode usage of TID is 0
> > 4. tree id is always inherited from parent
> > 5. No rename() or link() across tree id (clone should be possible)
> >
> > AFAIK btrfs subvol meets all the requirements of "tree id".
> >
> > Implementing tree id in ext4/xfs could be done by adding a new field to
> > inode on-disk format and a new quota entity to quota on-disk format and
> > quotatools.
> >
> > An alternative simpler way is to repurpose project id and project quota:
> > * Add filesystem feature projid-is-treeid
> > * The feature can be enabled on fresh mkfs or after fsck verifies "tree id"
> > rules are followed for all usage of projid
> > * Once the feature is enabled, filesystem enforces the new semantics
> > about setting projid and projid_inherit
> >
> > This might be a good option if there is little intersection between
> > systems that need to use the old project semantics and systems
> > that would rather have the tree id semantics.
>
> Yes, I actually think that having both tree-id and project-id on a
> filesystem would be too confusing. And I'm not aware of realistic usecases.
> I've heard only of people wanting current semantics (although these we more
> of the kind: "sometime in the past people used the feature like this") and
> the people complaining current semantics is not useful for them. This was
> discussed e.g. in ext4 list [2].
>
> > I think that with the "tree id" semantics, the user_ns/idmapped
> > questions become easier to answer.
> > Allocating tree id ranges per userns to avoid exhausting the tree id
> > namespace is a very similar problem to allocating uids per userns.
>
> It still depends how exactly tree ids get used - if you want to use them to
> limit space usage of a container, you still have to forbid changing of tree
> ids inside the container, don't you?
>
Yes.
This is where my view of userns becomes hazy (so pulling Christain into
the discussion), but in general I think that this use case would be similar
to the concept of single uid container - the range of allowed tree ids that
is allocated for the container in that case is a single tree id.
I understand that the next question would be about nesting subtree quotas
and I don't have a good answer to that question.
Are btrfs subvolume nested w.r.t. capacity limit? I don't think that they are.
Thanks,
Amir.
On Fri, Sep 17, 2021 at 01:23:08PM +0300, Amir Goldstein wrote:
> On Fri, Sep 17, 2021 at 12:38 PM Jan Kara <[email protected]> wrote:
> >
> > On Fri 17-09-21 10:36:08, Jan Kara wrote:
> > > Let me also post Amir's thoughts on this from a private thread:
> >
> > And now I'm actually replying to Amir :-p
> >
> > > On Fri 17-09-21 10:30:43, Jan Kara wrote:
> > > > We did a small update to the schedule:
> > > >
> > > > > Christian Brauner will run the second session, discussing what idmapped
> > > > > filesystem mounts are for and the current status of supporting more
> > > > > filesystems.
> > > >
> > > > We have extended this session as we'd like to discuss and get some feedback
> > > > from users about project quotas and project ids:
> > > >
> > > > Project quotas were originally mostly a collaborative feature and later got
> > > > used by some container runtimes to implement limitation of used space on a
> > > > filesystem shared by multiple containers. As a result current semantics of
> > > > project quotas are somewhat surprising and handling of project ids is not
> > > > consistent among filesystems. The main two contending points are:
> > > >
> > > > 1) Currently the inode owner can set project id of the inode to any
> > > > arbitrary number if he is in init_user_ns. It cannot change project id at
> > > > all in other user namespaces.
> > > >
> > > > 2) Should project IDs be mapped in user namespaces or not? User namespace
> > > > code does implement the mapping, VFS quota code maps project ids when using
> > > > them. However e.g. XFS does not map project IDs in its calls setting them
> > > > in the inode. Among other things this results in some funny errors if you
> > > > set project ID to (unsigned)-1.
> > > >
> > > > In the session we'd like to get feedback how project quotas / ids get used
> > > > / could be used so that we can define the common semantics and make the
> > > > code consistently follow these rules.
> > >
> > > I think that legacy projid semantics might not be a perfect fit for
> > > container isolation requirements. I added project quota support to docker
> > > at the time because it was handy and it did the job of limiting and
> > > querying disk usage of containers with an overlayfs storage driver.
> > >
> > > With btrfs storage driver, subvolumes are used to create that isolation.
> > > The TREE_ID proposal [1] got me thinking that it is not so hard to
> > > implement "tree id" as an extention or in addition to project id.
> > >
> > > The semantics of "tree id" would be:
> > > 1. tree id is a quota entity accounting inodes and blocks
> > > 2. tree id can be changed only on an empty directory
> > > 3. tree id can be set to TID only if quota inode usage of TID is 0
> > > 4. tree id is always inherited from parent
> > > 5. No rename() or link() across tree id (clone should be possible)
> > >
> > > AFAIK btrfs subvol meets all the requirements of "tree id".
> > >
> > > Implementing tree id in ext4/xfs could be done by adding a new field to
> > > inode on-disk format and a new quota entity to quota on-disk format and
> > > quotatools.
> > >
> > > An alternative simpler way is to repurpose project id and project quota:
> > > * Add filesystem feature projid-is-treeid
> > > * The feature can be enabled on fresh mkfs or after fsck verifies "tree id"
> > > rules are followed for all usage of projid
> > > * Once the feature is enabled, filesystem enforces the new semantics
> > > about setting projid and projid_inherit
I'd probably just repurpose the project quota mechanism, which means
that the xfs treeid is really just project quotas with somewhat
different behavior rules that are tailored to modern adversarial usage
models. ;)
IIRC someone asked for some sort of change like this on the xfs list
some years back. If memory serves, they wanted to prevent non-admin
userspace from changing project ids, even in the regular user ns? It
never got as far as a formal proposal though.
I could definitely see a use case for letting admin processes in a
container change project ids among only the projids that are idmapped
into the namespace.
> > >
> > > This might be a good option if there is little intersection between
> > > systems that need to use the old project semantics and systems
> > > that would rather have the tree id semantics.
> >
> > Yes, I actually think that having both tree-id and project-id on a
> > filesystem would be too confusing. And I'm not aware of realistic usecases.
> > I've heard only of people wanting current semantics (although these we more
> > of the kind: "sometime in the past people used the feature like this") and
> > the people complaining current semantics is not useful for them. This was
> > discussed e.g. in ext4 list [2].
> >
> > > I think that with the "tree id" semantics, the user_ns/idmapped
> > > questions become easier to answer.
> > > Allocating tree id ranges per userns to avoid exhausting the tree id
> > > namespace is a very similar problem to allocating uids per userns.
> >
> > It still depends how exactly tree ids get used - if you want to use them to
> > limit space usage of a container, you still have to forbid changing of tree
> > ids inside the container, don't you?
> >
>
> Yes.
> This is where my view of userns becomes hazy (so pulling Christain into
> the discussion), but in general I think that this use case would be similar
> to the concept of single uid container - the range of allowed tree ids that
> is allocated for the container in that case is a single tree id.
>
> I understand that the next question would be about nesting subtree quotas
> and I don't have a good answer to that question.
>
> Are btrfs subvolume nested w.r.t. capacity limit? I don't think that they are.
One thing that someone on #btrfs pointed out to me -- unlike ext4 and
xfs project quotas where the statvfs output reflects the project quota
limits, btrfs qgroups don't do that. Software that tries to trim its
preallocations when "space" gets low (e.g. journald) then fails to react
and /var/log can fill up.
Granted, it's btrfs quotas which they say aren't production ready still
and I have no idea, so ... <shrug>
--D
>
> Thanks,
> Amir.
On Thu, Sep 16, 2021 at 05:38:38PM +0530, Chandan Babu R wrote:
> On 16 Sep 2021 at 07:09, Darrick J. Wong wrote:
> > Hi folks!
> >
> > The Linux Plumbers conference is next week! The filesystems mini
> > conference is next Tuesday, 21 September, starting at 14:00 UTC:
> >
>
> <snip>
>
> >
> > To all the XFS developers: it has been a very long time since I've seen
> > all your faces! I would love to have a developer BOF of some kind to
> > see you all again, and to introduce Catherine Hoang (our newest
> > addition) to the group.
> >
> > If nobody else shows up to the roadmap we could do it there, but I'd
> > like to have /some/ kind of venue for everyone who don't find the
> > timeslots convenient (i.e. Dave and Chandan). This doesn't have to take
> > a long time -- even a 15 minute meet and greet to help everyone
> > (re)associate names with faces would go a long way towards feeling
> > normal(ish) again. ;)
>
> 14:00 UTC maps to 19:30 for me. I am fine with this time slot.
It maps to 12-4am for me. Not really practical :/
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Fri, Sep 17, 2021 at 09:12:17AM -0700, Darrick J. Wong wrote:
> On Fri, Sep 17, 2021 at 01:23:08PM +0300, Amir Goldstein wrote:
> > On Fri, Sep 17, 2021 at 12:38 PM Jan Kara <[email protected]> wrote:
> > >
> > > On Fri 17-09-21 10:36:08, Jan Kara wrote:
> > > > Let me also post Amir's thoughts on this from a private thread:
> > >
> > > And now I'm actually replying to Amir :-p
> > >
> > > > On Fri 17-09-21 10:30:43, Jan Kara wrote:
> > > > > We did a small update to the schedule:
> > > > >
> > > > > > Christian Brauner will run the second session, discussing what idmapped
> > > > > > filesystem mounts are for and the current status of supporting more
> > > > > > filesystems.
> > > > >
> > > > > We have extended this session as we'd like to discuss and get some feedback
> > > > > from users about project quotas and project ids:
> > > > >
> > > > > Project quotas were originally mostly a collaborative feature and later got
> > > > > used by some container runtimes to implement limitation of used space on a
> > > > > filesystem shared by multiple containers. As a result current semantics of
> > > > > project quotas are somewhat surprising and handling of project ids is not
> > > > > consistent among filesystems. The main two contending points are:
> > > > >
> > > > > 1) Currently the inode owner can set project id of the inode to any
> > > > > arbitrary number if he is in init_user_ns. It cannot change project id at
> > > > > all in other user namespaces.
> > > > >
> > > > > 2) Should project IDs be mapped in user namespaces or not? User namespace
> > > > > code does implement the mapping, VFS quota code maps project ids when using
> > > > > them. However e.g. XFS does not map project IDs in its calls setting them
> > > > > in the inode. Among other things this results in some funny errors if you
> > > > > set project ID to (unsigned)-1.
> > > > >
> > > > > In the session we'd like to get feedback how project quotas / ids get used
> > > > > / could be used so that we can define the common semantics and make the
> > > > > code consistently follow these rules.
> > > >
> > > > I think that legacy projid semantics might not be a perfect fit for
> > > > container isolation requirements. I added project quota support to docker
> > > > at the time because it was handy and it did the job of limiting and
> > > > querying disk usage of containers with an overlayfs storage driver.
> > > >
> > > > With btrfs storage driver, subvolumes are used to create that isolation.
> > > > The TREE_ID proposal [1] got me thinking that it is not so hard to
> > > > implement "tree id" as an extention or in addition to project id.
> > > >
> > > > The semantics of "tree id" would be:
> > > > 1. tree id is a quota entity accounting inodes and blocks
> > > > 2. tree id can be changed only on an empty directory
Hmmm. So once it's created, it can't be changed without first
deleting all the data in the tree?
> > > > 3. tree id can be set to TID only if quota inode usage of TID is 0
What does this mean? Defining behaviour.semantics in terms of it's
implementation is ambiguous and open for interpretation.
I *think* the intent here is that tree ids are unique and can only
be applied to a single tree, but...
And, of course, what happens if we have multiple filesystems? tree
IDs are no longer globally unique across the system, right?
> > > > 4. tree id is always inherited from parent
What happens as we traverse mount points within a tree? If the quota
applies to directory trees, then there are going to be directory
tree constructs that don't obviously follow this behaviour. e.g.
bind mounts from one directory tree to another, both having
different tree IDs.
Which then makes me question: are inodes and inode flags the right
place to track and propagate these tree IDs? Isn't the tree ID as
defined here a property of the path structure rather than a property
of the inode? Should we actually be looking at a new directory tree
ID tracking behaviour at, say, the vfs-mount+dentry level rather
than the inode level?
> > > > 5. No rename() or link() across tree id (clone should be possible)
The current directory tree quotas disallow this because of
implementation difficulties (e.g. avoiding recursive chproj inside
the kernel as part of rename()) and so would punt the operations too
difficult to do in the kernel back to userspace. They are not
intended to implement container boundaries in any way, shape or
form. Container boundaries need to use a proper access control
mechanism, not rely on side effects of difficult-to-implement low
level accounting mechanisms to provide access restriction.
Indeed, why do we want to place restrictions on moving things across
trees if the filesystem can actually do so correctly? Hence I think
this is somewhat inappropriately be commingling container access
restrictions with usage accounting....
I'm certain there will be filesytsems that do disallow rename and
link to punt the problem back up to userspace, but that's an
implementation detail to ensure accounting for the data movement to
a different tree is correct and not a behavioural or access
restriction...
> > > > AFAIK btrfs subvol meets all the requirements of "tree id".
> > > >
> > > > Implementing tree id in ext4/xfs could be done by adding a new field to
> > > > inode on-disk format and a new quota entity to quota on-disk format and
> > > > quotatools.
> > > >
> > > > An alternative simpler way is to repurpose project id and project quota:
> > > > * Add filesystem feature projid-is-treeid
> > > > * The feature can be enabled on fresh mkfs or after fsck verifies "tree id"
> > > > rules are followed for all usage of projid
> > > > * Once the feature is enabled, filesystem enforces the new semantics
> > > > about setting projid and projid_inherit
>
> I'd probably just repurpose the project quota mechanism, which means
> that the xfs treeid is really just project quotas with somewhat
> different behavior rules that are tailored to modern adversarial usage
> models. ;)
Potentially, yes, though I'm yet not convinced a "tree quota" is
actually something we should track at an individual inode
level...
> IIRC someone asked for some sort of change like this on the xfs list
> some years back. If memory serves, they wanted to prevent non-admin
> userspace from changing project ids, even in the regular user ns? It
> never got as far as a formal proposal though.
>
> I could definitely see a use case for letting admin processes in a
> container change project ids among only the projids that are idmapped
> into the namespace.
Yup, all we need is a solid definition of how it will work, but
that's always been the point where silence has fallen on the
discussion.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Sat, Sep 18, 2021 at 08:11:24AM +1000, Dave Chinner wrote:
> On Thu, Sep 16, 2021 at 05:38:38PM +0530, Chandan Babu R wrote:
> > On 16 Sep 2021 at 07:09, Darrick J. Wong wrote:
> > > Hi folks!
> > >
> > > The Linux Plumbers conference is next week! The filesystems mini
> > > conference is next Tuesday, 21 September, starting at 14:00 UTC:
> > >
> >
> > <snip>
> >
> > >
> > > To all the XFS developers: it has been a very long time since I've seen
> > > all your faces! I would love to have a developer BOF of some kind to
> > > see you all again, and to introduce Catherine Hoang (our newest
> > > addition) to the group.
> > >
> > > If nobody else shows up to the roadmap we could do it there, but I'd
> > > like to have /some/ kind of venue for everyone who don't find the
> > > timeslots convenient (i.e. Dave and Chandan). This doesn't have to take
> > > a long time -- even a 15 minute meet and greet to help everyone
> > > (re)associate names with faces would go a long way towards feeling
> > > normal(ish) again. ;)
> >
> > 14:00 UTC maps to 19:30 for me. I am fine with this time slot.
>
> It maps to 12-4am for me. Not really practical :/
FWIW I'll try to commandeer one of the LPC hack rooms late Tuesday
evening (say around 0200 UTC) if people are interested in XFS office
hours?
(They also don't need to be on LPC's BBB instance; I /can/ host largeish
meetings on Zoom courtesy of $employer...)
--D
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
On Sat, Sep 18, 2021 at 2:15 AM Dave Chinner <[email protected]> wrote:
>
> On Fri, Sep 17, 2021 at 09:12:17AM -0700, Darrick J. Wong wrote:
> > On Fri, Sep 17, 2021 at 01:23:08PM +0300, Amir Goldstein wrote:
> > > On Fri, Sep 17, 2021 at 12:38 PM Jan Kara <[email protected]> wrote:
> > > >
> > > > On Fri 17-09-21 10:36:08, Jan Kara wrote:
> > > > > Let me also post Amir's thoughts on this from a private thread:
> > > >
> > > > And now I'm actually replying to Amir :-p
> > > >
> > > > > On Fri 17-09-21 10:30:43, Jan Kara wrote:
> > > > > > We did a small update to the schedule:
> > > > > >
> > > > > > > Christian Brauner will run the second session, discussing what idmapped
> > > > > > > filesystem mounts are for and the current status of supporting more
> > > > > > > filesystems.
> > > > > >
> > > > > > We have extended this session as we'd like to discuss and get some feedback
> > > > > > from users about project quotas and project ids:
> > > > > >
> > > > > > Project quotas were originally mostly a collaborative feature and later got
> > > > > > used by some container runtimes to implement limitation of used space on a
> > > > > > filesystem shared by multiple containers. As a result current semantics of
> > > > > > project quotas are somewhat surprising and handling of project ids is not
> > > > > > consistent among filesystems. The main two contending points are:
> > > > > >
> > > > > > 1) Currently the inode owner can set project id of the inode to any
> > > > > > arbitrary number if he is in init_user_ns. It cannot change project id at
> > > > > > all in other user namespaces.
> > > > > >
> > > > > > 2) Should project IDs be mapped in user namespaces or not? User namespace
> > > > > > code does implement the mapping, VFS quota code maps project ids when using
> > > > > > them. However e.g. XFS does not map project IDs in its calls setting them
> > > > > > in the inode. Among other things this results in some funny errors if you
> > > > > > set project ID to (unsigned)-1.
> > > > > >
> > > > > > In the session we'd like to get feedback how project quotas / ids get used
> > > > > > / could be used so that we can define the common semantics and make the
> > > > > > code consistently follow these rules.
> > > > >
> > > > > I think that legacy projid semantics might not be a perfect fit for
> > > > > container isolation requirements. I added project quota support to docker
> > > > > at the time because it was handy and it did the job of limiting and
> > > > > querying disk usage of containers with an overlayfs storage driver.
> > > > >
> > > > > With btrfs storage driver, subvolumes are used to create that isolation.
> > > > > The TREE_ID proposal [1] got me thinking that it is not so hard to
> > > > > implement "tree id" as an extention or in addition to project id.
> > > > >
> > > > > The semantics of "tree id" would be:
> > > > > 1. tree id is a quota entity accounting inodes and blocks
> > > > > 2. tree id can be changed only on an empty directory
>
> Hmmm. So once it's created, it can't be changed without first
> deleting all the data in the tree?
That is correct.
Similar to fscrypt_ioctl_set_policy().
>
> > > > > 3. tree id can be set to TID only if quota inode usage of TID is 0
>
> What does this mean? Defining behaviour.semantics in terms of it's
> implementation is ambiguous and open for interpretation.
>
You are right. Let me give it a shot:
For the current use of project quotas in containers as a way to
limit disk usage of container instance, container managers
assign a project id with project quota to some directory that is
going to be used as the root of a bind mount point inside the
container (e.g. /home).
With xfs, that means that a user inside the container gets the df
report on the bind mount based on xfs_qm_statvfs() which provides
the project quota usage, which is intended to reflect the container's
disk usage on the xfs filesystem that is shared among containers.
In practice, the container's disk usage may be contaminated by
usage of other unrelated subtrees or solo files that have been
assigned the same project id on purpose or by mistake.
The current permission model for changing project id does
not always align with how container users should be allowed
to manipulate their own reported disk usage and affect the
disk usage reported or allowed to other containers.
The proposed alternative semantics for project ids and quotas
will allow container managers better control over the disk usage
observed by container users when project quotas are used, by making
sure that project id represents a single fully connected subtree in
the context of a single filesystem.
> I *think* the intent here is that tree ids are unique and can only
> be applied to a single tree, but...
>
You are correct. That is what it means.
Subtree members on a *single filesystem* are all descendants
of a single root.
The subtree root is a directory and it is the only member of the subtree
whose (sb) parent is not a member of the subtree.
> And, of course, what happens if we have multiple filesystems? tree
> IDs are no longer globally unique across the system, right?
>
No requirement for tree id to be unique across the system.
<st_dev, tree_id> should be unique across the system.
> > > > > 4. tree id is always inherited from parent
>
> What happens as we traverse mount points within a tree? If the quota
> applies to directory trees, then there are going to be directory
> tree constructs that don't obviously follow this behaviour. e.g.
> bind mounts from one directory tree to another, both having
> different tree IDs.
We want nothing to do with that.
Quotas and disk usage have always been a filesystem property.
I see no reason to extend them beyond a single filesystem boundary.
Just because Niel chose to use the term "tree" to represent those
entities, does not mean that they are related to mount trees.
If the term is confusing then we can use a different term.
I personally prefer the term "subtree" to represent those entities.
Mind you, the STATX_TREE_ID proposal and the project id proposal
that I derived from it are based on current btrfs subvol semantics.
One difference between xfs/ext4 subtree quota and btrfs subvol
is that subvol is also an isolated inode number namespace.
Another difference is that currently, btrfs subvol has a unique
st_dev, but Neil has proposed some designs to change that.
In the long term, userspace tools could gain option
-xx --one-file-system-subtree (or something) and learn
how to stay within subtree boundaries.
>
> Which then makes me question: are inodes and inode flags the right
> place to track and propagate these tree IDs? Isn't the tree ID as
> defined here a property of the path structure rather than a property
> of the inode? Should we actually be looking at a new directory tree
> ID tracking behaviour at, say, the vfs-mount+dentry level rather
> than the inode level?
>
That was not my intention with this proposal.
> > > > > 5. No rename() or link() across tree id (clone should be possible)
>
> The current directory tree quotas disallow this because of
> implementation difficulties (e.g. avoiding recursive chproj inside
> the kernel as part of rename()) and so would punt the operations too
> difficult to do in the kernel back to userspace. They are not
> intended to implement container boundaries in any way, shape or
> form. Container boundaries need to use a proper access control
> mechanism, not rely on side effects of difficult-to-implement low
> level accounting mechanisms to provide access restriction.
>
> Indeed, why do we want to place restrictions on moving things across
> trees if the filesystem can actually do so correctly? Hence I think
> this is somewhat inappropriately be commingling container access
> restrictions with usage accounting....
>
> I'm certain there will be filesytsems that do disallow rename and
> link to punt the problem back up to userspace, but that's an
> implementation detail to ensure accounting for the data movement to
> a different tree is correct and not a behavioural or access
> restriction...
>
I did not list the no-cross rename()/link() requirement because
of past project id behavior or because of filesystem implementation
challenges.
I listed it because of the single-fully-connected-subtree requirement.
rename()/link() across subtree boundaries would cause multiple roots
of the same subtree id.
Sorry, Dave, for not writing a mathematical definition of a
singly-connected-subtree-whatever, but by now,
I hope you all understand very well what this beast is.
It is really not that complicated (?).
The reason I started the discussion with implementation rules
is *because* they are so simple to understand, so I figured
everybody could understand what the result is, but I realize
that my initial proposal was missing the wider context.
Is anything not clear now?
Thanks,
Amir.
On Fri, 2021-09-17 at 16:50 -0700, Darrick J. Wong wrote:
> FWIW I'll try to commandeer one of the LPC hack rooms late Tuesday
> evening (say around 0200 UTC) if people are interested in XFS office
> hours?
The BBB Hack rooms of LPC will be available to all conference
participants at any hour. However, we contracted with a third party
for the live streaming, so that will only happen within conference
hours. We still have the streaming to Youtube infrastructure we used
last year, if you really want the hackroom streamed, but a member of
the LPC programme committee will need to be on hand to look after it
for you. Part of our contract this year was a live stream to china
(great firewall blocks youtube) but our old stream infrastructure can't
reach that end point.
We can also make available BoF rooms that are open to all comers (the
hack rooms are only open to registered conference attendees) since I
know people who can't attend for timezone reasons won't want to stump
up the attendee fee. Please email [email protected] to
arrange a room if you want to do this. Note that all rooms can be
recorded and the BBB recording made available to anyone (it's in BBB
format though, so can't be uploaded to youtube).
Regards,
James
On Sat, Sep 18, 2021 at 10:44:02AM +0300, Amir Goldstein wrote:
> On Sat, Sep 18, 2021 at 2:15 AM Dave Chinner <[email protected]> wrote:
> >
> > On Fri, Sep 17, 2021 at 09:12:17AM -0700, Darrick J. Wong wrote:
> > > On Fri, Sep 17, 2021 at 01:23:08PM +0300, Amir Goldstein wrote:
> > > > On Fri, Sep 17, 2021 at 12:38 PM Jan Kara <[email protected]> wrote:
> > > > >
> > > > > On Fri 17-09-21 10:36:08, Jan Kara wrote:
> > > > > > Let me also post Amir's thoughts on this from a private thread:
> > > > >
> > > > > And now I'm actually replying to Amir :-p
> > > > >
> > > > > > On Fri 17-09-21 10:30:43, Jan Kara wrote:
> > > > > > > We did a small update to the schedule:
> > > > > > >
> > > > > > > > Christian Brauner will run the second session, discussing what idmapped
> > > > > > > > filesystem mounts are for and the current status of supporting more
> > > > > > > > filesystems.
> > > > > > >
> > > > > > > We have extended this session as we'd like to discuss and get some feedback
> > > > > > > from users about project quotas and project ids:
> > > > > > >
> > > > > > > Project quotas were originally mostly a collaborative feature and later got
> > > > > > > used by some container runtimes to implement limitation of used space on a
> > > > > > > filesystem shared by multiple containers. As a result current semantics of
> > > > > > > project quotas are somewhat surprising and handling of project ids is not
> > > > > > > consistent among filesystems. The main two contending points are:
> > > > > > >
> > > > > > > 1) Currently the inode owner can set project id of the inode to any
> > > > > > > arbitrary number if he is in init_user_ns. It cannot change project id at
> > > > > > > all in other user namespaces.
> > > > > > >
> > > > > > > 2) Should project IDs be mapped in user namespaces or not? User namespace
> > > > > > > code does implement the mapping, VFS quota code maps project ids when using
> > > > > > > them. However e.g. XFS does not map project IDs in its calls setting them
> > > > > > > in the inode. Among other things this results in some funny errors if you
> > > > > > > set project ID to (unsigned)-1.
> > > > > > >
> > > > > > > In the session we'd like to get feedback how project quotas / ids get used
> > > > > > > / could be used so that we can define the common semantics and make the
> > > > > > > code consistently follow these rules.
> > > > > >
> > > > > > I think that legacy projid semantics might not be a perfect fit for
> > > > > > container isolation requirements. I added project quota support to docker
> > > > > > at the time because it was handy and it did the job of limiting and
> > > > > > querying disk usage of containers with an overlayfs storage driver.
> > > > > >
> > > > > > With btrfs storage driver, subvolumes are used to create that isolation.
> > > > > > The TREE_ID proposal [1] got me thinking that it is not so hard to
> > > > > > implement "tree id" as an extention or in addition to project id.
> > > > > >
> > > > > > The semantics of "tree id" would be:
> > > > > > 1. tree id is a quota entity accounting inodes and blocks
> > > > > > 2. tree id can be changed only on an empty directory
> >
> > Hmmm. So once it's created, it can't be changed without first
> > deleting all the data in the tree?
>
> That is correct.
> Similar to fscrypt_ioctl_set_policy().
>
> >
> > > > > > 3. tree id can be set to TID only if quota inode usage of TID is 0
> >
> > What does this mean? Defining behaviour.semantics in terms of it's
> > implementation is ambiguous and open for interpretation.
> >
>
> You are right. Let me give it a shot:
>
> For the current use of project quotas in containers as a way to
> limit disk usage of container instance, container managers
> assign a project id with project quota to some directory that is
> going to be used as the root of a bind mount point inside the
> container (e.g. /home).
>
> With xfs, that means that a user inside the container gets the df
> report on the bind mount based on xfs_qm_statvfs() which provides
> the project quota usage, which is intended to reflect the container's
> disk usage on the xfs filesystem that is shared among containers.
>
> In practice, the container's disk usage may be contaminated by
> usage of other unrelated subtrees or solo files that have been
> assigned the same project id on purpose or by mistake.
In practice, this doesn't happen because the admin infrastructure
assigns unique project IDs to each container subtree. We don't allow
users to screw around with project IDs when this accounting
mechanism is in use for precisely this reason.
> The current permission model for changing project id does
> not always align with how container users should be allowed
> to manipulate their own reported disk usage and affect the
> disk usage reported or allowed to other containers.
What does this mean? Container users don't control how disk usage
within their container is reported. That is entirely controlled by
what the kernel returns from statfs() for a given path...
> The proposed alternative semantics for project ids and quotas
> will allow container managers better control over the disk usage
> observed by container users when project quotas are used, by making
> sure that project id represents a single fully connected subtree in
> the context of a single filesystem.
That's something that the admin interface *should* already be doing
with project quotas being used as directory quotas for container
accounting.
I don't see where the treeid quota proposal changes anything
material in the way containers and project quota based directory
quotas actually work together to provide and enforce admin defined
per-container space usage restrictions.
> > I *think* the intent here is that tree ids are unique and can only
> > be applied to a single tree, but...
> >
> You are correct. That is what it means.
> Subtree members on a *single filesystem* are all descendants
> of a single root.
> The subtree root is a directory and it is the only member of the subtree
> whose (sb) parent is not a member of the subtree.
>
> > And, of course, what happens if we have multiple filesystems? tree
> > IDs are no longer globally unique across the system, right?
>
> No requirement for tree id to be unique across the system.
> <st_dev, tree_id> should be unique across the system.
Haven't we learnt that lesson from btrfs and subvols yet? Please
don't use st_dev as part of a unique filesystem object identifier.
We have UUIDs in filesystems for that purpose.
> > > > > > 4. tree id is always inherited from parent
> >
> > What happens as we traverse mount points within a tree? If the quota
> > applies to directory trees, then there are going to be directory
> > tree constructs that don't obviously follow this behaviour. e.g.
> > bind mounts from one directory tree to another, both having
> > different tree IDs.
>
> We want nothing to do with that.
Yet we have to consider it because it can introduce cycles
in the the directed, uni-directional graph that this tree-id
proposal is based on.
> Quotas and disk usage have always been a filesystem property.
> I see no reason to extend them beyond a single filesystem boundary.
Paths build from directory trees are a VFS property, not a
filesystem property. Subtrees are a path-based mechanism, not an
inode based mechanism.
Indeed, bind mounts can be within a single filesystem. There's nothing
stopping users from binding part of one treeid's subtree into
another treeid's subtree all within the one filesystem. Is the
filesystem aware of this? Hell no - this is all done at the
vfs_mount level.
> Just because Niel chose to use the term "tree" to represent those
> entities, does not mean that they are related to mount trees.
> If the term is confusing then we can use a different term.
> I personally prefer the term "subtree" to represent those entities.
I'm not looking at a "tree". I'm looking at the directory heirachy
the VFS presents to users and how that maps to individual inodes in
a filesystem.
If we want functionality that defines an exclusive path based
subtree, behaviour and properties need to be defined, enforced and
propagated at the vfsmount level, not at the filesytem inode level.
The persistent information needed for this construct might be stored
in an inode that is used as the root of that subtree, but other than
that the functionality needs to be defined and managed at the path
level...
> Mind you, the STATX_TREE_ID proposal and the project id proposal
> that I derived from it are based on current btrfs subvol semantics.
Which doesn't actually have anything to do with exclusive accounting
for directory trees. It has to do with identifying inodes on
subvolumes uniquely, not whether the _paths to any specific inode_
are unique and exclusive.
> One difference between xfs/ext4 subtree quota and btrfs subvol
> is that subvol is also an isolated inode number namespace.
> Another difference is that currently, btrfs subvol has a unique
> st_dev, but Neil has proposed some designs to change that.
It seems to me that this is conflating unrelated issues.
> > Which then makes me question: are inodes and inode flags the right
> > place to track and propagate these tree IDs? Isn't the tree ID as
> > defined here a property of the path structure rather than a property
> > of the inode? Should we actually be looking at a new directory tree
> > ID tracking behaviour at, say, the vfs-mount+dentry level rather
> > than the inode level?
>
> That was not my intention with this proposal.
Sure, but it's the basic problem with the tree ID proposal - tree id
based quota require exclusive, single path access to inodes for
correct behaviour, and *filesystems* are not aware of VFS path
constructs.
> > > > > > 5. No rename() or link() across tree id (clone should be possible)
> >
> > The current directory tree quotas disallow this because of
> > implementation difficulties (e.g. avoiding recursive chproj inside
> > the kernel as part of rename()) and so would punt the operations too
> > difficult to do in the kernel back to userspace. They are not
> > intended to implement container boundaries in any way, shape or
> > form. Container boundaries need to use a proper access control
> > mechanism, not rely on side effects of difficult-to-implement low
> > level accounting mechanisms to provide access restriction.
> >
> > Indeed, why do we want to place restrictions on moving things across
> > trees if the filesystem can actually do so correctly? Hence I think
> > this is somewhat inappropriately be commingling container access
> > restrictions with usage accounting....
> >
> > I'm certain there will be filesytsems that do disallow rename and
> > link to punt the problem back up to userspace, but that's an
> > implementation detail to ensure accounting for the data movement to
> > a different tree is correct and not a behavioural or access
> > restriction...
> >
>
> I did not list the no-cross rename()/link() requirement because
> of past project id behavior or because of filesystem implementation
> challenges.
>
> I listed it because of the single-fully-connected-subtree requirement.
> rename()/link() across subtree boundaries would cause multiple roots
> of the same subtree id.
Why would rename even do that? The tree id would change with move to
the new location. i.e. rename from tree X to tree Y is just "sub
from quota X, add to quota Y". That is, the -tree id of the inode
changes- with a move to a different sub-tree. If you can do a manual
"copy from tree X to Y, unlink from X", then there is no valid
reason for explicitly preventing rename() from being used to do that
same operation. Whether a filesystem can implement that operation in
rename is another matter, but it's not a reason for preventing
rename across subtree that exist in the same filesystem.
Remember, quotas are for accounting resource usage, not for
restricting access or controlling what operations can be performed
on user data. That's what permissions are for. IOWs, if we want
"tree ids" to be able to prevent moving data from one sub tree to
another, we need to add those as access controls to the VFS, not
hide them deep down in the quota implementations in each filesystem.
> Sorry, Dave, for not writing a mathematical definition of a
> singly-connected-subtree-whatever, but by now,
> I hope you all understand very well what this beast is.
> It is really not that complicated (?).
It's not very complicated, but it is flawed. The premise is based on
a tree id that provides path based access restrictions but is
implemented at a layer that knows nothing about VFS paths or how
those paths are constructed by the user.
It really seems like treeid should be a property of a vfs mount,
not filesystem inodes...
> The reason I started the discussion with implementation rules
> is *because* they are so simple to understand, so I figured
Please don't assume that "simple rules" means the concept is simple
or that the implementation is simple. "Simple" often means "this
hasn't been though through fully yet" and it's not until somebody
else starts trying to understand why it is "simple" that all the
"not simple" issues with the proposal are documented...
Cheers,
Dave.
--
Dave Chinner
[email protected]