Hi,
This is (yet another) repost of the patch series that fixes the
existing BLKZEROOUT ioctl to invalidate the page cache if the zeroing
command to the underlying device succeeds. This patch is against
4.5-rc6 and hasn't changed much in months.
The new BLKZEROOUT ioctl has the same semantics as the old one, but it
invalidates the page cache to prevent surprising results, just like
how dio writes invalidate page cache.
I've incorporated all the feedback I've received into these patches,
but haven't heard yea or nay or anything at all from the maintainer.
Will someone please pick this up for 4.6?
Comments and questions are, as always, welcome.
--D
Invalidate the page cache (as a regular O_DIRECT write would do) to avoid
returning stale cache contents at a later time.
v5: Refactor the 4.4 refactoring of the ioctl code into separate functions.
Split the page invalidation and the new ioctl into separate patches.
Signed-off-by: Darrick J. Wong <[email protected]>
---
block/ioctl.c | 29 +++++++++++++++++++++++------
1 file changed, 23 insertions(+), 6 deletions(-)
diff --git a/block/ioctl.c b/block/ioctl.c
index d8996bb..c6eb462 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -226,7 +226,9 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
unsigned long arg)
{
uint64_t range[2];
- uint64_t start, len;
+ struct address_space *mapping;
+ uint64_t start, end, len;
+ int ret;
if (!(mode & FMODE_WRITE))
return -EBADF;
@@ -236,18 +238,33 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
start = range[0];
len = range[1];
+ end = start + len - 1;
if (start & 511)
return -EINVAL;
if (len & 511)
return -EINVAL;
- start >>= 9;
- len >>= 9;
-
- if (start + len > (i_size_read(bdev->bd_inode) >> 9))
+ if (end >= (uint64_t)i_size_read(bdev->bd_inode))
+ return -EINVAL;
+ if (end < start)
return -EINVAL;
- return blkdev_issue_zeroout(bdev, start, len, GFP_KERNEL, false);
+ /* Invalidate the page cache, including dirty pages */
+ mapping = bdev->bd_inode->i_mapping;
+ truncate_inode_pages_range(mapping, start, end);
+
+ ret = blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL,
+ false);
+ if (ret)
+ return ret;
+
+ /*
+ * Invalidate again; if someone wandered in and dirtied a page,
+ * the caller will be given -EBUSY.
+ */
+ return invalidate_inode_pages2_range(mapping,
+ start >> PAGE_CACHE_SHIFT,
+ end >> PAGE_CACHE_SHIFT);
}
static int put_ushort(unsigned long arg, unsigned short val)
Create a new ioctl to expose the block layer's newfound ability to
issue either a zeroing discard, a WRITE SAME with a zero page, or a
regular write with the zero page. This BLKZEROOUT2 ioctl takes
{start, length, flags} as parameters. So far, the only flag available
is to enable the zeroing discard part -- without it, the call invokes
the old BLKZEROOUT behavior. start and length have the same meaning
as in BLKZEROOUT.
This new ioctl also invalidates the page cache correctly on account
of the previous patch in the series.
v3: Add extra padding for future expansion, and check the padding is zero.
v4: Check the start/len arguments for overflows prior to feeding the page
cache bogus numbers (that it'll ignore anyway).
v5: Refactor the 4.4 refactoring of the ioctl code into separate functions.
Separate patches for invalidation and new ioctl.
Signed-off-by: Darrick J. Wong <[email protected]>
---
block/ioctl.c | 57 ++++++++++++++++++++++++++++++++++++-----------
include/uapi/linux/fs.h | 9 +++++++
2 files changed, 53 insertions(+), 13 deletions(-)
diff --git a/block/ioctl.c b/block/ioctl.c
index c6eb462..5567466 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -222,24 +222,20 @@ static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
return blkdev_issue_discard(bdev, start, len, GFP_KERNEL, flags);
}
-static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
- unsigned long arg)
+static int __blk_ioctl_zeroout(struct block_device *bdev,
+ unsigned long long start,
+ unsigned long long len,
+ unsigned int flags)
{
- uint64_t range[2];
struct address_space *mapping;
- uint64_t start, end, len;
+ unsigned long long end;
+ bool discard = false;
int ret;
- if (!(mode & FMODE_WRITE))
- return -EBADF;
-
- if (copy_from_user(range, (void __user *)arg, sizeof(range)))
- return -EFAULT;
-
- start = range[0];
- len = range[1];
end = start + len - 1;
+ if (flags & ~BLKZEROOUT2_DISCARD_OK)
+ return -EINVAL;
if (start & 511)
return -EINVAL;
if (len & 511)
@@ -253,8 +249,10 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
mapping = bdev->bd_inode->i_mapping;
truncate_inode_pages_range(mapping, start, end);
+ if (flags & BLKZEROOUT2_DISCARD_OK)
+ discard = true;
ret = blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL,
- false);
+ discard);
if (ret)
return ret;
@@ -267,6 +265,37 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
end >> PAGE_CACHE_SHIFT);
}
+static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
+ unsigned long arg)
+{
+ uint64_t range[2];
+
+ if (!(mode & FMODE_WRITE))
+ return -EBADF;
+
+ if (copy_from_user(range, (void __user *)arg, sizeof(range)))
+ return -EFAULT;
+
+ return __blk_ioctl_zeroout(bdev, range[0], range[1], 0);
+}
+
+static int blk_ioctl_zeroout2(struct block_device *bdev, fmode_t mode,
+ unsigned long arg)
+{
+ struct blkzeroout2 p;
+
+ if (!(mode & FMODE_WRITE))
+ return -EBADF;
+
+ if (copy_from_user(&p, (void __user *)arg, sizeof(p)))
+ return -EFAULT;
+
+ if (p.padding || p.padding2)
+ return -EINVAL;
+
+ return __blk_ioctl_zeroout(bdev, p.start, p.length, p.flags);
+}
+
static int put_ushort(unsigned long arg, unsigned short val)
{
return put_user(val, (unsigned short __user *)arg);
@@ -560,6 +589,8 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
BLKDEV_DISCARD_SECURE);
case BLKZEROOUT:
return blk_ioctl_zeroout(bdev, mode, arg);
+ case BLKZEROOUT2:
+ return blk_ioctl_zeroout2(bdev, mode, arg);
case HDIO_GETGEO:
return blkdev_getgeo(bdev, argp);
case BLKRAGET:
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 149bec8..4c7a376 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -222,6 +222,15 @@ struct fsxattr {
#define BLKSECDISCARD _IO(0x12,125)
#define BLKROTATIONAL _IO(0x12,126)
#define BLKZEROOUT _IO(0x12,127)
+struct blkzeroout2 {
+ __u64 start;
+ __u64 length;
+ __u32 flags;
+ __u32 padding;
+ __u64 padding2;
+};
+#define BLKZEROOUT2_DISCARD_OK 1
+#define BLKZEROOUT2 _IOR(0x12, 127, struct blkzeroout2)
#define BLKDAXGET _IO(0x12,129)
#define BMAP_IOCTL 1 /* obsolete - kept for compatibility */
On Tuesday 01 March 2016 20:09:32 Darrick J. Wong wrote:
> This is (yet another) repost of the patch series that fixes the
> existing BLKZEROOUT ioctl to invalidate the page cache if the zeroing
> command to the underlying device succeeds. This patch is against
> 4.5-rc6 and hasn't changed much in months.
>
> The new BLKZEROOUT ioctl has the same semantics as the old one, but it
> invalidates the page cache to prevent surprising results, just like
> how dio writes invalidate page cache.
>
> I've incorporated all the feedback I've received into these patches,
> but haven't heard yea or nay or anything at all from the maintainer.
> Will someone please pick this up for 4.6?
>
> Comments and questions are, as always, welcome.
I'm missing the background on this, just saw the patch fly by,
so sorry if this has been asked before:
Why do you want to invalidate the cache? Is this to save RAM
or is something else going to write here and you have to invalidate
it for correctness?
If you just want to save RAM, would it be possible to instead
point the page cache to the empty zero page to speed up subsequent
reads? Maybe that just causes more complexity than it helps.
Arnd
Looks fine,
Reviewed-by: Christoph Hellwig <[email protected]>
Looks fine,
Reviewed-by: Christoph Hellwig <[email protected]>
On Wed, Mar 02, 2016 at 10:15:02AM +0100, Arnd Bergmann wrote:
> I'm missing the background on this, just saw the patch fly by,
> so sorry if this has been asked before:
>
> Why do you want to invalidate the cache? Is this to save RAM
> or is something else going to write here and you have to invalidate
> it for correctness?
BLKZEROOUT can be though as a direct I/O operation - it issues a
WRITE SAME (or similar) command straight to the device to zero
the specified range. If there was cached data for this range it would
now be stale.
On Wednesday 02 March 2016 01:44:16 Christoph Hellwig wrote:
> On Wed, Mar 02, 2016 at 10:15:02AM +0100, Arnd Bergmann wrote:
> > I'm missing the background on this, just saw the patch fly by,
> > so sorry if this has been asked before:
> >
> > Why do you want to invalidate the cache? Is this to save RAM
> > or is something else going to write here and you have to invalidate
> > it for correctness?
>
> BLKZEROOUT can be though as a direct I/O operation - it issues a
> WRITE SAME (or similar) command straight to the device to zero
> the specified range. If there was cached data for this range it would
> now be stale.
Ok, got it. Thanks,
Arnd
On Tue, Mar 1, 2016 at 8:09 PM, Darrick J. Wong <[email protected]> wrote:
> Create a new ioctl to expose the block layer's newfound ability to
> issue either a zeroing discard, a WRITE SAME with a zero page, or a
> regular write with the zero page. This BLKZEROOUT2 ioctl takes
> {start, length, flags} as parameters. So far, the only flag available
> is to enable the zeroing discard part -- without it, the call invokes
> the old BLKZEROOUT behavior. start and length have the same meaning
> as in BLKZEROOUT.
NAK, just based on annoyance with the randomness of this interface:
- without describing what the point of the new flag and lots of extra
expansion room is, this should never be merged. We don't add padding
just randomly.
- Somewhat related to that: the flags are checked for zero, but the
random expansion room isn't. So not only are the random expansion
fields not explained, they will contain random garbage in the future.
- why is that "FMODE_WRITE" check not in the common code, but duplicated?
it all seems very ad-hoc. It makes a big deal about that idiotic
"discard" behavior, which is entirely pointless.
Who cares about that discard behavior anyway? Why is it set to "false"
in the current version of BLKZEROOUT in the first place? Why do we do
that WRITE_SAME without questioning it, but "discard" is somehow so
special that it has a flag, and it's turned off by default?
If this is some "security issue" where somebody believes that discard
is not secure, then those people are full of shit. Discard and
write-same have exactly the same semantics - they may just unmap the
target range with the guarantee that you'll get zeroes on read.
So quite frankly, right now it seems that
(a) the *only* excuse for this patch is that people want to use "discard"
(b) the reason we already don't use "discard" for the old BLKZEROOUT
is very questionable
(c) any future possible use of flags is not described and is questionable
You'll find people who think that "write-same with some non-zero
pattern" would be a real over-write and good for security. Those
people will then argue that a sane extension would be to make that
pattern part of the future expansion of BLKZEROOUT2. And those people
are full of shit. Write-same with a non-zero pattern may well be just
a discard with the pattern set in another table.
So the whole patch looks pointless.
Why isn't the patch just "change false to true in blk_ioctl_zeroout()
when it calls blkdev_issue_zeroout()".
No new interface, no new random padding, just a simple "it makes
absolutely no sense to not allow discard".
Linus
On Wed, Mar 02, 2016 at 10:52:01AM -0800, Linus Torvalds wrote:
> On Tue, Mar 1, 2016 at 8:09 PM, Darrick J. Wong <[email protected]> wrote:
> > Create a new ioctl to expose the block layer's newfound ability to
> > issue either a zeroing discard, a WRITE SAME with a zero page, or a
> > regular write with the zero page. This BLKZEROOUT2 ioctl takes
> > {start, length, flags} as parameters. So far, the only flag available
> > is to enable the zeroing discard part -- without it, the call invokes
> > the old BLKZEROOUT behavior. start and length have the same meaning
> > as in BLKZEROOUT.
>
> NAK, just based on annoyance with the randomness of this interface:
>
> - without describing what the point of the new flag and lots of extra
The new flag means "if the device supports discard and that discard zeroes data
then it's ok to use discard". The difference between discard/unmap and
write-same is in thin-provisioned storage arrays -- UNMAP can release backing
store, whereas WRITE SAME ensures that something's been written to media.
So by default, BLKZEROOUT2 means "ensure subsequent reads return zeroes and
make sure there's real space waiting for the next time I write to this".
Passing in the flag changes that to "ensure subsequent reads return zeroes and
I don't care what happens to the backing store".
I'll grant that this distinction could be clarified in the header file,
though anyone familiar with WS/UNMAP to want to use them from userspace
ought to know that already.
> expansion room is, this should never be merged. We don't add padding
> just randomly.
Oh yes we do. Adding required-zero padding to allow for future increases of
the expressiveness of an ioctl is very common.
$ egrep -rn '(reserved|padding).*;' include/uapi/ | wc -l
564
The size of the ioctl structure is embedded in the ioctl number definition,
so we might as well reserve a lot of space ahead of time. Better that than
having to declare new ioctl numbers every time someone needs to add something.
Both ext4 and XFS perform this sort of future proofing.
> - Somewhat related to that: the flags are checked for zero, but the
> random expansion room isn't. So not only are the random expansion
> fields not explained, they will contain random garbage in the future.
Wrong. The padding fields /are/ checked, and the ioctl returns EINVAL
if they aren't zero:
static int blk_ioctl_zeroout2(struct block_device *bdev, fmode_t mode,
unsigned long arg)
{
...
if (p.padding || p.padding2)
return -EINVAL;
...
}
> - why is that "FMODE_WRITE" check not in the common code, but duplicated?
The old BLKZEROOUT checked FMODE_WRITE before copying the ioctl data from
userspace, so the new BLKZEROOUT will behave the same way when possible to
minimize porting hassle for userland.
> it all seems very ad-hoc. It makes a big deal about that idiotic
> "discard" behavior, which is entirely pointless.
No. This is not about enabling use of "that idiotic discard behavior", for
that there's BLKDISCARD. This ioctl does NOT use the handwavy old TRIM
advisory request thing that could return "fuzzy wuzzy" without violating the
specs.
BLKZEROOUT is about telling a device "after this call I want subsequent reads
to return zeroes". The first patch fixes the problem that the pagecache isn't
invalidated when we tell the device that we want to be able to read zeroes.
Even if it behaves according to spec (and even if it doesn't) without that
patch, userland programs can end up reading stale data out of the kernel. That
first patch is purely a bug fix.
However, even once the cache coherence problem is fixed, there's still the
problem that the old BLKZEROOUT didn't maintain page cache coherence and
userspace has no way to figure out whether BLKZEROOUT on a given system
actually will. The reason for creating a new ioctl is to establish an ioctl
that guarantees a page cache flush or returns an error code. While we're
defining a new number, we might as well allow for userspace to control the
discard parameter to blkdev_issue_zeroout(), and reserve more space in the
structure than we think we need.
> Who cares about that discard behavior anyway? Why is it set to "false"
> in the current version of BLKZEROOUT in the first place? Why do we do
> that WRITE_SAME without questioning it, but "discard" is somehow so
> special that it has a flag, and it's turned off by default?
Some users want to use the SCSI command (WRITE SAME) that won't return
GOOD status until blocks full of zeroes have been committed to stable storage
that can be rewritten, and other users are fine with a zeroing TRIM/UNMAP which
can release the storage and DMA back pages of zeroes without committing
any stable storage to future rewrites.
> If this is some "security issue" where somebody believes that discard
> is not secure, then those people are full of shit. Discard and
> write-same have exactly the same semantics - they may just unmap the
> target range with the guarantee that you'll get zeroes on read.
No. The point is that mkfs can zero out filesystem metadata blocks with
confidence that a subsequent re-read will always return zeroes. We tried
making mke2fs use BLKZEROOUT and the cache coherence problem bit us in the
arse.
If you care about securely deleting data, throw it into the sun.
> So quite frankly, right now it seems that
>
> (a) the *only* excuse for this patch is that people want to use "discard"
>
> (b) the reason we already don't use "discard" for the old BLKZEROOUT
> is very questionable
See above.
> (c) any future possible use of flags is not described and is questionable
How? If any of the other flag bits are set, it'll return -EINVAL. Or do you
mean that you think there will never be a need for any other flags? Or are
you complaining that you think there's too much padding and stuff in the
structure?
> You'll find people who think that "write-same with some non-zero
> pattern" would be a real over-write and good for security. Those
> people will then argue that a sane extension would be to make that
> pattern part of the future expansion of BLKZEROOUT2. And those people
> are full of shit. Write-same with a non-zero pattern may well be just
> a discard with the pattern set in another table.
>
> So the whole patch looks pointless.
I disagree, obviously.
> Why isn't the patch just "change false to true in blk_ioctl_zeroout()
> when it calls blkdev_issue_zeroout()".
Some people might want real storage blocks full of zeroes backing the LBAs they
just wrote out, others might not care.
--D
>
> No new interface, no new random padding, just a simple "it makes
> absolutely no sense to not allow discard".
>
> Linus
On Wed, Mar 2, 2016 at 2:56 PM, Darrick J. Wong <[email protected]> wrote:
>
> Oh yes we do. Adding required-zero padding to allow for future increases of
> the expressiveness of an ioctl is very common.
>
> $ egrep -rn '(reserved|padding).*;' include/uapi/ | wc -l
> 564
Most of those should be for alignment reasons. In particular, you'll
find them to make sure subsequent u64's are properly aligned, or at
the end to make sure we have names for the padding at the end when the
last member is differently sized than the alignment of the structure.
> Wrong. The padding fields /are/ checked, and the ioctl returns EINVAL
> if they aren't zero:
Ok, I obviously overlooked that. Good.
>> - why is that "FMODE_WRITE" check not in the common code, but duplicated?
>
> The old BLKZEROOUT checked FMODE_WRITE before copying the ioctl data from
> userspace, so the new BLKZEROOUT will behave the same way when possible to
> minimize porting hassle for userland.
That's a valid reason in theory, but I don't think anybody actually
cares. An error is an error. If you pass in invalid pointers _and_ a
read-only file descriptor, nobody will actually look at whether you
got EBADF or EFAULT.
> No. This is not about enabling use of "that idiotic discard behavior", for
> that there's BLKDISCARD. This ioctl does NOT use the handwavy old TRIM
> advisory request thing that could return "fuzzy wuzzy" without violating the
> specs.
So you agree that we could just make BLKZEROOUT always use trim?
> BLKZEROOUT is about telling a device "after this call I want subsequent reads
> to return zeroes". The first patch fixes the problem that the pagecache isn't
> invalidated when we tell the device that we want to be able to read zeroes.
> Even if it behaves according to spec (and even if it doesn't) without that
> patch, userland programs can end up reading stale data out of the kernel. That
> first patch is purely a bug fix.
Oh, I'm not at all arguing against the first patch.
It's the second one I think falls under the heading of "seems
over-engineered and unnecessary".
> However, even once the cache coherence problem is fixed, there's still the
> problem that the old BLKZEROOUT didn't maintain page cache coherence and
> userspace has no way to figure out whether BLKZEROOUT on a given system
> actually will. The reason for creating a new ioctl is to establish an ioctl
> that guarantees a page cache flush or returns an error code. While we're
> defining a new number, we might as well allow for userspace to control the
> discard parameter to blkdev_issue_zeroout(), and reserve more space in the
> structure than we think we need.
So quite frankly, this brings up three issues:
- why was this not explained in the commit message
- why would anybody *want* to control that parameter?
- what other parameters _could_ there be?
> Some users want to use the SCSI command (WRITE SAME) that won't return
> GOOD status until blocks full of zeroes have been committed to stable storage
> that can be rewritten, and other users are fine with a zeroing TRIM/UNMAP which
> can release the storage and DMA back pages of zeroes without committing
> any stable storage to future rewrites.
Ehh. I'm not convinced that any disk that completes a TRIM command
without it beign "stable" would do the same for WRITE_SAME.
>From everything I've ever seen, the difference between "stable on
disk" and "not stable" has almost never been about the particular
command used, and always been about the choice of the particular
target disk.
> No. The point is that mkfs can zero out filesystem metadata blocks with
> confidence that a subsequent re-read will always return zeroes. We tried
> making mke2fs use BLKZEROOUT and the cache coherence problem bit us in the
> arse.
Again, I don't disagree about the cache coherence at all.
> How? If any of the other flag bits are set, it'll return -EINVAL. Or do you
> mean that you think there will never be a need for any other flags? Or are
> you complaining that you think there's too much padding and stuff in the
> structure?
I absolutely *detest* code that tries to be overly forward-thinking by
randomly adding fields without even having an idea of what those
fields could possibly be used for.
I'm perfectly ok with padding that has a *reason*. Not a "we might use
it for something in the future".
> Some people might want real storage blocks full of zeroes backing the LBAs they
> just wrote out, others might not care.
.. but the flag doesn't even set that. Even if you avoid TRIM, there
is absolutely zero guarantees that WRITE_SAME would do "real storage
blocks full of zeroes backing the LBAs they just wrote out".
In fact, even if you do a real write of zeroed buffers, there's no
such guarantee. Doing some compression in the disk controller isn't
exactly unusual, even in enterprise hw (ie de-dup etc).
So I think this boils down to:
- I can see the "guaranteed cache flush". Get an error on old kernels
that might be without the flush.
At the same time, that sounds sad. The "add cache flush" sounds
like something that should be backported to stable, but new ioctl's?
Very questionable.
So now people who care would be forced to effectively do big zero
writes because they can't trust old kernels even if they are correct.
- it would seem that what you really want is a *generic* "invalidate
this page cache range" thing, because you have other users like the
whole "assemble SCSI command by hand" usage cases.
For example, maybe sync_file_range() should just be extended to
have a SYNC_FILE_RANGE_INVALIDATE flag? That sounds like a *much* more
generic thing that could be useful for other things than just the zero
write?
- I'd be much more ok with flags and extensions if they had
explanations and future cases.
IOW, this thing seems both too specific ("I need guarantees about
cache flushing, but only for this zerowrite thing") and too
"future-proofing" (I don't know what I might want in the future, but
zerowrite is such a generic operation that it migth want to have tons
of other arguments too").
It just rubs me the wrong way.
Linus
On Wed, Mar 02, 2016 at 03:49:53PM -0800, Linus Torvalds wrote:
> > No. This is not about enabling use of "that idiotic discard behavior", for
> > that there's BLKDISCARD. This ioctl does NOT use the handwavy old TRIM
> > advisory request thing that could return "fuzzy wuzzy" without violating the
> > specs.
>
> So you agree that we could just make BLKZEROOUT always use trim?
There is a massive bug in the SATA specs about trim, which is that it
is considered advisory. So the storage device can throw it away
whenever it feels like it. (In practice, when it's too busy doing
other things).
The thing is, this fuzzy wuzzy definition of trim in the SATA specs is
more and more a SATA-specific bug. The eMMC 5.1 spec has a reliable
trim, and SCSI has WRITE SAME. So it's just SATA which has this crazy
definition of trim.
(In practice there is no real difference between trim and discard;
it's just a matter of terminology; in order to determine whether or
not trim/discard is reliable you really have to pay close attention to
the specs, on in the case of SATA, use a whitelist of drive models,
which is crazy.)
- Ted
On Thu, Mar 3, 2016 at 9:02 AM, Theodore Ts'o <[email protected]> wrote:
>
> There is a massive bug in the SATA specs about trim, which is that it
> is considered advisory. So the storage device can throw it away
> whenever it feels like it. (In practice, when it's too busy doing
> other things).
Ugh.
But that essentially says that we shouldn't expose this interface at
all (unless we trust our white-lists - I'm sure they are getting
better, but if nobody has ever really _relied_ on the zeroing behavior
of trim, then I guess there could be tons of bugs lurking).
Or maybe we should expose it, but not call it BLKZEROOUT, and make it
*much* more generic.
That migth actually put some of my complaints to rest: if this is more
a general "manage this range of blocks" model, then the flags make
more sense to me.
So what are people actually wanting to do?
If they don't care horribly about the zeroing, they might then say
"using trim is ok". But why wouldn't they use the BLKDISCARD ioctl
then?
So just looking at this more, that "trim is ok" flag still doesn't
make much sense to me.
I see two cases: either we guarantee zero-out behavior with discard
set to true (and we trust our whitelists), or we don't. Can anybody
see a third alternative?
And if we don't guarantee zero-out behavior from
blkdev_issue_zeroout() with "discard" set to true, then why would we
expose such a random interface to user space? No sane user space could
*possibly* use it: if they care about zeroing, it's the wrong thing to
do, and if they *don't* care about zeroing it's still the wrong thing
to do.
In other words, I still don't see how that flag can possibly make
sense in any possible scenario.
Put succinctly:
"Either we trust trim and and our whitelists (in which case _not_
using trim makes no sense), or we do (in which case exposing a random
untrustworthy user interface is pointless, since any user would be
fundamentally broken and should just have used BLKDISCARD)"
See where I'm coming from?
Now, the reason I think a more generic model that *isn't* hung up
about zeroing the buffer migth be ok is that maybe it would be a good
thing to have a more unified itnerface for doing all those things
people do want to do:
- flush caches
- discard (our current BLKDISCARD doesn't flush caches either, so
together with flushing caches this is something new)
- zero out
- synchronous/asynchronous
- other things?
So I do see a case for passing in multiple flags, but a lot of that
case ends up depending on the zeroing out *not* being the most central
feature.
I very much could see wanting "discard these blocks and flush caches".
And I could see just "flush caches", with or without zeroing. But I do
*not* see the point of "discard blocks and zero" for the reasons
outlined above.
Linus
On Thu, Mar 03, 2016 at 09:55:38AM -0800, Linus Torvalds wrote:
> Ugh.
>
> But that essentially says that we shouldn't expose this interface at
> all (unless we trust our white-lists - I'm sure they are getting
> better, but if nobody has ever really _relied_ on the zeroing behavior
> of trim, then I guess there could be tons of bugs lurking).
Fortunately what Ted said is close to the truth, but a little off.
The mess about ATA is all real, and we work around it by never claiming
discard zeroes data for any ATA device except those specificly
whitelisted because the vendors gave guarantees for it (see the
ATA_HORKAGE_ZERO_AFTER_TRIM flag and the commit introducing it).
The reason why we don't use it unconditionally is simply because
there are users that want blocks zeroed without deallocating the
physical blocks. Now if you want to opt into the trim/unmap
behavior is one question, but we clearly need the split.
>>>>> "Linus" == Linus Torvalds <[email protected]> writes:
Linus> .. but the flag doesn't even set that. Even if you avoid TRIM,
Linus> there is absolutely zero guarantees that WRITE_SAME would do
Linus> "real storage blocks full of zeroes backing the LBAs they just
Linus> wrote out".
That's not entirely true. Writing the blocks may cause them to be
allocated on the storage device (depending on which flags we feed it in
WRITE SAME).
The filesystems people were wanted the following semantics:
- deallocate, don't care about contents for future reads (discard)
- deallocate, guarantee zeroes on future reads (zeroout)
- (re)allocate, guarantee zeroes on future reads (zeroout)
Maybe we just need a better naming scheme...
--
Martin K. Petersen Oracle Linux Engineering
On Thu, Mar 03, 2016 at 01:01:11PM -0500, Martin K. Petersen wrote:
> That's not entirely true. Writing the blocks may cause them to be
> allocated on the storage device (depending on which flags we feed it in
> WRITE SAME).
>
> The filesystems people were wanted the following semantics:
>
> - deallocate, don't care about contents for future reads (discard)
> - deallocate, guarantee zeroes on future reads (zeroout)
> - (re)allocate, guarantee zeroes on future reads (zeroout)
>
> Maybe we just need a better naming scheme...
In filesystem terms we have two and three:
- FALLOC_FL_PUNCH_HOLE assures zeroes are returned, but space is
deallocated as much as possible
- FALLOC_FL_ZERO_RANGE assures zeroes are returned, AND blocks are
actually allocated
Returning stale blocks in a file system is a nasty security risk, so
we don't do that, and so shouldn't storage that offers any kind
of multi tenancy, and if it's just VMs using multiple partitions on it.
On Thu, Mar 03, 2016 at 10:09:24AM -0800, Christoph Hellwig wrote:
> On Thu, Mar 03, 2016 at 01:01:11PM -0500, Martin K. Petersen wrote:
> > That's not entirely true. Writing the blocks may cause them to be
> > allocated on the storage device (depending on which flags we feed it in
> > WRITE SAME).
> >
> > The filesystems people were wanted the following semantics:
> >
> > - deallocate, don't care about contents for future reads (discard)
> > - deallocate, guarantee zeroes on future reads (zeroout)
> > - (re)allocate, guarantee zeroes on future reads (zeroout)
> >
> > Maybe we just need a better naming scheme...
>
> In filesystem terms we have two and three:
>
> - FALLOC_FL_PUNCH_HOLE assures zeroes are returned, but space is
> deallocated as much as possible
> - FALLOC_FL_ZERO_RANGE assures zeroes are returned, AND blocks are
> actually allocated
>
> Returning stale blocks in a file system is a nasty security risk, so
> we don't do that, and so shouldn't storage that offers any kind
> of multi tenancy, and if it's just VMs using multiple partitions on it.
Any particular reason why we can't just implement those two fallocate
flags for block devices?
--D
On Thu, Mar 3, 2016 at 10:01 AM, Martin K. Petersen
<[email protected]> wrote:
>>>>>> "Linus" == Linus Torvalds <[email protected]> writes:
>
> Linus> .. but the flag doesn't even set that. Even if you avoid TRIM,
> Linus> there is absolutely zero guarantees that WRITE_SAME would do
> Linus> "real storage blocks full of zeroes backing the LBAs they just
> Linus> wrote out".
>
> That's not entirely true. Writing the blocks may cause them to be
> allocated on the storage device (depending on which flags we feed it in
> WRITE SAME).
Ok, so now we're getting somewhere, with actual _reasons_ why somebody
would want to use one interface over another.
> The filesystems people were wanted the following semantics:
>
> - deallocate, don't care about contents for future reads (discard)
> - deallocate, guarantee zeroes on future reads (zeroout)
> - (re)allocate, guarantee zeroes on future reads (zeroout)
>
> Maybe we just need a better naming scheme...
Yes.
And this does make me think that Christoph is right: this would be so
much better if the block layer just supported fallocate() instead,
which already has those operations.
Right now we have
if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode))
return -ENODEV;
so right now the vfs_fallocate() code expliitly disallows block
devices, but that would be easy to expand.
Would people be happy with that kind of patch instead? It would
certainly make all my objections go away..
Linus
>>>>> "Linus" == Linus Torvalds <[email protected]> writes:
Linus> On Thu, Mar 3, 2016 at 9:02 AM, Theodore Ts'o <[email protected]> wrote:
>>
>> There is a massive bug in the SATA specs about trim, which is that it
>> is considered advisory. So the storage device can throw it away
>> whenever it feels like it. (In practice, when it's too busy doing
>> other things).
Linus> Ugh.
SCSI UNMAP provides similar semantics :(
Linus> But that essentially says that we shouldn't expose this interface
Linus> at all (unless we trust our white-lists - I'm sure they are
Linus> getting better, but if nobody has ever really _relied_ on the
Linus> zeroing behavior of trim, then I guess there could be tons of
Linus> bugs lurking).
We started out with the blacklist approach and it blew up. So now we're
down to a whitelist of drives that have been sanctioned by their
manufacturers to do the right thing.
Occasionally a new SSD model messes things up but we haven't updated it
since last summer. The drive vendors are much better at testing using
Linux than they used to be.
--
Martin K. Petersen Oracle Linux Engineering
On Thu, Mar 03, 2016 at 09:55:38AM -0800, Linus Torvalds wrote:
> But that essentially says that we shouldn't expose this interface at
> all (unless we trust our white-lists - I'm sure they are getting
> better, but if nobody has ever really _relied_ on the zeroing behavior
> of trim, then I guess there could be tons of bugs lurking).
We don't, so this interface won't be useful for SATA disks, where
we'll need to write zeros until the SATA folks get off their duffs and
fix it with a new, reliable trim command.
But it will be useful for other storage systems, such as eMMC devices,
which *do* have a reliable trim command. So it may be that the first
place we'll see widepspread usage of this will be in the low-end and
high-end systems (where we can rely on eMMC's reliable trim and SCSI's
WRITE SAME command).
But that's why we want to have a new interface which is distinct from
BLKDISCARD. We want one interface for an advisory hint (we don't care
about the contents, so if it's convenient, feel free to forget about
the contents and replace it by zeros), and something where it truly is
a zeroout command. The intention is that BLKZEROOUT will be the
reliable zeroout command, while BLKDISCARD will be the unreliable
advisory hint.
- Ted
>>>>> "Christoph" == Christoph Hellwig <[email protected]> writes:
Christoph> - FALLOC_FL_PUNCH_HOLE assures zeroes are returned, but
Christoph> space is deallocated as much as possible -
Christoph> FALLOC_FL_ZERO_RANGE assures zeroes are returned, AND blocks
Christoph> are actually allocated
That works for me. I think it would be great if we could have consistent
interfaces for fs and block. The more commonality the merrier.
--
Martin K. Petersen Oracle Linux Engineering
On Thu, Mar 03, 2016 at 01:54:54PM -0500, Martin K. Petersen wrote:
> >>>>> "Christoph" == Christoph Hellwig <[email protected]> writes:
>
> Christoph> - FALLOC_FL_PUNCH_HOLE assures zeroes are returned, but
> Christoph> space is deallocated as much as possible -
> Christoph> FALLOC_FL_ZERO_RANGE assures zeroes are returned, AND blocks
> Christoph> are actually allocated
>
> That works for me. I think it would be great if we could have consistent
> interfaces for fs and block. The more commonality the merrier.
So a question I have is do we want to add a "discard-as-a-hint" analog
for fallocate? In the past we've said no because we don't trust
userspace. (We seem to have as a core held belief that application
programmers are idiots and are not to be trusted with anything
dangerous, even if it would be highly useful in certain use cases.)
As a result I'm carrying an out-of-tree patch in our Google kernels so
that ext4 will honor BLKDISCARD on files. I can't remember if I
floated it on linux-fsdevel, or I didn't bother because I knew it
would instantly shot down. I believe it was the former, but I can't
be 100% sure.
It would be kind of nice, though, if we had some kind of agreement on
a consistent, unified interface for all three kinds of
"discard-as-a-hint", "zeroout", and "zeroout with deallocation" that
worked on block devices and files. Whether it's via fallocate(2) or
BLK* ioctls, I'm agnostic.
- Ted
P.S. Speaking of things that are powerful and too dangerous for
application programmers, after the Linux FAST workshop, I was having
dinner with the Ceph developers and Ric Wheeler, and we were talking
about things they really needed. Turns out they also could use an
FALLOC_FL_NO_HIDE_STALE functionality. I told them I had an
out-of-tree patch that had that functionality, and even Ric Wheeler
started getting tempted.... :-)
On Thu, Mar 03, 2016 at 01:54:54PM -0500, Martin K. Petersen wrote:
> >>>>> "Christoph" == Christoph Hellwig <[email protected]> writes:
>
> Christoph> - FALLOC_FL_PUNCH_HOLE assures zeroes are returned, but
> Christoph> space is deallocated as much as possible -
> Christoph> FALLOC_FL_ZERO_RANGE assures zeroes are returned, AND blocks
> Christoph> are actually allocated
>
> That works for me. I think it would be great if we could have consistent
> interfaces for fs and block. The more commonality the merrier.
Absolutely in agreement here. it would be much nicer if filesystems
could just call bdev->ops->fallocate(PUNCH_HOLE, off, len) and
bdev->ops->fallocate(ZERO_RANGE, off, len) than all the weird
"technology specific" blkdev_issue_foo() functions we have grown
over time. Let the block device implement them as it sees fit - the
higher levels don't need to care about protocol/technology details.
---
FWIW, this reminds me of a "bigger picture" I think we should
be working towards. Does anyone remember this:
https://lwn.net/Articles/592091/
(Splitting filesytems in two)
i.e. if we add fallocate support to punch holes, zero ranges and
*allocate blocks* to a block device, we're mostly at the point where
we can offload all freespace management that the filesystem
currently does to the underlying block device.
There's really only a small extension we'd need - the block
allocation done by the block device needs to be able to return the
the sector and length of the newly allocated extent. Indeed, this is
something we talked about last year at LSFMM as a solution to the
SMR write ordering problem:
https://lwn.net/Articles/637035/
(near the end, paragraph talking about a "new kind of write command")
That "new kind of write command" would enable delayed allocation
algorithms to continue to work at the filesystem level on block
devices that freespace management completely is offloaded to...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, Mar 03, 2016 at 05:39:52PM -0500, Theodore Ts'o wrote:
> On Thu, Mar 03, 2016 at 01:54:54PM -0500, Martin K. Petersen wrote:
> > >>>>> "Christoph" == Christoph Hellwig <[email protected]> writes:
> >
> > Christoph> - FALLOC_FL_PUNCH_HOLE assures zeroes are returned, but
> > Christoph> space is deallocated as much as possible -
> > Christoph> FALLOC_FL_ZERO_RANGE assures zeroes are returned, AND blocks
> > Christoph> are actually allocated
> >
> > That works for me. I think it would be great if we could have consistent
> > interfaces for fs and block. The more commonality the merrier.
>
> So a question I have is do we want to add a "discard-as-a-hint" analog
> for fallocate?
Well defined, reliable behaviour only, please. If the device can't
provide the required hardware offload, then it needs to use the
generic, slow implementation of the functionality or report
EOPNOTSUPP.
> P.S. Speaking of things that are powerful and too dangerous for
> application programmers, after the Linux FAST workshop, I was having
> dinner with the Ceph developers and Ric Wheeler, and we were talking
> about things they really needed. Turns out they also could use an
> FALLOC_FL_NO_HIDE_STALE functionality.
For better or for worse, Ceph is moving away from using filesystems
for its back end object store, so the use of such a hack in Ceph
has a very limited life.
> I told them I had an
> out-of-tree patch that had that functionality, and even Ric Wheeler
> started getting tempted.... :-)
You can tempt all you want, but it does not change the basic fact
that it is dangerous and compromises system security. As such, it
does not belong in upstream kernels. Especially in this day and age
where ensuring the fundamental integrity of our systems is more
important than ever.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Fri, Mar 04, 2016 at 10:10:50AM +1100, Dave Chinner wrote:
> You can tempt all you want, but it does not change the basic fact
> that it is dangerous and compromises system security. As such, it
> does not belong in upstream kernels. Especially in this day and age
> where ensuring the fundamental integrity of our systems is more
> important than ever.
The fact of the matter is as there are more user-space servers for
cluster file systems which are trusted to do the right thing, and are
considered inside the TCB, the demand for this is going to strong. We
do actually use a group id to provide access control to what is
admittedly a dangerous feature, so it's not available for all
userspace callers, and we didn't want the file server to run as root.
But I'll let the Ceph folks advocate internally for such a feature to
be shipped in RHEL if they feel strongly about it, and if later one we
want to come to an agreement about a better access control mechanism
than membership in a privileged group id passed in as a mount option,
I'm not wedded to that facility. I didn't think it was worth cracking
open a new capability bit, but if we want to go down that route, or
some other route to provide the appropriate security controls, we
should definitely talk about it.
Cheers,
- Ted
On 03/03/2016 11:56 PM, Dave Chinner wrote:
> That "new kind of write command" would enable delayed allocation
> algorithms to continue to work at the filesystem level on block
> devices that freespace management completely is offloaded to...
> Cheers, Dave.
This would advocate a uniform /internal/ interface (family) across both
fs and block layers, similiar in spirit to my old Athomux research
prototype long ago (see http://www.athomux.net).
This allows for recursive nesting in complex (distributed) storage/fs
hierarchies.
It would be nice if that internal interface (family) would be (partly /
fully) asynchronous with callbacks. In ideal case, it should be
compatible with workqueues (no need for blocking threads anymore).
Uniformity is only needed at concept level. There might remain different
flavours of concrete interfaces at different subsystems, if the number
of subsystems remains as small as possible, and interfacing is close to
trivial.
I would like to support this also in future versions of MARS (see
github.com/schoebel/mars).
Cheers, Thomas
On Thu, Mar 3, 2016 at 3:10 PM, Dave Chinner <[email protected]> wrote:
> On Thu, Mar 03, 2016 at 05:39:52PM -0500, Theodore Ts'o wrote:
>> On Thu, Mar 03, 2016 at 01:54:54PM -0500, Martin K. Petersen wrote:
>> > >>>>> "Christoph" == Christoph Hellwig <[email protected]> writes:
>> >
>> > Christoph> - FALLOC_FL_PUNCH_HOLE assures zeroes are returned, but
>> > Christoph> space is deallocated as much as possible -
>> > Christoph> FALLOC_FL_ZERO_RANGE assures zeroes are returned, AND blocks
>> > Christoph> are actually allocated
>> >
>> > That works for me. I think it would be great if we could have consistent
>> > interfaces for fs and block. The more commonality the merrier.
>>
>> So a question I have is do we want to add a "discard-as-a-hint" analog
>> for fallocate?
>
> Well defined, reliable behaviour only, please. If the device can't
> provide the required hardware offload, then it needs to use the
> generic, slow implementation of the functionality or report
> EOPNOTSUPP.
>
>> P.S. Speaking of things that are powerful and too dangerous for
>> application programmers, after the Linux FAST workshop, I was having
>> dinner with the Ceph developers and Ric Wheeler, and we were talking
>> about things they really needed. Turns out they also could use an
>> FALLOC_FL_NO_HIDE_STALE functionality.
>
> For better or for worse, Ceph is moving away from using filesystems
> for its back end object store, so the use of such a hack in Ceph
> has a very limited life.
Well, let's be clear: the reason Ceph is moving away from using local
filesystems is because we couldn't get the overheads of using them
down to what we considered an acceptable level. There are always going
to be some inefficiencies from it of course (since you have two
metadata streams) but the more issues get addressed, the fewer
userspace filesystems will feel or run up against the need to do their
own block device management. :) If none of them get fixed the same
scenario will just repeat itself — a userspace filesystem rises, it
tries to get features it needs into the kernel, it eventually gives up
and drops the kernel out of the loop, and then the fact that nobody's
using the kernel in this scenario will be considered a reason not to
make it work better.
I really am sensitive to the security concerns, just know that if it's
a permanent blocker you're essentially blocking out a growing category
of disk users (who run on an awfully large number of disks!).
-Greg
>
>> I told them I had an
>> out-of-tree patch that had that functionality, and even Ric Wheeler
>> started getting tempted.... :-)
>
> You can tempt all you want, but it does not change the basic fact
> that it is dangerous and compromises system security. As such, it
> does not belong in upstream kernels. Especially in this day and age
> where ensuring the fundamental integrity of our systems is more
> important than ever.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 09, 2016 at 02:20:31PM -0800, Gregory Farnum wrote:
> I really am sensitive to the security concerns, just know that if it's
> a permanent blocker you're essentially blocking out a growing category
> of disk users (who run on an awfully large number of disks!).
Or they just have to use kernels with out-of-tree patches installed. :-P
If you want to consider how many disks Google has that are using this
patch, I probably could have appealed to Linus and asked him to accept
the patch if I forced the issue. The only reason why I didn't was
that people like Ric Wheeler threatened to have distro-specific
patches to disable the feature, and at the end of the day, I didn't
care that much. After all, if it makes it harder for large scale
cloud companies besides Google to create more efficient userspace
cluster file systems, it's not like I was keeping the patch a secret.
So ultimately, if the Ceph developers want to make a case to Red Hat
management that this is important, great. If not, it's not that hard
for those people who need the patch and who are running large cloud
infrastructures to simply apply the out-of-tree patch if they need it.
Cheers,
- Ted
On 03/10/2016 04:38 AM, Theodore Ts'o wrote:
> On Wed, Mar 09, 2016 at 02:20:31PM -0800, Gregory Farnum wrote:
>> I really am sensitive to the security concerns, just know that if it's
>> a permanent blocker you're essentially blocking out a growing category
>> of disk users (who run on an awfully large number of disks!).
> Or they just have to use kernels with out-of-tree patches installed. :-P
>
> If you want to consider how many disks Google has that are using this
> patch, I probably could have appealed to Linus and asked him to accept
> the patch if I forced the issue. The only reason why I didn't was
> that people like Ric Wheeler threatened to have distro-specific
> patches to disable the feature, and at the end of the day, I didn't
> care that much. After all, if it makes it harder for large scale
> cloud companies besides Google to create more efficient userspace
> cluster file systems, it's not like I was keeping the patch a secret.
>
> So ultimately, if the Ceph developers want to make a case to Red Hat
> management that this is important, great. If not, it's not that hard
> for those people who need the patch and who are running large cloud
> infrastructures to simply apply the out-of-tree patch if they need it.
>
> Cheers,
>
> - Ted
>
What was objectionable at the time this patch was raised years back (not just to
me, but to pretty much every fs developer at LSF/MM that year) centered on the
concern that this would be viewed as a "performance" mode and we get pressure to
support this for non-priveleged users. It gives any user effectively the ability
to read the block device content for previously allocated data without restriction.
At the time, I also don't recall seeing the patch posted on upstream lists for
debate or justification.
As we discussed a few weeks back, I don't object to having support for doing
this in carefully controlled ways for things like user space file systems. In
effect, the problem of preventing other people's data being handed over to the
end user is taken on by that layer of code. I suspect that fits the use case at
google and Ceph both.
Regards,
Ric
On Thu, Mar 10, 2016 at 6:58 AM, Ric Wheeler <[email protected]> wrote:
>
> What was objectionable at the time this patch was raised years back (not
> just to me, but to pretty much every fs developer at LSF/MM that year)
> centered on the concern that this would be viewed as a "performance" mode
> and we get pressure to support this for non-priveleged users. It gives any
> user effectively the ability to read the block device content for previously
> allocated data without restriction.
The sane way to do it would be to just check permissions of the
underlying block device.
That way, people can just set the permissions for that to whatever
they want. If google right now uses some magical group for this, they
could make the underlying block device be writable for that group.
We can do the security check at the filesystem level, because we have
sb->s_bdev->bd_inode, and if you have read and write permissions to
that inode, you might as well have permission to create a unsafe hole.
That doesn't sound very hacky to me.
Linus
On Thu, Mar 10, 2016 at 10:33:49AM -0800, Linus Torvalds wrote:
> On Thu, Mar 10, 2016 at 6:58 AM, Ric Wheeler <[email protected]> wrote:
> >
> > What was objectionable at the time this patch was raised years back (not
> > just to me, but to pretty much every fs developer at LSF/MM that year)
> > centered on the concern that this would be viewed as a "performance" mode
> > and we get pressure to support this for non-priveleged users. It gives any
> > user effectively the ability to read the block device content for previously
> > allocated data without restriction.
Sure, but it was never "any user". We always had group-based
permissions from the beginning. Sure, we passed it in via a mount
option which was a bit hacky, but we never got to the point of
discussing the way that we would modulate the access --- the complaint
seemed to be that if it was a non-root user, it was an unacceptable
security hole. And the pushback I got was more in the way of a
religious objection more than anything else. Heck, even reserving a
code point for the out-of-tree patch received a huge amount of
pushback.
> The sane way to do it would be to just check permissions of the
> underlying block device.
>
> That way, people can just set the permissions for that to whatever
> they want. If google right now uses some magical group for this, they
> could make the underlying block device be writable for that group.
I'd suggest making it be if you had *read* access to the block device.
After all, the risk that everyone was all excited about was the risk
of being able to read stale (deleted) data from old files. And
there's no point giving the userspace cluster file system daemon the
ability to corrupt the file system or set the setuid bit on some
arbitrary executable.
And if we are going to go this far, then I'd also suggest using this
permission check to the user the ability to issue BLKDISCARD on a
file. Allowing BLKDISCARD on files is one that should have been even
more of a no-brainer, since it could never reveal stale data, but
simply wasn't guaranteed to have reliable results because it was a
hint to the underlying storage device. But this has also received a
huge amount of religious pushback, which is why this is also an
out-of-tree patch in the Google kernel. (If that means that our
competitors have a higher flash TCO than us, again, no skin off my
nose. I tried to get it upstream, and cost of forward porting the
patch each time we rebase the kernel isn't _that_ annoying.)
- Ted
> We can do the security check at the filesystem level, because we have
> sb->s_bdev->bd_inode, and if you have read and write permissions to
> that inode, you might as well have permission to create a unsafe hole.
>
> That doesn't sound very hacky to me.
>
> Linus
On 03/11/2016 12:03 AM, Linus Torvalds wrote:
> On Thu, Mar 10, 2016 at 6:58 AM, Ric Wheeler <[email protected]> wrote:
>> What was objectionable at the time this patch was raised years back (not
>> just to me, but to pretty much every fs developer at LSF/MM that year)
>> centered on the concern that this would be viewed as a "performance" mode
>> and we get pressure to support this for non-priveleged users. It gives any
>> user effectively the ability to read the block device content for previously
>> allocated data without restriction.
> The sane way to do it would be to just check permissions of the
> underlying block device.
>
> That way, people can just set the permissions for that to whatever
> they want. If google right now uses some magical group for this, they
> could make the underlying block device be writable for that group.
>
> We can do the security check at the filesystem level, because we have
> sb->s_bdev->bd_inode, and if you have read and write permissions to
> that inode, you might as well have permission to create a unsafe hole.
>
> That doesn't sound very hacky to me.
>
> Linus
I agree that this sounds quite reasonable.
Ric
> > We can do the security check at the filesystem level, because we have
> > sb->s_bdev->bd_inode, and if you have read and write permissions to
> > that inode, you might as well have permission to create a unsafe hole.
Not if you don't have access to a block device node to open it, or there
are SELinux rules that control the access. There are cases it isn't
entirely the same thing as far as I can see. Consider within a container
for example.
The paranoid approach would IMHO to have a mount option so you can
explicitly declare a file system mount should trust its owner/group and
then that can also be used to wire up any other "unsafe" activities in a
general "mounted for a special use" option.
Alan
On Fri, Mar 11, 2016 at 01:59:52PM +0000, One Thousand Gnomes wrote:
> > > We can do the security check at the filesystem level, because we have
> > > sb->s_bdev->bd_inode, and if you have read and write permissions to
> > > that inode, you might as well have permission to create a unsafe hole.
>
> Not if you don't have access to a block device node to open it, or there
> are SELinux rules that control the access. There are cases it isn't
> entirely the same thing as far as I can see. Consider within a container
> for example.
In a container shouldn't be a problem so long as we use uid mapping
when making the group id check.
> The paranoid approach would IMHO to have a mount option so you can
> explicitly declare a file system mount should trust its owner/group and
> then that can also be used to wire up any other "unsafe" activities in a
> general "mounted for a special use" option.
Indeed, that's what we're currently doing. We've acutally been using
different gid's for each "privileged" operation, though, since we want
to have fine-grained access controls.
The process who can perform an operation which can result in the
ability to read stale data might not need (and therefore should not be
given) access to be able to issue TCG/Opal management commands to the
HDD, for example.
- Ted
On Fri, Mar 11, 2016 at 5:59 AM, One Thousand Gnomes
<[email protected]> wrote:
>
> > > We can do the security check at the filesystem level, because we have
> > > sb->s_bdev->bd_inode, and if you have read and write permissions to
> > > that inode, you might as well have permission to create a unsafe hole.
>
> Not if you don't have access to a block device node to open it, or there
> are SELinux rules that control the access. There are cases it isn't
> entirely the same thing as far as I can see. Consider within a container
> for example.
I agree that it's not the same thing, but I don't think it really ends
up mattering.
Either the container is properly separated and set up - in which case
the uid mapping is what protects you - or it isn't - in which case the
container could just mknod whatever hell node it wants anyway.
So we do pretty much have the permission model.
> The paranoid approach would IMHO to have a mount option so you can
> explicitly declare a file system mount should trust its owner/group and
> then that can also be used to wire up any other "unsafe" activities in a
> general "mounted for a special use" option.
I think that a mount option to enable it isn't a bad idea, but the
mount option should be something generic and not be about "this group
is special" which it sounds like google is currently using.
More like "enable hole punching" - which doesn't enable it
unconditionally, you'd still have the security checks.
Linus
On Fri, Mar 11, 2016 at 9:23 AM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Mar 11, 2016 at 5:59 AM, One Thousand Gnomes
> <[email protected]> wrote:
>>
>> > > We can do the security check at the filesystem level, because we have
>> > > sb->s_bdev->bd_inode, and if you have read and write permissions to
>> > > that inode, you might as well have permission to create a unsafe hole.
>>
>> Not if you don't have access to a block device node to open it, or there
>> are SELinux rules that control the access. There are cases it isn't
>> entirely the same thing as far as I can see. Consider within a container
>> for example.
>
> I agree that it's not the same thing, but I don't think it really ends
> up mattering.
>
> Either the container is properly separated and set up - in which case
> the uid mapping is what protects you - or it isn't - in which case the
> container could just mknod whatever hell node it wants anyway.
>
> So we do pretty much have the permission model.
This makes me nervous.
Suppose I unshare my user namespace, set up very restrictive mounts,
drop caps, seccomp the hell out of myself (but allow literally only
read, write, and ioctl and keep only a single fd to a file on an
ordinary filesystem, which should be safe), and run untrusted code.
Now that code can do this unsafe ioctl simply because its uid or gid
happens to have read access to a device node that isn't even present
in the sandbox. Ick.
What if we had an ioctl to do these data-leaking operations that took,
as an extra parameter, an fd to the block device node. They allow
access if the fd points to the right inode and has FMODE_READ (and LSM
checks say it's okay). Sure, it's awkward, but it's much safer.
--Andy
On Fri, Mar 11, 2016 at 9:30 AM, Andy Lutomirski <[email protected]> wrote:
>
> What if we had an ioctl to do these data-leaking operations that took,
> as an extra parameter, an fd to the block device node. They allow
> access if the fd points to the right inode and has FMODE_READ (and LSM
> checks say it's okay). Sure, it's awkward, but it's much safer.
That sounds absolutely horrible.
I'd *much* prefer the suggestion from Alan to simply have a mount-time
option to enable it. That way, you will never get any surprises, and
no "subtle new behavior for somebody who set their system up in a way
that doesn't allow for this".
So you'd have to explicitly say "my setup is ok with hole punching".
Linus
On Fri, Mar 11, 2016 at 10:25:30AM -0800, Linus Torvalds wrote:
> On Fri, Mar 11, 2016 at 9:30 AM, Andy Lutomirski <[email protected]> wrote:
> >
> > What if we had an ioctl to do these data-leaking operations that took,
> > as an extra parameter, an fd to the block device node. They allow
> > access if the fd points to the right inode and has FMODE_READ (and LSM
> > checks say it's okay). Sure, it's awkward, but it's much safer.
>
> That sounds absolutely horrible.
>
> I'd *much* prefer the suggestion from Alan to simply have a mount-time
> option to enable it. That way, you will never get any surprises, and
> no "subtle new behavior for somebody who set their system up in a way
> that doesn't allow for this".
>
> So you'd have to explicitly say "my setup is ok with hole punching".
Except it's not hole punching that is the problem. Hoel punching
makes sure that the underlying blocksare removed and so whatever
data is in them cannot be accessed any more. The problem here is
preallocation of unwritten blocks that expose the stale data if the
filesystem skips marking those blocks as unwritten.
And, so, what happens when a file that is preallocated with the
unwritten bit so it exposes stale data then has it's owner/group
changed? Or copied by root/user in privileged group to a different
location that other users can access? Or any of the other vectors
that can result in the stale data being copied/made available to
unprivileged users?
It's all well and good to restrict access to the fallocate() call to
limit who can expose stale data, but it doesn't remove the fact it
is easy for stale data to unintentionally escape the privileged
group once it has been exposed because there is no record of the
fact the file contains uninitialised blocks....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Fri, Mar 11, 2016 at 2:30 PM, Dave Chinner <[email protected]> wrote:
> On Fri, Mar 11, 2016 at 10:25:30AM -0800, Linus Torvalds wrote:
>>
>> So you'd have to explicitly say "my setup is ok with hole punching".
>
> Except it's not hole punching that is the problem. [..]
> The problem here is
> preallocation of unwritten blocks that expose the stale data if the
> filesystem skips marking those blocks as unwritten.
Right you are.
> It's all well and good to restrict access to the fallocate() call to
> limit who can expose stale data, but it doesn't remove the fact it
> is easy for stale data to unintentionally escape the privileged
> group once it has been exposed because there is no record of the
> fact the file contains uninitialised blocks....
Good point.
It's not just that the user in question *couldn't* have exposed it
other ways by reading the raw device and then writing that info into a
file. You're right that the "don't initialize unwritten blocks" thing
has a more insidious problem of making it easy to unintentionally
expose such data just because you missed an error path (or the process
died without ever doing the proper over-write)..
It would be much better if we could somehow mitigate just _how_ easy
it is to screw up.
One way to do that would be to not just require that the user that
discards the initializing writes have read access to the underlying
device, but perhaps also have some strict requirement that you can
discard only if the file you are working with is legible only to you?
That would limit the damage, and keep the stale data "private" to the
user who is already able to read the raw data off the device. Sure,
you can then mark the file read-by-world by others later, but at that
point you're kind of *consciously* exposing that stale data (and at
that point, you have hopefully cleaned it all up and replaced the
stale data with real data).
But would that perhaps not be reasonable for the kind of use cases
that google has now?
Linus
On Sat, Mar 12, 2016 at 09:30:47AM +1100, Dave Chinner wrote:
> It's all well and good to restrict access to the fallocate() call to
> limit who can expose stale data, but it doesn't remove the fact it
> is easy for stale data to unintentionally escape the privileged
> group once it has been exposed because there is no record of the
> fact the file contains uninitialised blocks....
Ultimately, it's up to the trusted process to make sure it never
reveals any stale data. For example, if you have the policy that all
data is encrypted at rest, and the trusted process is always going to
be decrypting any blocks it reads from disk before passing it on its
client (for example) then any stale data is going to be obscured by
the decryption step before it gets passed on.
At the end of the day it's about whether you trust the userspace
program or not. I know there's a long and venerated traition of
assuming that all application programmers are incompetent, but that
leads to file systems doing more work than what is strictly necessary,
and that has a performance tax.
And if the result is the cluster file system authors decide to create
a user space file system, and bypass the kernel file system directly,
then we have to trust them to do a competent job anyway. But if you
believe that there are still ways in which a in-kernel file system can
add value, then it's encumbent on us to to be a bit more flexible and
not assume that all userspace programmers are blithering idiots.
Cheers,
- Ted
On Fri, Mar 11, 2016 at 4:35 PM, Theodore Ts'o <[email protected]> wrote:
>
> At the end of the day it's about whether you trust the userspace
> program or not.
There's a big difference between "give the user rope", and "tie the
rope in a noose and put a banana peel so that the user might stumble
into the rope and hang himself", though.
So I do think that Dave is right that we should also strive to make
sure that our interfaces are not just secure in theory, but that they
are also good interfaces to make mistakes less likely.
I think we _should_ give users rope, but maybe we should also make
sure that there isn't some hidden rapidly spinning saw-blade right
next to the rope that the user doesn't even think about.
Linus
On Fri, Mar 11, 2016 at 04:44:16PM -0800, Linus Torvalds wrote:
> On Fri, Mar 11, 2016 at 4:35 PM, Theodore Ts'o <[email protected]> wrote:
> >
> > At the end of the day it's about whether you trust the userspace
> > program or not.
>
> There's a big difference between "give the user rope", and "tie the
> rope in a noose and put a banana peel so that the user might stumble
> into the rope and hang himself", though.
So let's see. The user application has to explicitly request
NO_HIDE_STALE via an fallocate flag --- so it requires changing the
source code and recompiling the application. And then, the system
administrator has to pass in a mount option specifying a group that
the application has to run under. And then the application has to run
setgid with that group's privileges.
I hardly think that can be considered handing the user a pre-tied
noose.
Sure, the application can do something stupid --- but I'd arguing
giving root to some junior sysadmin is far more likely to cause
problems.
Cheers,
- Ted
On 03/12/2016 08:19 AM, Theodore Ts'o wrote:
> On Fri, Mar 11, 2016 at 04:44:16PM -0800, Linus Torvalds wrote:
>>
>> There's a big difference between "give the user rope", and "tie the
>> rope in a noose and put a banana peel so that the user might stumble
>> into the rope and hang himself", though.
> [...] And then the application has to run
> setgid with that group's privileges.
Your concept of hierarchically nesting containers via filesystem
instances looks nice to me.
A potential concern could be whether gids are the right implementation
for expressing hierarchically nested access permissions in a persistent way.
Your permissions attached to gids are nested (because inside of your
containers you may have another instance of a completely different gid
namespace), they are also persistent when your mount flags etc are
restored properly after a crash (by some scripts), but probably use of
gids for this might look like a kind of "misuse" of the original gid
concept from the 1970s.
Maybe you currently don't have a better /persistent/ concept for
expressing your needs, so maybe your solution could be just fine under
the currently given cirumstances.
Introduction of a new concept for overcoming the current limitations
must be done very carefully.
The bad discard semantics concerns about information leaks could be
/hypothetically/ solved at /concept level/ in the following way. Please
note that by "concept level" I don't want to imply any particular
implementation, this is just a mental experiment for discussion of the
problems, just a "model of thinking":
a) Use a hierarchical namespace for naming subjects, e.g.
hypervisorA.containerB.subcontainerC.user9 instead of gid=9
b) Attach actual permissions to each block of the underlying block
device (fine-grained object model).
c) Correctly maintain access rights at each hierarchical layer, and for
all operations (including discard with whatever semantics). In case some
inner instance is untrusted and may do evil things, this will be
intercepted / corrected at outer layers (which are more trusted). In
essence, the nesting hierarchy is also a hierarchy of trust.
Now information leaks by bad discard semantics etc should be solved at
any level, even regarding completely unrelated containers or users, as
long as no physical access to the disk is possible. In addition,
encryption may be used for even overcoming this.
Of course, a direct implementation of such extremely fine-grained access
permissions would carry way too much overhead. Both the number of
subjects as well as the number of objects must be reduced to some
reasonable order of magnitude, at least at outer levels.
Thus the question is: how can we achieve almost the same effect with
much less overhead?
Hmm, in my old Athomux research prototype, I proposed some solutions for
this, on an academic green meadow. But I am unsure what is transferable
to a standard POSIX semantics system, and what not. Rethinking these
concepts as well as checking them may take some time....
Here is a first alpha-stage attempt:
1) Give up the hierarchical subject namespace a), but maybe not fully.
Access checking will continue /locally/ at each layer, by treating each
subsystem as a (grey) blackbox. This is already the default
implementation strategy. The total system may be less secure than in an
idealized fine-grained system, because outer levels can no longer detect
bad guys inside of their subsystem instances. The question is: how to
get a "more secure" system than currently, with some reasonable effort.
2) Some /coarse/ access permission checks at the block layer b), but
finer than today. Currently there is almost no checking at all (except
when accessing a huge block device as a whole during open() => at 1&1 we
have very large ones, and they may continue running for years). I am
unsure how to achieve this in detail.
An idea for a long-term solution would be offloading of "allocation
groups" to the block layer (if their size is coarsely dynamic in
general, e.g. in steps of gigabytes), and to implement some coarse
permission checks there. These could then be related to "containers" or
"container groups". One of the problems is that some wide-spread network
protocols like iSCSI have no clue about this, so this can only be an
optional new feature.
Further ideas sought.
Cheers, Thomas
P.S. The concept of a "nest" in Athomux was already some kind of
"recursively nested block device".
On Fri, Mar 11, 2016 at 04:44:16PM -0800, Linus Torvalds wrote:
> On Fri, Mar 11, 2016 at 4:35 PM, Theodore Ts'o <[email protected]> wrote:
> >
> > At the end of the day it's about whether you trust the userspace
> > program or not.
>
> There's a big difference between "give the user rope", and "tie the
> rope in a noose and put a banana peel so that the user might stumble
> into the rope and hang himself", though.
>
> So I do think that Dave is right that we should also strive to make
> sure that our interfaces are not just secure in theory, but that they
> are also good interfaces to make mistakes less likely.
At which point I have to ask: how do we safely allow filesystems to
expose stale data in files? There's a big "we need to trust
userspace" component in ever proposal that has been made so far -
that's the part I have extreme trouble with.
For example, what happens when a backup process running as root a
file that has exposed stale data? Yes, we could set the "NODUMP"
flag on the inode to tell backup programs to skip backing up such
files, but we're now trusting some random userspace application
(e.g. tar, rsync, etc) not to do something we don't want it to do
with the data in that file.
AFAICT, we can't stop root from copying files that have exposed
stale data or changing their ownership without some kind of special
handling of "contains stale data" files within the kernel. At this
point we are back to needing persistent tracking of the "exposed
stale data" state in the inode as the only safe way to allow us to
expose stale data. That's fairly ironic given that the stated
purpose of exposing stale data through fallocate is to avoid the
overhead of the existing mechanisms we use to track extents
containing stale data....
> I think we _should_ give users rope, but maybe we should also make
> sure that there isn't some hidden rapidly spinning saw-blade right
> next to the rope that the user doesn't even think about.
IMO we already have a good, safe interface that provides the rope
without the saw blades. I'm happy to be proven wrong, but IMO I
don't see that we can provide stale data exposure in a safe,
non-saw-bladey way without any kernel/filesystem side overhead.....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On 03/13/2016 07:30 PM, Dave Chinner wrote:
> On Fri, Mar 11, 2016 at 04:44:16PM -0800, Linus Torvalds wrote:
>> On Fri, Mar 11, 2016 at 4:35 PM, Theodore Ts'o <[email protected]> wrote:
>>> At the end of the day it's about whether you trust the userspace
>>> program or not.
>> There's a big difference between "give the user rope", and "tie the
>> rope in a noose and put a banana peel so that the user might stumble
>> into the rope and hang himself", though.
>>
>> So I do think that Dave is right that we should also strive to make
>> sure that our interfaces are not just secure in theory, but that they
>> are also good interfaces to make mistakes less likely.
> At which point I have to ask: how do we safely allow filesystems to
> expose stale data in files? There's a big "we need to trust
> userspace" component in ever proposal that has been made so far -
> that's the part I have extreme trouble with.
>
> For example, what happens when a backup process running as root a
> file that has exposed stale data? Yes, we could set the "NODUMP"
> flag on the inode to tell backup programs to skip backing up such
> files, but we're now trusting some random userspace application
> (e.g. tar, rsync, etc) not to do something we don't want it to do
> with the data in that file.
>
> AFAICT, we can't stop root from copying files that have exposed
> stale data or changing their ownership without some kind of special
> handling of "contains stale data" files within the kernel. At this
> point we are back to needing persistent tracking of the "exposed
> stale data" state in the inode as the only safe way to allow us to
> expose stale data. That's fairly ironic given that the stated
> purpose of exposing stale data through fallocate is to avoid the
> overhead of the existing mechanisms we use to track extents
> containing stale data....
I think that once we enter this mode, the local file system has effectively
ceded its role to prevent stale data exposure to the upper layer. In effect,
this ceases to become a normal file system for any enabled process if we control
this through fallocate() or for all processes if we do the brute force mount
option that would be file system wide.
That means we would not need to track this. Extents would be marked as if they
always have had valid data (no more allocated but unwritten state).
In the end, that is the actual goal - move this enforcement up a layer for
overlay/user space file systems that are then responsible for policing this ind
of thing.
Regards,
Ric
>
>> I think we _should_ give users rope, but maybe we should also make
>> sure that there isn't some hidden rapidly spinning saw-blade right
>> next to the rope that the user doesn't even think about.
> IMO we already have a good, safe interface that provides the rope
> without the saw blades. I'm happy to be proven wrong, but IMO I
> don't see that we can provide stale data exposure in a safe,
> non-saw-bladey way without any kernel/filesystem side overhead.....
>
> Cheers,
>
> Dave.
On Mon, Mar 14, 2016 at 06:34:00AM -0400, Ric Wheeler wrote:
> I think that once we enter this mode, the local file system has effectively
> ceded its role to prevent stale data exposure to the upper layer. In effect,
> this ceases to become a normal file system for any enabled process if we
> control this through fallocate() or for all processes if we do the brute
> force mount option that would be file system wide.
Or we do this via group id, such that we are ceding responsibility for
proventing stale data exposure to the processes running under that
group id. That process has the responsibility for making sure that it
doesn't return any data from that file unless it has been written, and
also to make sure the permissions of that file are not readable by
processes that aren't in that group. (For example, owned by user
ceph, group ceph, with premissions 640).
> In the end, that is the actual goal - move this enforcement up a layer for
> overlay/user space file systems that are then responsible for policing this
> ind of thing.
Yes, exactly.
- Ted
On Mon, Mar 14, 2016 at 10:46:03AM -0400, Theodore Ts'o wrote:
> On Mon, Mar 14, 2016 at 06:34:00AM -0400, Ric Wheeler wrote:
> > I think that once we enter this mode, the local file system has effectively
> > ceded its role to prevent stale data exposure to the upper layer. In effect,
> > this ceases to become a normal file system for any enabled process if we
> > control this through fallocate() or for all processes if we do the brute
> > force mount option that would be file system wide.
>
> Or we do this via group id, such that we are ceding responsibility for
> proventing stale data exposure to the processes running under that
> group id. That process has the responsibility for making sure that it
> doesn't return any data from that file unless it has been written, and
> also to make sure the permissions of that file are not readable by
> processes that aren't in that group. (For example, owned by user
> ceph, group ceph, with premissions 640).
Root can still change the group id of a file that has exposed stale
data and hence make it visible outside of the group based
containment wall. i.e. external actors can still unintentionally
expose stale data, even though the application might be correctly
contained and safe.
What we are missing is actual numbers that show that exposing stale
data is a /significant/ win for these applications that are
demanding it. And then we need evidence proving that the problem is
actually systemic and not just a hack around a bad implementation of
a feature...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, Mar 15, 2016 at 1:14 PM, Dave Chinner <[email protected]> wrote:
>
> Root can still change the group id of a file that has exposed stale
> data and hence make it visible outside of the group based
> containment wall.
Ok, Dave, now you're just being ridiculous.
The issue has never been - and *should* never be - that stale data
cannot get out.
The only issue is that we shouldn't make it ridiculously easy to make
silly mistakes.
There's no "group based containment wall" that is some kind of
absolute protection border.
Put another way: this is not about theoretical leaks - because those
are totally irrelevant (in theory, the original discard writer had
access to all that stale data anyway). This is about making it a
practical interface that doesn't have serious hidden gotchas.
So stop making silly theoretical arguments that make no sense.
We should make sure that we have _practical_ rules that are sensible,
but also not painful enough for the people who want to use this in
_practice_.
Reality trumps everything else.
If google is already using this kind of interface, then that is
_reality_. Take that into account.
Linus
On Tue, Mar 15, 2016 at 01:43:01PM -0700, Linus Torvalds wrote:
> Put another way: this is not about theoretical leaks - because those
> are totally irrelevant (in theory, the original discard writer had
> access to all that stale data anyway). This is about making it a
> practical interface that doesn't have serious hidden gotchas.
So there have been two interfaces proposed so far:
1) A mount option which specifies the group id under which a program
must be running in order to have the permission to use
FALLOC_FL_NO_HIDE_STALE flag in the fallocate system call.
2) The program must have read access to the underlying block device
inode under which the file system is mounted in order to have the
permission to use the FALLOC_FL_NO_HIDE_STALE flag in the fallocate
system call.
In both cases, the application has to be make C source code changes to
use this feature, and the system administrator has to set up the
application so it has the privileges to use it. In the case of #1,
the sysadmin has to specify a mount option as well.
We're doing #1 in production in a very large number of mounted disks
today. Linus has suggested #2, although there was some concern that
screw-up in the user namespaces configuration could result in
accidentally in a security exposure. (To which my response is, as
opposed to the gazillions of other security nightmares which the user
namespace makes us vulnerable to?)
Still, my preference is for #1, since the mount option acts as an
additional control for those really paranoid types that seem convinced
that it can't be used safely, and it's what we're doing in production
already. I'm open to #2 if other people are OK with it, though.
Cheers,
- Ted
On Tue, Mar 15, 2016 at 01:43:01PM -0700, Linus Torvalds wrote:
> On Tue, Mar 15, 2016 at 1:14 PM, Dave Chinner <[email protected]> wrote:
> >
> > Root can still change the group id of a file that has exposed stale
> > data and hence make it visible outside of the group based
> > containment wall.
>
> Ok, Dave, now you're just being ridiculous.
>
> The issue has never been - and *should* never be - that stale data
> cannot get out.
Stale data escaping containment is a security issue. Enabling
generic kernel mechanisms to *enable containment escape* is
fundamentally wrong, and relying on userspace to Do The Right Thing
is even more of a gamble, IMO.
> The only issue is that we shouldn't make it ridiculously easy to make
> silly mistakes.
# sudo rsync -a ....
And now the stale data is on another machine without group-based
access containment.
> There's no "group based containment wall" that is some kind of
> absolute protection border.
Precisely my point - it's being pitched as a generic containment
mechanism, but it really isn't.
> Put another way: this is not about theoretical leaks - because those
> are totally irrelevant (in theory, the original discard writer had
> access to all that stale data anyway). This is about making it a
> practical interface that doesn't have serious hidden gotchas.
>
> So stop making silly theoretical arguments that make no sense.
It's a practical concern because if we enable this functionality in
fallocate because it will get used by more than just special storage
apps. i.e. this can't be waved away with "only properly managed
applications will use it" arguments.
> We should make sure that we have _practical_ rules that are sensible,
> but also not painful enough for the people who want to use this in
> _practice_.
Of course. The fact is that most of the people discussing this issue
have very little domain specific expertise.
In _practice_, XFS has *always* been able to turn off unwritten
extents and expose stale data. e.g. see this speed-racer blog from
2003 (first google hit on "xfs bonnie++ optimisation"):
http://everything2.com/title/Filesystem+performance+tweaking+with+XFS+on+Linux
" The first XFS tweak I'll try relates to XFS' practice of
adding a flag to all unwritten extents. This is a safety
feature, and it can be disabled with an option during
filesystem creation time (mkfs.xfs -d unwritten=0). [....]
A few improvements, a few setbacks, they're all at the level
of statistical noise. Disabling unwritten extent flagging
doesn't seem to be terribly useful here."
I also don't make a habit of publicising the fact that since we
disabled the "-d unwritten=X" mkfs parameter (because of speed racer
blogs such as the above and configuration cargo-culting resulting in
unsuspecting users exposing stale data unintentionally) that the
functionality still exists in the kernel code and that it only takes
a single xfs_db command to turn off unwritten extents in XFS. i.e.
we can easily make fallocate on XFS expose stale data, filesystem
wide, without requiring mount options, kernel or application
modifications.
And, yes, I do know of proprietary and non-public storage
applications that have used this capability for years, even though
it is unsupported and performance benefits have only ever been
marginal.
> Reality trumps everything else.
Yes, it does. The reality is we've enabled people who know what they
are doing to expose stale data through preallocation interfaces on
XFS since 1998 and we haven't required kernel API hacks to do this.
> If google is already using this kind of interface, then that is
> _reality_. Take that into account.
Making Google's hack more widely available through the fallocate
API is entirely dependent on proving that:
a) the performance problem still exists;
b) the performance problem exists across multiple
filesytsems and is not isolated to just ext4 or one specific
set of workloads;
c) the performance problem cannot be fixed;
d) ext4 can't implement a simple feature check to turn off
unwritten extents similar to XFS; and
e) if all else fails, that the API hack does not compromise the
security of general users unaware that applications might
be using this functionality.
a), b), c) and d) have not been demonstrated, discussed or iterated -
we've jumped straight to arguing about e). Before anything else,
we need to work through a)-d) because exposing stale data through a
general purpose API is a *last resort*.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On 3/15/16 3:14 PM, Dave Chinner wrote:
> What we are missing is actual numbers that show that exposing stale
> data is a /significant/ win for these applications that are
> demanding it. And then we need evidence proving that the problem is
> actually systemic and not just a hack around a bad implementation of
> a feature...
Thanks Dave; I totally agree on this point. We've spent more than enough
time talking about how and if to implement stale data exposure, but nowhere
in this thread has there been any actual performance data indicating why
we should do it at all.
-Eric
On Wed, Mar 16, 2016 at 09:33:13AM +1100, Dave Chinner wrote:
>
> Stale data escaping containment is a security issue. Enabling
> generic kernel mechanisms to *enable containment escape* is
> fundamentally wrong, and relying on userspace to Do The Right Thing
> is even more of a gamble, IMO.
We already have generic kernel mechanisms such as "the block device". P
> It's a practical concern because if we enable this functionality in
> fallocate because it will get used by more than just special storage
> apps. i.e. this can't be waved away with "only properly managed
> applications will use it" arguments.
It requires a mount option. How is this going to allow random
applications to use this feature, again?
> I also don't make a habit of publicising the fact that since we
> disabled the "-d unwritten=X" mkfs parameter (because of speed racer
> blogs such as the above and configuration cargo-culting resulting in
> unsuspecting users exposing stale data unintentionally) that the
> functionality still exists in the kernel code and that it only takes
> a single xfs_db command to turn off unwritten extents in XFS. i.e.
> we can easily make fallocate on XFS expose stale data, filesystem
> wide, without requiring mount options, kernel or application
> modifications.
So you have something even more dangerous in XFS and it's in the
kernel tree? Has Red Hat threatened to have a distro-specific patch
to comment out this code to make sure irresponsible users can't use
it? What I've been suggesting has even more controls that what you
have. And I've been keeping it as an out-of-tree kernel patch mainly
because you've been arguing that it's such a horrible thing.
> Making Google's hack more widely available through the fallocate
> API is entirely dependent on proving that:
Ceph is about to completely bypass the file system because of your
intransigence, and reimplement a userspace file system. They seem to
believe it's necessary. I'll let them make the case, because they
seem to think it's necessary. And if not, if Linus sides with you,
and doesn't want to take the patch, I'll just keep it as a
Google-specific out-of-tree patch. I don't *need* to have this thing
upstream.
- Ted
On Tue, Mar 15, 2016 at 3:33 PM, Dave Chinner <[email protected]> wrote:
>
>> There's no "group based containment wall" that is some kind of
>> absolute protection border.
>
> Precisely my point - it's being pitched as a generic containment
> mechanism, but it really isn't.
No it hasn't.
It has been pitched as
"Companies are already doing this, and other groups are interested in it too".
What part of that is hard to understand?
At no point was it "generic containment" except in your mind.
Everybody else was very clear about the fact that stale data would be
visible. The only issue was to try to mitigate subtle mistakes.
And no, "Root reads a file that has possibly stale data in it" is not
a "subtle mistake". That's a rather obvious thing.
> Making Google's hack more widely available through the fallocate
> API is entirely dependent on proving that:
>
> a) the performance problem still exists;
> b) the performance problem exists across multiple
> filesytsems and is not isolated to just ext4 or one specific
> set of workloads;
> c) the performance problem cannot be fixed;
> d) ext4 can't implement a simple feature check to turn off
> unwritten extents similar to XFS; and
> e) if all else fails, that the API hack does not compromise the
> security of general users unaware that applications might
> be using this functionality.
>
> a), b), c) and d) have not been demonstrated, discussed or iterated -
The thing is, the onus of those shouldn't be on the people who already
have a solution that they are happy with.
The onus of that work should be on the people who are arguing against
it, and have alternate models they want to push.
So you can't ask people who already solved their issue to then try to
argue against their solution.
I do agree that it would be damn useful to have
(a) numbers. No question about that.
(b) we should have a better idea of exactly what the user needs are.
Because if we do expose this kind of functionality, it should be as
limited as possible - but that "as possible" very much depends on what
the usage models are.
And yes, "keep the patch entirely inside google" is obviously one good
way to limit the interface. But if there are really other groups that
want to explore this, then that sounds like a pretty horrible model
too.
Linus
On Tue, Mar 15, 2016 at 4:06 PM, Linus Torvalds
<[email protected]> wrote:
>
> And yes, "keep the patch entirely inside google" is obviously one good
> way to limit the interface. But if there are really other groups that
> want to explore this, then that sounds like a pretty horrible model
> too.
Side note: I really don't see how your argument of "XFS has been able
to do something like this for over a decade, using an even uglier
trick that is hidden and not documented" is at all an argument for
your position.
You're saying "nobody else should be doing what I've been doing for a
long time", and backing that argument up with "but I don't document
it, and it's completely different because it's done at mkfs/debugfs
time rather than mount-time".
But now that people are talking about a filesystem-independent way of
doing the same thing, now it's suddenly poisonous.
Dave, I call BS on your arguments. Or maybe I misunderstood it. But it
does smell very "do what I say, not what I do".
Linus
On Tue, Mar 15, 2016 at 04:06:10PM -0700, Linus Torvalds wrote:
> On Tue, Mar 15, 2016 at 3:33 PM, Dave Chinner <[email protected]> wrote:
> >
> >> There's no "group based containment wall" that is some kind of
> >> absolute protection border.
> >
> > Precisely my point - it's being pitched as a generic containment
> > mechanism, but it really isn't.
>
> No it hasn't.
>
> It has been pitched as
>
> "Companies are already doing this, and other groups are interested in it too".
>
> What part of that is hard to understand?
>
> At no point was it "generic containment" except in your mind.
> Everybody else was very clear about the fact that stale data would be
> visible. The only issue was to try to mitigate subtle mistakes.
>
> And no, "Root reads a file that has possibly stale data in it" is not
> a "subtle mistake". That's a rather obvious thing.
>
> > Making Google's hack more widely available through the fallocate
> > API is entirely dependent on proving that:
> >
> > a) the performance problem still exists;
> > b) the performance problem exists across multiple
> > filesytsems and is not isolated to just ext4 or one specific
> > set of workloads;
> > c) the performance problem cannot be fixed;
> > d) ext4 can't implement a simple feature check to turn off
> > unwritten extents similar to XFS; and
> > e) if all else fails, that the API hack does not compromise the
> > security of general users unaware that applications might
> > be using this functionality.
> >
> > a), b), c) and d) have not been demonstrated, discussed or iterated -
>
> The thing is, the onus of those shouldn't be on the people who already
> have a solution that they are happy with.
>
> The onus of that work should be on the people who are arguing against
> it, and have alternate models they want to push.
I hate having to quote the guidelines we are supposed to work by,
but this seems pretty clear cut. Section 2 of
Documentation/SubmittingPatches:
"Describe your problem. Whether your patch is a one-line bug fix or
5000 lines of a new feature, there must be an underlying problem
that motivated you to do this work. Convince the reviewer that
there is a problem worth fixing and that it makes sense for them to
read past the first paragraph.
Describe user-visible impact. Straight up crashes and lockups are
pretty convincing, but not all bugs are that blatant. Even if the
problem was spotted during code review, describe the impact you think
it can have on users. Keep in mind that the majority of Linux
installations run kernels from secondary stable trees or
vendor/product-specific trees that cherry-pick only specific patches
from upstream, so include anything that could help route your change
downstream: provoking circumstances, excerpts from dmesg, crash
descriptions, performance regressions, latency spikes, lockups, etc.
Quantify optimizations and trade-offs. If you claim improvements in
performance, memory consumption, stack footprint, or binary size,
include numbers that back them up. But also describe non-obvious
costs. Optimizations usually aren't free but trade-offs between CPU,
memory, and readability; or, when it comes to heuristics, between
different workloads. Describe the expected downsides of your
optimization so that the reviewer can weigh costs against benefits.
Once the problem is established, describe what you are actually doing
about it in technical detail. It's important to describe the change
in plain English for the reviewer to verify that the code is behaving
as you intend it to."
It is pretty clear that the onus is on the patch submitter to
provide justification for inclusion, not for the reviewer/Maintainer
to have to prove that the solution is unworkable. "Google uses this"
is not sufficient justification.
> So you can't ask people who already solved their issue to then try to
> argue against their solution.
I'm not asking Ted to argue against his solution. I'm the person who
is trying to poke holes in the "solution" being presented...
> I do agree that it would be damn useful to have
>
> (a) numbers. No question about that.
>
> (b) we should have a better idea of exactly what the user needs are.
> Because if we do expose this kind of functionality, it should be as
> limited as possible - but that "as possible" very much depends on what
> the usage models are.
It's not "damn useful" - it's an absolute requirement that numbers
are provided to back up the assertion that there is no other way to
solve this problem. Once it is established that there is no other
way to solve the performance problem, then we can talk whether we
want to trade off data safety and complexity for performance....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, Mar 15, 2016 at 4:52 PM, Dave Chinner <[email protected]> wrote:
>
> It is pretty clear that the onus is on the patch submitter to
> provide justification for inclusion, not for the reviewer/Maintainer
> to have to prove that the solution is unworkable.
I agree, but quite frankly, performance is a good justification.
So if Ted can give performance numbers, that's justification enough.
We've certainly taken changes with less.
And with your "we should _not_ do this" argument, the onus is clearly on you.
> "Google uses this" is not sufficient justification.
Not per se, no, but it's a very traditional and time-honored model for
"should we merge this".
It's traditionally been things like "Redhat merged it in their distro
kernel, because they have customers that want/need it". That is *the*
reason many big projects got merged, ranging from filesystems to
drivers etc. Take reiserfs, for example: it got merged because SuSE
was actively using it.
So "this feature is being used in real life" is a big hint that the
standard upstream kernel may be missing something important. People
arguing against things like that has been a big problem in the past.
It took people _years_ to get over the whole Android thing. We need to
merge stuff that people are using and depend on, because _not_ merging
them just causes more and more distance between peoples kernels, and
makes it even harder to merge in the future.
I do agree that we want to have hard numbers.
And I do think that we should strive for the whole "we want to merge"
phase to be a time when we also look at "can we improve the
interfaces". But that can - and often does - go too far. Again, we had
_years_ of pointless masturbation over the whole Android thing, just
because people were making up new interfaces.
Linus
On Tue, Mar 15, 2016 at 04:14:32PM -0700, Linus Torvalds wrote:
> On Tue, Mar 15, 2016 at 4:06 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > And yes, "keep the patch entirely inside google" is obviously one good
> > way to limit the interface. But if there are really other groups that
> > want to explore this, then that sounds like a pretty horrible model
> > too.
>
> Side note: I really don't see how your argument of "XFS has been able
> to do something like this for over a decade, using an even uglier
> trick that is hidden and not documented" is at all an argument for
> your position.
>
> You're saying "nobody else should be doing what I've been doing for a
> long time", and backing that argument up with "but I don't document
> it, and it's completely different because it's done at mkfs/debugfs
> time rather than mount-time".
You can read it that way, but that is not the message I was trying
to get across.
The message I was trying to get across is that there are people out
there that have been using hacks like what google uses for as long
as there have been filesystems around that support unwritten
extents. And that they do this even though the benefits of such
hacks are marginal and frequently can't be backed up with numbers.
We don't support filesystems with unwritten extents disabled. I
don't like the fact that there are people who turn them off on XFS
filesystems, but we are stuck with it as it is part of the supported
on disk format that. The best we can do is to try to prevent
unsuspecting users from shooting themselves in the foot with such
features, which is what we did by removing the mkfs option back in
2007.
Like I said, most people don't know or understand the history....
> But now that people are talking about a filesystem-independent way of
> doing the same thing, now it's suddenly poisonous.
It's always been poisonous.
> Dave, I call BS on your arguments. Or maybe I misunderstood it. But it
> does smell very "do what I say, not what I do".
We're stuck with a lot of historical functionality in XFS that we
can't easily remove or disable. Learn from history, don't repeat it.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On 3/15/16 7:06 PM, Linus Torvalds wrote:
> On Tue, Mar 15, 2016 at 4:52 PM, Dave Chinner <[email protected]> wrote:
>> >
>> > It is pretty clear that the onus is on the patch submitter to
>> > provide justification for inclusion, not for the reviewer/Maintainer
>> > to have to prove that the solution is unworkable.
> I agree, but quite frankly, performance is a good justification.
>
> So if Ted can give performance numbers, that's justification enough.
> We've certainly taken changes with less.
I've been away from ext4 for a while, so I'm really not on top of the
mechanics of the underlying problem at the moment.
But I would say that in addition to numbers showing that ext4 has trouble
with unwritten extent conversion, we should have an explanation of
why it can't be solved in a way that doesn't open up these concerns.
XFS certainly has different mechanisms, but is the demonstrated workload
problematic on XFS (or btrfs) as well? If not, can ext4 adopt any of the
solutions that make the workload perform better on other filesystems?
Adding the risk and complexity of the stale data exposure should be
considered as a last resort, after the above questions have been answered.
-Eric
On Tue, Mar 15, 2016 at 07:30:14PM -0500, Eric Sandeen wrote:
> On 3/15/16 7:06 PM, Linus Torvalds wrote:
> > On Tue, Mar 15, 2016 at 4:52 PM, Dave Chinner <[email protected]> wrote:
> >> >
> >> > It is pretty clear that the onus is on the patch submitter to
> >> > provide justification for inclusion, not for the reviewer/Maintainer
> >> > to have to prove that the solution is unworkable.
> > I agree, but quite frankly, performance is a good justification.
> >
> > So if Ted can give performance numbers, that's justification enough.
> > We've certainly taken changes with less.
>
> I've been away from ext4 for a while, so I'm really not on top of the
> mechanics of the underlying problem at the moment.
>
> But I would say that in addition to numbers showing that ext4 has trouble
> with unwritten extent conversion, we should have an explanation of
> why it can't be solved in a way that doesn't open up these concerns.
>
> XFS certainly has different mechanisms, but is the demonstrated workload
> problematic on XFS (or btrfs) as well? If not, can ext4 adopt any of the
> solutions that make the workload perform better on other filesystems?
When I've benchmarked this in the past, doing small random buffered writes
into an preallocated extent was dramatically (3x or more) slower on xfs
than doing them into a fully written extent. That was two years ago,
but I can redo it.
On a fio card, this gets 16,000 iops on a preallocated extent and 40,000
iops if you run it a second time. It's not random writes, but the fsync
probably means the preallocated conversion is more expensive. That's
on a 4.0 kernel, but I'll rerun it on nvme on newer kernels.
fio --name=fsync --rw=write --fsync=1 --bs=4k --filename=/xfs/fio_4096 --size=4g --overwrite=0
I'm happy to run variations on things, just let me know what workloads
are interesting.
-chris
On Tue, Mar 15, 2016 at 06:52:24PM -0400, Theodore Ts'o wrote:
> On Wed, Mar 16, 2016 at 09:33:13AM +1100, Dave Chinner wrote:
> >
> > Stale data escaping containment is a security issue. Enabling
> > generic kernel mechanisms to *enable containment escape* is
> > fundamentally wrong, and relying on userspace to Do The Right Thing
> > is even more of a gamble, IMO.
>
> We already have generic kernel mechanisms such as "the block device". P
>
> > It's a practical concern because if we enable this functionality in
> > fallocate because it will get used by more than just special storage
> > apps. i.e. this can't be waved away with "only properly managed
> > applications will use it" arguments.
>
> It requires a mount option. How is this going to allow random
> applications to use this feature, again?
>
> > I also don't make a habit of publicising the fact that since we
> > disabled the "-d unwritten=X" mkfs parameter (because of speed racer
> > blogs such as the above and configuration cargo-culting resulting in
> > unsuspecting users exposing stale data unintentionally) that the
> > functionality still exists in the kernel code and that it only takes
> > a single xfs_db command to turn off unwritten extents in XFS. i.e.
> > we can easily make fallocate on XFS expose stale data, filesystem
> > wide, without requiring mount options, kernel or application
> > modifications.
>
> So you have something even more dangerous in XFS and it's in the
> kernel tree? Has Red Hat threatened to have a distro-specific patch
xfs_db is the XFS debugger, so you can only enable that bit of functionality
with magical commands, which IMHO isn't much different than people messing
with their ext4 filesystems with debugfs. You had better know what you're
doing and if you break the filesystem you can eat both pieces. :P
> to comment out this code to make sure irresponsible users can't use
> it? What I've been suggesting has even more controls that what you
> have. And I've been keeping it as an out-of-tree kernel patch mainly
> because you've been arguing that it's such a horrible thing.
One could lock it down even more -- hide it behind a Kconfig option that
depends on CONFIG_EXPERT=y and itself defaults to n, require a mount option,
only allow the file owner to call no-hide-stale and only if the file is 0600
(or the appropriate group equivalents like Ted's existing patch), and upon
adding stale extents, set an inode flag that locks uid/gid/mode/flags.
Obviously root can still get to the file, but at least there's hard evidence
that one is entering the twilight zone.
> > Making Google's hack more widely available through the fallocate
> > API is entirely dependent on proving that:
>
> Ceph is about to completely bypass the file system because of your
> intransigence, and reimplement a userspace file system. They seem to
> believe it's necessary. I'll let them make the case, because they
> seem to think it's necessary. And if not, if Linus sides with you,
> and doesn't want to take the patch, I'll just keep it as a
> Google-specific out-of-tree patch. I don't *need* to have this thing
> upstream.
Frankly, I second Eric Sandeen's comments -- just how bad is ext4's
unwritten extent conversion for these folks?
I ran this crappy looping microbenchmark against a ramdisk:
fallocate 400M
write 400M
fsync
rewrite the 400M
fsync
on kernel 4.5.
For writing 400M through the page cache in 4k chunks,
ext4: ~460MB/s -> ~580MB/s (~20%)
XFS: ~660 -> ~870 (~25%)
btrfs: ~130 -> ~200 (~35%)
For writing 400M in 80M chunks,
ext4: ~480MB/s -> ~720MB/s (~30%)
XFS: ~1GB/s -> ~1.5GB/s (~35%)
btrfs: ~590MB/s -> ~590MB/s (no change)
For directio writing 400MB in 4k chunks,
ext4: 25MB/s -> 26MB/s (~5%)
XFS: 25 -> 27 (~8%)
btrfs: 22 -> 18 (...)
For directio writing 1200MB in 80M chunks,
ext4: ~2.9GB/s -> ~3.3GB/s (~13%)
XFS: 3.2 -> 3.5 (~9%)
btrfs: 2.3 -> 2.2 (...)
Clearly, the performance hit of unwritten extent conversion is large
enough to tempt people to ask for no-hide-stale. But I'd rather hear
that directly from a developer, Ceph or otherwise.
In the meantime, please have a look at the v7 blockdev fallocate patches,
which implement only the "you can read zeroes afterwards" commands.
--D
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mar 15, 2016, at 7:51 PM, Darrick J. Wong <[email protected]> wrote:
>
> On Tue, Mar 15, 2016 at 06:52:24PM -0400, Theodore Ts'o wrote:
>> On Wed, Mar 16, 2016 at 09:33:13AM +1100, Dave Chinner wrote:
>>>
>>> Stale data escaping containment is a security issue. Enabling
>>> generic kernel mechanisms to *enable containment escape* is
>>> fundamentally wrong, and relying on userspace to Do The Right Thing
>>> is even more of a gamble, IMO.
>>
>> We already have generic kernel mechanisms such as "the block device". P
>>
>>> It's a practical concern because if we enable this functionality in
>>> fallocate because it will get used by more than just special storage
>>> apps. i.e. this can't be waved away with "only properly managed
>>> applications will use it" arguments.
>>
>> It requires a mount option. How is this going to allow random
>> applications to use this feature, again?
>>
>>> I also don't make a habit of publicising the fact that since we
>>> disabled the "-d unwritten=X" mkfs parameter (because of speed racer
>>> blogs such as the above and configuration cargo-culting resulting in
>>> unsuspecting users exposing stale data unintentionally) that the
>>> functionality still exists in the kernel code and that it only takes
>>> a single xfs_db command to turn off unwritten extents in XFS. i.e.
>>> we can easily make fallocate on XFS expose stale data, filesystem
>>> wide, without requiring mount options, kernel or application
>>> modifications.
>>
>> So you have something even more dangerous in XFS and it's in the
>> kernel tree? Has Red Hat threatened to have a distro-specific patch
>
> xfs_db is the XFS debugger, so you can only enable that bit of functionality
> with magical commands, which IMHO isn't much different than people messing
> with their ext4 filesystems with debugfs. You had better know what you're
> doing and if you break the filesystem you can eat both pieces. :P
>
>> to comment out this code to make sure irresponsible users can't use
>> it? What I've been suggesting has even more controls that what you
>> have. And I've been keeping it as an out-of-tree kernel patch mainly
>> because you've been arguing that it's such a horrible thing.
>
> One could lock it down even more -- hide it behind a Kconfig option that
> depends on CONFIG_EXPERT=y and itself defaults to n, require a mount option,
> only allow the file owner to call no-hide-stale and only if the file is 0600
> (or the appropriate group equivalents like Ted's existing patch), and upon
> adding stale extents, set an inode flag that locks uid/gid/mode/flags.
> Obviously root can still get to the file, but at least there's hard evidence
> that one is entering the twilight zone.
>
>>> Making Google's hack more widely available through the fallocate
>>> API is entirely dependent on proving that:
>>
>> Ceph is about to completely bypass the file system because of your
>> intransigence, and reimplement a userspace file system. They seem to
>> believe it's necessary. I'll let them make the case, because they
>> seem to think it's necessary. And if not, if Linus sides with you,
>> and doesn't want to take the patch, I'll just keep it as a
>> Google-specific out-of-tree patch. I don't *need* to have this thing
>> upstream.
>
> Frankly, I second Eric Sandeen's comments -- just how bad is ext4's
> unwritten extent conversion for these folks?
>
> I ran this crappy looping microbenchmark against a ramdisk:
> fallocate 400M
> write 400M
> fsync
> rewrite the 400M
> fsync
> on kernel 4.5.
>
> For writing 400M through the page cache in 4k chunks,
> ext4: ~460MB/s -> ~580MB/s (~20%)
> XFS: ~660 -> ~870 (~25%)
> btrfs: ~130 -> ~200 (~35%)
>
> For writing 400M in 80M chunks,
> ext4: ~480MB/s -> ~720MB/s (~30%)
> XFS: ~1GB/s -> ~1.5GB/s (~35%)
> btrfs: ~590MB/s -> ~590MB/s (no change)
>
> For directio writing 400MB in 4k chunks,
> ext4: 25MB/s -> 26MB/s (~5%)
> XFS: 25 -> 27 (~8%)
> btrfs: 22 -> 18 (...)
>
> For directio writing 1200MB in 80M chunks,
> ext4: ~2.9GB/s -> ~3.3GB/s (~13%)
> XFS: 3.2 -> 3.5 (~9%)
> btrfs: 2.3 -> 2.2 (...)
>
> Clearly, the performance hit of unwritten extent conversion is large
> enough to tempt people to ask for no-hide-stale. But I'd rather hear
> that directly from a developer, Ceph or otherwise.
I suspect that this gets significantly worse if you are running with
random writes instead of sequential overwrites. With sequential overwrites
there is only a single boundary between init and uninit extents, so at
most one extra extent in the tree. The above performance deltas will also
be much larger when real disks are involved and seek latency is a factor.
If you are doing random writes in a large file, then there may be many
thousands or millions of new extents and tree splits because the large
uninit extents from fallocate() need to be converted to init extents in
a piecemeal fashion.
We might consider tricks to optimize this somewhat for ext4, such as
limiting the uninit extent size below the init extent size (e.g. 1/2)
so that if such extent splits happen we can get a bunch of extent splits
in one leaf without having to also reshuffle the extent tree. We already
ensure that there is a minimum uninit->init extent conversion size
(64KB IIRC?) so that we don't get pathological 4KB init/uninit extent
interleaving. As all of the extents become initialized and merge we
can reduce the tree depth, but at least could avoid the worst case
scenarios. However, I still suspect there would be a big overhead
from doing conversion under such a workload.
Cheers, Andreas
> In the meantime, please have a look at the v7 blockdev fallocate patches,
> which implement only the "you can read zeroes afterwards" commands.
>
> --D
>
>>
>> - Ted
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Cheers, Andreas
On Tue, Mar 15, 2016 at 05:51:17PM -0700, Chris Mason wrote:
> On Tue, Mar 15, 2016 at 07:30:14PM -0500, Eric Sandeen wrote:
> > On 3/15/16 7:06 PM, Linus Torvalds wrote:
> > > On Tue, Mar 15, 2016 at 4:52 PM, Dave Chinner <[email protected]> wrote:
> > >> >
> > >> > It is pretty clear that the onus is on the patch submitter to
> > >> > provide justification for inclusion, not for the reviewer/Maintainer
> > >> > to have to prove that the solution is unworkable.
> > > I agree, but quite frankly, performance is a good justification.
> > >
> > > So if Ted can give performance numbers, that's justification enough.
> > > We've certainly taken changes with less.
> >
> > I've been away from ext4 for a while, so I'm really not on top of the
> > mechanics of the underlying problem at the moment.
> >
> > But I would say that in addition to numbers showing that ext4 has trouble
> > with unwritten extent conversion, we should have an explanation of
> > why it can't be solved in a way that doesn't open up these concerns.
> >
> > XFS certainly has different mechanisms, but is the demonstrated workload
> > problematic on XFS (or btrfs) as well? If not, can ext4 adopt any of the
> > solutions that make the workload perform better on other filesystems?
>
> When I've benchmarked this in the past, doing small random buffered writes
> into an preallocated extent was dramatically (3x or more) slower on xfs
> than doing them into a fully written extent. That was two years ago,
> but I can redo it.
So I re-ran some benchmarks, with 4K O_DIRECT random ios on nvme (4.5
kernel). This is O_DIRECT without O_SYNC. I don't think xfs will do
commits for each IO into the prealloc file? O_SYNC makes it much
slower, so hopefully I've got this right.
The test runs for 60 seconds, and I used an iodepth of 4:
prealloc file: 32,000 iops
overwrite: 121,000 iops
If I bump the iodepth up to 512:
prealloc file: 33,000 iops
overwrite: 279,000 iops
For streaming writes, XFS converts prealloc to written much better when
the IO isn't random. You can start seeing the difference at 16K
sequential O_DIRECT writes, but really its not a huge impact. The worst
case is 4K:
prealloc file: 227MB/s
overwrite: 340MB/s
I can't think of sequential workloads where this will matter, since they
will either end up with bigger IO or the performance impact won't get
noticed.
-chris
On Wed, Mar 16, 2016 at 03:45:49PM -0600, Andreas Dilger wrote:
> > Clearly, the performance hit of unwritten extent conversion is large
> > enough to tempt people to ask for no-hide-stale. But I'd rather hear
> > that directly from a developer, Ceph or otherwise.
>
> I suspect that this gets significantly worse if you are running with
> random writes instead of sequential overwrites. With sequential overwrites
> there is only a single boundary between init and uninit extents, so at
> most one extra extent in the tree. The above performance deltas will also
> be much larger when real disks are involved and seek latency is a factor.
It will vary a lot depending on your use case. If you are running
with data=ordered, and with journalled enabled, then even if it is a
single extent that is modified, the fact that a journal transaction
involved, with a forced data block flush to avoid revealing stale
data, that is certainly going to be measurable.
The other thing is if you are worried about tail latency, which is a
major concern at Google[1], and you are running your disks close to
flat out, the fact that you have to do an extra seek to update the
extent tree is a seek that you can't be using for useful work --- and
worse, could delay a low-latency read from completing within your SLO.
[1] https://research.google.com/pubs/pub44830.html
Part of what's challenging with giving numbers is that it's trivially
easy to give some worst case scneario where the numbers are really
terrible. A random 4k random write benchmark into an fallocated file,
eeven with XFS, would have pretty bad numbers, But of course people
wouldn't say that it's very realistic. But those are the easiest to
get.
The most realistic numbers are going to be a lot harder to get, and
wouldn't necessarily make a lot of sense without revealing a lot
proprietary information. I will say that Google does have a fairly
large number of disks[2] and so even a small fractional percentage
gain multipled by gazillions of disks starts turning into a dollar
number with enough zeros that people really sit up and take notice.
I'll also note that map reduce can be quite nasty as far as random I/O
is concerned[3], and while map reduce jobs are often not high priority
jobs, they can interfere with low-latency reads from important
applications (e.g., web search, user-visible gmail operations, etc.)
[2] https://what-if.xkcd.com/63/
[3] https://pdfs.semanticscholar.org/6238/e5f0fd807f634f5999701c7aa6a09d88dfc8.pdf
So I'm not sure what numbers I can really give that would satisfy
people. Doing a random write fio job is not hard, and will result in
fairly impressive numbers. If that's enough, then either I can do
this, or Chris Mason can reproduce his experiment using XFS (which
would presumably eliminate the excuse that it's because ext4 sucks at
extent operations). But if that's not going to convince people, then
I'd much rather not waste my time.
Besides, at Google it's easy enough for me to maintain the patch
out-of-tree. It's the Ceph folks who would need to at the very least,
have such a patch ship in Red Hat Enterprise Linux. So it's probably
better for them to justify it, if numbers are really necessary.
- Ted
On 3/16/16 7:15 PM, Theodore Ts'o wrote:
> On Wed, Mar 16, 2016 at 03:45:49PM -0600, Andreas Dilger wrote:
>>> Clearly, the performance hit of unwritten extent conversion is large
>>> enough to tempt people to ask for no-hide-stale. But I'd rather hear
>>> that directly from a developer, Ceph or otherwise.
>>
>> I suspect that this gets significantly worse if you are running with
>> random writes instead of sequential overwrites. With sequential overwrites
>> there is only a single boundary between init and uninit extents, so at
>> most one extra extent in the tree. The above performance deltas will also
>> be much larger when real disks are involved and seek latency is a factor.
>
> It will vary a lot depending on your use case. If you are running
> with data=ordered, and with journalled enabled, then even if it is a
> single extent that is modified, the fact that a journal transaction
> involved, with a forced data block flush to avoid revealing stale
> data, that is certainly going to be measurable.
>
> The other thing is if you are worried about tail latency, which is a
> major concern at Google[1], and you are running your disks close to
> flat out, the fact that you have to do an extra seek to update the
> extent tree is a seek that you can't be using for useful work --- and
> worse, could delay a low-latency read from completing within your SLO.
>
> [1] https://research.google.com/pubs/pub44830.html
>
> Part of what's challenging with giving numbers is that it's trivially
> easy to give some worst case scneario where the numbers are really
> terrible. A random 4k random write benchmark into an fallocated file,
> eeven with XFS, would have pretty bad numbers, But of course people
> wouldn't say that it's very realistic. But those are the easiest to
> get.
>
> The most realistic numbers are going to be a lot harder to get, and
> wouldn't necessarily make a lot of sense without revealing a lot
> proprietary information. I will say that Google does have a fairly
> large number of disks[2] and so even a small fractional percentage
> gain multipled by gazillions of disks starts turning into a dollar
> number with enough zeros that people really sit up and take notice.
> I'll also note that map reduce can be quite nasty as far as random I/O
> is concerned[3], and while map reduce jobs are often not high priority
> jobs, they can interfere with low-latency reads from important
> applications (e.g., web search, user-visible gmail operations, etc.)
>
> [2] https://what-if.xkcd.com/63/
> [3] https://pdfs.semanticscholar.org/6238/e5f0fd807f634f5999701c7aa6a09d88dfc8.pdf
>
> So I'm not sure what numbers I can really give that would satisfy
> people. Doing a random write fio job is not hard, and will result in
> fairly impressive numbers. If that's enough, then either I can do
> this, or Chris Mason can reproduce his experiment using XFS (which
> would presumably eliminate the excuse that it's because ext4 sucks at
> extent operations). But if that's not going to convince people, then
> I'd much rather not waste my time.
>
> Besides, at Google it's easy enough for me to maintain the patch
> out-of-tree. It's the Ceph folks who would need to at the very least,
> have such a patch ship in Red Hat Enterprise Linux. So it's probably
> better for them to justify it, if numbers are really necessary.
I may have lost the thread at this point, with poor Darrick's original
patch submission devolving into a long thread about a NO_HIDE_STALE patch
used at Google, but I don't *think* Ceph ever asked for NO_HIDE_STALE.
At least I can't find any indication of that.
Am I missing something? cc'ing Greg on this one in case I am.
-Eric
On Wed, Mar 16, 2016 at 07:33:55PM -0500, Eric Sandeen wrote:
> I may have lost the thread at this point, with poor Darrick's original
> patch submission devolving into a long thread about a NO_HIDE_STALE patch
> used at Google, but I don't *think* Ceph ever asked for NO_HIDE_STALE.
>
> At least I can't find any indication of that.
>
> Am I missing something? cc'ing Greg on this one in case I am.
This came up at dinner after the Linux FAST conference where the Ric
Wheeler, some Ceph folks, and I were at dinner, and they were
discussing creating a userspace file systems because they couldn't get
kernel file system developers to be responsive for their needs. I
pointed out what I had implemented at Google for our cluster file
system, which includes NO_HIDE_STALE_FL and the BLKDISCARD ioctl to
work on ext4 files, both of which I had kept out of tree because they
were rejected last time and Ric Wheeler had threated to patch them out
of the RHEL kernel if they went in upstream.
When Ric indicated that he was now a bit more accepting of these
patches now that he was responsible for managing people who needed the
functionality, I thought it was time to revisit the topic ---
especially since Tao Bao has been using the same patches for their
hadoopfs servers, and at the very least I should offer the patches to
the Ceph folks to see if they could get them into RHEL, and perhaps
take another run at getting the patches upstream.
- Ted
On Tue, Mar 15, 2016 at 06:51:39PM -0700, Darrick J. Wong wrote:
> On Tue, Mar 15, 2016 at 06:52:24PM -0400, Theodore Ts'o wrote:
> > On Wed, Mar 16, 2016 at 09:33:13AM +1100, Dave Chinner wrote:
> > >
> > > Stale data escaping containment is a security issue. Enabling
> > > generic kernel mechanisms to *enable containment escape* is
> > > fundamentally wrong, and relying on userspace to Do The Right Thing
> > > is even more of a gamble, IMO.
> >
> > We already have generic kernel mechanisms such as "the block device". P
> >
> > > It's a practical concern because if we enable this functionality in
> > > fallocate because it will get used by more than just special storage
> > > apps. i.e. this can't be waved away with "only properly managed
> > > applications will use it" arguments.
> >
> > It requires a mount option. How is this going to allow random
> > applications to use this feature, again?
> >
> > > I also don't make a habit of publicising the fact that since we
> > > disabled the "-d unwritten=X" mkfs parameter (because of speed racer
> > > blogs such as the above and configuration cargo-culting resulting in
> > > unsuspecting users exposing stale data unintentionally) that the
> > > functionality still exists in the kernel code and that it only takes
> > > a single xfs_db command to turn off unwritten extents in XFS. i.e.
> > > we can easily make fallocate on XFS expose stale data, filesystem
> > > wide, without requiring mount options, kernel or application
> > > modifications.
> >
> > So you have something even more dangerous in XFS and it's in the
> > kernel tree? Has Red Hat threatened to have a distro-specific patch
>
> xfs_db is the XFS debugger, so you can only enable that bit of functionality
> with magical commands, which IMHO isn't much different than people messing
> with their ext4 filesystems with debugfs. You had better know what you're
> doing and if you break the filesystem you can eat both pieces. :P
>
> > to comment out this code to make sure irresponsible users can't use
> > it? What I've been suggesting has even more controls that what you
> > have. And I've been keeping it as an out-of-tree kernel patch mainly
> > because you've been arguing that it's such a horrible thing.
>
> One could lock it down even more -- hide it behind a Kconfig option that
> depends on CONFIG_EXPERT=y and itself defaults to n, require a mount option,
> only allow the file owner to call no-hide-stale and only if the file is 0600
> (or the appropriate group equivalents like Ted's existing patch), and upon
> adding stale extents, set an inode flag that locks uid/gid/mode/flags.
> Obviously root can still get to the file, but at least there's hard evidence
> that one is entering the twilight zone.
>
> > > Making Google's hack more widely available through the fallocate
> > > API is entirely dependent on proving that:
> >
> > Ceph is about to completely bypass the file system because of your
> > intransigence, and reimplement a userspace file system. They seem to
> > believe it's necessary. I'll let them make the case, because they
> > seem to think it's necessary. And if not, if Linus sides with you,
> > and doesn't want to take the patch, I'll just keep it as a
> > Google-specific out-of-tree patch. I don't *need* to have this thing
> > upstream.
>
> Frankly, I second Eric Sandeen's comments -- just how bad is ext4's
> unwritten extent conversion for these folks?
>
> I ran this crappy looping microbenchmark against a ramdisk:
> fallocate 400M
> write 400M
> fsync
> rewrite the 400M
> fsync
> on kernel 4.5.
>
> For writing 400M through the page cache in 4k chunks,
> ext4: ~460MB/s -> ~580MB/s (~20%)
> XFS: ~660 -> ~870 (~25%)
> btrfs: ~130 -> ~200 (~35%)
>
> For writing 400M in 80M chunks,
> ext4: ~480MB/s -> ~720MB/s (~30%)
> XFS: ~1GB/s -> ~1.5GB/s (~35%)
> btrfs: ~590MB/s -> ~590MB/s (no change)
>
> For directio writing 400MB in 4k chunks,
> ext4: 25MB/s -> 26MB/s (~5%)
> XFS: 25 -> 27 (~8%)
> btrfs: 22 -> 18 (...)
>
> For directio writing 1200MB in 80M chunks,
> ext4: ~2.9GB/s -> ~3.3GB/s (~13%)
> XFS: 3.2 -> 3.5 (~9%)
> btrfs: 2.3 -> 2.2 (...)
That's not comparing apples to apples. overwrite does not require
allocation/extent manipulation at all so it is comparing performance
of completely different file extent operations. The operations we
should be comparing are "first write" operations, namely "write over
hole with allocation" vs "write over preallocation". Overwrite
performance should be the same regardless of the method used for the
intial write/allocation.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, Mar 17, 2016 at 12:01:16PM +1100, Dave Chinner wrote:
> On Tue, Mar 15, 2016 at 06:51:39PM -0700, Darrick J. Wong wrote:
> > On Tue, Mar 15, 2016 at 06:52:24PM -0400, Theodore Ts'o wrote:
> > > On Wed, Mar 16, 2016 at 09:33:13AM +1100, Dave Chinner wrote:
> > > >
> > > > Stale data escaping containment is a security issue. Enabling
> > > > generic kernel mechanisms to *enable containment escape* is
> > > > fundamentally wrong, and relying on userspace to Do The Right Thing
> > > > is even more of a gamble, IMO.
> > >
> > > We already have generic kernel mechanisms such as "the block device". P
> > >
> > > > It's a practical concern because if we enable this functionality in
> > > > fallocate because it will get used by more than just special storage
> > > > apps. i.e. this can't be waved away with "only properly managed
> > > > applications will use it" arguments.
> > >
> > > It requires a mount option. How is this going to allow random
> > > applications to use this feature, again?
> > >
> > > > I also don't make a habit of publicising the fact that since we
> > > > disabled the "-d unwritten=X" mkfs parameter (because of speed racer
> > > > blogs such as the above and configuration cargo-culting resulting in
> > > > unsuspecting users exposing stale data unintentionally) that the
> > > > functionality still exists in the kernel code and that it only takes
> > > > a single xfs_db command to turn off unwritten extents in XFS. i.e.
> > > > we can easily make fallocate on XFS expose stale data, filesystem
> > > > wide, without requiring mount options, kernel or application
> > > > modifications.
> > >
> > > So you have something even more dangerous in XFS and it's in the
> > > kernel tree? Has Red Hat threatened to have a distro-specific patch
> >
> > xfs_db is the XFS debugger, so you can only enable that bit of functionality
> > with magical commands, which IMHO isn't much different than people messing
> > with their ext4 filesystems with debugfs. You had better know what you're
> > doing and if you break the filesystem you can eat both pieces. :P
> >
> > > to comment out this code to make sure irresponsible users can't use
> > > it? What I've been suggesting has even more controls that what you
> > > have. And I've been keeping it as an out-of-tree kernel patch mainly
> > > because you've been arguing that it's such a horrible thing.
> >
> > One could lock it down even more -- hide it behind a Kconfig option that
> > depends on CONFIG_EXPERT=y and itself defaults to n, require a mount option,
> > only allow the file owner to call no-hide-stale and only if the file is 0600
> > (or the appropriate group equivalents like Ted's existing patch), and upon
> > adding stale extents, set an inode flag that locks uid/gid/mode/flags.
> > Obviously root can still get to the file, but at least there's hard evidence
> > that one is entering the twilight zone.
> >
> > > > Making Google's hack more widely available through the fallocate
> > > > API is entirely dependent on proving that:
> > >
> > > Ceph is about to completely bypass the file system because of your
> > > intransigence, and reimplement a userspace file system. They seem to
> > > believe it's necessary. I'll let them make the case, because they
> > > seem to think it's necessary. And if not, if Linus sides with you,
> > > and doesn't want to take the patch, I'll just keep it as a
> > > Google-specific out-of-tree patch. I don't *need* to have this thing
> > > upstream.
> >
> > Frankly, I second Eric Sandeen's comments -- just how bad is ext4's
> > unwritten extent conversion for these folks?
> >
> > I ran this crappy looping microbenchmark against a ramdisk:
> > fallocate 400M
> > write 400M
> > fsync
> > rewrite the 400M
> > fsync
> > on kernel 4.5.
> >
> > For writing 400M through the page cache in 4k chunks,
> > ext4: ~460MB/s -> ~580MB/s (~20%)
> > XFS: ~660 -> ~870 (~25%)
> > btrfs: ~130 -> ~200 (~35%)
> >
> > For writing 400M in 80M chunks,
> > ext4: ~480MB/s -> ~720MB/s (~30%)
> > XFS: ~1GB/s -> ~1.5GB/s (~35%)
> > btrfs: ~590MB/s -> ~590MB/s (no change)
> >
> > For directio writing 400MB in 4k chunks,
> > ext4: 25MB/s -> 26MB/s (~5%)
> > XFS: 25 -> 27 (~8%)
> > btrfs: 22 -> 18 (...)
> >
> > For directio writing 1200MB in 80M chunks,
> > ext4: ~2.9GB/s -> ~3.3GB/s (~13%)
> > XFS: 3.2 -> 3.5 (~9%)
> > btrfs: 2.3 -> 2.2 (...)
>
> That's not comparing apples to apples. overwrite does not require
> allocation/extent manipulation at all so it is comparing performance
> of completely different file extent operations. The operations we
> should be comparing are "first write" operations, namely "write over
> hole with allocation" vs "write over preallocation". Overwrite
> performance should be the same regardless of the method used for the
> intial write/allocation.
Eh, ok, let's compare writing an fallocated region vs writing an empty file:
(laptop this time, so the numbers aren't the same)
For writing 400M in 4k chunks,
ext4: ~720MB/s -> 620MB/s (~14%)
XFS: ~560MB/s -> 540MB/s (~4%)
btrfs: ~260 -> 580MB/s
For writing 400M in 80M chunks,
ext4: ~960MB/s -> ~730MB/s (~24%)
XFS: ~1000MB/s -> ~980MB/s (~2%)
btrfs: ~950MB/s -> ~930MB/s (~3%)
For directio writing 1200MB in 80M chunks,
ext4: ~2.9GB/s -> ~2.8GB/s (~4%)
XFS: 3.2 -> 3.2
btrfs: 2.3 -> 2.3
--D
On Wed, Mar 16, 2016 at 5:33 PM, Eric Sandeen <[email protected]> wrote:
> I may have lost the thread at this point, with poor Darrick's original
> patch submission devolving into a long thread about a NO_HIDE_STALE patch
> used at Google, but I don't *think* Ceph ever asked for NO_HIDE_STALE.
>
> At least I can't find any indication of that.
>
> Am I missing something? cc'ing Greg on this one in case I am.
Brief background:
Ceph currently has two big local storage subsystems: FileStore and
BlueStore. FileStore is the one that's been around for forever and is
currently stable/production-ready/bla bla bla. This one represents
RADOS objects as actual files and while it's *mostly* just converting
object operations into posix FS ones, it does rely on a few pieces of
the fs namespace and posix ops to do its work.
BlueStore is our new, pure userspace solution (Sage started this about
8 months ago, I think?). It started out using xfs basically as a block
allocator, but at this point it's just doing raw block access 100% in
userspace.
So we've not asked for NO_HIDE_STALE on the mailing lists, but I think
it was one of the problems Sage had using xfs in his BlueStore
implementation and was a big part of why it moved to pure userspace.
FileStore might use NO_HIDE_STALE in some places but it would be
pretty limited. When it came up at Linux FAST we were discussing how
it and similar things had been problems for us in the past and it
would've been nice if they were upstream. What *is* a big deal for
FileStore (and would be easy to take advantage of) is the thematically
similar O_NOMTIME flag, which is also about reducing metadata updates
and got blocked on similar stupid-user grounds (although not security
ones): http://thread.gmane.org/gmane.linux.kernel.api/10727.
As noted though, we've basically given up and are moving to a
pure-userspace solution as quickly as we can. So no, Ceph isn't likely
to be a big user of these interfaces as it's too late for us. Adding
them would be an investment for future distributed storage systems
more than current ones. Maybe that's not worth it, or maybe there are
better places to keep them in the kernel. (I think I saw a reference
to some hypothetical shared block allocator? That would be *awesome*.)
=========
Separately. In the particular case of the extents and data leaks, a
coworker of mine suggested you could tag any files which *ever* had
unwritten extents with something that prevents them being read by a
user who doesn't have raw block access (and, even better, let us apply
that flag on file create)...that's a weird new security rule for
people to know and requires space for tagging (no idea how bad that
is), but would work in any use cases we have and would not leak
anything the user doesn't already have access to.
-Greg
On Wed, Mar 16, 2016 at 10:18:19PM -0700, Gregory Farnum wrote:
> would've been nice if they were upstream. What *is* a big deal for
> FileStore (and would be easy to take advantage of) is the thematically
> similar O_NOMTIME flag, which is also about reducing metadata updates
> and got blocked on similar stupid-user grounds (although not security
> ones): http://thread.gmane.org/gmane.linux.kernel.api/10727.
Try mounting with the lazytime mount option. That should give you the
nearly all of the metaata update reduction for free. It is slightly
better optimized for in ext4. but it should work for xfs and btrfs as
well. The mtime will still get updated on disk about once a day (or
if memory pressure pushes the inode out of memory or on an unmount),
but the in-memory inode has the correct timestamps, so stat will
return the correct mtime value and so the result is POSIX compliant
for those people who still care such things.
(Actually, if you need to pass POSIX conformance testing you should
use strictatime,lazytime so that atime is updated when POSIX mandates
it to be updated.)
- Ted
On 03/16/2016 06:23 PM, Chris Mason wrote:
> On Tue, Mar 15, 2016 at 05:51:17PM -0700, Chris Mason wrote:
>> On Tue, Mar 15, 2016 at 07:30:14PM -0500, Eric Sandeen wrote:
>>> On 3/15/16 7:06 PM, Linus Torvalds wrote:
>>>> On Tue, Mar 15, 2016 at 4:52 PM, Dave Chinner <[email protected]> wrote:
>>>>>> It is pretty clear that the onus is on the patch submitter to
>>>>>> provide justification for inclusion, not for the reviewer/Maintainer
>>>>>> to have to prove that the solution is unworkable.
>>>> I agree, but quite frankly, performance is a good justification.
>>>>
>>>> So if Ted can give performance numbers, that's justification enough.
>>>> We've certainly taken changes with less.
>>> I've been away from ext4 for a while, so I'm really not on top of the
>>> mechanics of the underlying problem at the moment.
>>>
>>> But I would say that in addition to numbers showing that ext4 has trouble
>>> with unwritten extent conversion, we should have an explanation of
>>> why it can't be solved in a way that doesn't open up these concerns.
>>>
>>> XFS certainly has different mechanisms, but is the demonstrated workload
>>> problematic on XFS (or btrfs) as well? If not, can ext4 adopt any of the
>>> solutions that make the workload perform better on other filesystems?
>> When I've benchmarked this in the past, doing small random buffered writes
>> into an preallocated extent was dramatically (3x or more) slower on xfs
>> than doing them into a fully written extent. That was two years ago,
>> but I can redo it.
> So I re-ran some benchmarks, with 4K O_DIRECT random ios on nvme (4.5
> kernel). This is O_DIRECT without O_SYNC. I don't think xfs will do
> commits for each IO into the prealloc file? O_SYNC makes it much
> slower, so hopefully I've got this right.
>
> The test runs for 60 seconds, and I used an iodepth of 4:
>
> prealloc file: 32,000 iops
> overwrite: 121,000 iops
>
> If I bump the iodepth up to 512:
>
> prealloc file: 33,000 iops
> overwrite: 279,000 iops
>
> For streaming writes, XFS converts prealloc to written much better when
> the IO isn't random. You can start seeing the difference at 16K
> sequential O_DIRECT writes, but really its not a huge impact. The worst
> case is 4K:
>
> prealloc file: 227MB/s
> overwrite: 340MB/s
>
> I can't think of sequential workloads where this will matter, since they
> will either end up with bigger IO or the performance impact won't get
> noticed.
>
> -chris
I think that these numbers are the interesting ones, see a 4x slow down is
certainly significant.
If you do the same patch after hacking XFS preallocation as Dave suggested with
xfs_db, do we get most of the performance back?
Ric
On Wed, Mar 16, 2016 at 10:18 PM, Gregory Farnum <[email protected]> wrote:
>
> So we've not asked for NO_HIDE_STALE on the mailing lists, but I think
> it was one of the problems Sage had using xfs in his BlueStore
> implementation and was a big part of why it moved to pure userspace.
> FileStore might use NO_HIDE_STALE in some places but it would be
> pretty limited. When it came up at Linux FAST we were discussing how
> it and similar things had been problems for us in the past and it
> would've been nice if they were upstream.
Hmm.
So to me it really sounds like somebody should cook up a patch, but we
shouldn't put it in the upstream kernel until we get numbers and
actual "yes, we'd use this" from outside of google.
I say "outside of google", because inside of google not only do we not
get numbers, but google can maintain their own patch.
But maybe Ted could at least post the patch google uses, and somebody
in the Ceph community might want to at least try it out...
> What *is* a big deal for
> FileStore (and would be easy to take advantage of) is the thematically
> similar O_NOMTIME flag, which is also about reducing metadata updates
> and got blocked on similar stupid-user grounds (although not security
> ones): http://thread.gmane.org/gmane.linux.kernel.api/10727.
Hmm. I don't hate that patch, because the NOATIME thing really does
wonders on many loads. NOMTIME makes sense.
It's not like you can't do this with utimes() anyway.
That said, I do wonder if people wouldn't just prefer to expand on and
improve on the lazytime.
Is there some reason you guys didn't use that?
> As noted though, we've basically given up and are moving to a
> pure-userspace solution as quickly as we can.
That argues against worrying about this all in the kernel unless there
are other users.
Linus
On 03/17/2016 01:47 PM, Linus Torvalds wrote:
> On Wed, Mar 16, 2016 at 10:18 PM, Gregory Farnum <[email protected]> wrote:
>> So we've not asked for NO_HIDE_STALE on the mailing lists, but I think
>> it was one of the problems Sage had using xfs in his BlueStore
>> implementation and was a big part of why it moved to pure userspace.
>> FileStore might use NO_HIDE_STALE in some places but it would be
>> pretty limited. When it came up at Linux FAST we were discussing how
>> it and similar things had been problems for us in the past and it
>> would've been nice if they were upstream.
> Hmm.
>
> So to me it really sounds like somebody should cook up a patch, but we
> shouldn't put it in the upstream kernel until we get numbers and
> actual "yes, we'd use this" from outside of google.
>
> I say "outside of google", because inside of google not only do we not
> get numbers, but google can maintain their own patch.
>
> But maybe Ted could at least post the patch google uses, and somebody
> in the Ceph community might want to at least try it out...
>
>> What *is* a big deal for
>> FileStore (and would be easy to take advantage of) is the thematically
>> similar O_NOMTIME flag, which is also about reducing metadata updates
>> and got blocked on similar stupid-user grounds (although not security
>> ones): http://thread.gmane.org/gmane.linux.kernel.api/10727.
> Hmm. I don't hate that patch, because the NOATIME thing really does
> wonders on many loads. NOMTIME makes sense.
>
> It's not like you can't do this with utimes() anyway.
>
> That said, I do wonder if people wouldn't just prefer to expand on and
> improve on the lazytime.
>
> Is there some reason you guys didn't use that?
>
>> As noted though, we've basically given up and are moving to a
>> pure-userspace solution as quickly as we can.
> That argues against worrying about this all in the kernel unless there
> are other users.
>
> Linus
Just a note, when Greg says "user space solution", Ceph is looking at writing
directly to raw block devices which is kind of a through back to early
enterprise database trends.
Ric
On Thu, Mar 17, 2016 at 10:50 AM, Ric Wheeler <[email protected]> wrote:
>>
>> That argues against worrying about this all in the kernel unless there
>> are other users.
>
> Just a note, when Greg says "user space solution", Ceph is looking at
> writing directly to raw block devices which is kind of a through back to
> early enterprise database trends.
Right, I understand. But that makes it kind of pointless to worry
about NO_HIDE_STALE, since it wouldn't get used anyway. The issues
with filesystem preallocation just don't exist.
Of course, if there are other possible users, we should keep this on
the table, but ...
Linus
On Thu, Mar 17, 2016 at 10:47:29AM -0700, Linus Torvalds wrote:
> On Wed, Mar 16, 2016 at 10:18 PM, Gregory Farnum <[email protected]> wrote:
> >
> > So we've not asked for NO_HIDE_STALE on the mailing lists, but I think
> > it was one of the problems Sage had using xfs in his BlueStore
> > implementation and was a big part of why it moved to pure userspace.
> > FileStore might use NO_HIDE_STALE in some places but it would be
> > pretty limited. When it came up at Linux FAST we were discussing how
> > it and similar things had been problems for us in the past and it
> > would've been nice if they were upstream.
>
> Hmm.
>
> So to me it really sounds like somebody should cook up a patch, but we
> shouldn't put it in the upstream kernel until we get numbers and
> actual "yes, we'd use this" from outside of google.
We haven't had internal tiers yelling at us for fallocate performance,
so I'm unlikely to suggest it, just because its a potential
privacy leak we'd have to educate people about. What I'd be more likely
to use is code inside the filesystem like this:
somefs_fallocate() {
if (trim_can_really_zero(my_device)) {
trim
allocate a regular extent
return
} else {
do normal fallocate
}
}
Then the out of tree patch (for google or whoever) becomes a hack to
flip trim_can_really_zero on a given block device. The rest of us can
use explicit interfaces from the hardware when deciding what we want
preallocation to mean.
It gets messy for crcs in btrfs, so we'd need the old fashioned
preallocation anyway. But the database workloads where this matters
aren't our target right now, so its more an ext4/xfs thing anyway.
-chris
On Mar 17, 2016, at 12:35 PM, Chris Mason <[email protected]> wrote:
>
> On Thu, Mar 17, 2016 at 10:47:29AM -0700, Linus Torvalds wrote:
>> On Wed, Mar 16, 2016 at 10:18 PM, Gregory Farnum <[email protected]> wrote:
>>>
>>> So we've not asked for NO_HIDE_STALE on the mailing lists, but I think
>>> it was one of the problems Sage had using xfs in his BlueStore
>>> implementation and was a big part of why it moved to pure userspace.
>>> FileStore might use NO_HIDE_STALE in some places but it would be
>>> pretty limited. When it came up at Linux FAST we were discussing how
>>> it and similar things had been problems for us in the past and it
>>> would've been nice if they were upstream.
>>
>> Hmm.
>>
>> So to me it really sounds like somebody should cook up a patch, but we
>> shouldn't put it in the upstream kernel until we get numbers and
>> actual "yes, we'd use this" from outside of google.
>
> We haven't had internal tiers yelling at us for fallocate performance,
> so I'm unlikely to suggest it, just because its a potential
> privacy leak we'd have to educate people about. What I'd be more likely
> to use is code inside the filesystem like this:
>
> somefs_fallocate() {
> if (trim_can_really_zero(my_device)) {
> trim
> allocate a regular extent
> return
> } else {
> do normal fallocate
> }
> }
We were discussing almost this very same thing in the ext4 concall today.
Ted initially didn't think it was worthwhile to implement, but after looking
at the whitelist for SATA SSDs it seems that there are enough devices on the
market that support the ATA_HORKAGE_ZERO_AFTER_TRIM to make this approach
worthwhile to implement.
Also, if the ext4 extent size was limited it might even be possible to do
this efficiently enough with write_same on HDD devices.
> Then the out of tree patch (for google or whoever) becomes a hack to
> flip trim_can_really_zero on a given block device. The rest of us can
> use explicit interfaces from the hardware when deciding what we want
> preallocation to mean.
This might be a bit trickier, since this would affect all zero/trim
operations, not just ones for uninitialized data extents.
> It gets messy for crcs in btrfs, so we'd need the old fashioned
> preallocation anyway. But the database workloads where this matters
> aren't our target right now, so its more an ext4/xfs thing anyway.
Cheers, Andreas
On Thu, Mar 17, 2016 at 02:49:06PM -0600, Andreas Dilger wrote:
> On Mar 17, 2016, at 12:35 PM, Chris Mason <[email protected]> wrote:
> >
> > On Thu, Mar 17, 2016 at 10:47:29AM -0700, Linus Torvalds wrote:
> >> On Wed, Mar 16, 2016 at 10:18 PM, Gregory Farnum <[email protected]> wrote:
> >>>
> >>> So we've not asked for NO_HIDE_STALE on the mailing lists, but I think
> >>> it was one of the problems Sage had using xfs in his BlueStore
> >>> implementation and was a big part of why it moved to pure userspace.
> >>> FileStore might use NO_HIDE_STALE in some places but it would be
> >>> pretty limited. When it came up at Linux FAST we were discussing how
> >>> it and similar things had been problems for us in the past and it
> >>> would've been nice if they were upstream.
> >>
> >> Hmm.
> >>
> >> So to me it really sounds like somebody should cook up a patch, but we
> >> shouldn't put it in the upstream kernel until we get numbers and
> >> actual "yes, we'd use this" from outside of google.
> >
> > We haven't had internal tiers yelling at us for fallocate performance,
> > so I'm unlikely to suggest it, just because its a potential
> > privacy leak we'd have to educate people about. What I'd be more likely
> > to use is code inside the filesystem like this:
> >
> > somefs_fallocate() {
> > if (trim_can_really_zero(my_device)) {
> > trim
> > allocate a regular extent
> > return
> > } else {
> > do normal fallocate
> > }
> > }
>
> We were discussing almost this very same thing in the ext4 concall today.
>
> Ted initially didn't think it was worthwhile to implement, but after looking
> at the whitelist for SATA SSDs it seems that there are enough devices on the
> market that support the ATA_HORKAGE_ZERO_AFTER_TRIM to make this approach
> worthwhile to implement.
We'll end up with people complaining it makes fallocate slower because
of the trims, so it's not a perfect solution. But I much prefer it to
fallocate-stale.
>
> Also, if the ext4 extent size was limited it might even be possible to do
> this efficiently enough with write_same on HDD devices.
>
> > Then the out of tree patch (for google or whoever) becomes a hack to
> > flip trim_can_really_zero on a given block device. The rest of us can
> > use explicit interfaces from the hardware when deciding what we want
> > preallocation to mean.
>
> This might be a bit trickier, since this would affect all zero/trim
> operations, not just ones for uninitialized data extents.
Thinking more, my guess is that google will just keep doing what they
are already doing ;) But there could be a flag in sysfs dedicated to
trim-for-fallocate so admins can see what their devices are reporting.
readonly in mainline, if someone wants to patch it in their large data
center it wouldn't be hard.
-chris
On Thu, Mar 17, 2016 at 02:00:18PM -0700, Chris Mason wrote:
>
> Thinking more, my guess is that google will just keep doing what they
> are already doing ;) But there could be a flag in sysfs dedicated to
> trim-for-fallocate so admins can see what their devices are reporting.
> readonly in mainline, if someone wants to patch it in their large data
> center it wouldn't be hard.
That's true, because one of the major use cases is SATA drives where
trim isn't available. Even for SAS drives where you have WRITE SAME,
you wouldn't want to use it for large fallocate regions.
So I see using reliable trim as a zeroing mechanism to be orthogonal
to the question of NO_HIIDE_STALE.
I do think that using TRIM in various causes where we are doing an
fallocate does make sense for non-rotational devices. In general TRIM
should be fast enough that that I'd be surprised that people would be
complaining --- especially since most of the time, fallocate isn't on
the timing-critical path of most applications.
- Ted
On Thu, Mar 17, 2016 at 10:47 AM, Linus Torvalds
<[email protected]> wrote:
> On Wed, Mar 16, 2016 at 10:18 PM, Gregory Farnum <[email protected]> wrote:
>>
>> So we've not asked for NO_HIDE_STALE on the mailing lists, but I think
>> it was one of the problems Sage had using xfs in his BlueStore
>> implementation and was a big part of why it moved to pure userspace.
>> FileStore might use NO_HIDE_STALE in some places but it would be
>> pretty limited. When it came up at Linux FAST we were discussing how
>> it and similar things had been problems for us in the past and it
>> would've been nice if they were upstream.
>
> Hmm.
>
> So to me it really sounds like somebody should cook up a patch, but we
> shouldn't put it in the upstream kernel until we get numbers and
> actual "yes, we'd use this" from outside of google.
>
> I say "outside of google", because inside of google not only do we not
> get numbers, but google can maintain their own patch.
>
> But maybe Ted could at least post the patch google uses, and somebody
> in the Ceph community might want to at least try it out...
>
>> What *is* a big deal for
>> FileStore (and would be easy to take advantage of) is the thematically
>> similar O_NOMTIME flag, which is also about reducing metadata updates
>> and got blocked on similar stupid-user grounds (although not security
>> ones): http://thread.gmane.org/gmane.linux.kernel.api/10727.
>
> Hmm. I don't hate that patch, because the NOATIME thing really does
> wonders on many loads. NOMTIME makes sense.
>
> It's not like you can't do this with utimes() anyway.
>
> That said, I do wonder if people wouldn't just prefer to expand on and
> improve on the lazytime.
>
> Is there some reason you guys didn't use that?
I wasn't really involved in this stuff but I gather from looking at
http://www.spinics.net/lists/xfs/msg36869.html that any durability
command other than fdatasync is going to write out the mtime updates
to the inodes on disk. Given our durability requirements and the
guarantees offered about when things actually hit disk, that doesn't
work for us. We run an fsync on the order of every 30 seconds, and we
do various combinations of fsync, fdatasync, flush_file_range, (and,
well, any command we're provided) to try and bound the amount of dirty
data and prevent fs/disk throughput pauses when we do that full sync.
Anybody trying to do anything similar would want a switch that
prevents operations from updating the mtime on disk no matter what.
-Greg
On Thu, Mar 17, 2016 at 11:52 PM, Gregory Farnum <[email protected]> wrote:
>
> I wasn't really involved in this stuff but I gather from looking at
> http://www.spinics.net/lists/xfs/msg36869.html that any durability
> command other than fdatasync is going to write out the mtime updates
> to the inodes on disk. Given our durability requirements and the
> guarantees offered about when things actually hit disk, that doesn't
> work for us.
Fair enough. Yes, the lazytime thing doesn't help if you actually sync
the file explicitly, then you'd really do need something like NOMTIME
in order to not dirty the inode itself at all (together with
preallocation - otherwise the inode will be dirty due to the block
allocation updates).
Linus
"Theodore Ts'o" <[email protected]> writes:
> I do think that using TRIM in various causes where we are doing an
> fallocate does make sense for non-rotational devices. In general TRIM
> should be fast enough that that I'd be surprised that people would be
> complaining --- especially since most of the time, fallocate isn't on
> the timing-critical path of most applications.
TRIM/UNMAP isn't just supported on solid state devices, though. I do
recall some enterprise thinly provisioned storage that would take ages
to discard large regions. I think that caused us to change the defaults
for mkfs, right?
Cheers,
Jeff
>>>>> "Jeff" == Jeff Moyer <[email protected]> writes:
Jeff> TRIM/UNMAP isn't just supported on solid state devices, though. I
Jeff> do recall some enterprise thinly provisioned storage that would
Jeff> take ages to discard large regions. I think that caused us to
Jeff> change the defaults for mkfs, right?
I think those have largely been fixed. But, yes.
--
Martin K. Petersen Oracle Linux Engineering
On Wed, Mar 16 2016, Theodore Ts'o wrote:
> On Wed, Mar 16, 2016 at 09:33:13AM +1100, Dave Chinner wrote:
>>
>> Stale data escaping containment is a security issue. Enabling
>> generic kernel mechanisms to *enable containment escape* is
>> fundamentally wrong, and relying on userspace to Do The Right Thing
>> is even more of a gamble, IMO.
>
> We already have generic kernel mechanisms such as "the block device". P
This is a bit of an 'off-the-wall' suggestion, but I agree that these
things that might be of value to user-space file servers do seem a lot
like block devices. So why not make them look exactly like block
devices?
i.e. a new open flag O_BLOCKDEV which, when combined with O_CREAT
creates a thing that is managed in the filesystem much like a file, but
that appears to user-space like a block device. The major/minor numbers
would be essentially meaningless - the filesystem wouldn't call
init_special_inode() like it does on normal block devices, it would
retain control itself.
That would make the content invisible to backups and rsync and all the
things that Dave has raised as potential concerns. And it would be no
surprise if the contents included stale data because that is exactly
what you get when you create a new logical volume with LVM2.
The block device would initially be of size zero, but could be resized
using fallocate (which soon will work on block devices), which can
request zeros, leave holes, or with Teds new FALLOC flag (that would only
be permitted on block devices) could allocate uninitialized space.
Rules for using O_BLOCKDEV would still need to be clarified - mount
option, access to underlying block device, CAP_MKNOD .. whatever.
I think that being able to use a filesystem as a logical volume manager
is an extremely interesting idea.... we might even end up with a
filesystem interface on device-mapper :-)
NeilBrown