[1/2] ext4: wipe filename upon file deletion:
- removed mount option for filename wipe, now filename wipe is
default behavior
- added wiping of file type at time of filename wipe
[2/2] ext4: add ioctl FS_IOC_CHKPT_JRNL:
- moved ioctl definition to include/uapi/linux/fs.h
- renamed ioctl FS_IOC_CHKPT_JRNL
- updated to require admin privileges to call ioctl
- updated ioctl to take _u64
- updated jbd2_journal_flush to take flags argument to allow
for discard flag
(kernel: JBD2_FLAG_DO_DISCARD / userspace: CHKPT_JRNL_DO_DISCARD)
Leah Rumancik (2):
ext4: wipe filename upon file deletion
ext4: add ioctl FS_IOC_CHKPT_JRNL
fs/ext4/ext4.h | 1 +
fs/ext4/inode.c | 4 +-
fs/ext4/ioctl.c | 34 ++++++++++++--
fs/ext4/namei.c | 4 ++
fs/ext4/super.c | 6 +--
fs/jbd2/journal.c | 100 +++++++++++++++++++++++++++++++++++++++-
fs/ocfs2/alloc.c | 2 +-
fs/ocfs2/journal.c | 8 ++--
include/linux/jbd2.h | 5 +-
include/uapi/linux/fs.h | 4 ++
10 files changed, 152 insertions(+), 16 deletions(-)
--
2.31.0.208.g409f899ff0-goog
ioctl FS_IOC_CHKPT_JRNL checkpoints and flushes the journal. With the
CHKPT_JRNL_DISCARD flag set, the journal blocks are also discarded.
With the filename wipeout patch, Ext4 guarantees that all data will be
discarded on deletion. This ioctl allows for periodically discarding
journal contents too.
Also, add journal discard (if discard supported) during journal load
after recovery. This provides a potential solution to
https://lore.kernel.org/linux-ext4/[email protected]/ for
disks that support discard. After a successful journal recovery, e2fsck
can call this ioctl to discard the journal as well.
Signed-off-by: Leah Rumancik <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/inode.c | 4 +-
fs/ext4/ioctl.c | 34 ++++++++++++--
fs/ext4/super.c | 6 +--
fs/jbd2/journal.c | 100 +++++++++++++++++++++++++++++++++++++++-
fs/ocfs2/alloc.c | 2 +-
fs/ocfs2/journal.c | 8 ++--
include/linux/jbd2.h | 5 +-
include/uapi/linux/fs.h | 4 ++
9 files changed, 148 insertions(+), 16 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 826a56e3bbd2..e76e9961e992 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -724,6 +724,7 @@ enum {
#define EXT4_IOC_CLEAR_ES_CACHE _IO('f', 40)
#define EXT4_IOC_GETSTATE _IOW('f', 41, __u32)
#define EXT4_IOC_GET_ES_CACHE _IOWR('f', 42, struct fiemap)
+/* ioctl code 43 reserved for FS_IOC_JRNL_CHKPT */
#define EXT4_IOC_SHUTDOWN _IOR ('X', 125, __u32)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0948a43f1b3d..715077e30c58 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3225,7 +3225,7 @@ static sector_t ext4_bmap(struct address_space *mapping, sector_t block)
ext4_clear_inode_state(inode, EXT4_STATE_JDATA);
journal = EXT4_JOURNAL(inode);
jbd2_journal_lock_updates(journal);
- err = jbd2_journal_flush(journal);
+ err = jbd2_journal_flush(journal, 0);
jbd2_journal_unlock_updates(journal);
if (err)
@@ -6007,7 +6007,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
if (val)
ext4_set_inode_flag(inode, EXT4_INODE_JOURNAL_DATA);
else {
- err = jbd2_journal_flush(journal);
+ err = jbd2_journal_flush(journal, 0);
if (err < 0) {
jbd2_journal_unlock_updates(journal);
percpu_up_write(&sbi->s_writepages_rwsem);
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index a2cf35066f46..ca64c680ef6d 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -756,7 +756,7 @@ static long ext4_ioctl_group_add(struct file *file,
err = ext4_group_add(sb, input);
if (EXT4_SB(sb)->s_journal) {
jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
- err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal);
+ err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal, 0);
jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
}
if (err == 0)
@@ -939,7 +939,7 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
err = ext4_group_extend(sb, EXT4_SB(sb)->s_es, n_blocks_count);
if (EXT4_SB(sb)->s_journal) {
jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
- err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal);
+ err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal, 0);
jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
}
if (err == 0)
@@ -1082,7 +1082,7 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
if (EXT4_SB(sb)->s_journal) {
ext4_fc_mark_ineligible(sb, EXT4_FC_REASON_RESIZE);
jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
- err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal);
+ err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal, 0);
jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
}
if (err == 0)
@@ -1318,6 +1318,33 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
return -EOPNOTSUPP;
return fsverity_ioctl_read_metadata(filp,
(const void __user *)arg);
+ case FS_IOC_CHKPT_JRNL:
+ {
+ int err = 0;
+ unsigned long long flags = 0;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ /* file argument is not the mount point */
+ if (file_dentry(filp) != sb->s_root)
+ return -EINVAL;
+
+ /* filesystem is not backed by block device */
+ if (sb->s_bdev == NULL)
+ return -EINVAL;
+
+ if (copy_from_user(&flags, (__u64 __user *)arg,
+ sizeof(__u64)))
+ return -EFAULT;
+
+ if (EXT4_SB(sb)->s_journal) {
+ jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
+ err = jbd2_journal_flush(EXT4_SB(sb)->s_journal, flags);
+ jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
+ }
+ return err;
+ }
default:
return -ENOTTY;
@@ -1407,6 +1434,7 @@ long ext4_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
case EXT4_IOC_GET_ES_CACHE:
case FS_IOC_FSGETXATTR:
case FS_IOC_FSSETXATTR:
+ case FS_IOC_CHKPT_JRNL:
break;
default:
return -ENOIOCTLCMD;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index b9693680463a..1b3a9eb58b63 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5613,7 +5613,7 @@ static int ext4_mark_recovery_complete(struct super_block *sb,
return 0;
}
jbd2_journal_lock_updates(journal);
- err = jbd2_journal_flush(journal);
+ err = jbd2_journal_flush(journal, 0);
if (err < 0)
goto out;
@@ -5755,7 +5755,7 @@ static int ext4_freeze(struct super_block *sb)
* Don't clear the needs_recovery flag if we failed to
* flush the journal.
*/
- error = jbd2_journal_flush(journal);
+ error = jbd2_journal_flush(journal, 0);
if (error < 0)
goto out;
@@ -6349,7 +6349,7 @@ static int ext4_quota_on(struct super_block *sb, int type, int format_id,
* otherwise be livelocked...
*/
jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
- err = jbd2_journal_flush(EXT4_SB(sb)->s_journal);
+ err = jbd2_journal_flush(EXT4_SB(sb)->s_journal, 0);
jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
if (err)
return err;
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 2dc944442802..6bb5980ac789 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1686,6 +1686,90 @@ static void jbd2_mark_journal_empty(journal_t *journal, int write_op)
write_unlock(&journal->j_state_lock);
}
+/* discard journal blocks excluding journal superblock */
+static int __jbd2_journal_issue_discard(journal_t *journal)
+{
+ int err = 0;
+ unsigned long block, log_offset; /* logical */
+ unsigned long long phys_block, block_start, block_stop; /* physical */
+ loff_t byte_start, byte_stop, byte_count;
+ struct request_queue *q = bdev_get_queue(journal->j_dev);
+
+ if (!q)
+ return -ENXIO;
+
+ if (!blk_queue_discard(q))
+ return -EOPNOTSUPP;
+
+ /* lookup block mapping and issue discard for each contiguous region */
+ log_offset = be32_to_cpu(journal->j_superblock->s_first);
+
+ err = jbd2_journal_bmap(journal, log_offset, &block_start);
+ if (err) {
+ printk(KERN_ERR "JBD2: bad block at offset %lu", log_offset);
+ return err;
+ }
+
+ /*
+ * use block_start - 1 to meet check for contiguous with previous region:
+ * phys_block == block_stop + 1
+ */
+ block_stop = block_start - 1;
+
+ for (block = log_offset; block < journal->j_total_len; block++) {
+ err = jbd2_journal_bmap(journal, block, &phys_block);
+ if (err) {
+ printk(KERN_ERR "JBD2: bad block at offset %lu", block);
+ return err;
+ }
+
+ /*
+ * if block is last block, update stopping point
+ * if not last block and
+ * block is contiguous with previous block, continue
+ */
+ if (block == journal->j_total_len - 1)
+ block_stop = phys_block;
+ else if (phys_block == block_stop + 1) {
+ block_stop++;
+ continue;
+ }
+
+ /*
+ * if not contiguous with prior physical block or this is last
+ * block of journal, take care of the region
+ */
+ byte_start = block_start * journal->j_blocksize;
+ byte_stop = block_stop * journal->j_blocksize;
+ byte_count = (block_stop - block_start + 1) *
+ journal->j_blocksize;
+
+ truncate_inode_pages_range(journal->j_dev->bd_inode->i_mapping,
+ byte_start, byte_stop);
+
+ /*
+ * use blkdev_issue_discard instead of sb_issue_discard
+ * because superblock not yet populated when this is
+ * called during journal_load during mount process
+ */
+ err = blkdev_issue_discard(journal->j_dev,
+ byte_start >> SECTOR_SHIFT,
+ byte_count >> SECTOR_SHIFT,
+ GFP_NOFS, 0);
+
+ if (unlikely(err != 0)) {
+ printk(KERN_ERR "JBD2: unable to discard "
+ "journal at physical blocks %llu - %llu",
+ block_start, block_stop);
+ return err;
+ }
+
+ block_start = phys_block;
+ block_stop = phys_block;
+ }
+
+ return blkdev_issue_flush(journal->j_dev);
+}
/**
* jbd2_journal_update_sb_errno() - Update error in the journal.
@@ -1936,6 +2020,10 @@ int jbd2_journal_load(journal_t *journal)
*/
journal->j_flags &= ~JBD2_ABORT;
+ /* if journal device supports discard, discard journal blocks */
+ if (__jbd2_journal_issue_discard(journal))
+ printk(KERN_WARNING "JBD2: failed to discard journal when loading");
+
/* OK, we've finished with the dynamic journal bits:
* reinitialise the dynamic contents of the superblock in memory
* and reset them on disk. */
@@ -2246,13 +2334,17 @@ EXPORT_SYMBOL(jbd2_journal_clear_features);
/**
* jbd2_journal_flush() - Flush journal
* @journal: Journal to act on.
+ * @flags: options (see below)
*
* Flush all data for a given journal to disk and empty the journal.
* Filesystems can use this when remounting readonly to ensure that
* recovery does not need to happen on remount.
+ *
+ * flags:
+ * JBD2_FLAG_DO_DISCARD: also discard the journal blocks; if discard is not
+ * supported on the device, returns err
*/
-
-int jbd2_journal_flush(journal_t *journal)
+int jbd2_journal_flush(journal_t *journal, unsigned long long flags)
{
int err = 0;
transaction_t *transaction = NULL;
@@ -2306,6 +2398,10 @@ int jbd2_journal_flush(journal_t *journal)
* commits of data to the journal will restore the current
* s_start value. */
jbd2_mark_journal_empty(journal, REQ_SYNC | REQ_FUA);
+
+ if (flags & JBD2_FLAG_DO_DISCARD)
+ err = __jbd2_journal_issue_discard(journal);
+
mutex_unlock(&journal->j_checkpoint_mutex);
write_lock(&journal->j_state_lock);
J_ASSERT(!journal->j_running_transaction);
diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index 78710788c237..1b41bf9f4a7e 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -6020,7 +6020,7 @@ int __ocfs2_flush_truncate_log(struct ocfs2_super *osb)
* Then truncate log will be replayed resulting in cluster double free.
*/
jbd2_journal_lock_updates(journal->j_journal);
- status = jbd2_journal_flush(journal->j_journal);
+ status = jbd2_journal_flush(journal->j_journal, 0);
jbd2_journal_unlock_updates(journal->j_journal);
if (status < 0) {
mlog_errno(status);
diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index db52e843002a..a1438548747e 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -310,7 +310,7 @@ static int ocfs2_commit_cache(struct ocfs2_super *osb)
}
jbd2_journal_lock_updates(journal->j_journal);
- status = jbd2_journal_flush(journal->j_journal);
+ status = jbd2_journal_flush(journal->j_journal, 0);
jbd2_journal_unlock_updates(journal->j_journal);
if (status < 0) {
up_write(&journal->j_trans_barrier);
@@ -1002,7 +1002,7 @@ void ocfs2_journal_shutdown(struct ocfs2_super *osb)
if (ocfs2_mount_local(osb)) {
jbd2_journal_lock_updates(journal->j_journal);
- status = jbd2_journal_flush(journal->j_journal);
+ status = jbd2_journal_flush(journal->j_journal, 0);
jbd2_journal_unlock_updates(journal->j_journal);
if (status < 0)
mlog_errno(status);
@@ -1072,7 +1072,7 @@ int ocfs2_journal_load(struct ocfs2_journal *journal, int local, int replayed)
if (replayed) {
jbd2_journal_lock_updates(journal->j_journal);
- status = jbd2_journal_flush(journal->j_journal);
+ status = jbd2_journal_flush(journal->j_journal, 0);
jbd2_journal_unlock_updates(journal->j_journal);
if (status < 0)
mlog_errno(status);
@@ -1668,7 +1668,7 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb,
/* wipe the journal */
jbd2_journal_lock_updates(journal);
- status = jbd2_journal_flush(journal);
+ status = jbd2_journal_flush(journal, 0);
jbd2_journal_unlock_updates(journal);
if (status < 0)
mlog_errno(status);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 99d3cd051ac3..50510473283e 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -307,6 +307,9 @@ typedef struct journal_superblock_s
JBD2_FEATURE_INCOMPAT_CSUM_V3 | \
JBD2_FEATURE_INCOMPAT_FAST_COMMIT)
+/* discard journal blocks flag for jbd2_journal_flush */
+#define JBD2_FLAG_DO_DISCARD 1
+
#ifdef __KERNEL__
#include <linux/fs.h>
@@ -1491,7 +1494,7 @@ extern int jbd2_journal_invalidatepage(journal_t *,
struct page *, unsigned int, unsigned int);
extern int jbd2_journal_try_to_free_buffers(journal_t *journal, struct page *page);
extern int jbd2_journal_stop(handle_t *);
-extern int jbd2_journal_flush (journal_t *);
+extern int jbd2_journal_flush(journal_t *journal, unsigned long long flags);
extern void jbd2_journal_lock_updates (journal_t *);
extern void jbd2_journal_unlock_updates (journal_t *);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index f44eb0a04afd..d66408c38c1d 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -214,6 +214,7 @@ struct fsxattr {
#define FS_IOC_FSSETXATTR _IOW('X', 32, struct fsxattr)
#define FS_IOC_GETFSLABEL _IOR(0x94, 49, char[FSLABEL_MAX])
#define FS_IOC_SETFSLABEL _IOW(0x94, 50, char[FSLABEL_MAX])
+#define FS_IOC_CHKPT_JRNL _IOW('f', 43, __u64)
/*
* Inode flags (FS_IOC_GETFLAGS / FS_IOC_SETFLAGS)
@@ -304,4 +305,7 @@ typedef int __bitwise __kernel_rwf_t;
#define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
RWF_APPEND)
+/* flag for ioctl FS_IOC_JRNL_CHKPT */
+#define CHKPT_JRNL_DO_DISCARD 1
+
#endif /* _UAPI_LINUX_FS_H */
--
2.31.0.208.g409f899ff0-goog
Zero out filename and file type fields when file is deleted.
Signed-off-by: Leah Rumancik <[email protected]>
---
fs/ext4/namei.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 883e2a7cd4ab..0147c86de99e 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2492,6 +2492,10 @@ int ext4_generic_delete_entry(struct inode *dir,
else
de->inode = 0;
inode_inc_iversion(dir);
+
+ memset(de_del->name, 0, de_del->name_len);
+ memset(&de_del->file_type, 0, sizeof(__u8));
+
return 0;
}
i += ext4_rec_len_from_disk(de->rec_len, blocksize);
--
2.31.0.208.g409f899ff0-goog
On Wed, Apr 07, 2021 at 03:42:02PM +0000, Leah Rumancik wrote:
> ioctl FS_IOC_CHKPT_JRNL checkpoints and flushes the journal. With the
> CHKPT_JRNL_DISCARD flag set, the journal blocks are also discarded.
> With the filename wipeout patch, Ext4 guarantees that all data will be
> discarded on deletion. This ioctl allows for periodically discarding
> journal contents too.
This needs a documentation update to cover what this new userspace ABI
does, and probably linux-api and linux-fsdevel should be cc'd.
> Also, add journal discard (if discard supported) during journal load
> after recovery. This provides a potential solution to
> https://lore.kernel.org/linux-ext4/[email protected]/ for
> disks that support discard. After a successful journal recovery, e2fsck
> can call this ioctl to discard the journal as well.
>
> Signed-off-by: Leah Rumancik <[email protected]>
> ---
> fs/ext4/ext4.h | 1 +
> fs/ext4/inode.c | 4 +-
> fs/ext4/ioctl.c | 34 ++++++++++++--
> fs/ext4/super.c | 6 +--
> fs/jbd2/journal.c | 100 +++++++++++++++++++++++++++++++++++++++-
> fs/ocfs2/alloc.c | 2 +-
> fs/ocfs2/journal.c | 8 ++--
> include/linux/jbd2.h | 5 +-
> include/uapi/linux/fs.h | 4 ++
> 9 files changed, 148 insertions(+), 16 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 826a56e3bbd2..e76e9961e992 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -724,6 +724,7 @@ enum {
> #define EXT4_IOC_CLEAR_ES_CACHE _IO('f', 40)
> #define EXT4_IOC_GETSTATE _IOW('f', 41, __u32)
> #define EXT4_IOC_GET_ES_CACHE _IOWR('f', 42, struct fiemap)
> +/* ioctl code 43 reserved for FS_IOC_JRNL_CHKPT */
Pat Sajak hates it when people won't buy vowels. ;)
FS_IOC_CHECKPOINT, maybe?
>
> #define EXT4_IOC_SHUTDOWN _IOR ('X', 125, __u32)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 0948a43f1b3d..715077e30c58 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3225,7 +3225,7 @@ static sector_t ext4_bmap(struct address_space *mapping, sector_t block)
> ext4_clear_inode_state(inode, EXT4_STATE_JDATA);
> journal = EXT4_JOURNAL(inode);
> jbd2_journal_lock_updates(journal);
> - err = jbd2_journal_flush(journal);
> + err = jbd2_journal_flush(journal, 0);
> jbd2_journal_unlock_updates(journal);
>
> if (err)
> @@ -6007,7 +6007,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
> if (val)
> ext4_set_inode_flag(inode, EXT4_INODE_JOURNAL_DATA);
> else {
> - err = jbd2_journal_flush(journal);
> + err = jbd2_journal_flush(journal, 0);
> if (err < 0) {
> jbd2_journal_unlock_updates(journal);
> percpu_up_write(&sbi->s_writepages_rwsem);
> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> index a2cf35066f46..ca64c680ef6d 100644
> --- a/fs/ext4/ioctl.c
> +++ b/fs/ext4/ioctl.c
> @@ -756,7 +756,7 @@ static long ext4_ioctl_group_add(struct file *file,
> err = ext4_group_add(sb, input);
> if (EXT4_SB(sb)->s_journal) {
> jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
> - err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal);
> + err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal, 0);
> jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
> }
> if (err == 0)
> @@ -939,7 +939,7 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> err = ext4_group_extend(sb, EXT4_SB(sb)->s_es, n_blocks_count);
> if (EXT4_SB(sb)->s_journal) {
> jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
> - err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal);
> + err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal, 0);
> jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
> }
> if (err == 0)
> @@ -1082,7 +1082,7 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> if (EXT4_SB(sb)->s_journal) {
> ext4_fc_mark_ineligible(sb, EXT4_FC_REASON_RESIZE);
> jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
> - err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal);
> + err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal, 0);
> jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
> }
> if (err == 0)
> @@ -1318,6 +1318,33 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> return -EOPNOTSUPP;
> return fsverity_ioctl_read_metadata(filp,
> (const void __user *)arg);
> + case FS_IOC_CHKPT_JRNL:
> + {
Should this whole block be a separate static function?
> + int err = 0;
> + unsigned long long flags = 0;
> +
> + if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + /* file argument is not the mount point */
> + if (file_dentry(filp) != sb->s_root)
> + return -EINVAL;
> +
> + /* filesystem is not backed by block device */
> + if (sb->s_bdev == NULL)
> + return -EINVAL;
> +
> + if (copy_from_user(&flags, (__u64 __user *)arg,
> + sizeof(__u64)))
> + return -EFAULT;
Needs a check here to return -EINVAL if any flags bits are set other
than CHKPT_JRNL_DO_DISCARD, particularly since you're feeding flags
straight into jbd2 code.
FWIW I would also separate the changes into two patches -- one to
add the flags paramter to jbd2_journal_flush, and a second patch to
define the new ioctl and wire it up.
> +
> + if (EXT4_SB(sb)->s_journal) {
> + jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
> + err = jbd2_journal_flush(EXT4_SB(sb)->s_journal, flags);
> + jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
> + }
> + return err;
> + }
>
> default:
> return -ENOTTY;
> @@ -1407,6 +1434,7 @@ long ext4_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> case EXT4_IOC_GET_ES_CACHE:
> case FS_IOC_FSGETXATTR:
> case FS_IOC_FSSETXATTR:
> + case FS_IOC_CHKPT_JRNL:
> break;
> default:
> return -ENOIOCTLCMD;
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index b9693680463a..1b3a9eb58b63 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -5613,7 +5613,7 @@ static int ext4_mark_recovery_complete(struct super_block *sb,
> return 0;
> }
> jbd2_journal_lock_updates(journal);
> - err = jbd2_journal_flush(journal);
> + err = jbd2_journal_flush(journal, 0);
> if (err < 0)
> goto out;
>
> @@ -5755,7 +5755,7 @@ static int ext4_freeze(struct super_block *sb)
> * Don't clear the needs_recovery flag if we failed to
> * flush the journal.
> */
> - error = jbd2_journal_flush(journal);
> + error = jbd2_journal_flush(journal, 0);
> if (error < 0)
> goto out;
>
> @@ -6349,7 +6349,7 @@ static int ext4_quota_on(struct super_block *sb, int type, int format_id,
> * otherwise be livelocked...
> */
> jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
> - err = jbd2_journal_flush(EXT4_SB(sb)->s_journal);
> + err = jbd2_journal_flush(EXT4_SB(sb)->s_journal, 0);
> jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
> if (err)
> return err;
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 2dc944442802..6bb5980ac789 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -1686,6 +1686,90 @@ static void jbd2_mark_journal_empty(journal_t *journal, int write_op)
> write_unlock(&journal->j_state_lock);
> }
>
> +/* discard journal blocks excluding journal superblock */
> +static int __jbd2_journal_issue_discard(journal_t *journal)
> +{
> + int err = 0;
> + unsigned long block, log_offset; /* logical */
> + unsigned long long phys_block, block_start, block_stop; /* physical */
> + loff_t byte_start, byte_stop, byte_count;
> + struct request_queue *q = bdev_get_queue(journal->j_dev);
> +
> + if (!q)
> + return -ENXIO;
> +
> + if (!blk_queue_discard(q))
> + return -EOPNOTSUPP;
> +
> + /* lookup block mapping and issue discard for each contiguous region */
> + log_offset = be32_to_cpu(journal->j_superblock->s_first);
> +
> + err = jbd2_journal_bmap(journal, log_offset, &block_start);
> + if (err) {
> + printk(KERN_ERR "JBD2: bad block at offset %lu", log_offset);
> + return err;
> + }
> +
> + /*
> + * use block_start - 1 to meet check for contiguous with previous region:
> + * phys_block == block_stop + 1
> + */
> + block_stop = block_start - 1;
> +
> + for (block = log_offset; block < journal->j_total_len; block++) {
> + err = jbd2_journal_bmap(journal, block, &phys_block);
> + if (err) {
> + printk(KERN_ERR "JBD2: bad block at offset %lu", block);
> + return err;
> + }
> +
> + /*
> + * if block is last block, update stopping point
> + * if not last block and
> + * block is contiguous with previous block, continue
> + */
> + if (block == journal->j_total_len - 1)
> + block_stop = phys_block;
> + else if (phys_block == block_stop + 1) {
> + block_stop++;
> + continue;
> + }
> +
> + /*
> + * if not contiguous with prior physical block or this is last
> + * block of journal, take care of the region
> + */
> + byte_start = block_start * journal->j_blocksize;
> + byte_stop = block_stop * journal->j_blocksize;
> + byte_count = (block_stop - block_start + 1) *
> + journal->j_blocksize;
> +
> + truncate_inode_pages_range(journal->j_dev->bd_inode->i_mapping,
> + byte_start, byte_stop);
> +
> + /*
> + * use blkdev_issue_discard instead of sb_issue_discard
> + * because superblock not yet populated when this is
> + * called during journal_load during mount process
> + */
> + err = blkdev_issue_discard(journal->j_dev,
> + byte_start >> SECTOR_SHIFT,
> + byte_count >> SECTOR_SHIFT,
> + GFP_NOFS, 0);
> +
> + if (unlikely(err != 0)) {
> + printk(KERN_ERR "JBD2: unable to discard "
> + "journal at physical blocks %llu - %llu",
> + block_start, block_stop);
> + return err;
> + }
> +
> + block_start = phys_block;
> + block_stop = phys_block;
> + }
> +
> + return blkdev_issue_flush(journal->j_dev);
> +}
>
> /**
> * jbd2_journal_update_sb_errno() - Update error in the journal.
> @@ -1936,6 +2020,10 @@ int jbd2_journal_load(journal_t *journal)
> */
> journal->j_flags &= ~JBD2_ABORT;
>
> + /* if journal device supports discard, discard journal blocks */
> + if (__jbd2_journal_issue_discard(journal))
> + printk(KERN_WARNING "JBD2: failed to discard journal when loading");
How badly do you need to log this? The journal still works, right?
This is going to trigger on all the old spinning disks that don't
support discard.
> +
> /* OK, we've finished with the dynamic journal bits:
> * reinitialise the dynamic contents of the superblock in memory
> * and reset them on disk. */
> @@ -2246,13 +2334,17 @@ EXPORT_SYMBOL(jbd2_journal_clear_features);
> /**
> * jbd2_journal_flush() - Flush journal
> * @journal: Journal to act on.
> + * @flags: options (see below)
> *
> * Flush all data for a given journal to disk and empty the journal.
> * Filesystems can use this when remounting readonly to ensure that
> * recovery does not need to happen on remount.
> + *
> + * flags:
> + * JBD2_FLAG_DO_DISCARD: also discard the journal blocks; if discard is not
> + * supported on the device, returns err
> */
> -
> -int jbd2_journal_flush(journal_t *journal)
> +int jbd2_journal_flush(journal_t *journal, unsigned long long flags)
> {
> int err = 0;
> transaction_t *transaction = NULL;
> @@ -2306,6 +2398,10 @@ int jbd2_journal_flush(journal_t *journal)
> * commits of data to the journal will restore the current
> * s_start value. */
> jbd2_mark_journal_empty(journal, REQ_SYNC | REQ_FUA);
> +
> + if (flags & JBD2_FLAG_DO_DISCARD)
> + err = __jbd2_journal_issue_discard(journal);
> +
> mutex_unlock(&journal->j_checkpoint_mutex);
> write_lock(&journal->j_state_lock);
> J_ASSERT(!journal->j_running_transaction);
> diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
> index 78710788c237..1b41bf9f4a7e 100644
> --- a/fs/ocfs2/alloc.c
> +++ b/fs/ocfs2/alloc.c
> @@ -6020,7 +6020,7 @@ int __ocfs2_flush_truncate_log(struct ocfs2_super *osb)
> * Then truncate log will be replayed resulting in cluster double free.
> */
> jbd2_journal_lock_updates(journal->j_journal);
> - status = jbd2_journal_flush(journal->j_journal);
> + status = jbd2_journal_flush(journal->j_journal, 0);
> jbd2_journal_unlock_updates(journal->j_journal);
> if (status < 0) {
> mlog_errno(status);
> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
> index db52e843002a..a1438548747e 100644
> --- a/fs/ocfs2/journal.c
> +++ b/fs/ocfs2/journal.c
> @@ -310,7 +310,7 @@ static int ocfs2_commit_cache(struct ocfs2_super *osb)
> }
>
> jbd2_journal_lock_updates(journal->j_journal);
> - status = jbd2_journal_flush(journal->j_journal);
> + status = jbd2_journal_flush(journal->j_journal, 0);
> jbd2_journal_unlock_updates(journal->j_journal);
> if (status < 0) {
> up_write(&journal->j_trans_barrier);
> @@ -1002,7 +1002,7 @@ void ocfs2_journal_shutdown(struct ocfs2_super *osb)
>
> if (ocfs2_mount_local(osb)) {
> jbd2_journal_lock_updates(journal->j_journal);
> - status = jbd2_journal_flush(journal->j_journal);
> + status = jbd2_journal_flush(journal->j_journal, 0);
> jbd2_journal_unlock_updates(journal->j_journal);
> if (status < 0)
> mlog_errno(status);
> @@ -1072,7 +1072,7 @@ int ocfs2_journal_load(struct ocfs2_journal *journal, int local, int replayed)
>
> if (replayed) {
> jbd2_journal_lock_updates(journal->j_journal);
> - status = jbd2_journal_flush(journal->j_journal);
> + status = jbd2_journal_flush(journal->j_journal, 0);
> jbd2_journal_unlock_updates(journal->j_journal);
> if (status < 0)
> mlog_errno(status);
> @@ -1668,7 +1668,7 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb,
>
> /* wipe the journal */
> jbd2_journal_lock_updates(journal);
> - status = jbd2_journal_flush(journal);
> + status = jbd2_journal_flush(journal, 0);
> jbd2_journal_unlock_updates(journal);
> if (status < 0)
> mlog_errno(status);
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 99d3cd051ac3..50510473283e 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -307,6 +307,9 @@ typedef struct journal_superblock_s
> JBD2_FEATURE_INCOMPAT_CSUM_V3 | \
> JBD2_FEATURE_INCOMPAT_FAST_COMMIT)
>
> +/* discard journal blocks flag for jbd2_journal_flush */
> +#define JBD2_FLAG_DO_DISCARD 1
> +
> #ifdef __KERNEL__
>
> #include <linux/fs.h>
> @@ -1491,7 +1494,7 @@ extern int jbd2_journal_invalidatepage(journal_t *,
> struct page *, unsigned int, unsigned int);
> extern int jbd2_journal_try_to_free_buffers(journal_t *journal, struct page *page);
> extern int jbd2_journal_stop(handle_t *);
> -extern int jbd2_journal_flush (journal_t *);
> +extern int jbd2_journal_flush(journal_t *journal, unsigned long long flags);
> extern void jbd2_journal_lock_updates (journal_t *);
> extern void jbd2_journal_unlock_updates (journal_t *);
>
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index f44eb0a04afd..d66408c38c1d 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -214,6 +214,7 @@ struct fsxattr {
> #define FS_IOC_FSSETXATTR _IOW('X', 32, struct fsxattr)
> #define FS_IOC_GETFSLABEL _IOR(0x94, 49, char[FSLABEL_MAX])
> #define FS_IOC_SETFSLABEL _IOW(0x94, 50, char[FSLABEL_MAX])
> +#define FS_IOC_CHKPT_JRNL _IOW('f', 43, __u64)
>
> /*
> * Inode flags (FS_IOC_GETFLAGS / FS_IOC_SETFLAGS)
> @@ -304,4 +305,7 @@ typedef int __bitwise __kernel_rwf_t;
> #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
> RWF_APPEND)
>
> +/* flag for ioctl FS_IOC_JRNL_CHKPT */
> +#define CHKPT_JRNL_DO_DISCARD 1
FS_IOC_CHECKPOINT_FLAG_DISCARD, to go with FS_IOC_CHECKPOINT?
--D
> +
> #endif /* _UAPI_LINUX_FS_H */
> --
> 2.31.0.208.g409f899ff0-goog
>
On Wed, Apr 07, 2021 at 03:42:01PM +0000, Leah Rumancik wrote:
> Zero out filename and file type fields when file is deleted.
Why?
Also what about when a dir_entry is moved? Wouldn't you want to make sure that
any copies don't get left around?
>
> Signed-off-by: Leah Rumancik <[email protected]>
> ---
> fs/ext4/namei.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 883e2a7cd4ab..0147c86de99e 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -2492,6 +2492,10 @@ int ext4_generic_delete_entry(struct inode *dir,
> else
> de->inode = 0;
> inode_inc_iversion(dir);
> +
> + memset(de_del->name, 0, de_del->name_len);
> + memset(&de_del->file_type, 0, sizeof(__u8));
The second line could be written as simply 'de_del->file_type = 0'.
Also why zero these two fields in particular and not the whole dir_entry?
- Eric
On Wed, Apr 07, 2021 at 11:35:47AM -0700, Darrick J. Wong wrote:
> On Wed, Apr 07, 2021 at 03:42:02PM +0000, Leah Rumancik wrote:
> > ioctl FS_IOC_CHKPT_JRNL checkpoints and flushes the journal. With the
> > CHKPT_JRNL_DISCARD flag set, the journal blocks are also discarded.
> > With the filename wipeout patch, Ext4 guarantees that all data will be
> > discarded on deletion. This ioctl allows for periodically discarding
> > journal contents too.
>
> This needs a documentation update to cover what this new userspace ABI
> does, and probably linux-api and linux-fsdevel should be cc'd.
You need to describe the semantics that you are exporting to
userspace. Exactly what does a "journal checkpoint" mean from the
point of view of user visible metadata and data integrity? How is it
different to running fsync() on a directory, syncfs() or freeze on a
filesystem, or any of the other methods we already have for
guaranteeing completed changes are committed to stable storage? All
of these methods imply a journal checkpoint of some kind is done in
ext4, so why do we need a specific ioctl to do this?
But, really, if this is for secure deletion, then why isn't
"fsync(dirfd)" sufficient to force the unlinks into the journal and
onto stable storage? Why does this functionality need some special
new CAP_SYS_ADMIN only ioctl to checkpoint the journal when, by
definition, fsync() should already be doing that? Indeed, users can't
actually use this new ioctl for secure deletion because it is root
only, so how do users and their applications actually ensure that
secure deletion of their files has actually occurred?
I'm also curious as to what "journal checkpoint" means for
filesystems that don't have journals? Or that have multiple and/or
partial state journals (e.g. iper-inode journals in NOVA, fsync
journals in brtfs, etc)? What does it mean for filesystems that use
COW instead of journals?
> > Also, add journal discard (if discard supported) during journal load
> > after recovery. This provides a potential solution to
> > https://lore.kernel.org/linux-ext4/[email protected]/ for
> > disks that support discard. After a successful journal recovery, e2fsck
> > can call this ioctl to discard the journal as well.
If you need discard after recovery for secure remove, then you also
really need discard on every extent being freed, too. Which then
implies that the -o discard mount option needs to be used in
conjunction with this functionality. Perhaps, then, journal discard
at mount should be tied in to the -o discard mount option, and then
the need for an ioctl to trigger this goes away completely.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, Apr 08, 2021 at 07:15:00AM +1000, Dave Chinner wrote:
> On Wed, Apr 07, 2021 at 11:35:47AM -0700, Darrick J. Wong wrote:
> > On Wed, Apr 07, 2021 at 03:42:02PM +0000, Leah Rumancik wrote:
> > > ioctl FS_IOC_CHKPT_JRNL checkpoints and flushes the journal. With the
> > > CHKPT_JRNL_DISCARD flag set, the journal blocks are also discarded.
> > > With the filename wipeout patch, Ext4 guarantees that all data will be
> > > discarded on deletion. This ioctl allows for periodically discarding
> > > journal contents too.
> >
> > This needs a documentation update to cover what this new userspace ABI
> > does, and probably linux-api and linux-fsdevel should be cc'd.
>
> You need to describe the semantics that you are exporting to
> userspace. Exactly what does a "journal checkpoint" mean from the
> point of view of user visible metadata and data integrity?
To be clear, my interests are /not/ the same as Leah's here -- I want to
use this "checkpoint" call to provide a way for grub to know that it
will be able to find boot files without needing to recover the log.
For the grub use case, the user-visible behaviors that are required are:
1. All dirty file data in memory are flushed;
2. All committed metadata updates are forced to the ondisk log;
3. All log contents have been written into the filesystem.
(Note confusing terminology here: in my head, #2 means checkpointing the
ondisk log, whereas #3 means checkpointing the filesystem itself; and
"FS_IOC_CHECKPOINT" means checkpointing the whole filesystem, not just
the log.)
((For Leah's use case, I think you'd add "4. If the DISCARD flag is set,
the journal storage will be zeroed." or something like that...))
> How is it different to running fsync() on a directory, syncfs() or
> freeze on a filesystem, or any of the other methods we already have
> for guaranteeing completed changes are committed to stable storage?
This differs from fsync and syncfs in that it's sufficient to checkpoint
the log (#2), whereas this new call would require also checkpointing the
filesystem (#3).
It's different from FIFREEZE in that it's not necessary to freeze
ongoing modifications from running programs. The guarantees only apply
to changes made before the call.
A second difference is that if we made grub initiate a FIFREEZE/FITHAW
cycle, the FITHAW call will unfreeze the filesystem even if another
racing program had frozen the fs after grub wrote its files.
> All of these methods imply a journal checkpoint of some kind is done
> in ext4, so why do we need a specific ioctl to do this?
For XFS, we don't have any kind of system call that will checkpoint the
fs without the unwanted behaviors of FIFREEZE and FITHAW. AFAICT
there's no explicit way to force a fs checkpoint in ext4 aside from
contorted insanity with data=journal files and bmap(). Weird things
like NOVA would have to figure out a way to checkpoint the whole fs
(flush all the journals?).
btrfs can probably get away with flushing the disk cache since it has
COW btrees for metadata (fsync log notwithstanding); and I'd imagine
stupid things like FAT would just return EOPNOTSUPP.
To solve my stupid grub problem, this could easily be:
ret = syncfs2(fd, SYNCFS_CHECKPOINT_FS);
Though you'd have to add that new syncfs2 syscall. Probably a good way
to get yourself invited to LSFMM. ;)
Anyway, I'll let Leah lead the secure deletion aspects of this
discussion.
--D
> But, really, if this is for secure deletion, then why isn't
> "fsync(dirfd)" sufficient to force the unlinks into the journal and
> onto stable storage? Why does this functionality need some special
> new CAP_SYS_ADMIN only ioctl to checkpoint the journal when, by
> definition, fsync() should already be doing that? Indeed, users can't
> actually use this new ioctl for secure deletion because it is root
> only, so how do users and their applications actually ensure that
> secure deletion of their files has actually occurred?
>
> I'm also curious as to what "journal checkpoint" means for
> filesystems that don't have journals? Or that have multiple and/or
> partial state journals (e.g. iper-inode journals in NOVA, fsync
> journals in brtfs, etc)? What does it mean for filesystems that use
> COW instead of journals?
>
> > > Also, add journal discard (if discard supported) during journal load
> > > after recovery. This provides a potential solution to
> > > https://lore.kernel.org/linux-ext4/[email protected]/ for
> > > disks that support discard. After a successful journal recovery, e2fsck
> > > can call this ioctl to discard the journal as well.
>
> If you need discard after recovery for secure remove, then you also
> really need discard on every extent being freed, too. Which then
> implies that the -o discard mount option needs to be used in
> conjunction with this functionality. Perhaps, then, journal discard
> at mount should be tied in to the -o discard mount option, and then
> the need for an ioctl to trigger this goes away completely.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
On Wed, Apr 07, 2021 at 02:33:15PM -0700, Eric Biggers wrote:
> On Wed, Apr 07, 2021 at 03:42:01PM +0000, Leah Rumancik wrote:
> > Zero out filename and file type fields when file is deleted.
>
> Why?
Eric is right that we need to have a better explanation in the commit
description.
In answer to Eric's question, the problem that is trying to be solved
here is that if a customer happens to be storing PII in filenames
(e-mail addresses, SSN's, etc.) that they might want to have a
guarantee that if a file is deleted, the filename and the file's
contents can be considered as *gone* after some wipeout time period
has elapsed. So the use case is every N hours, some system daemon
will execute FITRIM and FS_IOC_CHKPT_JRNL with the CHKPT_JRNL_DISCARD
flag set, in order to meet this particular guarantee.
Yes, we can't guarantee that discard will always zap all unused data
blocks in the general case and SSD's may very well have copies made as
a side effect of their GC operations, yadda, yadda. Fortunately, for
this particular use case, the primary concern is for cloud customers
running services on Google Cloud's Persistent Disks.
> Also what about when a dir_entry is moved? Wouldn't you want to make sure that
> any copies don't get left around?
Yes, we need to also make sure that after we do a hash tree leaf node
split in fs/ext4/namei.c:do_split(), that the empty space is also
zero'ed out.
> > @@ -2492,6 +2492,10 @@ int ext4_generic_delete_entry(struct inode *dir,
> > else
> > de->inode = 0;
> > inode_inc_iversion(dir);
> > +
> > + memset(de_del->name, 0, de_del->name_len);
> > + memset(&de_del->file_type, 0, sizeof(__u8));
>
> The second line could be written as simply 'de_del->file_type = 0'.
>
> Also why zero these two fields in particular and not the whole dir_entry?
Agreed, it would be a good diea to clear everything up to rec_len:
memset(de_del->name, 0, de_del->rec_len - 12);
and we should probably zero out de_del->name_len as well. Better to
not leave anything behind.
- Ted
P.S. By the way, this is a guarantee that we're going to eventually
want to care about for XFS as well, since as of COS-85
(Container-Optimized OS), XFS is supported in Preview Mode. This
means that eventually we're going to want submit patches so as to be
able to support the CHKPT_JRNL_DISCARD flag for FS_IOC_CHKPT_JRNL in
XFS as well.
P.P.S. We'll also want to have a mount option which supresses file
names (for example, from ext4_error() messages) from showing up in
kernel logs, to ease potential privacy concerns with respect to serial
console and kernel logs. But that's for another patch set....
On Wed, Apr 07, 2021 at 06:26:51PM -0700, Darrick J. Wong wrote:
> On Thu, Apr 08, 2021 at 07:15:00AM +1000, Dave Chinner wrote:
> > On Wed, Apr 07, 2021 at 11:35:47AM -0700, Darrick J. Wong wrote:
> > > On Wed, Apr 07, 2021 at 03:42:02PM +0000, Leah Rumancik wrote:
> > > > ioctl FS_IOC_CHKPT_JRNL checkpoints and flushes the journal. With the
> > > > CHKPT_JRNL_DISCARD flag set, the journal blocks are also discarded.
> > > > With the filename wipeout patch, Ext4 guarantees that all data will be
> > > > discarded on deletion. This ioctl allows for periodically discarding
> > > > journal contents too.
> > >
> > > This needs a documentation update to cover what this new userspace ABI
> > > does, and probably linux-api and linux-fsdevel should be cc'd.
> >
> > You need to describe the semantics that you are exporting to
> > userspace. Exactly what does a "journal checkpoint" mean from the
> > point of view of user visible metadata and data integrity?
>
> To be clear, my interests are /not/ the same as Leah's here -- I want to
> use this "checkpoint" call to provide a way for grub to know that it
> will be able to find boot files without needing to recover the log.
>
> For the grub use case, the user-visible behaviors that are required are:
>
> 1. All dirty file data in memory are flushed;
> 2. All committed metadata updates are forced to the ondisk log;
> 3. All log contents have been written into the filesystem.
>
> (Note confusing terminology here: in my head, #2 means checkpointing the
> ondisk log, whereas #3 means checkpointing the filesystem itself; and
> "FS_IOC_CHECKPOINT" means checkpointing the whole filesystem, not just
> the log.)
So, yeah, you just renamed the ioctl because you are clearly not just
talking about a journal checkpoint. A journal checkpoint is what
XFS does when it pushes the CIL to disk (i.e. #2). Quiescing the log
is what we call #3 - basically bringing it to an empty, idle state.
Which makes me even more concerned about defining the behaviour and
semantics needed before we even talk about the API that would be
used.
> > All of these methods imply a journal checkpoint of some kind is done
> > in ext4, so why do we need a specific ioctl to do this?
>
> For XFS, we don't have any kind of system call that will checkpoint the
> fs without the unwanted behaviors of FIFREEZE and FITHAW. AFAICT
> there's no explicit way to force a fs checkpoint in ext4 aside from
> contorted insanity with data=journal files and bmap(). Weird things
> like NOVA would have to figure out a way to checkpoint the whole fs
> (flush all the journals?).
So, yeah, you're not talking about a journal checkpoint. You're
describing a completely different set of requirements.... :/
> btrfs can probably get away with flushing the disk cache since it has
> COW btrees for metadata (fsync log notwithstanding); and I'd imagine
> stupid things like FAT would just return EOPNOTSUPP.
>
> To solve my stupid grub problem, this could easily be:
>
> ret = syncfs2(fd, SYNCFS_CHECKPOINT_FS);
Sure, but is that the same thing that Leah needs? Checkpoints don't
imply discards (let alone journal discards) in any way, and adding
(optional) discard support for random parts of filesysetms to
syncfs() semantics doesn't seem like a very good fit...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed, Apr 07, 2021 at 11:48:40PM -0400, Theodore Ts'o wrote:
> On Wed, Apr 07, 2021 at 02:33:15PM -0700, Eric Biggers wrote:
> > On Wed, Apr 07, 2021 at 03:42:01PM +0000, Leah Rumancik wrote:
> > > Zero out filename and file type fields when file is deleted.
> >
> > Why?
>
> Eric is right that we need to have a better explanation in the commit
> description.
>
> In answer to Eric's question, the problem that is trying to be solved
> here is that if a customer happens to be storing PII in filenames
"if"
Is this purely a hypothetical "if", or is it "we have a customer
that actaully does this"? Because if this is just hypothetical, then
future customers should already be advised and know not to store PII
information in clear text *anywhere* in their systems.
> (e-mail addresses, SSN's, etc.) that they might want to have a
> guarantee that if a file is deleted, the filename and the file's
> contents can be considered as *gone* after some wipeout time period
> has elapsed. So the use case is every N hours, some system daemon
> will execute FITRIM and FS_IOC_CHKPT_JRNL with the CHKPT_JRNL_DISCARD
> flag set, in order to meet this particular guarantee.
This seems like a better fit for FITRIM than anything else.
Ooohh. We sure do suck at APIs, don't we? FITRIM has no flags field,
so we can't extend that.
But it still makes more sense to me to have something like:
int fstrim(int fd, struct fstrim_range *r, int flags)
syscall where the flags field can indicate that the journal should
be trimmed.
At that point, the "journal checkpoint and flush" is implied by the
fact userspace is asking for the journal to be discarded....
> P.S. By the way, this is a guarantee that we're going to eventually
> want to care about for XFS as well, since as of COS-85
> (Container-Optimized OS), XFS is supported in Preview Mode. This
> means that eventually we're going to want submit patches so as to be
> able to support the CHKPT_JRNL_DISCARD flag for FS_IOC_CHKPT_JRNL in
> XFS as well.
Oh, that won't be fun. XFS places a whiteout over the dirent to
indicate that it has been freed, and it does not actually log
anything other than the 4 byte whiteout at the start of the dirent
and the 2 byte XFS_DIR2_DATA_FREE_TAG tag at the end of the dirent.
So zeroing dirents is going to require changing the size and shape
of dirent logging during unlinks...
This will have to be done correclty for all the node merge, split
and compaction cases, too, not just the "remove name" code.
> P.P.S. We'll also want to have a mount option which supresses file
> names (for example, from ext4_error() messages) from showing up in
> kernel logs, to ease potential privacy concerns with respect to serial
> console and kernel logs. But that's for another patch set....
This sounds more and more like "Don't encode PII in clear text
anywhere" is a best practice that should be enforced with a big
hammer. Filenames get everywhere and there's no easy way to prevent
that because path lookups can be done by anyone in the kernel. This
so much sounds like you're starting a game of whack-a-mole that can
never be won.
From a security perspective, this is just bad design. Storing PII in
clear text filenames pretty much guarantees that the PII will leak
because it can't be scrubbed/contained within application controlled
boundaries. Trying to contain the spread of filenames within random
kernel subsystems sounds like a fool's errand to me, especially
given how few kernel developers will even know that filenames are
considered sensitive information from a security perspective...
Fundamentally, applications should *never* place PII in clear text
in potentially leaky environments. The environment for storing PII
should be designed to be secure and free of data leaks from the
ground up. And ext4 has already got this with fscrypt support.....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, Apr 08, 2021 at 03:21:55PM +1000, Dave Chinner wrote:
> "if"
>
> Is this purely a hypothetical "if", or is it "we have a customer
> that actaully does this"? Because if this is just hypothetical, then
> future customers should already be advised and know not to store PII
> information in clear text *anywhere* in their systems.
Customers might not even know if they are doing this. The concern
they have is that when they are doing "lift and shift", and moving
their workloads from their private data center to the cloud, they
would have no idea what might be lurking in their legacy workloads.
Heck, in a previous job, I visited a major customer who had lost their
sources to a critical bit of their software and they were changing
pathnames by running a hex editor on the binary. Enterprise customers
do the darnedest things... (OTOH, they pay $$$ to our companies :-)
In the ideal world, sure, all or most of them would agree that they
*shouldn't* be storing any kind of PII at rest unencrypted, but they
can't be sure, and so from the perspective of keeping their audit and
I/T compliance committees happy, this requirement is desirable from a
"belt and suspenders" perspective.
> This seems like a better fit for FITRIM than anything else.
>
> Ooohh. We sure do suck at APIs, don't we? FITRIM has no flags field,
> so we can't extend that.
I don't have any serious objections to defining FITRIM2; OTOH, for
Darrick's use case of wanting to make XFS work reliably with the GRUB
bootloader (which doesn't replay file system journals) and a certain
Enterprise Linux distribution (cough, cough) can't be bothered to do a
clean shutdown at the end of doing an install, using a hypothetical
FITRIM2 seems... wierd, whereas it might be cleaner to define
semantics in terms of FS_IOC_CHKPT_JRNL. But I don't have a dog in
that particular fight, since I'm not responsible of maintaining either
XFS or RHEL. :-)
Yet another possible solution might be to define a new system call,
syncfs2(), which takes a flag option. That might be a bit more
heavyweight, and we would still have to figure out how to define what
a "journal checkpoint means" from the standpoint of an API definition.
It would presumably be something like "allows a non-kernel
implementation accessing the file system (e.g., from bootloaders like
grub) to be able to access files on the file sytstem as easily as
unmounting the file system", or perhaps defining it in terms of doing
a FIFREEZE/FITHAW, without having to actually freeze the file system.
> Oh, that won't be fun. XFS places a whiteout over the dirent to
> indicate that it has been freed, and it does not actually log
> anything other than the 4 byte whiteout at the start of the dirent
> and the 2 byte XFS_DIR2_DATA_FREE_TAG tag at the end of the dirent.
> So zeroing dirents is going to require changing the size and shape
> of dirent logging during unlinks...
So I'm not an expert on XFS, but XFS does logical logging, so what is
in the log is "we're going to white out this dirent", right? So
couldn't the replay code be taught to look at the dirent's reclen, and
zero out the full directory entry at journal replay time? If the
directory entry has already been reused, that's a case which the XFS
replay code has to handle already. Or is there something subtle which
makes this hard to do.
> This will have to be done correclty for all the node merge, split
> and compaction cases, too, not just the "remove name" code.
Agreed this is going to be a lot more complicated for XFS.
> > P.P.S. We'll also want to have a mount option which supresses file
> > names (for example, from ext4_error() messages) from showing up in
> > kernel logs, to ease potential privacy concerns with respect to serial
> > console and kernel logs. But that's for another patch set....
>
> This sounds more and more like "Don't encode PII in clear text
> anywhere" is a best practice that should be enforced with a big
> hammer. Filenames get everywhere and there's no easy way to prevent
> that because path lookups can be done by anyone in the kernel. This
> so much sounds like you're starting a game of whack-a-mole that can
> never be won.
>
> From a security perspective, this is just bad design. Storing PII in
> clear text filenames pretty much guarantees that the PII will leak
> because it can't be scrubbed/contained within application controlled
> boundaries. Trying to contain the spread of filenames within random
> kernel subsystems sounds like a fool's errand to me, especially
> given how few kernel developers will even know that filenames are
> considered sensitive information from a security perspective...
The problem is that the company that the Cloud SRE's work for is
different from the enterprise customer owning the VM. If a customer
stores PII in a filename, and the Cloud SRE places the log in some
place where it could leak out, Google gets blamed, not the enterprise
customer.
So the Cloud SRE's *have* to treat the logs as if they might contain
sensitive information, which means it can't be made available in bug
trackers without a manual (human-drive) scrubbing to make sure the log
doesn't have anything that might appear to contain PII.
By the way, MySQL puts table names into file names, and even though
table names normally aren't PII, it's still considered "customer
data", and we need to treat any kind of customer data super-carefully.
> Fundamentally, applications should *never* place PII in clear text
> in potentially leaky environments. The environment for storing PII
> should be designed to be secure and free of data leaks from the
> ground up. And ext4 has already got this with fscrypt support.....
Cloud disks (no matter which cloud vendor) do tend to be encrypted at
rest, and in the case of Google, we even give customers the option to
Bring Your Own Key, which means when the VM isn't running, no one at
Google has access to the encryption key. But that doesn't change the
fact that if we need to debug a system, the logs might contain file
names, and file names might be customer-owned data. Using encryption
at rest doesn't solve that problem.
- Ted
On Thu, Apr 08, 2021 at 02:43:27PM +1000, Dave Chinner wrote:
> On Wed, Apr 07, 2021 at 06:26:51PM -0700, Darrick J. Wong wrote:
> > On Thu, Apr 08, 2021 at 07:15:00AM +1000, Dave Chinner wrote:
> > > On Wed, Apr 07, 2021 at 11:35:47AM -0700, Darrick J. Wong wrote:
> > > > On Wed, Apr 07, 2021 at 03:42:02PM +0000, Leah Rumancik wrote:
> > > > > ioctl FS_IOC_CHKPT_JRNL checkpoints and flushes the journal. With the
> > > > > CHKPT_JRNL_DISCARD flag set, the journal blocks are also discarded.
> > > > > With the filename wipeout patch, Ext4 guarantees that all data will be
> > > > > discarded on deletion. This ioctl allows for periodically discarding
> > > > > journal contents too.
> > > >
> > > > This needs a documentation update to cover what this new userspace ABI
> > > > does, and probably linux-api and linux-fsdevel should be cc'd.
> > >
> > > You need to describe the semantics that you are exporting to
> > > userspace. Exactly what does a "journal checkpoint" mean from the
> > > point of view of user visible metadata and data integrity?
> >
> > To be clear, my interests are /not/ the same as Leah's here -- I want to
> > use this "checkpoint" call to provide a way for grub to know that it
> > will be able to find boot files without needing to recover the log.
> >
> > For the grub use case, the user-visible behaviors that are required are:
> >
> > 1. All dirty file data in memory are flushed;
> > 2. All committed metadata updates are forced to the ondisk log;
> > 3. All log contents have been written into the filesystem.
> >
> > (Note confusing terminology here: in my head, #2 means checkpointing the
> > ondisk log, whereas #3 means checkpointing the filesystem itself; and
> > "FS_IOC_CHECKPOINT" means checkpointing the whole filesystem, not just
> > the log.)
>
> So, yeah, you just renamed the ioctl because you are clearly not just
> talking about a journal checkpoint. A journal checkpoint is what
> XFS does when it pushes the CIL to disk (i.e. #2). Quiescing the log
> is what we call #3 - basically bringing it to an empty, idle state.
>
> Which makes me even more concerned about defining the behaviour and
> semantics needed before we even talk about the API that would be
> used.
Ok, let's draft a manpage. Here's my mockup of a manpage for the ioctl,
though again, I don't have a strong preference between this and a
syncfs2 call.
NAME
ioctl_fs_ioc_checkpoint - Commit all filesystem changes to disk
SYNOPSYS
int ioctl(int fd, FS_IOC_CHECKPOINT, __u64 *flags);
DESCRIPTION
Ensure that all previous changes to the filesystem backing the given
file descriptor are persisted to disk in the same form that they would
be if the filesystem were to be unmounted cleanly. Changes made during
or after this call are not required to be persisted.
The motivation of this ioctl are twofold -- first, to provide a way for
application software to prepare a mounted filesystem for future
read-only access by primordial external applications (e.g. bootloaders)
that do not know about crash recovery. The second motivation is to
provide a way to clean out ephemeral areas of the filesystem involved in
crash recovery for security cleaning purposes.
FLAGS
The flags argument should point to a __u64 containing any combination of
the following flags:
FS_IOC_CHECKPOINT_DISCARD_STAGING
Issue a discard command to erase all areas of the filesystem
that are involved in staging and committing updates.
ERRORS
Error codes can be one of, but are not limited to, the following:
EFSCORRUPTED Filesystem corruption was detected.
EINVAL One of the arguments is not valid.
ENOMEM Not enough memory.
ENOSPC Not enough space.
<etc>
>
> > > All of these methods imply a journal checkpoint of some kind is done
> > > in ext4, so why do we need a specific ioctl to do this?
> >
> > For XFS, we don't have any kind of system call that will checkpoint the
> > fs without the unwanted behaviors of FIFREEZE and FITHAW. AFAICT
> > there's no explicit way to force a fs checkpoint in ext4 aside from
> > contorted insanity with data=journal files and bmap(). Weird things
> > like NOVA would have to figure out a way to checkpoint the whole fs
> > (flush all the journals?).
>
> So, yeah, you're not talking about a journal checkpoint. You're
> describing a completely different set of requirements.... :/
Yes, I'm talking about making sure that we've written changes back into
the whole fs, not just the journal.
> > btrfs can probably get away with flushing the disk cache since it has
> > COW btrees for metadata (fsync log notwithstanding); and I'd imagine
> > stupid things like FAT would just return EOPNOTSUPP.
> >
> > To solve my stupid grub problem, this could easily be:
> >
> > ret = syncfs2(fd, SYNCFS_CHECKPOINT_FS);
>
> Sure, but is that the same thing that Leah needs? Checkpoints don't
> imply discards (let alone journal discards) in any way, and adding
> (optional) discard support for random parts of filesysetms to
> syncfs() semantics doesn't seem like a very good fit...
Yes. I think Leah and Ted are more inclined to go with an ioctl
since this is something that's peculiar to journalled filesystems.
--D
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
On Thu, Apr 08, 2021 at 03:25:30PM -0400, Theodore Ts'o wrote:
> On Thu, Apr 08, 2021 at 03:21:55PM +1000, Dave Chinner wrote:
> > "if"
> >
> > Is this purely a hypothetical "if", or is it "we have a customer
> > that actaully does this"? Because if this is just hypothetical, then
> > future customers should already be advised and know not to store PII
> > information in clear text *anywhere* in their systems.
>
> Customers might not even know if they are doing this. The concern
> they have is that when they are doing "lift and shift", and moving
> their workloads from their private data center to the cloud, they
> would have no idea what might be lurking in their legacy workloads.
Customers don't know and don't care so long as they can get away with
it. Two seconds after we grant them write access to the filesystem,
they've written your covid19 vaccination card selfie^W^W^Wcredit card
number to permanent storage.
(Err I mean, they care after the breach goes public and gaslighting the
victims^Wcustomers results in lawsuits that stick, so I guess maybe we
should do the world a solid and try to prevent that...)
> Heck, in a previous job, I visited a major customer who had lost their
> sources to a critical bit of their software and they were changing
> pathnames by running a hex editor on the binary. Enterprise customers
> do the darnedest things... (OTOH, they pay $$$ to our companies :-)
Heh. I never thought of myself as an enterprise customer. Editing
binaries is fun.
> In the ideal world, sure, all or most of them would agree that they
> *shouldn't* be storing any kind of PII at rest unencrypted, but they
> can't be sure, and so from the perspective of keeping their audit and
> I/T compliance committees happy, this requirement is desirable from a
> "belt and suspenders" perspective.
>
> > This seems like a better fit for FITRIM than anything else.
> >
> > Ooohh. We sure do suck at APIs, don't we? FITRIM has no flags field,
> > so we can't extend that.
>
> I don't have any serious objections to defining FITRIM2; OTOH, for
Er, are we talking about the directory name wiping, or the journal
discarding?
FITRIM is a horrible interface for directory name wiping. I'm assuming
you're talking about the journal discard patch, but it's a little
strange that these are all replies to the patch adding directory name
wipes.
> Darrick's use case of wanting to make XFS work reliably with the GRUB
> bootloader (which doesn't replay file system journals) and a certain
> Enterprise Linux distribution (cough, cough) can't be bothered to do a
To be fair, the certain Enterprise Linux distro actually does
/sbin/reboot properly like you'd expect; it's the clod users that (heh,
I'll leave the typo in) who expect that they can save time by making
their systems unrebootable.
> clean shutdown at the end of doing an install, using a hypothetical
> FITRIM2 seems... wierd, whereas it might be cleaner to define
> semantics in terms of FS_IOC_CHKPT_JRNL. But I don't have a dog in
> that particular fight, since I'm not responsible of maintaining either
> XFS or RHEL. :-)
> Yet another possible solution might be to define a new system call,
> syncfs2(), which takes a flag option. That might be a bit more
> heavyweight, and we would still have to figure out how to define what
> a "journal checkpoint means" from the standpoint of an API definition.
> It would presumably be something like "allows a non-kernel
> implementation accessing the file system (e.g., from bootloaders like
> grub) to be able to access files on the file sytstem as easily as
> unmounting the file system", or perhaps defining it in terms of doing
> a FIFREEZE/FITHAW, without having to actually freeze the file system.
Let's move this part of the discussion over to the other patch, please.
> > Oh, that won't be fun. XFS places a whiteout over the dirent to
> > indicate that it has been freed, and it does not actually log
> > anything other than the 4 byte whiteout at the start of the dirent
> > and the 2 byte XFS_DIR2_DATA_FREE_TAG tag at the end of the dirent.
> > So zeroing dirents is going to require changing the size and shape
> > of dirent logging during unlinks...
>
> So I'm not an expert on XFS, but XFS does logical logging, so what is
> in the log is "we're going to white out this dirent", right? So
> couldn't the replay code be taught to look at the dirent's reclen, and
> zero out the full directory entry at journal replay time? If the
> directory entry has already been reused, that's a case which the XFS
> replay code has to handle already. Or is there something subtle which
> makes this hard to do.
>
> > This will have to be done correclty for all the node merge, split
> > and compaction cases, too, not just the "remove name" code.
>
> Agreed this is going to be a lot more complicated for XFS.
I didn't think it was any more difficult than changing xfs_removename to
zero out the name and ftype fields at the same time it adds the whiteout
to the dirent. But TBH I haven't thought through this too deeply.
I /do/ think that if you ever want to add "secure" deletion to XFS, I'd
want to do it by implementing FS_SECRM_FL for XFS, and not by adding
more mount options.
> > > P.P.S. We'll also want to have a mount option which supresses file
> > > names (for example, from ext4_error() messages) from showing up in
> > > kernel logs, to ease potential privacy concerns with respect to serial
> > > console and kernel logs. But that's for another patch set....
> >
> > This sounds more and more like "Don't encode PII in clear text
> > anywhere" is a best practice that should be enforced with a big
> > hammer.
How? Encrypting the entire block device? Implementing fscrypt for all
filesystems? Sending Daves to companies everywhere to heckle them for
their poor sekuritee practices? ;)
> > Filenames get everywhere and there's no easy way to prevent
> > that because path lookups can be done by anyone in the kernel. This
> > so much sounds like you're starting a game of whack-a-mole that can
> > never be won.
> >
> > From a security perspective, this is just bad design. Storing PII in
> > clear text filenames pretty much guarantees that the PII will leak
> > because it can't be scrubbed/contained within application controlled
> > boundaries. Trying to contain the spread of filenames within random
> > kernel subsystems sounds like a fool's errand to me, especially
> > given how few kernel developers will even know that filenames are
> > considered sensitive information from a security perspective...
>
> The problem is that the company that the Cloud SRE's work for is
> different from the enterprise customer owning the VM. If a customer
> stores PII in a filename, and the Cloud SRE places the log in some
> place where it could leak out, Google gets blamed, not the enterprise
> customer.
>
> So the Cloud SRE's *have* to treat the logs as if they might contain
> sensitive information, which means it can't be made available in bug
> trackers without a manual (human-drive) scrubbing to make sure the log
> doesn't have anything that might appear to contain PII.
Question -- does e2image have the ability to obscure names like
xfs_metadump will do if you don't pass it "-o" ?
> By the way, MySQL puts table names into file names, and even though
> table names normally aren't PII, it's still considered "customer
> data", and we need to treat any kind of customer data super-carefully.
<shrug> I don't have any objection to logging the directory inode number
and the offset of the dirent. Replacing "Error in path /AUTOEXEC.BAT"
with "Error in directory 0x1515 offset 23568" solves the PII-in-logfiles
problem, right? And still leaves us enough of a breadcrumb that we can
figure out where fstests went wrong.
> > Fundamentally, applications should *never* place PII in clear text
> > in potentially leaky environments. The environment for storing PII
> > should be designed to be secure and free of data leaks from the
> > ground up. And ext4 has already got this with fscrypt support.....
>
> Cloud disks (no matter which cloud vendor) do tend to be encrypted at
> rest, and in the case of Google, we even give customers the option to
> Bring Your Own Key, which means when the VM isn't running, no one at
> Google has access to the encryption key. But that doesn't change the
> fact that if we need to debug a system, the logs might contain file
> names, and file names might be customer-owned data. Using encryption
> at rest doesn't solve that problem.
<nod>
--D
>
> - Ted
On Thu, Apr 08, 2021 at 05:02:07PM -0700, Darrick J. Wong wrote:
> > In the ideal world, sure, all or most of them would agree that they
> > *shouldn't* be storing any kind of PII at rest unencrypted, but they
> > can't be sure, and so from the perspective of keeping their audit and
> > I/T compliance committees happy, this requirement is desirable from a
> > "belt and suspenders" perspective.
> >
> > > This seems like a better fit for FITRIM than anything else.
> > >
> > > Ooohh. We sure do suck at APIs, don't we? FITRIM has no flags field,
> > > so we can't extend that.
> >
> > I don't have any serious objections to defining FITRIM2; OTOH, for
>
> Er, are we talking about the directory name wiping, or the journal
> discarding?
Sorry, I was talking about journal wiping. The conflation is because
the reason why we want to wipe the journal is because of the directory
names in the journal, so the two are very much connected for our use
case, but yes, directory names in directories is very from directory
names in the journal.
We don't actually need any kind of interface for wiping names in
directories, since it doesn't cost us anything to unconditionally wipe
the directory entries as opposed to just setting the inode number to
zero.
> I didn't think it was any more difficult than changing xfs_removename to
> zero out the name and ftype fields at the same time it adds the whiteout
> to the dirent. But TBH I haven't thought through this too deeply.
>
> I /do/ think that if you ever want to add "secure" deletion to XFS, I'd
> want to do it by implementing FS_SECRM_FL for XFS, and not by adding
> more mount options.
The original meaning of FS_SECRM_FL was that the data blocks would be
zero'ed --- when the inode was deleted. We don't intend to have a
mount option for ext4 for zero'ing the directory entry, since it
really doesn't cost us anything to memset the directory entry to zero
at unlink time. I guess for a DAX file system, zero'ing the directory
entry might cost a an extra cache line write, but for block-oriented
devices, for us it's essentially cost-free --- so why add an extra
mount option, and instead just zero the directory entry of everything
other than rec_len?
> Question -- does e2image have the ability to obscure names like
> xfs_metadump will do if you don't pass it "-o" ?
Yes, e2image has had the -s option to scramble file names since
E2fsprogs 1.36 (February, 2005).
- Ted
On Thu, Apr 08, 2021 at 04:49:09PM -0700, Darrick J. Wong wrote:
> On Thu, Apr 08, 2021 at 02:43:27PM +1000, Dave Chinner wrote:
> > On Wed, Apr 07, 2021 at 06:26:51PM -0700, Darrick J. Wong wrote:
> > > On Thu, Apr 08, 2021 at 07:15:00AM +1000, Dave Chinner wrote:
> > > > On Wed, Apr 07, 2021 at 11:35:47AM -0700, Darrick J. Wong wrote:
> > > > > On Wed, Apr 07, 2021 at 03:42:02PM +0000, Leah Rumancik wrote:
> > > > > > ioctl FS_IOC_CHKPT_JRNL checkpoints and flushes the journal. With the
> > > > > > CHKPT_JRNL_DISCARD flag set, the journal blocks are also discarded.
> > > > > > With the filename wipeout patch, Ext4 guarantees that all data will be
> > > > > > discarded on deletion. This ioctl allows for periodically discarding
> > > > > > journal contents too.
> > > > >
> > > > > This needs a documentation update to cover what this new userspace ABI
> > > > > does, and probably linux-api and linux-fsdevel should be cc'd.
> > > >
> > > > You need to describe the semantics that you are exporting to
> > > > userspace. Exactly what does a "journal checkpoint" mean from the
> > > > point of view of user visible metadata and data integrity?
> > >
> > > To be clear, my interests are /not/ the same as Leah's here -- I want to
> > > use this "checkpoint" call to provide a way for grub to know that it
> > > will be able to find boot files without needing to recover the log.
> > >
> > > For the grub use case, the user-visible behaviors that are required are:
> > >
> > > 1. All dirty file data in memory are flushed;
> > > 2. All committed metadata updates are forced to the ondisk log;
> > > 3. All log contents have been written into the filesystem.
> > >
> > > (Note confusing terminology here: in my head, #2 means checkpointing the
> > > ondisk log, whereas #3 means checkpointing the filesystem itself; and
> > > "FS_IOC_CHECKPOINT" means checkpointing the whole filesystem, not just
> > > the log.)
> >
> > So, yeah, you just renamed the ioctl because you are clearly not just
> > talking about a journal checkpoint. A journal checkpoint is what
> > XFS does when it pushes the CIL to disk (i.e. #2). Quiescing the log
> > is what we call #3 - basically bringing it to an empty, idle state.
> >
> > Which makes me even more concerned about defining the behaviour and
> > semantics needed before we even talk about the API that would be
> > used.
>
> Ok, let's draft a manpage. Here's my mockup of a manpage for the ioctl,
> though again, I don't have a strong preference between this and a
> syncfs2 call.
I'd much prefer a syscall as we already have sync() and syncfs()
syscalls to do very similar things.
> NAME
>
> ioctl_fs_ioc_checkpoint - Commit all filesystem changes to disk
>
> SYNOPSYS
>
> int ioctl(int fd, FS_IOC_CHECKPOINT, __u64 *flags);
>
> DESCRIPTION
>
> Ensure that all previous changes to the filesystem backing the given
> file descriptor are persisted to disk in the same form that they would
> be if the filesystem were to be unmounted cleanly. Changes made during
> or after this call are not required to be persisted.
What does "unmounted cleanly" actually mean? An application writer
has no idea what this might mean for their application. Also, what
happens with a filesystem that "cleanly unmounts" but still requires
work to be done on/after the next mount? e.g. per-inode journals
might only get replayed when the inode is next referenced, not when
the filesystem mounts.
IMO, if this is going to force a fully consistent on-disk structure,
it needs to be described as providing similar guarantees as a
fsfreeze(8), but only for modifications completed before the
checkpoint is issued. i.e. a read-only mount on a read-only block
device will see all the changes completed before the checkpoint was
taken.
> The motivation of this ioctl are twofold -- first, to provide a way for
> application software to prepare a mounted filesystem for future
> read-only access by primordial external applications (e.g. bootloaders)
> that do not know about crash recovery.
Yup, moving on-disk state to an external read-only access compatible
state is the requirement, not "cleanly unmounted".
/me wonders how we plan to prevent modifications to files and
metadata writeback during the checkpointing process from resulting
in inconsistent read-only state on disk.
e.g. concurrent directory operations that are partialy written back
while the checkpoint is doing it's magic writing back active
metadata in the journal might result in directories being in
inconsistent state on disk, even though all the modifications that
happened prior to the checkpoint were captured. It's problems like
these that make me ask how the "read-only access" guarantee is going
to work if we haven't actually frozen the filesystem to prevent
concurrent modifications leaking into the on-disk state and making
it inconsistent....
> The second motivation is to
> provide a way to clean out ephemeral areas of the filesystem involved in
> crash recovery for security cleaning purposes.
>
> FLAGS
>
> The flags argument should point to a __u64 containing any combination of
> the following flags:
>
> FS_IOC_CHECKPOINT_DISCARD_STAGING
> Issue a discard command to erase all areas of the filesystem
> that are involved in staging and committing updates.
I don't think this should be conflated with a filesystem checkpoint.
It is, fundamentally, a completely different operation, and one that
might be exceedingly complex and slow to implement. e.g. does this
definition imply deleting and discarding all the previous versions
of now unreferenced metadata in a COW filesytem? iWhat does it imply
for log structured filesystems? What about filesystems that can
discard journal/staging areas without checkpointing?
Sure, discarding the journal might require a full journal checkpoint
for filesystems like ext4, but I can see that we don't actually need
to implement "read-only" guarantees for XFS to implement a journal
discard.
For XFS, we'd just need to force the log, push AIL to the required
target LSN, force the log again to update the on-disk tail, take the
CIL checkpoint write lock to prevent the head moving forward, and
now discard all the unused journal between the head and tail. IOWs,
there is -no requirement- for the filesystem to be in a read-only
access compatible state to discard the unused parts of the
journal - we only need to hold off journal writes for a short period
of time to do the discards....
Hence I think "journal discard" needs to be a separate
syscall/ioctl, not get conflated with a filesystem checkpoint
operation.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, Apr 08, 2021 at 10:51:09PM -0400, Theodore Ts'o wrote:
> On Thu, Apr 08, 2021 at 05:02:07PM -0700, Darrick J. Wong wrote:
> > > In the ideal world, sure, all or most of them would agree that they
> > > *shouldn't* be storing any kind of PII at rest unencrypted, but they
> > > can't be sure, and so from the perspective of keeping their audit and
> > > I/T compliance committees happy, this requirement is desirable from a
> > > "belt and suspenders" perspective.
> > >
> > > > This seems like a better fit for FITRIM than anything else.
> > > >
> > > > Ooohh. We sure do suck at APIs, don't we? FITRIM has no flags field,
> > > > so we can't extend that.
> > >
> > > I don't have any serious objections to defining FITRIM2; OTOH, for
> >
> > Er, are we talking about the directory name wiping, or the journal
> > discarding?
>
> Sorry, I was talking about journal wiping. The conflation is because
> the reason why we want to wipe the journal is because of the directory
> names in the journal, so the two are very much connected for our use
> case, but yes, directory names in directories is very from directory
> names in the journal.
>
> We don't actually need any kind of interface for wiping names in
> directories, since it doesn't cost us anything to unconditionally wipe
> the directory entries as opposed to just setting the inode number to
> zero.
>
> > I didn't think it was any more difficult than changing xfs_removename to
> > zero out the name and ftype fields at the same time it adds the whiteout
> > to the dirent. But TBH I haven't thought through this too deeply.
> >
> > I /do/ think that if you ever want to add "secure" deletion to XFS, I'd
> > want to do it by implementing FS_SECRM_FL for XFS, and not by adding
> > more mount options.
>
> The original meaning of FS_SECRM_FL was that the data blocks would be
> zero'ed --- when the inode was deleted.
Sure, if discard is Good Enough(tm) for the journal, then we just
treat this flag like "-o discard" was enabled for this file. Let the
hardware do the "zeroing" in the background once we mark the extent
as free. And if the hardware supports secure erase in place of
discard, then even better.
In the case of XFS, if we are to implement this directory entry
zeroing then we actually need to discard the directory blocks. We
may elide writeback of the directory block altogether if it is
removed from the directory entirely between journal checkpoints. In
that situation, we just write a whiteout for the block to the
journal (we cancel the buffer) and we never actually write that
buffer's contents to disk as it has been marked free by the journal
commit.
And, similarly short form directories aren't in blocks and can't be
discarded, and we can elide inode writeback in the case where the
inode clusters are freed. Hence zeroing dirents held inline in the
inodes are not guaranteed to hit the disk, either. So we'd need to
discard inode clusters as well.
IOWs, we can do "rm -rf" of a directory with tens of thousands of
entries, and the only thing that ends up hitting stable storage
is a few hundred buffer invalidations in the journal. They remain
unmodified in free space after the journal commit.
This is why I said "good luck" to fixing XFS not to leak directory
entries to disk. It's a pretty major undertaking to audit, fix and
verify all the paths that remove directory entries to ensure that we
do not leak dirent names anywhere.
And I haven't even touched on PII in extended attributes :/
Cheers,
Dave.
--
Dave Chinner
[email protected]