Carlo Wood has demonstrated that it's possible to recover deleted
files from the journal. Something that will make this easier is if we
can put the time of the commit into commit block.
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/jbd2/commit.c | 3 +++
include/linux/jbd2.h | 2 ++
2 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index e013978..7c5fbd5 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -112,6 +112,7 @@ static int journal_submit_commit_record(journal_t *journal,
struct buffer_head *bh;
int ret;
int barrier_done = 0;
+ struct timespec now = current_kernel_time();
if (is_journal_aborted(journal))
return 0;
@@ -126,6 +127,8 @@ static int journal_submit_commit_record(journal_t *journal,
tmp->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
tmp->h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
tmp->h_sequence = cpu_to_be32(commit_transaction->t_tid);
+ tmp->h_commit_sec = cpu_to_be32(now->tv_sec);
+ tmp->h_commit_nsec = cpu_to_be32(now->tv_nsec);
if (JBD2_HAS_COMPAT_FEATURE(journal,
JBD2_FEATURE_COMPAT_CHECKSUM)) {
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 2cbf6fd..7cbeb2b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -170,6 +170,8 @@ struct commit_header {
unsigned char h_chksum_size;
unsigned char h_padding[2];
__be32 h_chksum[JBD2_CHECKSUM_BYTES];
+ __be32 h_commit_sec;
+ __be32 h_commit_nsec;
};
/*
--
1.5.4.1.144.gdfee-dirty
On Mar 15, 2008 20:59 -0400, Theodore Ts'o wrote:
> Carlo Wood has demonstrated that it's possible to recover deleted
> files from the journal. Something that will make this easier is if we
> can put the time of the commit into commit block.
>
> @@ -170,6 +170,8 @@ struct commit_header {
> unsigned char h_chksum_size;
> unsigned char h_padding[2];
> __be32 h_chksum[JBD2_CHECKSUM_BYTES];
> + __be32 h_commit_sec;
> + __be32 h_commit_nsec;
> };
We should probably use a 64-bit seconds field, after we just told
someone on #ext4 that it would work until at least 2242 :-).
struct commit_header {
unsigned char h_chksum_size;
unsigned char h_padding[2];
__be32 h_chksum[JBD2_CHECKSUM_BYTES];
__be64 h_commit_sec;
__be32 h_commit_nsec;
};
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
On Mar 15, 2008 20:59 -0400, Theodore Ts'o wrote:
> Carlo Wood has demonstrated that it's possible to recover deleted
> files from the journal. Something that will make this easier is if we
> can put the time of the commit into commit block.
Note that we'd still be a lot further ahead undelete- and performance-wise
if we avoided overwriting the indirect blocks in the first place... As
it is, this is only really useful if you pull the plug after the delete.
No harm in doing it, but won't help you recover as much as you could.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
On Sun, Mar 16, 2008 at 09:26:02AM +0800, Andreas Dilger wrote:
> On Mar 15, 2008 20:59 -0400, Theodore Ts'o wrote:
> > Carlo Wood has demonstrated that it's possible to recover deleted
> > files from the journal. Something that will make this easier is if we
> > can put the time of the commit into commit block.
>
> Note that we'd still be a lot further ahead undelete- and performance-wise
> if we avoided overwriting the indirect blocks in the first place... As
> it is, this is only really useful if you pull the plug after the delete.
> No harm in doing it, but won't help you recover as much as you could.
Yeah, I looked at that at one point, but I never had time to try to
code it up. The concept would is that we only need to zero out the
block pointers if we end up dirtying enough bitmap blocks that we've
run out of space in the journal and so we need to close the
transaction. Of course, the problem is that we need to either (a)
figure out in advance exactly how many bitmap blocks we need to dirty
(which means we have to read all the indirect blocks twice to figure
it out for ext3; this is easier for ext4) so we know whether it will
fit in one transaction, or (b) if we try to do it in a single pass, we
need to allow enough safety margin so that when we *do* decide we
can't make it fit, we still do have enough space in the journal to
zero out the blocks in the indirect blocks and in the inode.
I guess the third alternative, (c), is that we don't update *any* of
the superblock or block group descriptors until the very end of the
transaction, and don't update any of the blocks. So we just update
the bitmap blocks first, and then in a second pass update all of the
blockgroup descriptors and superblock. This would require assuring
that the update of all of the block group descriptors, superblock, and
removing the inode from the orphan linked list, can all fit in a
single transaction. If not, this scheme wouldn't work at all.
(a) is probably the simplest, but it's fundamentally a two pass
algorithm.
- Ted
On Mar 15, 2008 23:10 -0400, Theodore Ts'o wrote:
> On Sun, Mar 16, 2008 at 09:26:02AM +0800, Andreas Dilger wrote:
> > Note that we'd still be a lot further ahead undelete- and performance-wise
> > if we avoided overwriting the indirect blocks in the first place... As
> > it is, this is only really useful if you pull the plug after the delete.
> > No harm in doing it, but won't help you recover as much as you could.
>
> Yeah, I looked at that at one point, but I never had time to try to
> code it up. The concept would is that we only need to zero out the
> block pointers if we end up dirtying enough bitmap blocks that we've
> run out of space in the journal and so we need to close the
> transaction. Of course, the problem is that we need to either (a)
> figure out in advance exactly how many bitmap blocks we need to dirty
> (which means we have to read all the indirect blocks twice to figure
> it out for ext3; this is easier for ext4) so we know whether it will
> fit in one transaction,
While it's true it is a two-pass algorithm, I think it can actually
improve overall performance. One major win is that we don't have
to write out indirect blocks, saving about 32/33 (IIRC) of the IO
needed for the current truncate. The second win is that we can do
async prefetching of all the (d)indirect blocks from the {dt}indirect
blocks in forward order instead of the current block-at-a-time reads.
Finally, on the second pass the blocks will normally be in RAM already
so not nearly so slow.
> (b) if we try to do it in a single pass, we
> need to allow enough safety margin so that when we *do* decide we
> can't make it fit, we still do have enough space in the journal to
> zero out the blocks in the indirect blocks and in the inode.
We'd still have to truncate from the end in this case...
> I guess the third alternative, (c), is that we don't update *any* of
> the superblock or block group descriptors until the very end of the
> transaction, and don't update any of the blocks. So we just update
> the bitmap blocks first, and then in a second pass update all of the
> blockgroup descriptors and superblock. This would require assuring
> that the update of all of the block group descriptors, superblock, and
> removing the inode from the orphan linked list, can all fit in a
> single transaction. If not, this scheme wouldn't work at all.
I'm not sure I understand this. Wouldn't this possibly lead to those
blocks being re-allocated after a crash?
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
On Sun, Mar 16, 2008 at 11:16:17PM +0800, Andreas Dilger wrote:
> > I guess the third alternative, (c), is that we don't update *any* of
> > the superblock or block group descriptors until the very end of the
> > transaction, and don't update any of the blocks. So we just update
> > the bitmap blocks first, and then in a second pass update all of the
> > blockgroup descriptors and superblock. This would require assuring
> > that the update of all of the block group descriptors, superblock, and
> > removing the inode from the orphan linked list, can all fit in a
> > single transaction. If not, this scheme wouldn't work at all.
>
> I'm not sure I understand this. Wouldn't this possibly lead to those
> blocks being re-allocated after a crash?
No, because the inode is on the orphan/truncate list, which would get
processed as part of mounting the filesystem. So we might end up
replaying some of the updates to the bitmaps, and clearing blocks that
are already cleared; but that's OK, because clearing the bitmap
allocations is an idempotent operation. Incrementing the free blocks
count in the superblock and bitmap allocation blocks is *not*
idempotent, which means they they (along with removing the inode from
the orphaned inode list) all have to be done within a single atomic
transaction.
- Ted
On Sat, 2008-03-15 at 20:59 -0400, Theodore Ts'o wrote:
> Carlo Wood has demonstrated that it's possible to recover deleted
> files from the journal. Something that will make this easier is if we
> can put the time of the commit into commit block.
>
Sounds good. Added to the patch queue after fix the compile error.
Shouldn't this a JBD2 INCOMPAT feature?
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
> fs/jbd2/commit.c | 3 +++
> include/linux/jbd2.h | 2 ++
> 2 files changed, 5 insertions(+), 0 deletions(-)
>
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index e013978..7c5fbd5 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -112,6 +112,7 @@ static int journal_submit_commit_record(journal_t *journal,
> struct buffer_head *bh;
> int ret;
> int barrier_done = 0;
> + struct timespec now = current_kernel_time();
>
> if (is_journal_aborted(journal))
> return 0;
> @@ -126,6 +127,8 @@ static int journal_submit_commit_record(journal_t *journal,
> tmp->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
> tmp->h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
> tmp->h_sequence = cpu_to_be32(commit_transaction->t_tid);
> + tmp->h_commit_sec = cpu_to_be32(now->tv_sec);
> + tmp->h_commit_nsec = cpu_to_be32(now->tv_nsec);
~~~~~~~~~~~~~~
Should be now.ntv_sec,
Attached is the updated patch.
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/jbd2/commit.c | 3 +++
include/linux/jbd2.h | 2 ++
2 files changed, 5 insertions(+)
Index: linux-2.6.25-rc6/fs/jbd2/commit.c
===================================================================
--- linux-2.6.25-rc6.orig/fs/jbd2/commit.c 2008-03-25 11:36:37.000000000 -0700
+++ linux-2.6.25-rc6/fs/jbd2/commit.c 2008-03-25 11:50:08.000000000 -0700
@@ -112,6 +112,7 @@ static int journal_submit_commit_record(
struct buffer_head *bh;
int ret;
int barrier_done = 0;
+ struct timespec now = current_kernel_time();
if (is_journal_aborted(journal))
return 0;
@@ -126,6 +127,8 @@ static int journal_submit_commit_record(
tmp->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
tmp->h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
tmp->h_sequence = cpu_to_be32(commit_transaction->t_tid);
+ tmp->h_commit_sec = cpu_to_be32(now.tv_sec);
+ tmp->h_commit_nsec = cpu_to_be32(now.tv_nsec);
if (JBD2_HAS_COMPAT_FEATURE(journal,
JBD2_FEATURE_COMPAT_CHECKSUM)) {
Index: linux-2.6.25-rc6/include/linux/jbd2.h
===================================================================
--- linux-2.6.25-rc6.orig/include/linux/jbd2.h 2008-03-16 16:32:14.000000000 -0700
+++ linux-2.6.25-rc6/include/linux/jbd2.h 2008-03-25 11:46:46.000000000 -0700
@@ -170,6 +170,8 @@ struct commit_header {
unsigned char h_chksum_size;
unsigned char h_padding[2];
__be32 h_chksum[JBD2_CHECKSUM_BYTES];
+ __be32 h_commit_sec;
+ __be32 h_commit_nsec;
};
/*
On Tue, Mar 25, 2008 at 11:57:49AM -0700, Mingming Cao wrote:
> On Sat, 2008-03-15 at 20:59 -0400, Theodore Ts'o wrote:
> > Carlo Wood has demonstrated that it's possible to recover deleted
> > files from the journal. Something that will make this easier is if we
> > can put the time of the commit into commit block.
> >
> Sounds good. Added to the patch queue after fix the compile error.
>
> Shouldn't this a JBD2 INCOMPAT feature?
No, because we're just writing some extra data at the end of the
commit record. There's no imcompatibility implied, since both the
kernel and e2fsck will ignore the extra information.
This is the same reasoning used for why we can add new superblock
fields without needing new compatibility flags, as long as a default
value of 0 does something sane.
- Ted