2010-08-18 14:04:30

by Andre Noll

[permalink] [raw]
Subject: Memory allocation failed, e2fsck: aborted

Hi

I'm having trouble with checking a corrupt ext3 file system resulting
from a 3-disk failure on a 12 disk software raid6 array. The disk
failure was due to an issue with the (3ware) controller and 11 disks
appear to be fine. However, I had to --assemble --force the array
because two of the 11 disks were not up to date after the crash.

e2fsck from today's git master branch aborts after a while with

./e2fsck -f -y -C 0 /dev/euclidean/snap_abt1_kristall_RockMakerStorage
e2fsck 1.41.12 (17-May-2010)
Backing up journal inode block information.

Pass 1: Checking inodes, blocks, and sizes
Error storing inode count information (inode=245859898, count=2): Memory allocation failed
e2fsck: aborted

This is an old 32 bit system with only 1G of ram and a 2.6.24 distro
kernel. I added _lots_ of swap but this did not help.

Any hints on how to proceed?

Since the file system is corrupt anyway, it is maybe easiest
to delete inode 245859898 with debugfs, but maybe there is
a better option. Moreover, since this might be some kind of
e2fsck-trusts-corrupt-data issue, you might be interested in looking
at this.

Further info: The ext3 file system lives on a lv within a vg whose
single pv is the 12 disk raid6 array. The file system stores hard
link based backups, so it contains _lots_ of hard links.

Thanks
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (1.37 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2010-08-18 20:20:15

by Andreas Dilger

[permalink] [raw]
Subject: Re: Memory allocation failed, e2fsck: aborted

On 2010-08-18, at 08:04, Andre Noll wrote:
> I'm having trouble with checking a corrupt ext3 file system resulting
> from a 3-disk failure on a 12 disk software raid6 array. The disk
> failure was due to an issue with the (3ware) controller and 11 disks
> appear to be fine. However, I had to --assemble --force the array
> because two of the 11 disks were not up to date after the crash.
>
> e2fsck from today's git master branch aborts after a while with
>
> ./e2fsck -f -y -C 0 /dev/euclidean/snap_abt1_kristall_RockMakerStorage
> e2fsck 1.41.12 (17-May-2010)
> Backing up journal inode block information.
>
> Pass 1: Checking inodes, blocks, and sizes
> Error storing inode count information (inode=245859898, count=2): Memory allocation failed
> e2fsck: aborted
>
> This is an old 32 bit system with only 1G of ram and a 2.6.24 distro
> kernel. I added _lots_ of swap but this did not help.

Yeah, it is possible to have filesystems that are too large for the node they are running on. There are low-priority discussions for how to reduce memory usage of e2fsck, but they have never been a priority to implement.

> Since the file system is corrupt anyway, it is maybe easiest
> to delete inode 245859898 with debugfs, but maybe there is
> a better option. Moreover, since this might be some kind of
> e2fsck-trusts-corrupt-data issue, you might be interested in looking
> at this.

No, I don't think this will help. The problem is not with that inode, just that it is needing to allocate a structure because of nlinks=2 (this is normal).

In theory it might be possible to avoid allocating icount structures for every directory inode (which have icount == 2 normally), if we used the "fs->inode_dir_map" bit as "+1" for the inode link count.

In any case, this is a non-trivial fix.

> Further info: The ext3 file system lives on a lv within a vg whose
> single pv is the 12 disk raid6 array. The file system stores hard
> link based backups, so it contains _lots_ of hard links.

Ah, that is also a major user of memory, and not necessarily one that optimizing the internal bitmap will help significantly. It may well be that your swap cannot be used if a single allocation is in the same neighbourhood as the total RAM size.

Every file with nlink > 1 will need an additional 8 bytes of data, and the insert_icount_el() function reallocates this structure every 100 elements, so it can use AT MOST 1/2 of all memory before the new copy and the old one fill all available memory.

It would probably make sense to modify the internal icount structure to hold a 2-level tree of arrays of e.g. 8kB chunks, or other advanced data structure so that it doesn't force reallocation and average .51 memory copies of the WHOLE LIST on every insert. This is probably doable with some light understanding of e2fsprogs, since the icount interface is well encapsulated, but it isn't something I have time for now.

If you are interested to hack/improve e2fsprogs I'd be willing to guide you, but if not I'd just suggest connecting this array to another node to run e2fsck, and consider spending the $200 needed to get a 64-bit system with a few GB of RAM.

Cheers, Andreas






2010-08-19 00:54:08

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Memory allocation failed, e2fsck: aborted

On Wed, Aug 18, 2010 at 02:20:13PM -0600, Andreas Dilger wrote:
>
> Ah, that is also a major user of memory, and not necessarily one
> that optimizing the internal bitmap will help significantly. It may
> well be that your swap cannot be used if a single allocation is in
> the same neighbourhood as the total RAM size.

Something which *might* help (but will take a long time) is to add to
your /etc/e2fsck.conf (if you have one; if not create one wiht these
contents):

[scratch_files]
directory = /var/cache/fsck

(And then make sure /var/cache/fsck exists.)

Unfortunately, as it turns out tdb (from Samba) doesn't scale as much
as I would have liked, so it's on my todo to replace this with
something else. The problem with that berk_db has non-standard
interfaces and varies from version to version. So I've avoided using
it up until now.

But try it, and see if this will help.... it may still be way too
slow.

A related blog entry:

http://thunk.org/tytso/blog/2009/01/12/wanted-incremental-backup-solutions-that-use-a-database/

- Ted

P.S. I recently was told about a new backup system that meets the
requirements detailed in my post:

http://sites.google.com/site/hashbackup/home/features

I haven't tried it yet, but it looks interesting. Let me know if you
do try it and what you think of it.

2010-08-19 13:01:36

by Andre Noll

[permalink] [raw]
Subject: Re: Memory allocation failed, e2fsck: aborted

On Wed, Aug 18, 14:20, Andreas Dilger wrote:
> > This is an old 32 bit system with only 1G of ram and a 2.6.24 distro
> > kernel. I added _lots_ of swap but this did not help.
>
> Yeah, it is possible to have filesystems that are too large for the
> node they are running on. There are low-priority discussions for how
> to reduce memory usage of e2fsck, but they have never been a priority
> to implement.

But e2fsck runs entirely in user space, so all memory should be
swappable, no? I think the system did not swap at all.

> > Since the file system is corrupt anyway, it is maybe easiest
> > to delete inode 245859898 with debugfs, but maybe there is
> > a better option. Moreover, since this might be some kind of
> > e2fsck-trusts-corrupt-data issue, you might be interested in looking
> > at this.
>
> No, I don't think this will help. The problem is not with that inode,
> just that it is needing to allocate a structure because of nlinks=2
> (this is normal).
>
> In theory it might be possible to avoid allocating icount structures
> for every directory inode (which have icount == 2 normally), if we
> used the "fs->inode_dir_map" bit as "+1" for the inode link count.
>
> In any case, this is a non-trivial fix.

I'm not sure I can follow. Are you saying we currently allocate two
struct ext2_icount for a directory inode even if . and .. are the
only two references? So we could just omit this allocation in the
common icount == 2 case because we know it is a directory inode
(so we have one additional reference) if fs->inode_dir_map is not NULL.

> > Further info: The ext3 file system lives on a lv within a vg whose
> > single pv is the 12 disk raid6 array. The file system stores hard
> > link based backups, so it contains _lots_ of hard links.
>
> Ah, that is also a major user of memory, and not necessarily one that
> optimizing the internal bitmap will help significantly. It may well
> be that your swap cannot be used if a single allocation is in the same
> neighbourhood as the total RAM size.

Is playing with the memory overcommit knobs likely going to help?

> Every file with nlink > 1 will need an additional 8 bytes of data, and
> the insert_icount_el() function reallocates this structure every 100
> elements, so it can use AT MOST 1/2 of all memory before the new copy
> and the old one fill all available memory.
>
> It would probably make sense to modify the internal icount structure
> to hold a 2-level tree of arrays of e.g. 8kB chunks, or other advanced
> data structure so that it doesn't force reallocation and average .51
> memory copies of the WHOLE LIST on every insert. This is probably
> doable with some light understanding of e2fsprogs, since the icount
> interface is well encapsulated, but it isn't something I have time for
> now.

I'm interested in having a look at the icount structure and see what
can be done to reduce memory usage. Here's a first question: There is
ext2fs_create_icount() and ext2fs_create_icount_tdb(). Is is correct
that they do the same thing, the only difference being that the tdb
variant uses an on-disk database while ext2fs_create_icount stores
everything in memory?

If so we might want to discuss first whether it is more important
to improve the performance of the on-disk database or to decrease
the memory requirements of the in-memory variant. The answer likely
depends on the amounts of disk space and RAM a typical system will
have in 5 or 10 years from now.

> If you are interested to hack/improve e2fsprogs I'd be willing to
> guide you, but if not I'd just suggest connecting this array to
> another node to run e2fsck, and consider spending the $200 needed to
> get a 64-bit system with a few GB of RAM.

Yeah right. There is already a 64bit system waiting to replace the
old one. Moving the old disks would be a PITA though because of the
lack of hot swap bays...

Thanks for your help
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (3.89 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2010-08-19 13:10:32

by Andre Noll

[permalink] [raw]
Subject: Re: Memory allocation failed, e2fsck: aborted

On Wed, Aug 18, 20:54, Ted Ts'o wrote:
> On Wed, Aug 18, 2010 at 02:20:13PM -0600, Andreas Dilger wrote:
> >
> > Ah, that is also a major user of memory, and not necessarily one
> > that optimizing the internal bitmap will help significantly. It may
> > well be that your swap cannot be used if a single allocation is in
> > the same neighbourhood as the total RAM size.
>
> Something which *might* help (but will take a long time) is to add to
> your /etc/e2fsck.conf (if you have one; if not create one wiht these
> contents):
>
> [scratch_files]
> directory = /var/cache/fsck
>
> (And then make sure /var/cache/fsck exists.)

Thanks for the hint. It is running for an hour now and I will report
back tomorrow. ATM, it's at 1% and the two files in /var/cache/fsck
are ~50M large.

> Unfortunately, as it turns out tdb (from Samba) doesn't scale as much
> as I would have liked, so it's on my todo to replace this with
> something else. The problem with that berk_db has non-standard
> interfaces and varies from version to version. So I've avoided using
> it up until now.

Silly question: Would it be possible to simply mmap a large enough
file for the data and and use e.g. rbtrees for the lookups? If yes,
osl [1] could probably be an option. It's very simple but likely too
slow on inserts to be useful for e2fsprogs.

> A related blog entry:
>
> http://thunk.org/tytso/blog/2009/01/12/wanted-incremental-backup-solutions-that-use-a-database/

Hey, I read this posting back then, and I agree with what you say.
However, we are quite happy with our hard link based backup and use
it to "snapshot" file systems as large as 16T. Users love that they
can simply copy back an older version of the file they just removed
by accident. Another killer argument for this type of backup is that
you can easily replace a broken system by the machine that stores
the backup. This takes an hour while restoring everything from tapes
takes _weeks_.

But yes, from the file system developer's POV the whole concept of
hard link based backups must be a nightmare ;) And it does not work
well if there are very many files. Unfortunately, this is the case
for the file system in question.

> P.S. I recently was told about a new backup system that meets the
> requirements detailed in my post:
>
> http://sites.google.com/site/hashbackup/home/features
>
> I haven't tried it yet, but it looks interesting. Let me know if you
> do try it and what you think of it.

Looks interesting, but where's the source? I might give it a try for
the problematic file system, but maybe not before next month.

Thanks
Andre

[1] http://systemlinux.org/~maan/osl
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (2.67 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2010-08-19 17:16:23

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Memory allocation failed, e2fsck: aborted

On Thu, Aug 19, 2010 at 03:10:17PM +0200, Andre Noll wrote:
> Silly question: Would it be possible to simply mmap a large enough
> file for the data and and use e.g. rbtrees for the lookups? If yes,
> osl [1] could probably be an option. It's very simple but likely too
> slow on inserts to be useful for e2fsprogs.

As I recall, you're on a 32-bit machine, right? If so, a limitation
you may run into is simply running out of address space. If it's not
an address space issue, we don't need to mmap anything; you could just
try enabling swap, and use the existing e2fsck code.

(I had assumed you had tried that before suggesting you use the
scratch_files tdb approach....)

- Ted

2010-08-19 19:03:07

by Andreas Dilger

[permalink] [raw]
Subject: Re: Memory allocation failed, e2fsck: aborted

On 2010-08-19, at 07:01, Andre Noll wrote:
> On Wed, Aug 18, 14:20, Andreas Dilger wrote:
>>> This is an old 32 bit system with only 1G of ram and a 2.6.24 distro
>>> kernel. I added _lots_ of swap but this did not help.
>>
>> Yeah, it is possible to have filesystems that are too large for the
>> node they are running on. There are low-priority discussions for how
>> to reduce memory usage of e2fsck, but they have never been a priority
>> to implement.
>
> But e2fsck runs entirely in user space, so all memory should be
> swappable, no? I think the system did not swap at all.

I think the problem isn't just the TOTAL amount of RAM being used, but the fact that this piece of code is trying to do a SINGLE allocation that is HUGE. The second problem is that the constant re-allocation of this huge array (every 100 insertions) means that it can never really exceed 1/2 of RAM in size.

>>> Since the file system is corrupt anyway, it is maybe easiest
>>> to delete inode 245859898 with debugfs, but maybe there is
>>> a better option. Moreover, since this might be some kind of
>>> e2fsck-trusts-corrupt-data issue, you might be interested in looking
>>> at this.
>>
>> No, I don't think this will help. The problem is not with that inode,
>> just that it is needing to allocate a structure because of nlinks=2
>> (this is normal).
>>
>> In theory it might be possible to avoid allocating icount structures
>> for every directory inode (which have icount == 2 normally), if we
>> used the "fs->inode_dir_map" bit as "+1" for the inode link count.
>>
>> In any case, this is a non-trivial fix.
>
> I'm not sure I can follow. Are you saying we currently allocate two
> struct ext2_icount for a directory inode even if . and .. are the
> only two references?

Well, one per directory, yes. Whether the directory count is the core problem (which could be fixed by this hack) or the many hard-linked files is the core problem I can't really say without more information.

> So we could just omit this allocation in the
> common icount == 2 case because we know it is a directory inode
> (so we have one additional reference) if fs->inode_dir_map is not NULL.

Yes, that was my idea. The main problem is that this would tie together parts of the e2fsck code that are currently independent, and I don't know how cleanly this can be done. It deserves some further investigation, even for normal e2fsck usage, because it would likely eliminate 75% of the extra allocations in icount due to leaf directories (internal directories will have nlink > 2 due to child directories)

>>> Further info: The ext3 file system lives on a lv within a vg whose
>>> single pv is the 12 disk raid6 array. The file system stores hard
>>> link based backups, so it contains _lots_ of hard links.
>>
>> Ah, that is also a major user of memory, and not necessarily one that
>> optimizing the internal bitmap will help significantly. It may well
>> be that your swap cannot be used if a single allocation is in the same
>> neighbourhood as the total RAM size.
>
> Is playing with the memory overcommit knobs likely going to help?

Probably not.

>> Every file with nlink > 1 will need an additional 8 bytes of data, and
>> the insert_icount_el() function reallocates this structure every 100
>> elements, so it can use AT MOST 1/2 of all memory before the new copy
>> and the old one fill all available memory.
>>
>> It would probably make sense to modify the internal icount structure
>> to hold a 2-level tree of arrays of e.g. 8kB chunks, or other advanced
>> data structure so that it doesn't force reallocation and average .51
>> memory copies of the WHOLE LIST on every insert. This is probably
>> doable with some light understanding of e2fsprogs, since the icount
>> interface is well encapsulated, but it isn't something I have time for
>> now.
>
> I'm interested in having a look at the icount structure and see what
> can be done to reduce memory usage.

One simple way that this could be fixed fairly easily (which would presumably allow swap to be used) is to have a 2-level (or N-level) tree of allocations for the icount->list array, with the first level being just an array of pointers to individually-allocated arrays of ext2_icount_el. The sub-arrays can be some reasonable size (maybe 64kB), which would give us a fan-out of 64k / 8 = 8k, and if the top-level array is (re-)allocated in chunks of, say 64 pointers, the number of times the top-level array needs to be reallocated would only be every 64 * 8192 insertions, and only the pointer array needs to be reallocated/copied.

That said, any insert-optimized tree structure with a high fan-out would be suitable. Elements are almost never deleted, and we would never need to compact the tree (it is freed as a whole when it is done).

> Here's a first question: There is ext2fs_create_icount() and
> ext2fs_create_icount_tdb(). Is is correct that they do the same thing,
> the only difference being that the tdb variant uses an on-disk
> database while ext2fs_create_icount stores everything in memory?

Correct.

> If so we might want to discuss first whether it is more important
> to improve the performance of the on-disk database or to decrease
> the memory requirements of the in-memory variant. The answer likely
> depends on the amounts of disk space and RAM a typical system will
> have in 5 or 10 years from now.

I suspect that using multi-level arrays + swap is always going to be more efficient than using an on-disk database. This will maximize the RAM usage due to very compact storage of the icount_el data instead of database records with much more (2x at least) overhead, and it will also avoid all disk IO unless the system is actually running out of RAM. Finally, all of the swap IO will be in at least PAGE_SIZE chunks, while tdb and friends often write 512-byte sparse records randomly.

>> If you are interested to hack/improve e2fsprogs I'd be willing to
>> guide you, but if not I'd just suggest connecting this array to
>> another node to run e2fsck, and consider spending the $200 needed to
>> get a 64-bit system with a few GB of RAM.
>
> Yeah right. There is already a 64bit system waiting to replace the
> old one. Moving the old disks would be a PITA though because of the
> lack of hot swap bays...

If you have a new system ready to go, you may be ahead by just doing a raw copy of the filesystem over to the new storage, and running e2fsck on that. This also gives you the security of doing the e2fsck on a copy of the data, in case it does something bad.

You could start the raw copy even if you have your current e2fsck running, since I suspect the current tdb-based e2fsck will be totally IO-bound on the database and not on IO from the filesystem.

Cheers, Andreas






2010-08-20 14:15:12

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Memory allocation failed, e2fsck: aborted

On Thu, Aug 19, 2010 at 01:03:06PM -0600, Andreas Dilger wrote:
>
> I think the problem isn't just the TOTAL amount of RAM being used,
> but the fact that this piece of code is trying to do a SINGLE
> allocation that is HUGE. The second problem is that the constant
> re-allocation of this huge array (every 100 insertions) means that
> it can never really exceed 1/2 of RAM in size.

Part of the problem is that the data structures are optimized for
relatively few hard links and a reasonable number of directories
compared to other types of inodes. For people who are using hard-link
backup schemes, these assumptions are violated in some fairly massive
ways.

Something that might be interesting to do is to keep some statistics
in the superblock about the number of hard links. This would allow
e2fsck to allocate the data structures appropriately up front, and
maybe allow it to switch to some other representation if it turns out
the file system is one which is dense with hard links.

> That said, any insert-optimized tree structure with a high fan-out
> would be suitable. Elements are almost never deleted, and we would
> never need to compact the tree (it is freed as a whole when it is
> done).

We could try to create a tree-optimized data structure, but I wonder
if it's worth it. IIRC berk_db has an in-memory and file-backed
option, and it has a btree option. Using that might be easier than
trying to hand-code something special.

- Ted

2010-08-20 14:36:59

by Andre Noll

[permalink] [raw]
Subject: Re: Memory allocation failed, e2fsck: aborted

On Thu, Aug 19, 15:10, Andre Noll wrote:
> > Something which *might* help (but will take a long time) is to add to
> > your /etc/e2fsck.conf (if you have one; if not create one wiht these
> > contents):
> >
> > [scratch_files]
> > directory = /var/cache/fsck
> >
> > (And then make sure /var/cache/fsck exists.)
>
> Thanks for the hint. It is running for an hour now and I will report
> back tomorrow. ATM, it's at 1% and the two files in /var/cache/fsck
> are ~50M large.

It got much slower: 4.8% after one day. So I cancelled the fsck and
am copying the raw device to a 64bit machine and run fsck there,
as Andreas suggested.

Thanks
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (729.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2010-08-20 14:39:56

by Andre Noll

[permalink] [raw]
Subject: Re: Memory allocation failed, e2fsck: aborted

On Thu, Aug 19, 13:03, Andreas Dilger wrote:
> On 2010-08-19, at 07:01, Andre Noll wrote:
> > On Wed, Aug 18, 14:20, Andreas Dilger wrote:
> >>> This is an old 32 bit system with only 1G of ram and a 2.6.24 distro
> >>> kernel. I added _lots_ of swap but this did not help.
> >>
> >> Yeah, it is possible to have filesystems that are too large for the
> >> node they are running on. There are low-priority discussions for how
> >> to reduce memory usage of e2fsck, but they have never been a priority
> >> to implement.
> >
> > But e2fsck runs entirely in user space, so all memory should be
> > swappable, no? I think the system did not swap at all.
>
> I think the problem isn't just the TOTAL amount of RAM being used, but
> the fact that this piece of code is trying to do a SINGLE allocation
> that is HUGE. The second problem is that the constant re-allocation
> of this huge array (every 100 insertions) means that it can never
> really exceed 1/2 of RAM in size.

I see. So you are propossing to address the second problem by
introducing a more clever algorithm for allocating the icount
structures.

> > So we could just omit this allocation in the
> > common icount == 2 case because we know it is a directory inode
> > (so we have one additional reference) if fs->inode_dir_map is not NULL.
>
> Yes, that was my idea. The main problem is that this would tie
> together parts of the e2fsck code that are currently independent, and
> I don't know how cleanly this can be done. It deserves some further
> investigation, even for normal e2fsck usage, because it would likely
> eliminate 75% of the extra allocations in icount due to leaf
> directories (internal directories will have nlink > 2 due to child
> directories)

Looks like it is worthwhile to keep this on the TODO list :)

> >> Every file with nlink > 1 will need an additional 8 bytes of data, and
> >> the insert_icount_el() function reallocates this structure every 100
> >> elements, so it can use AT MOST 1/2 of all memory before the new copy
> >> and the old one fill all available memory.
> >>
> >> It would probably make sense to modify the internal icount structure
> >> to hold a 2-level tree of arrays of e.g. 8kB chunks, or other advanced
> >> data structure so that it doesn't force reallocation and average .51
> >> memory copies of the WHOLE LIST on every insert. This is probably
> >> doable with some light understanding of e2fsprogs, since the icount
> >> interface is well encapsulated, but it isn't something I have time for
> >> now.
> >
> > I'm interested in having a look at the icount structure and see what
> > can be done to reduce memory usage.
>
> One simple way that this could be fixed fairly easily (which would
> presumably allow swap to be used) is to have a 2-level (or N-level)
> tree of allocations for the icount->list array, with the first level
> being just an array of pointers to individually-allocated arrays of
> ext2_icount_el. The sub-arrays can be some reasonable size (maybe
> 64kB), which would give us a fan-out of 64k / 8 = 8k, and if the
> top-level array is (re-)allocated in chunks of, say 64 pointers, the
> number of times the top-level array needs to be reallocated would only
> be every 64 * 8192 insertions, and only the pointer array needs to be
> reallocated/copied.

Seems to be simple enough. If Ted does not object to this approach
in general, I'll try to send some patches next week. Probably I will
need some further advise though.

> That said, any insert-optimized tree structure with a high fan-out
> would be suitable. Elements are almost never deleted, and we would
> never need to compact the tree (it is freed as a whole when it is
> done).

IMHO the benefits of some sophisticated tree structure do not justify
to introduce a new dependency on some library that implements the
structure, especially since this would not be an optional add-on. So
I'd say we try the simple 2-level approach you sketched above. The
parameters could be fine-tuned with information from the superblock (if
available), like the fs-wide number of hard links, as Ted suggested.

> I suspect that using multi-level arrays + swap is always going to be
> more efficient than using an on-disk database. This will maximize the
> RAM usage due to very compact storage of the icount_el data instead of
> database records with much more (2x at least) overhead, and it will
> also avoid all disk IO unless the system is actually running out of
> RAM. Finally, all of the swap IO will be in at least PAGE_SIZE
> chunks, while tdb and friends often write 512-byte sparse records
> randomly.

Agreed. Moreover, RAM size tends to grow faster than disk space. That
32bit machine came with 1G RAM and 80G hard disks some years ago. Today
you can order servers with 512G RAM and 2T disks, so the RAM growth
rate is 512 while disk space grew "only" by a factor of 25.

> If you have a new system ready to go, you may be ahead by just doing a
> raw copy of the filesystem over to the new storage, and running e2fsck
> on that. This also gives you the security of doing the e2fsck on a
> copy of the data, in case it does something bad.
>
> You could start the raw copy even if you have your current e2fsck
> running, since I suspect the current tdb-based e2fsck will be totally
> IO-bound on the database and not on IO from the filesystem.

Very good idea! I'm already copying the raw device to the new machine.

Thanks a lot
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (5.40 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2010-08-20 14:40:18

by Andre Noll

[permalink] [raw]
Subject: Re: Memory allocation failed, e2fsck: aborted

On Thu, Aug 19, 13:16, Ted Ts'o wrote:
> On Thu, Aug 19, 2010 at 03:10:17PM +0200, Andre Noll wrote:
> > Silly question: Would it be possible to simply mmap a large enough
> > file for the data and and use e.g. rbtrees for the lookups? If yes,
> > osl [1] could probably be an option. It's very simple but likely too
> > slow on inserts to be useful for e2fsprogs.
>
> As I recall, you're on a 32-bit machine, right?

Right.

> If so, a limitation you may run into is simply running out of address
> space. If it's not an address space issue, we don't need to mmap
> anything; you could just try enabling swap, and use the existing
> e2fsck code.
>
> (I had assumed you had tried that before suggesting you use the
> scratch_files tdb approach....)

Yes, I added 50G of swap (which is of course kind of silly) before my
first posting. This did not help, so I guess it is an address space
issue. My question about mmap was about an alternative to tdb btw.

Thanks
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (1.03 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2010-08-23 15:53:19

by Andre Noll

[permalink] [raw]
Subject: [PATCH]: icount: Replace the icount list by a two-level tree

On Fri, Aug 20, 2010 at 04:39:43PM +0200, Andre Noll wrote:
> > One simple way that this could be fixed fairly easily (which would
> > presumably allow swap to be used) is to have a 2-level (or N-level)
> > tree of allocations for the icount->list array, with the first level
> > being just an array of pointers to individually-allocated arrays of
> > ext2_icount_el. The sub-arrays can be some reasonable size (maybe
> > 64kB), which would give us a fan-out of 64k / 8 = 8k, and if the
> > top-level array is (re-)allocated in chunks of, say 64 pointers, the
> > number of times the top-level array needs to be reallocated would only
> > be every 64 * 8192 insertions, and only the pointer array needs to be
> > reallocated/copied.
>
> Seems to be simple enough. If Ted does not object to this approach
> in general, I'll try to send some patches next week. Probably I will
> need some further advise though.

Here is a patch against the next branch that implements your idea. It
compiles without (additional) warnings on Linux/i386 and Linux/x86_64
even with "make gcc-wall", and checkpatch.pl is also happy with
it. Moreover, it passes the test suite though this does probably not
prove much because the icount tests insert less than 8192 inodes, so
only one sub-array is created by these tests. I also ran the patched
e2fsck against a 100M toy file system containing many directories
and hard links and it seems to work fine.

Most importantly, this version no longer exits due to failing memory
allocations while checking my problematic file system on the 32bit
box. Instead memory and swap usage increase steadily as new icount
sub-arrays are being allocated. Therefore I think the patch has real
benefits and is worth considering for inclusion. Please review.

Note: Although the two-level tree approach avoids large reallocations,
for insertions it is as inefficient as the old sorted list method
because both the old and the new code move large amounts of memory
around in insert_icount_el(). This could be avoided by replacing the
sorted lists by a hash table, with open addressing and maybe double
hashing to deal with collisions. But for this to work out we'd need an
upper bound on the number of inodes to be inserted. So I like Ted's
proposal to store some information in the superblock that allows to
allocate a suitably sized structure up front.

Thanks
Andre
---
commit 0e115a361efe70b26a4f5daa10a3a7afda145e2c
Author: Andre Noll <[email protected]>
Date: Mon Aug 23 00:12:22 2010 +0200

icount: Replace the icount list by a two-level tree.

Currently, inode counts are stored in an array of struct ext2_icount_el
which is resized on demand. This array may become very large,
eventually causing the reallocation to fail due to memory or address
space limits, especially on 32 bit systems.

This patch changes the way the ext2_icount_el structures are stored to
a two-level tree with the first level being just an array of pointers
to individually-allocated sub-arrays of ext2_icount_el, each of which
stores up to 8192 entries. If all sub-arrays are full, only the small
first-level array has to be resized while a new second-level array
is allocated from scratch. This avoids reallocations of large amounts
of memory and allows to use swap efficiently.

Several simple accessor functions are introduced to keep the changes
to users of the icount API minimal. However, the fact that new icounts
are now allocated in chunks of 8192 entries breaks the expectations
of the icount test programs. So tests/progs/test_data/expect.icount
had to be patched as well to reflect this change.

Signed-off-by: Andre Noll <[email protected]>

diff --git a/lib/ext2fs/icount.c b/lib/ext2fs/icount.c
index 43cc53e..aad4866 100644
--- a/lib/ext2fs/icount.c
+++ b/lib/ext2fs/icount.c
@@ -31,14 +31,14 @@
* Also, e2fsck tends to load the icount data sequentially.
*
* So, we use an inode bitmap to indicate which inodes have a count of
- * one, and then use a sorted list to store the counts for inodes
+ * one, and then use a sorted tree to store the counts for inodes
* which are greater than one.
*
* We also use an optional bitmap to indicate which inodes are already
- * in the sorted list, to speed up the use of this abstraction by
+ * in the sorted tree, to speed up the use of this abstraction by
* e2fsck's pass 2. Pass 2 increments inode counts as it finds them,
- * so this extra bitmap avoids searching the sorted list to see if a
- * particular inode is on the sorted list already.
+ * so this extra bitmap avoids searching the sorted tree to see if a
+ * particular inode is on the sorted tree already.
*/

struct ext2_icount_el {
@@ -54,7 +54,7 @@ struct ext2_icount {
ext2_ino_t size;
ext2_ino_t num_inodes;
ext2_ino_t cursor;
- struct ext2_icount_el *list;
+ struct ext2_icount_el **tree;
struct ext2_icount_el *last_lookup;
char *tdb_fn;
TDB_CONTEXT *tdb;
@@ -69,14 +69,82 @@ struct ext2_icount {
*/
#define icount_16_xlate(x) (((x) > 65500) ? 65500 : (x))

+/*
+ * Number of elements in one icount group. This should be a power of two
+ * so that modulo operations are fast.
+ */
+#define ICOUNT_FANOUT (65536 / sizeof(struct ext2_icount_el))
+
+static inline ext2_ino_t num_icount_groups(ext2_icount_t icount)
+{
+ return (icount->size + ICOUNT_FANOUT - 1) / ICOUNT_FANOUT;
+}
+
+static inline void get_icount_indices(ext2_ino_t num, ext2_ino_t *group_num,
+ ext2_ino_t *group_index)
+{
+ *group_num = num / ICOUNT_FANOUT;
+ *group_index = num % ICOUNT_FANOUT;
+}
+
+static inline struct ext2_icount_el *get_icount_entry(ext2_icount_t icount,
+ ext2_ino_t num)
+{
+ ext2_ino_t gnum, gidx;
+
+ get_icount_indices(num, &gnum, &gidx);
+ return icount->tree[gnum] + gidx;
+}
+
+static inline ext2_ino_t get_icount_ino(ext2_icount_t icount, ext2_ino_t num)
+{
+ struct ext2_icount_el *el = get_icount_entry(icount, num);
+
+ return el->ino;
+}
+
+static inline void set_icount_ino(ext2_icount_t icount, ext2_ino_t num,
+ ext2_ino_t ino)
+{
+ struct ext2_icount_el *el = get_icount_entry(icount, num);
+
+ el->ino = ino;
+}
+
+static inline struct ext2_icount_el *set_icount_entry(ext2_icount_t icount,
+ ext2_ino_t num,
+ ext2_ino_t ino,
+ __u32 count)
+{
+ struct ext2_icount_el *el = get_icount_entry(icount, num);
+
+ el->ino = ino;
+ el->count = count;
+ return el;
+}
+
+static inline errcode_t alloc_icount_group(struct ext2_icount_el **ptr)
+{
+ errcode_t retval = ext2fs_get_array(ICOUNT_FANOUT,
+ sizeof(struct ext2_icount_el), ptr);
+ if (retval)
+ return retval;
+ memset(*ptr, 0, ICOUNT_FANOUT * sizeof(struct ext2_icount_el));
+ return 0;
+}
+
void ext2fs_free_icount(ext2_icount_t icount)
{
if (!icount)
return;

icount->magic = 0;
- if (icount->list)
- ext2fs_free_mem(&icount->list);
+ if (icount->tree) {
+ ext2_ino_t i, num_groups = num_icount_groups(icount);
+ for (i = 0; i < num_groups; i++)
+ ext2fs_free_mem(&icount->tree[i]);
+ ext2fs_free_mem(&icount->tree);
+ }
if (icount->single)
ext2fs_free_inode_bitmap(icount->single);
if (icount->multiple)
@@ -214,8 +282,7 @@ errcode_t ext2fs_create_icount2(ext2_filsys fs, int flags, unsigned int size,
{
ext2_icount_t icount;
errcode_t retval;
- size_t bytes;
- ext2_ino_t i;
+ ext2_ino_t i, num_groups;

if (hint) {
EXT2_CHECK_MAGIC(hint, EXT2_ET_MAGIC_ICOUNT);
@@ -240,29 +307,34 @@ errcode_t ext2fs_create_icount2(ext2_filsys fs, int flags, unsigned int size,
goto errout;
icount->size += fs->super->s_inodes_count / 50;
}
-
- bytes = (size_t) (icount->size * sizeof(struct ext2_icount_el));
-#if 0
- printf("Icount allocated %u entries, %d bytes.\n",
- icount->size, bytes);
-#endif
- retval = ext2fs_get_array(icount->size, sizeof(struct ext2_icount_el),
- &icount->list);
+ num_groups = num_icount_groups(icount);
+ /* always allocate full icount groups */
+ icount->size = num_groups * ICOUNT_FANOUT;
+ retval = ext2fs_get_array(num_groups, sizeof(struct ext2_icount_el *),
+ &icount->tree);
if (retval)
goto errout;
- memset(icount->list, 0, bytes);
-
+ memset(icount->tree, 0, num_groups * sizeof(struct ext2_icount_el *));
+ for (i = 0; i < num_groups; i++) {
+ retval = alloc_icount_group(icount->tree + i);
+ if (retval)
+ goto errout;
+ }
+#if 0
+ printf("Icount allocated %zu entries, %d groups\n",
+ icount->size, num_groups);
+#endif
icount->count = 0;
icount->cursor = 0;

/*
- * Populate the sorted list with those entries which were
+ * Populate the sorted tree with those entries which were
* found in the hint icount (since those are ones which will
- * likely need to be in the sorted list this time around).
+ * likely need to be in the sorted tree this time around).
*/
if (hint) {
- for (i=0; i < hint->count; i++)
- icount->list[i].ino = hint->list[i].ino;
+ for (i = 0; i < hint->count; i++)
+ set_icount_ino(icount, i, get_icount_ino(hint, i));
icount->count = hint->count;
}

@@ -282,59 +354,65 @@ errcode_t ext2fs_create_icount(ext2_filsys fs, int flags,
}

/*
- * insert_icount_el() --- Insert a new entry into the sorted list at a
+ * insert_icount_el() --- Insert a new entry into the sorted tree at a
* specified position.
*/
static struct ext2_icount_el *insert_icount_el(ext2_icount_t icount,
ext2_ino_t ino, int pos)
{
- struct ext2_icount_el *el;
errcode_t retval;
- ext2_ino_t new_size = 0;
- int num;
+ ext2_ino_t num_groups = num_icount_groups(icount);
+ ext2_ino_t i, gnum, gidx;

if (icount->last_lookup && icount->last_lookup->ino == ino)
return icount->last_lookup;

if (icount->count >= icount->size) {
- if (icount->count) {
- new_size = icount->list[(unsigned)icount->count-1].ino;
- new_size = (ext2_ino_t) (icount->count *
- ((float) icount->num_inodes / new_size));
- }
- if (new_size < (icount->size + 100))
- new_size = icount->size + 100;
-#if 0
- printf("Reallocating icount %u entries...\n", new_size);
-#endif
- retval = ext2fs_resize_mem((size_t) icount->size *
- sizeof(struct ext2_icount_el),
- (size_t) new_size *
- sizeof(struct ext2_icount_el),
- &icount->list);
+ retval = ext2fs_resize_mem(num_groups *
+ sizeof(struct ext2_icount_el *),
+ (num_groups + 1) *
+ sizeof(struct ext2_icount_el *),
+ &icount->tree);
if (retval)
- return 0;
- icount->size = new_size;
+ return NULL;
+ retval = alloc_icount_group(icount->tree + num_groups);
+ if (retval)
+ return NULL;
+ icount->size += ICOUNT_FANOUT;
+ num_groups++;
}
- num = (int) icount->count - pos;
- if (num < 0)
- return 0; /* should never happen */
- if (num) {
- memmove(&icount->list[pos+1], &icount->list[pos],
- sizeof(struct ext2_icount_el) * num);
+ if ((ext2_ino_t)pos == icount->count) /* Just insert the new element */
+ goto success;
+ /* Shift entries by one. */
+ get_icount_indices(pos, &gnum, &gidx);
+ for (i = num_groups - 1; i != (ext2_ino_t)-1 && i >= gnum; i--) {
+ struct ext2_icount_el *g = icount->tree[i];
+ const size_t s = sizeof(struct ext2_icount_el);
+
+ /*
+ * If g is not the last group, move the last element
+ * of this group to the beginning of the next group.
+ */
+ if (i < num_groups - 1)
+ icount->tree[i + 1][0] = g[ICOUNT_FANOUT - 1];
+
+ if (i > gnum) /* shift the whole group */
+ memmove(g + 1, g, (ICOUNT_FANOUT - 1) * s);
+ else if (gidx < ICOUNT_FANOUT - 1)
+ /* shift starts at entry gidx */
+ memmove(g + gidx + 1, g + gidx,
+ (ICOUNT_FANOUT - gidx - 1) * s);
}
+success:
+ icount->last_lookup = set_icount_entry(icount, pos, ino, 0);
icount->count++;
- el = &icount->list[pos];
- el->count = 0;
- el->ino = ino;
- icount->last_lookup = el;
- return el;
+ return icount->last_lookup;
}

/*
* get_icount_el() --- given an inode number, try to find icount
- * information in the sorted list. If the create flag is set,
- * and we can't find an entry, create one in the sorted list.
+ * information in the sorted tree. If the create flag is set,
+ * and we can't find an entry, create one in the sorted tree.
*/
static struct ext2_icount_el *get_icount_el(ext2_icount_t icount,
ext2_ino_t ino, int create)
@@ -343,11 +421,11 @@ static struct ext2_icount_el *get_icount_el(ext2_icount_t icount,
int low, high, mid;
ext2_ino_t lowval, highval;

- if (!icount || !icount->list)
+ if (!icount || !icount->tree)
return 0;

if (create && ((icount->count == 0) ||
- (ino > icount->list[(unsigned)icount->count-1].ino))) {
+ (ino > get_icount_ino(icount, icount->count - 1)))) {
return insert_icount_el(icount, ino, (unsigned) icount->count);
}
if (icount->count == 0)
@@ -355,8 +433,8 @@ static struct ext2_icount_el *get_icount_el(ext2_icount_t icount,

if (icount->cursor >= icount->count)
icount->cursor = 0;
- if (ino == icount->list[icount->cursor].ino)
- return &icount->list[icount->cursor++];
+ if (ino == get_icount_ino(icount, icount->cursor))
+ return get_icount_entry(icount, icount->cursor++);
#if 0
printf("Non-cursor get_icount_el: %u\n", ino);
#endif
@@ -370,8 +448,8 @@ static struct ext2_icount_el *get_icount_el(ext2_icount_t icount,
mid = low;
else {
/* Interpolate for efficiency */
- lowval = icount->list[low].ino;
- highval = icount->list[high].ino;
+ lowval = get_icount_ino(icount, low);
+ highval = get_icount_ino(icount, high);

if (ino < lowval)
range = 0;
@@ -388,11 +466,11 @@ static struct ext2_icount_el *get_icount_el(ext2_icount_t icount,
mid = low + ((int) (range * (high-low)));
}
#endif
- if (ino == icount->list[mid].ino) {
+ if (ino == get_icount_ino(icount, mid)) {
icount->cursor = mid+1;
- return &icount->list[mid];
+ return get_icount_entry(icount, mid);
}
- if (ino < icount->list[mid].ino)
+ if (ino < get_icount_ino(icount, mid))
high = mid-1;
else
low = mid+1;
@@ -480,10 +558,10 @@ errcode_t ext2fs_icount_validate(ext2_icount_t icount, FILE *out)
return EXT2_ET_INVALID_ARGUMENT;
}
for (i=1; i < icount->count; i++) {
- if (icount->list[i-1].ino >= icount->list[i].ino) {
- fprintf(out, "%s: list[%d].ino=%u, list[%d].ino=%u\n",
- bad, i-1, icount->list[i-1].ino,
- i, icount->list[i].ino);
+ if (get_icount_ino(icount, i-1) >= get_icount_ino(icount, i)) {
+ fprintf(out, "%s: tree[%d].ino=%u, tree[%d].ino=%u\n",
+ bad, i-1, get_icount_ino(icount, i - 1),
+ i, get_icount_ino(icount, i));
ret = EXT2_ET_INVALID_ARGUMENT;
}
}
@@ -525,7 +603,7 @@ errcode_t ext2fs_icount_increment(ext2_icount_t icount, ext2_ino_t ino,
if (ext2fs_test_inode_bitmap2(icount->single, ino)) {
/*
* If the existing count is 1, then we know there is
- * no entry in the list.
+ * no entry in the tree.
*/
if (set_inode_count(icount, ino, 2))
return EXT2_ET_NO_MEMORY;
@@ -535,7 +613,7 @@ errcode_t ext2fs_icount_increment(ext2_icount_t icount, ext2_ino_t ino,
/*
* The count is either zero or greater than 1; if the
* inode is set in icount->multiple, then there should
- * be an entry in the list, so we need to fix it.
+ * be an entry in the tree, so we need to fix it.
*/
if (ext2fs_test_inode_bitmap2(icount->multiple, ino)) {
get_inode_count(icount, ino, &curr_value);
@@ -555,7 +633,7 @@ errcode_t ext2fs_icount_increment(ext2_icount_t icount, ext2_ino_t ino,
} else {
/*
* The count is either zero or greater than 1; try to
- * find an entry in the list to determine which.
+ * find an entry in the tree to determine which.
*/
get_inode_count(icount, ino, &curr_value);
curr_value++;
diff --git a/tests/progs/test_data/expect.icount b/tests/progs/test_data/expect.icount
index b58a373..c36f7de 100644
--- a/tests/progs/test_data/expect.icount
+++ b/tests/progs/test_data/expect.icount
@@ -56,7 +56,7 @@ test_icount: store 20000 0
test_icount: fetch 20000
Count is 0
test_icount: get_size
-Size of icount is: 5
+Size of icount is: 8192
test_icount: decrement 2
decrement: Invalid argument passed to ext2 library while calling ext2fs_icount_decrement
test_icount: increment 2
@@ -145,7 +145,7 @@ Count is now 0
test_icount: decrement 5
decrement: Invalid argument passed to ext2 library while calling ext2fs_icount_decrement
test_icount: get_size
-Size of icount is: 105
+Size of icount is: 8192
test_icount: validate
Icount structure successfully validated
test_icount: store 10 10
@@ -188,6 +188,6 @@ test_icount: dump
95: 95
100: 100
test_icount: get_size
-Size of icount is: 105
+Size of icount is: 8192
test_icount: validate
Icount structure successfully validated
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (16.46 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2010-11-01 22:55:08

by Mala Iyer

[permalink] [raw]
Subject: Re: [PATCH]: icount: Replace the icount list by a two-level tree

Andre Noll <maan <at> systemlinux.org> writes:

>
> Here is a patch against the next branch that implements your idea. It
> compiles without (additional) warnings on Linux/i386 and Linux/x86_64
> even with "make gcc-wall", and checkpatch.pl is also happy with
> it. Moreover, it passes the test suite though this does probably not
> prove much because the icount tests insert less than 8192 inodes, so
> only one sub-array is created by these tests. I also ran the patched
> e2fsck against a 100M toy file system containing many directories
> and hard links and it seems to work fine.
>
> Most importantly, this version no longer exits due to failing memory
> allocations while checking my problematic file system on the 32bit
> box. Instead memory and swap usage increase steadily as new icount
> sub-arrays are being allocated. Therefore I think the patch has real
> benefits and is worth considering for inclusion. Please review.
>
> Note: Although the two-level tree approach avoids large reallocations,
> for insertions it is as inefficient as the old sorted list method
> because both the old and the new code move large amounts of memory
> around in insert_icount_el(). This could be avoided by replacing the
> sorted lists by a hash table, with open addressing and maybe double
> hashing to deal with collisions. But for this to work out we'd need an
> upper bound on the number of inodes to be inserted. So I like Ted's
> proposal to store some information in the superblock that allows to
> allocate a suitably sized structure up front.
>
> Thanks
> Andre
> ---


Hi Andre

We are trying to use e2fsck on a embedded system with almost 3GB of swap and
200MB of physical memory. e2fsck on a 16TB ext3 filesystem always appears to
fail with the following error

Pass 1D: Reconciling multiply-claimed blocks
e2fsck: Memory allocation failed while retrying to read bitmaps for /dev/sda1

We have tried the patch you proposed, but still seeing the same failure.
How can we get e2fsck to work on our memory constrained system.

Thanks
Mala.


2010-11-01 23:23:49

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH]: icount: Replace the icount list by a two-level tree

On 2010-11-01, at 16:49, Mala Iyer wrote:
> We are trying to use e2fsck on a embedded system with almost 3GB of swap and
> 200MB of physical memory. e2fsck on a 16TB ext3 filesystem always appears to
> fail with the following error
>
> Pass 1D: Reconciling multiply-claimed blocks
> e2fsck: Memory allocation failed while retrying to read bitmaps for /dev/sda1

As a general rule of thumb, expect to use 1 byte of RAM for every block in the filesystem. For a 16TB filesystem (2^32 blocks) I would expect to need about 4GB of RAM, from past experience.

There is a considerable amount of RAM savings that could be had by implementing more efficient sparse bitmap support internal to libext2fs. The code is structured and ready to do this, but nobody has had a real need to do it before now.

Some proposals were discussed in the past (see for example the brief description in https://bugzilla.lustre.org/show_bug.cgi?id=12202, and you can Google for the referenced thread in the archives) to use run-length-encoding of sparse bitmaps, or possibly rb-tree structures. In most cases, the bitmaps will have long runs of either all-ones or all-zeroes, so I expect even simple RLE encoding could help significantly. Since this is an internal implementation detail, the code can be changed in future releases without impacting interoperability.

Also, https://bugzilla.lustre.org/attachment.cgi?id=10088 has an out-of-date patch (see the "ebitmap.c" file) that you might consider using for reference, but I also believe the APIs in libext2fs have changed significantly, as has the desired implementation, so I doubt the patch is useful for anything except finding out which code to start changing.

Cheers, Andreas






2010-12-20 19:46:12

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH]: icount: Replace the icount list by a two-level tree

On 2010-12-20, at 11:30, Christian Doerr wrote:
> I am writing you with a follow-up question to your response in this thread (http://comments.gmane.org/gmane.comp.file-systems.ext4/21646), where you're given a general rule of thumb on the memory usage of e2fsck.
>
> My problem is similar to the one posted there (my thread is http://serverfault.com/questions/201554/e2fsck-aborts-memory-allocation-failed), except that I am checking a broken 9TB partition with now 12 GB of RAM (+ swap space, upgraded from 4GB) which still doesn't seem enough.
>
> In regard to the bugzilla patch you're referring to, would you consider it possible to build an old set of libraries that would compile against this patch? If so, do you have any intuition when the libraries might have changed? After one month of debugging, I am ready to try anything...


Have you actually tested the patch that was posted in that thread?
<http://article.gmane.org/gmane.comp.file-systems.ext4/20731/

This addresses one aspect of the memory utilization problem, where libext2fs was trying to make a single allocation that was larger than 1/2 of RAM to handle hard-linked files, and preventing it from using the swap. With the patch, the allocations are smaller, and can be swapped out individually instead of simply failing the allocation outright.

If this patch resolves your problem, we should look into getting this patch included in the upstream e2fsprogs. Ted is going to be making a new release this week, so you may still have some time to get it included.


Looking at the error message you provided "Error while storing directory block information" (PR_1_ADD_DBLOCK), it looks like this is the result of an allocation failure in the function ext2fs_add_dir_block() function, which is a similar problem to the icount one fixed above, but in a different part of the code. ext2fs_add_dir_block() is also trying to do a single large allocation to handle all of the directories in the filesystem, but is failing to be able to reallocate the array.

It looks like the ext2fs_dblist_*() functions would also allow changing the internal list implementation into a 2-level tree as was done for the icount structures, to avoid such large memory allocations. It appears that the dblist is generated as an unsorted list initially (unlike icount), and is then later sorted with qsort(). If this was changed to a 2-level tree the qsort() would be done within individual leaf blocks, and then a merge sort would be needed to combine the blocks afterward. I don't think it makes sense to create this initially as a sorted tree, because several different sort functions are used within e2fsck on this list. This is unfortunately not work that I am able to do right now, but anyone with some C programming skills could tackle.


Also, Lukas Czerner has been working on an "rbtree" patch for the bitmap memory reduction. While the bitmap memory allocation may not be the problem being seen here, reducing the memory used by the various bitmaps would certainly make more memory available for other parts of e2fsck. Note, however, that Lukas' code is currently in the early stages of development, so I would only suggest to run them in read-only mode (-n) until he is further along with development/testing.


Cheers, Andreas