Background: we actually use the badblocks feature of the ext filesystem
group to do a poorman's boot filesystem for parisc: Our system chunks
up the disk searching for an Initial Program Loader (IPL) signature and
then executes it, so we poke a hole in an ext3 filesystem at creation
time and place the IPL into it. Our IP can read ext3 files and
directories, so it allows us to load the kernel directly from the file.
The problem is that our IPL needs to be aligned at 256k in absolute
terms on the disk, so, in the usual situation of having a 64k partition
label and the boot partition being the first one we usually end up
poking the badblock hole beginning at block 224 (using a 1k block
size).
The problem is that this used to work long ago (where the value of long
seems to be some time before 2011) but no longer does. The problem can
be illustrated simply by doing
---
# dd if=/dev/zero of=bbtest.img bs=1M count=100
# losetup /dev/loop0 bbtest.img
# a=237; while [ $a -le 450 ]; do echo $a >> bblist.txt; a=$[$a+1]; done
# mke2fs -b 1024 -l /home/jejb/bblist.txt /dev/loop0
---
Now if you try to do an e2fsck on the partition you'll get this
---
# e2fsck -f /dev/loop0
e2fsck 1.45.2 (27-May-2019)
Pass 1: Checking inodes, blocks, and sizes
Programming error? block #237 claimed for no reason in process_bad_block.
Programming error? block #238 claimed for no reason in process_bad_block.
Programming error? block #239 claimed for no reason in process_bad_block.
Programming error? block #240 claimed for no reason in process_bad_block.
Programming error? block #241 claimed for no reason in process_bad_block.
Programming error? block #242 claimed for no reason in process_bad_block.
Programming error? block #243 claimed for no reason in process_bad_block.
Programming error? block #244 claimed for no reason in process_bad_block.
Programming error? block #245 claimed for no reason in process_bad_block.
Programming error? block #246 claimed for no reason in process_bad_block.
Programming error? block #247 claimed for no reason in process_bad_block.
Programming error? block #248 claimed for no reason in process_bad_block.
Programming error? block #249 claimed for no reason in process_bad_block.
Programming error? block #250 claimed for no reason in process_bad_block.
Programming error? block #251 claimed for no reason in process_bad_block.
Programming error? block #252 claimed for no reason in process_bad_block.
Programming error? block #253 claimed for no reason in process_bad_block.
Programming error? block #254 claimed for no reason in process_bad_block.
Programming error? block #255 claimed for no reason in process_bad_block.
Programming error? block #256 claimed for no reason in process_bad_block.
Programming error? block #257 claimed for no reason in process_bad_block.
Programming error? block #258 claimed for no reason in process_bad_block.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong for group #0 (7556, counted=7578).
Fix<y>?
---
So mke2fs has created an ab-inito corrupt filesystem. Empirically,
this only seems to happen if there is a block in the bad block list
under 251, but I haven't verified this extensively.
James
On Mon, Jul 01, 2019 at 03:44:30PM -0700, James Bottomley wrote:
> Background: we actually use the badblocks feature of the ext filesystem
> group to do a poorman's boot filesystem for parisc: Our system chunks
> up the disk searching for an Initial Program Loader (IPL) signature and
> then executes it, so we poke a hole in an ext3 filesystem at creation
> time and place the IPL into it. Our IP can read ext3 files and
> directories, so it allows us to load the kernel directly from the file.
>
> The problem is that our IPL needs to be aligned at 256k in absolute
> terms on the disk, so, in the usual situation of having a 64k partition
> label and the boot partition being the first one we usually end up
> poking the badblock hole beginning at block 224 (using a 1k block
> size).
>
> The problem is that this used to work long ago (where the value of long
> seems to be some time before 2011) but no longer does. The problem can
> be illustrated simply by doing
It broke sometime around 2006. E2fsprogs 1.39 is when we started
creating file systems with the resize inode to support the online
resize feature.
And the problem is with a 100M file system using 1k blocks, when you
reserve blocks 237 -- 258, you're conflicting with the reserved blocks
used for online resizing:
Group 0: (Blocks 1-8192)
Primary superblock at 1, Group descriptors at 2-2
Reserved GDT blocks at 3-258 <========= THIS
Block bitmap at 451 (+450)
Inode bitmap at 452 (+451)
Inode table at 453-699 (+452)
7456 free blocks, 1965 free inodes, 2 directories
Free blocks: 715-8192
Free inodes: 12-1976
It's a bug that mke2fs didn't notice this issue and give an error
message ("HAHAHAHA... NO."). And it's also a bug that e2fsck didn't
correctly diagnose the nature of the corruption. Both of these bugs
are because how the reserved blocks for online resizing are handled is
a bit of a special case.
In any case, the workaround is to do this:
# mke2fs -b 1024 -O ^resize_inode -l /home/jejb/bblist.txt /dev/loop0
For bonus points, you could even add something like this to
/etc/mke2fs.conf:
[fs_types]
parisc_boot = {
features = ^resize_inode
blocksize = 1024
inode_size = 128
}
Then all you would need to do something like this:
# mke2fs -T parisc_boot -l bblist.txt /dev/sda1
Also, I guess this answers the other question that had recently
crossed my mind, which is I had been thinking of deprecating and
eventually removing the badblock feature in e2fsprogs altogether,
since no sane user of badblocks should exist in 2019. I guess I stand
corrected. :-)
- Ted
P.S. Does this mean parisc has been using an amazingly obsolete
version of e2fsprogs, which is why no one had noticed? Or was there a
static image file of the 100M boot partition, which you hadn't
regenerated until now.... ?
On Mon, 2019-07-01 at 20:23 -0400, Theodore Ts'o wrote:
> On Mon, Jul 01, 2019 at 03:44:30PM -0700, James Bottomley wrote:
> > Background: we actually use the badblocks feature of the ext
> > filesystem group to do a poorman's boot filesystem for parisc: Our
> > system chunks up the disk searching for an Initial Program Loader
> > (IPL) signature and then executes it, so we poke a hole in an ext3
> > filesystem at creation time and place the IPL into it. Our IP can
> > read ext3 files and directories, so it allows us to load the kernel
> > directly from the file.
> >
> > The problem is that our IPL needs to be aligned at 256k in absolute
> > terms on the disk, so, in the usual situation of having a 64k
> > partition label and the boot partition being the first one we
> > usually end up poking the badblock hole beginning at block 224
> > (using a 1k block size).
> >
> > The problem is that this used to work long ago (where the value of
> > long seems to be some time before 2011) but no longer does. The
> > problem can be illustrated simply by doing
>
> It broke sometime around 2006. E2fsprogs 1.39 is when we started
> creating file systems with the resize inode to support the online
> resize feature.
>
> And the problem is with a 100M file system using 1k blocks, when you
> reserve blocks 237 -- 258, you're conflicting with the reserved
> blocks used for online resizing:
>
> Group 0: (Blocks 1-8192)
> Primary superblock at 1, Group descriptors at 2-2
> Reserved GDT blocks at 3-258 <========= THIS
> Block bitmap at 451 (+450)
> Inode bitmap at 452 (+451)
> Inode table at 453-699 (+452)
> 7456 free blocks, 1965 free inodes, 2 directories
> Free blocks: 715-8192
> Free inodes: 12-1976
>
> It's a bug that mke2fs didn't notice this issue and give an error
> message ("HAHAHAHA... NO."). And it's also a bug that e2fsck didn't
> correctly diagnose the nature of the corruption. Both of these bugs
> are because how the reserved blocks for online resizing are handled
> is a bit of a special case.
>
> In any case, the workaround is to do this:
>
> # mke2fs -b 1024 -O ^resize_inode -l
> /home/jejb/bblist.txt /dev/loop0
>
> For bonus points, you could even add something like this to
> /etc/mke2fs.conf:
>
> [fs_types]
> parisc_boot = {
> features = ^resize_inode
> blocksize = 1024
> inode_size = 128
> }
>
> Then all you would need to do something like this:
>
> # mke2fs -T parisc_boot -l bblist.txt /dev/sda1
Actually, we control the location of the IPL, so as long as mke2fs
errors out if we get it wrong I can add an offset so it begins at >
sector 258. Palo actually executed mke2fs when you initialize the
partition so it can add any options it likes. I was also thinking I
should update palo to support ext4 as well.
> Also, I guess this answers the other question that had recently
> crossed my mind, which is I had been thinking of deprecating and
> eventually removing the badblock feature in e2fsprogs altogether,
> since no sane user of badblocks should exist in 2019. I guess I
> stand corrected. :-)
Well, we don't have to use badblocks to achieve this, but we would like
a way to make an inode cover the reserved physical area of the IPL.
Effectively it's a single contiguous area on disk with specific
absolute alignment constraints. It doesn't actually matter if it
appears in the directory tree.
> - Ted
>
> P.S. Does this mean parisc has been using an amazingly obsolete
> version of e2fsprogs, which is why no one had noticed? Or was there
> a static image file of the 100M boot partition, which you hadn't
> regenerated until now.... ?
Yes, since we only do this at boot, and we can actually update the IPL
on the fly when we need to because the space reserved is the maximum,
it only gets invoked on a reinitialization, which I haven't done for a
very long time. The only reason I did it this time is because I have a
spare disk in a pa8800 which I used as an install target.
James
On Mon, Jul 01, 2019 at 05:53:34PM -0700, James Bottomley wrote:
>
> Actually, we control the location of the IPL, so as long as mke2fs
> errors out if we get it wrong I can add an offset so it begins at
> sector 258. Palo actually executed mke2fs when you initialize the
> partition so it can add any options it likes. I was also thinking I
> should update palo to support ext4 as well.
If you never going to resize the boot partition, because it's fixed
size, you might as we not waste space on the reserving blocks for
online resize. So having the palo bootloader be very restrictive
about what features it enables probably makes sense.
> Well, we don't have to use badblocks to achieve this, but we would like
> a way to make an inode cover the reserved physical area of the IPL.
> Effectively it's a single contiguous area on disk with specific
> absolute alignment constraints. It doesn't actually matter if it
> appears in the directory tree.
If you don't mind that it is visible in the namespace, you could take
advantage of the existing mk_hugefile feature[1][2]
[1] http://man7.org/linux/man-pages/man5/mke2fs.conf.5.html
[2] https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/tree/misc/mk_hugefiles.c
# cat >> /etc/mke2fs.conf < EOF
[fs_types]
palo_boot = {
features = ^resize_inode
blocksize = 1024
make_hugefiles = true
num_hugefiles = 1
hugefiles_dir = /palo
hugefiles_name = IPL
hugefiles_size = 214k
hugefiles_align = 256k
hugefiles_align_disk = true
}
EOF
# mke2fs -T palo_boot /dev/sda1
Something like this will create a 1k block file system, containing a
zero-filled /palo/IPL which is 214k long, aligned with respect to the
beginning of the disk at an 256k boundary. (This feature was
sponsored by the letters, 'S', 'M', and 'R'. :-)
If you wanted it to be hidden from the file system you could just drop
the hugefiles_dir line above, and then after mounting the file system
run open the /IPL file and then execute the EXT4_IOC_SWAP_BOOT ioctl
on it. This will move those blocks so they are owned by inode #5, an
inode reserved for the boot loader.
Cheers,
- Ted
On Tue, 2019-07-02 at 13:33 -0400, Theodore Ts'o wrote:
> On Mon, Jul 01, 2019 at 05:53:34PM -0700, James Bottomley wrote:
> >
> > Actually, we control the location of the IPL, so as long as mke2fs
> > errors out if we get it wrong I can add an offset so it begins at
> > sector 258. Palo actually executed mke2fs when you initialize the
> > partition so it can add any options it likes. I was also thinking I
> > should update palo to support ext4 as well.
>
> If you never going to resize the boot partition, because it's fixed
> size, you might as we not waste space on the reserving blocks for
> online resize. So having the palo bootloader be very restrictive
> about what features it enables probably makes sense.
Yes, I think given I've only created one partition in the last ten
years, that's reasonable. I was just worrying about eliminating a
feature which could later become mandatory, but if it never will,
that's not a problem. However, moving the bootloader is also very
simple, so if
> > Well, we don't have to use badblocks to achieve this, but we would
> > like a way to make an inode cover the reserved physical area of the
> > IPL. Effectively it's a single contiguous area on disk with
> > specific absolute alignment constraints. It doesn't actually
> > matter if it appears in the directory tree.
>
> If you don't mind that it is visible in the namespace, you could take
> advantage of the existing mk_hugefile feature[1][2]
>
> [1] http://man7.org/linux/man-pages/man5/mke2fs.conf.5.html
> [2] https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/tree/misc/mk
> _hugefiles.c
>
> # cat >> /etc/mke2fs.conf < EOF
>
> [fs_types]
> palo_boot = {
> features = ^resize_inode
> blocksize = 1024
> make_hugefiles = true
> num_hugefiles = 1
> hugefiles_dir = /palo
> hugefiles_name = IPL
> hugefiles_size = 214k
> hugefiles_align = 256k
> hugefiles_align_disk = true
> }
> EOF
> # mke2fs -T palo_boot /dev/sda1
>
> Something like this will create a 1k block file system, containing a
> zero-filled /palo/IPL which is 214k long, aligned with respect to the
> beginning of the disk at an 256k boundary. (This feature was
> sponsored by the letters, 'S', 'M', and 'R'. :-)
Actually, this is giving me:
mke2fs: Operation not supported for inodes containing extents while
creating huge files
Is that because it's an ext4 only feature?
Having it visible is useful for updating the IPL, which occurs more
often than intializing the partition.
James
> If you wanted it to be hidden from the file system you could just
> drop the hugefiles_dir line above, and then after mounting the file
> system run open the /IPL file and then execute the EXT4_IOC_SWAP_BOOT
> ioctl on it. This will move those blocks so they are owned by inode
> #5, an inode reserved for the boot loader.
>
> Cheers,
>
> - Ted
>
On Tue, Jul 02, 2019 at 12:31:34PM -0700, James Bottomley wrote:
> Actually, this is giving me:
>
> mke2fs: Operation not supported for inodes containing extents while
> creating huge files
>
> Is that because it's an ext4 only feature?
That'll teach me not to send out a sequence like that without testing
it myself first. :-)
Yeah, because one of the requirements was to make the file contiguous,
without any intervening indirect block or extent tree blocks, the
creation of the file is done manually, and at the time, I only
implemented it for extents, since the original goal of the goal was to
create really big files (hence the name of the feature "mk_hugefile"),
and using indirect blocks would be a huge waste of disk space.
It wouldn't be that hard for me to add support for indirect block
maps, or if you were going to convert things over so that the pa_risc
2nd stage boot loader can understand how to read from extents, that'll
allow this to work as well.
- Ted
On Tue, 2019-07-02 at 16:39 -0400, Theodore Ts'o wrote:
> On Tue, Jul 02, 2019 at 12:31:34PM -0700, James Bottomley wrote:
> > Actually, this is giving me:
> >
> > mke2fs: Operation not supported for inodes containing extents while
> > creating huge files
> >
> > Is that because it's an ext4 only feature?
>
> That'll teach me not to send out a sequence like that without testing
> it myself first. :-)
Heh, join the club ... it has a very large membership ... I've got a
frequent flier card for it ...
> Yeah, because one of the requirements was to make the file
> contiguous, without any intervening indirect block or extent tree
> blocks, the creation of the file is done manually, and at the time, I
> only implemented it for extents, since the original goal of the goal
> was to create really big files (hence the name of the feature
> "mk_hugefile"), and using indirect blocks would be a huge waste of
> disk space.
I guessed as much.
> It wouldn't be that hard for me to add support for indirect block
> maps, or if you were going to convert things over so that the pa_risc
> 2nd stage boot loader can understand how to read from extents,
> that'll allow this to work as well.
Let me look at it. I think I can just take routines out of lib/ext2fs
and graft them into the IPL, but our own home grown ext2/3 handling
routines are slightly eccentric so it's not as simple as that.
James
I've got preliminary ext4 support for palo completed. My original plan
was to switch our boot loader (iplboot) to using libext2fs, but that
proved to be impossible due to the external dependencies libext2fs
needs which we simply can't provide in a tiny bootloader, so I switched
to simply adding support for variable sized groups and handling extent
based files in our original code. Right at the moment we only support
reading files for the kernel and the initrd, so we have a simple
routine that loads blocks monotonically by mapping from inode relative
to partition absolute. It's fairly simple to cache the extent tree at
all depths and use a similar resolution scheme for extent based
filesystems. I'll add this list on cc to the initial patch so you can
check it.
Now the problem: I'd like to do some testing with high depth extent
trees to make sure I got this right, but the files we load at boot are
~20MB in size and I'm having a hard time fragmenting the filesystem
enough to produce a reasonable extent (I've basically only got to a two
level tree with two entries at the top). Is there an easy way of
producing a high depth extent tree for a 20MB file?
Thanks,
James
On Fri, Jul 05, 2019 at 09:25:48AM -0700, James Bottomley wrote:
> Now the problem: I'd like to do some testing with high depth extent
> trees to make sure I got this right, but the files we load at boot are
> ~20MB in size and I'm having a hard time fragmenting the filesystem
> enough to produce a reasonable extent (I've basically only got to a two
> level tree with two entries at the top). Is there an easy way of
> producing a high depth extent tree for a 20MB file?
Create a series of 4kB files numbered sequentially, each 4kB in size
until you fill the partition. Delete the even numbered ones. Create a
20MB file.
On Fri, 2019-07-05 at 10:39 -0700, Matthew Wilcox wrote:
> On Fri, Jul 05, 2019 at 09:25:48AM -0700, James Bottomley wrote:
> > Now the problem: I'd like to do some testing with high depth extent
> > trees to make sure I got this right, but the files we load at boot
> > are ~20MB in size and I'm having a hard time fragmenting the
> > filesystem enough to produce a reasonable extent (I've basically
> > only got to a two level tree with two entries at the top). Is
> > there an easy way of producing a high depth extent tree for a 20MB
> > file?
>
> Create a series of 4kB files numbered sequentially, each 4kB in size
> until you fill the partition. Delete the even numbered ones. Create
> a 20MB file.
Well, I know *how* to do it ... I was just hoping, in the interests of
creative laziness, that someone else had produced a script for this
before I had to ... particularly one which leaves more randomized gaps.
James
On Fri, Jul 05, 2019 at 11:49:02AM -0700, James Bottomley wrote:
> On Fri, 2019-07-05 at 10:39 -0700, Matthew Wilcox wrote:
> > On Fri, Jul 05, 2019 at 09:25:48AM -0700, James Bottomley wrote:
> > > Now the problem: I'd like to do some testing with high depth extent
> > > trees to make sure I got this right, but the files we load at boot
> > > are ~20MB in size and I'm having a hard time fragmenting the
> > > filesystem enough to produce a reasonable extent (I've basically
> > > only got to a two level tree with two entries at the top). Is
> > > there an easy way of producing a high depth extent tree for a 20MB
> > > file?
> >
> > Create a series of 4kB files numbered sequentially, each 4kB in size
> > until you fill the partition. Delete the even numbered ones. Create
> > a 20MB file.
>
> Well, I know *how* to do it ... I was just hoping, in the interests of
> creative laziness, that someone else had produced a script for this
> before I had to ... particularly one which leaves more randomized gaps.
If you don't care about the contents of the file you could just build
src/punch-alternating.c from xfstests and use it to turn your 20M file
into holy cheese.
(Granted if you actually need 5,120 extents then you probably ought to
make it a 40M file and /then/ run it through the cheese grater....)
--D
> James
>
On Fri, Jul 05, 2019 at 11:49:02AM -0700, James Bottomley wrote:
> > Create a series of 4kB files numbered sequentially, each 4kB in size
> > until you fill the partition. Delete the even numbered ones. Create
> > a 20MB file.
>
> Well, I know *how* to do it ... I was just hoping, in the interests of
> creative laziness, that someone else had produced a script for this
> before I had to ... particularly one which leaves more randomized gaps.
You mean something like this? It doesn't do randomized gaps, since
usually I'm trying to stress test block allocations.
#!/bin/bash
DEV=/dev/lambda/scratch
SIZE=10M
mke2fs -Fq -t ext4 -i 4096 -b 4096 $DEV $SIZE
max=$(dumpe2fs -h $DEV 2>/dev/null | awk -F: '/^Free blocks:/{print $2}')
mount $DEV /mnt
cd /mnt
mkdir -p d{0,1,2,3,4,5,6,7,8,9}/{0,1,2,3,4,5,6,7,8,9}
seq 1 $max | sed -E -e 's;^([[:digit:]])([[:digit:]])([[:digit:]]);d\1/\2/\3;' > /tmp/files$$
cat /tmp/files$$ | xargs -n 1 fallocate -l 4096 2>/dev/null
sed -ne 'p;n' < /tmp/files$$ | xargs rm -f
cd /
umount $DEV
rm /tmp/files$$