2019-01-10 01:26:57

by James Harvey

[permalink] [raw]
Subject: Interpreting /sys/block/<disk>/{,<partition>}/discard_alignment

https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-block

Describes discard_alignment as:

"Devices that support discard functionality may internally allocate
space in units that are bigger than the exported logical block size.
The discard_alignment parameter indicates how many bytes the beginning
of the device is offset from the internal allocation unit's natural
alignment."

Q1 - I'm hoping you can clarify how this should be interpreted.

I originally took this to mean the number of bytes into the first
discard_granularity block that the partition resides at. i.e. If
discard_granularity_block is 128MB, and partition 1 starts at sector
2048 with 512 byte sectors, that this should return 2048*512=1048576
(1MB.)

However, LVM thin volumes (using device mapper thin pools) are seeming
to give the number of bytes left in the first discard_granularity
block at the beginning of the partition. i.e. Returning
discard_granularity of 128 * 1024 * 1024 minus the start of the
partition 2048 * 512, or 133169152. (This is if the thin volume is
created with a chunk size of 128MB.)

Q2 - At https://lkml.org/lkml/2018/12/5/1693 --- I saw you recently
said "... there are not many devices that actually report a non-zero
discard alignment..." Does this mean that every filesystem needs to
look at the partition table to determine its correct value on its own,
rather than using discard_alignment?


2019-01-11 00:42:58

by Martin K. Petersen

[permalink] [raw]
Subject: Re: Interpreting /sys/block/<disk>/{,<partition>}/discard_alignment


James,

> Q1 - I'm hoping you can clarify how this should be interpreted.
>
> I originally took this to mean the number of bytes into the first
> discard_granularity block that the partition resides at. i.e. If
> discard_granularity_block is 128MB, and partition 1 starts at sector
> 2048 with 512 byte sectors, that this should return 2048*512=1048576
> (1MB.)

The alignment offset is the offset for the given block device. It
doesn't matter whether the block device in question is a partition, DM
device or a full device. A block device is a block device.

The common alignment scenario is 3584 on a device with 4K physical
blocks. That's because of the 63-sector legacy FAT partition table
offset. Which essentially means that the first LBA is misaligned and the
first aligned HBA is 7.

Many of the first 512e drives shipped with that intentional misalignment
as default. And you could switch it to 0-aligned via a jumper. These
days all drives are 0-aligned.

> Q2 - At https://lkml.org/lkml/2018/12/5/1693 --- I saw you recently
> said "... there are not many devices that actually report a non-zero
> discard alignment..." Does this mean that every filesystem needs to
> look at the partition table to determine its correct value on its own,
> rather than using discard_alignment?

No, it needs to look at the device topology for the block device it is
on. I don't believe we ever wired up an ioctl for the discard alignment
so you'll have to find your device in sysfs. There's an alignment ioctl
for the "regular" block alignment, though.

--
Martin K. Petersen Oracle Linux Engineering

2019-01-11 05:13:11

by James Harvey

[permalink] [raw]
Subject: Re: Interpreting /sys/block/<disk>/{,<partition>}/discard_alignment

On Thu, Jan 10, 2019 at 7:04 PM Martin K. Petersen
<[email protected]> wrote:
> James,
>
> > Q1 - I'm hoping you can clarify how this should be interpreted.
> >
> > I originally took this to mean the number of bytes into the first
> > discard_granularity block that the partition resides at. i.e. If
> > discard_granularity_block is 128MB, and partition 1 starts at sector
> > 2048 with 512 byte sectors, that this should return 2048*512=1048576
> > (1MB.)
>
> The alignment offset is the offset for the given block device. It
> doesn't matter whether the block device in question is a partition, DM
> device or a full device. A block device is a block device.
>
> The common alignment scenario is 3584 on a device with 4K physical
> blocks. That's because of the 63-sector legacy FAT partition table
> offset. Which essentially means that the first LBA is misaligned and the
> first aligned HBA is 7.

If I can double check I'm understanding you correctly, if:

* Block device "A" has 512 byte sectors
* A has a partition table with partition A1 starting at sector 2048
(1048576 bytes)
* A and A1 have discard_granularity of 128MB (134217728 bytes)
* A has discard_alignment of 0

Then A1 should have a discard_alignment of 1048576, not 133169152
(128MB - 512 bytes/sector * 2048 sectors)?

> Many of the first 512e drives shipped with that intentional misalignment
> as default. And you could switch it to 0-aligned via a jumper. These
> days all drives are 0-aligned.
>
> > Q2 - At https://lkml.org/lkml/2018/12/5/1693 --- I saw you recently
> > said "... there are not many devices that actually report a non-zero
> > discard alignment..." Does this mean that every filesystem needs to
> > look at the partition table to determine its correct value on its own,
> > rather than using discard_alignment?
>
> No, it needs to look at the device topology for the block device it is
> on. I don't believe we ever wired up an ioctl for the discard alignment
> so you'll have to find your device in sysfs. There's an alignment ioctl
> for the "regular" block alignment, though.

Ahh, good. I took that the wrong way, originally worried you were
saying the value of discard_alignment couldn't be trusted.