2010-12-08 17:04:01

by Christian Brandt

[permalink] [raw]
Subject: swap storage alignment and stride size

Preamble:

Hi fellow linux tamers, the following question has bounced around for
some days in local lists and newsgroups without conclussion and was
escalated upstream several times, here we are...

We are discussing semi-professional storage systems, e.g. ext4 on luks
on lvm on raid on gpt-partitions on 4k sector harddrives or 512k sector
SSDs. Usually every level profits a lot from aligning the data to the
underlying sector/stride/chunk size, e.g. ext4 with a 128k stripe size
will run a lot better on a well aligned 64k stride raid5.

In other words, partition tables, LVM, RAID, luks and filesystems know
how to handle and profit from aligned larger chunks.

In detail:

As far as we can read mm/swapfile.c linux is only concerned about cpu
page size and does not know anything about underlying
chunk/sector/stride sizes and alignment.

Therefore we think every small 1/2/4/8kiB page-sized write access leads
to a read-modify-write cycle for the whole chunk, taking more then twice
as long than simply writing the whole chunk at once.

Questions:

Is this the right place to ask?

Does or could linux swapping make use of aligning chunks?

And if, how?

If not, would it be an improvement?

Will this effect be mostly compensated by the block elevator?

Does it make any sense to change the mkswap page size to the chunk size?
We think those are two totally different beasts and should be left
seperated.

Is Linux already aware of chunk sizes within swap?

How to set up and controlled by the administrator?

--
Christian Brandt

life is short and in most cases it ends with death
but my tombstone will carry the hiscore


2010-12-08 19:56:32

by Ric Wheeler

[permalink] [raw]
Subject: Re: swap storage alignment and stride size

On 12/08/2010 12:03 PM, Christian Brandt wrote:
> Preamble:
>
> Hi fellow linux tamers, the following question has bounced around for
> some days in local lists and newsgroups without conclussion and was
> escalated upstream several times, here we are...
>
> We are discussing semi-professional storage systems, e.g. ext4 on luks
> on lvm on raid on gpt-partitions on 4k sector harddrives or 512k sector
> SSDs. Usually every level profits a lot from aligning the data to the
> underlying sector/stride/chunk size, e.g. ext4 with a 128k stripe size
> will run a lot better on a well aligned 64k stride raid5.
>
> In other words, partition tables, LVM, RAID, luks and filesystems know
> how to handle and profit from aligned larger chunks.
>
> In detail:
>
> As far as we can read mm/swapfile.c linux is only concerned about cpu
> page size and does not know anything about underlying
> chunk/sector/stride sizes and alignment.
>
> Therefore we think every small 1/2/4/8kiB page-sized write access leads
> to a read-modify-write cycle for the whole chunk, taking more then twice
> as long than simply writing the whole chunk at once.
>
> Questions:
>
> Is this the right place to ask?
>
> Does or could linux swapping make use of aligning chunks?
>
> And if, how?
>
> If not, would it be an improvement?
>
> Will this effect be mostly compensated by the block elevator?
>
> Does it make any sense to change the mkswap page size to the chunk size?
> We think those are two totally different beasts and should be left
> seperated.
>
> Is Linux already aware of chunk sizes within swap?
>
> How to set up and controlled by the administrator?
>

Hi Christian,

There has been a lot of work on alignment, Martin Petersen lead most of that and
is probably the best one to ping.

Ric

2010-12-14 20:01:46

by Martin K. Petersen

[permalink] [raw]
Subject: Re: swap storage alignment and stride size

>>>>> "Ric" == Ric Wheeler <[email protected]> writes:

Sorry, I've been away for a couple of weeks.

Ric> There has been a lot of work on alignment, Martin Petersen lead
Ric> most of that and is probably the best one to ping.

With modern tooling we should align the partition or DM device correctly
so the swap starts on a properly aligned boundary. But I don't think
anybody has looked into hooking the swap stuff up with the I/O
topology. I'm also not sure the swap code is flexible enough to deal
with units that are bigger than page size.

Hugh?

--
Martin K. Petersen Oracle Linux Engineering

2010-12-15 04:57:40

by Hugh Dickins

[permalink] [raw]
Subject: Re: swap storage alignment and stride size

On Tue, 14 Dec 2010, Martin K. Petersen wrote:
> >>>>> "Ric" == Ric Wheeler <[email protected]> writes:
>
> Ric> There has been a lot of work on alignment, Martin Petersen lead
> Ric> most of that and is probably the best one to ping.
>
> With modern tooling we should align the partition or DM device correctly
> so the swap starts on a properly aligned boundary. But I don't think
> anybody has looked into hooking the swap stuff up with the I/O
> topology. I'm also not sure the swap code is flexible enough to deal
> with units that are bigger than page size.

You and Christian are right, mm/swapfile.c is very much oriented to
the small mm page size, 4kB on x86.

Yes, when it's running nicely, the elevator can make a big difference
by merging adjacent writes to swap; but swapping is often by nature
not so nice.

I think it would be a big mistake to try to build the idea of bigger
blocks into mm/swapfile.c: it is so orientated towards the mm concerns
that we'd end up with a mess that way.

Much better to add a dm layer below it, to buffer such alignment and
stride concerns. Perhaps someone has already done that?

(scan_swap_map does try to allocate in 1MB clusters, but they're not
written out that way, and there's no attempt to align: if it worked
out better for the lower level to require that these 1MB clusters
are aligned, we could probably go for that - though the swap header
page might then be a nuisance.)

Hugh

2010-12-15 19:31:20

by Martin K. Petersen

[permalink] [raw]
Subject: Re: swap storage alignment and stride size

>>>>> "Hugh" == Hugh Dickins <[email protected]> writes:

Hugh> (scan_swap_map does try to allocate in 1MB clusters, but they're
Hugh> not written out that way, and there's no attempt to align: if it
Hugh> worked out better for the lower level to require that these 1MB
Hugh> clusters are aligned, we could probably go for that - though the
Hugh> swap header page might then be a nuisance.)

You called it a "header page". Does that imply that it is page sized?
Or will it cause pages written to a 4k-aligned swap device to be
misaligned?

--
Martin K. Petersen Oracle Linux Engineering

2010-12-16 00:42:45

by Hugh Dickins

[permalink] [raw]
Subject: Re: swap storage alignment and stride size

On Wed, 15 Dec 2010, Martin K. Petersen wrote:
> >>>>> "Hugh" == Hugh Dickins <[email protected]> writes:
>
> Hugh> (scan_swap_map does try to allocate in 1MB clusters, but they're
> Hugh> not written out that way, and there's no attempt to align: if it
> Hugh> worked out better for the lower level to require that these 1MB
> Hugh> clusters are aligned, we could probably go for that - though the
> Hugh> swap header page might then be a nuisance.)
>
> You called it a "header page". Does that imply that it is page sized?

Yes. (Rather a nuisance on a PowerPC system which sometimes uses
a kernel with 4k pages and sometimes a kernel with 64k pages.)

> Or will it cause pages written to a 4k-aligned swap device to be
> misaligned?

No, the 4k-aligned remains 4k-aligned, of course. But if you aligned
your swap partition on, say, a 1MB boundary, and are thinking of
working in aligned 1MB blocks, then it may be awkward that there's
always this special 4k at the start (it could be written back each
time even though it hasn't changed, but it's still an odd case).

Hugh

2010-12-16 23:47:07

by Martin K. Petersen

[permalink] [raw]
Subject: Re: swap storage alignment and stride size

>>>>> "Hugh" == Hugh Dickins <[email protected]> writes:

>> You called it a "header page". Does that imply that it is page sized?

Hugh> Yes. (Rather a nuisance on a PowerPC system which sometimes uses
Hugh> a kernel with 4k pages and sometimes a kernel with 64k pages.)

Ok.


>> Or will it cause pages written to a 4k-aligned swap device to be
>> misaligned?

Hugh> No, the 4k-aligned remains 4k-aligned, of course. But if you
Hugh> aligned your swap partition on, say, a 1MB boundary, and are
Hugh> thinking of working in aligned 1MB blocks, then it may be awkward
Hugh> that there's always this special 4k at the start (it could be
Hugh> written back each time even though it hasn't changed, but it's
Hugh> still an odd case).

Yeah, I got that. I just wanted to make sure that the header was not 32
bytes or something like that because that would be highly painful from
an I/O alignment perspective.

--
Martin K. Petersen Oracle Linux Engineering

2010-12-17 00:15:52

by Christian Brandt

[permalink] [raw]
Subject: Re: swap storage alignment and stride size

Am 16.12.2010 01:42, schrieb Hugh Dickins:

> No, the 4k-aligned remains 4k-aligned, of course. But if you aligned
> your swap partition on, say, a 1MB boundary, and are thinking of
> working in aligned 1MB blocks, then it may be awkward that there's
> always this special 4k at the start (it could be written back each
> time even though it hasn't changed, but it's still an odd case).

Hallo Hugh, I stayed a bit quit after my question to see what people
think. So I wasn't wrong from the start, neither kernel nor userland
tools care for anything beyond page size today. For today I'll be happy
enough with the status quo

Though I have a clear vision what I would like:

The kernel needs to be prepared to handle larger groups of pages.
In a perfect world it would favorite larger operations whicha are
already aligned to underlying architectures.
Eg, lets first write that very big 8192kiB chunk which is perfectly
aligned. But ignore the small pieces scattered around memory below the
chunk size until memory gets really low.
Small pieces don't eat up much memory any.
May it would even help to swap out only chunk-aligned parts even when
there are small pre- and post-data around the big chunk.

Also I would expect a userspace tool to setup a swap space with
alternate settings (e.g. offset to device start, chunk size, alignment)
- a nice new role for mkswap?

The chunk size shouldn't be any fixed value.
Often I have 64k, sometimes 256k, rarelly even 4096kiB.
And you never know what strange layouts are luring around the next
corner, maybe in 2015 we will all use SSD drives with a cell size of
12288kByte, already today several budget SSDs have pretty strange cell
sizes:
My cheapish Acer 24giB drive has 384kiB because it has connected three
8giB flash chips to the four channel controler with the fourth channel
being broken/disabled...

--
Christian Brandt