2024-05-22 07:53:58

by [email protected]

[permalink] [raw]
Subject: [PATCH 0/2] Improve dmesg output for swapfile+hibernation

While trying to use a swapfile for hibernation, I noticed that the suspend
process was failing when it tried to search for the swap to use for snapshot.
I had created the swapfile on ext4 and got the starting physical block offset
using the filefrag command.

Upon looking at the swap activation code, I realized that the iomap part is
doing some rounding up for the physical offset of the swapfile.
Then I checked the block size of the filesystem, which was actually set to
1KB by default in my environment.
(This was in buildroot, using the genimage utility to create the VM disk
partitions, filesystems etc.)

The block offset is rounded-up and stored in the swap extents metadata by
iomap code, but as the exact value is lost for 1KB-block filesystem, hibernate
cannot read back the swap header after it finishes writing data pages to swap.

Note that this is not a bug in my understanding. Both swapfile and hibernate
subsystems have the correct handling of this edge case, individually.

Another observation was that we need to rely on external commands, such as
filefrag for getting the swapfile offset value. This value can be conveniently
printed in dmesg output when doing swapon.

Sukrit Bhatnagar (2):
iomap: swap: print warning for unaligned swapfile
mm: swap: print starting physical block offset in swapon

fs/iomap/swapfile.c | 10 ++++++++++
mm/swapfile.c | 3 ++-
2 files changed, 12 insertions(+), 1 deletion(-)

--
2.34.1



2024-05-22 07:54:17

by [email protected]

[permalink] [raw]
Subject: [PATCH 1/2] iomap: swap: print warning for unaligned swapfile

When creating a swapfile on a filesystem with block size less than the
PAGE_SIZE, there is a possibility that the starting physical block is not
page-aligned, which results in rounding up that value before setting it
in the first swap extent. But now that the value is rounded up, we have
lost the actual offset location of the first physical block.

The starting physical block value is needed in hibernation when using a
swapfile, i.e., the resume_offset. After we have written the snapshot
pages, some values will be set in the swap header which is accessed using
that offset location. However, it will not find the swap header if the
offset value was rounded up and results in an error.

The swapfile offset being unaligned should not fail the swapon activation
as the swap extents will always have the alignment.

Therefore, just print a warning if an unaligned swapfile is activated
when hibernation is enabled.

Signed-off-by: Sukrit Bhatnagar <[email protected]>
---
fs/iomap/swapfile.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/fs/iomap/swapfile.c b/fs/iomap/swapfile.c
index 5fc0ac36dee3..1f7b189089dd 100644
--- a/fs/iomap/swapfile.c
+++ b/fs/iomap/swapfile.c
@@ -49,6 +49,16 @@ static int iomap_swapfile_add_extent(struct iomap_swapfile_info *isi)
next_ppage = ALIGN_DOWN(iomap->addr + iomap->length, PAGE_SIZE) >>
PAGE_SHIFT;

+#ifdef CONFIG_HIBERNATION
+ /*
+ * Print a warning if the starting physical block is not aligned
+ * to PAGE_SIZE (for filesystems using smaller block sizes).
+ * This will fail the hibernation suspend as we need to read
+ * the swap header later using the starting block offset.
+ */
+ if (!iomap->offset && iomap->addr & PAGE_MASK)
+ pr_warn("swapon: starting physical offset not page-aligned\n");
+#endif
/* Skip too-short physical extents. */
if (first_ppage >= next_ppage)
return 0;
--
2.34.1


2024-05-22 07:54:27

by [email protected]

[permalink] [raw]
Subject: [PATCH 2/2] mm: swap: print starting physical block offset in swapon

When a swapfile is created for hibernation purposes, we always need
the starting physical block offset, which is usually determined using
userspace commands such as filefrag.

It would be good to have that value printed when we do swapon and get
that value directly from dmesg.

Signed-off-by: Sukrit Bhatnagar <[email protected]>
---
mm/swapfile.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index f6ca215fb92f..53c9187d5fbe 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3264,8 +3264,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
(swap_flags & SWAP_FLAG_PRIO_MASK) >> SWAP_FLAG_PRIO_SHIFT;
enable_swap_info(p, prio, swap_map, cluster_info);

- pr_info("Adding %uk swap on %s. Priority:%d extents:%d across:%lluk %s%s%s%s\n",
+ pr_info("Adding %uk swap on %s. Priority:%d extents:%d start:%llu across:%lluk %s%s%s%s\n",
K(p->pages), name->name, p->prio, nr_extents,
+ (unsigned long long)first_se(p)->start_block,
K((unsigned long long)span),
(p->flags & SWP_SOLIDSTATE) ? "SS" : "",
(p->flags & SWP_DISCARDABLE) ? "D" : "",
--
2.34.1


2024-05-22 14:56:50

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: swap: print starting physical block offset in swapon

On Wed, May 22, 2024 at 04:46:58PM +0900, Sukrit Bhatnagar wrote:
> When a swapfile is created for hibernation purposes, we always need
> the starting physical block offset, which is usually determined using
> userspace commands such as filefrag.

If you always need this value, then shouldn't it be exported via sysfs
or somewhere so that you can always get to it? The kernel ringbuffer
can overwrite log messages, swapfiles can get disabled, etc.

> It would be good to have that value printed when we do swapon and get
> that value directly from dmesg.
>
> Signed-off-by: Sukrit Bhatnagar <[email protected]>
> ---
> mm/swapfile.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f6ca215fb92f..53c9187d5fbe 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3264,8 +3264,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
> (swap_flags & SWAP_FLAG_PRIO_MASK) >> SWAP_FLAG_PRIO_SHIFT;
> enable_swap_info(p, prio, swap_map, cluster_info);
>
> - pr_info("Adding %uk swap on %s. Priority:%d extents:%d across:%lluk %s%s%s%s\n",
> + pr_info("Adding %uk swap on %s. Priority:%d extents:%d start:%llu across:%lluk %s%s%s%s\n",
> K(p->pages), name->name, p->prio, nr_extents,
> + (unsigned long long)first_se(p)->start_block,

Last time I looked, start_block was in units of PAGE_SIZE, despite
add_swap_extent confusingly (ab)using the sector_t type. Wherever you
end up reporting this value, it ought to be converted to something more
common (like byte offset or 512b-block offset).

Also ... if this is a swap *file* then reporting the path and the
physical storage device address is not that helpful. Exposing the block
device major/minor and block device address would be much more useful,
wouldn't it?

(Not that I have any idea what the "suspend process" in the cover letter
refers to -- suspend and hibernate have been broken on xfs forever...)

--D

> K((unsigned long long)span),
> (p->flags & SWP_SOLIDSTATE) ? "SS" : "",
> (p->flags & SWP_DISCARDABLE) ? "D" : "",
> --
> 2.34.1
>
>

2024-05-22 18:03:27

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH 1/2] iomap: swap: print warning for unaligned swapfile

Hi Sukrit,

It seems that you need the swap file block start address to read the
swap file headers.
This warning still requires the user to read the dmesg. The kernel
still does not have the swapfile header at resume. In other words, it
does not fix the issue.

I don't know the suspend/resume code enough, will adding recording the
physical start address of the swapfile in swap_info_struct help you
address this problem? The suspend code can write that value to
"somewhere* for resume to pick it up.

Let's find a proper way to fix this issue rather than just warning on it.

Chris

On Wed, May 22, 2024 at 12:42 AM Sukrit Bhatnagar
<[email protected]> wrote:
>
> When creating a swapfile on a filesystem with block size less than the
> PAGE_SIZE, there is a possibility that the starting physical block is not
> page-aligned, which results in rounding up that value before setting it
> in the first swap extent. But now that the value is rounded up, we have
> lost the actual offset location of the first physical block.
>
> The starting physical block value is needed in hibernation when using a
> swapfile, i.e., the resume_offset. After we have written the snapshot
> pages, some values will be set in the swap header which is accessed using
> that offset location. However, it will not find the swap header if the
> offset value was rounded up and results in an error.
>
> The swapfile offset being unaligned should not fail the swapon activation
> as the swap extents will always have the alignment.
>
> Therefore, just print a warning if an unaligned swapfile is activated
> when hibernation is enabled.
>
> Signed-off-by: Sukrit Bhatnagar <[email protected]>
> ---
> fs/iomap/swapfile.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/fs/iomap/swapfile.c b/fs/iomap/swapfile.c
> index 5fc0ac36dee3..1f7b189089dd 100644
> --- a/fs/iomap/swapfile.c
> +++ b/fs/iomap/swapfile.c
> @@ -49,6 +49,16 @@ static int iomap_swapfile_add_extent(struct iomap_swapfile_info *isi)
> next_ppage = ALIGN_DOWN(iomap->addr + iomap->length, PAGE_SIZE) >>
> PAGE_SHIFT;
>
> +#ifdef CONFIG_HIBERNATION
> + /*
> + * Print a warning if the starting physical block is not aligned
> + * to PAGE_SIZE (for filesystems using smaller block sizes).
> + * This will fail the hibernation suspend as we need to read
> + * the swap header later using the starting block offset.
> + */
> + if (!iomap->offset && iomap->addr & PAGE_MASK)
> + pr_warn("swapon: starting physical offset not page-aligned\n");
> +#endif
> /* Skip too-short physical extents. */
> if (first_ppage >= next_ppage)
> return 0;
> --
> 2.34.1
>
>

2024-05-22 20:06:43

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: swap: print starting physical block offset in swapon

On Wed, May 22, 2024 at 07:56:37AM -0700, Darrick J. Wong wrote:
> On Wed, May 22, 2024 at 04:46:58PM +0900, Sukrit Bhatnagar wrote:
> > When a swapfile is created for hibernation purposes, we always need
> > the starting physical block offset, which is usually determined using
> > userspace commands such as filefrag.
>
> If you always need this value, then shouldn't it be exported via sysfs
> or somewhere so that you can always get to it? The kernel ringbuffer
> can overwrite log messages, swapfiles can get disabled, etc.

Scraping a block address from anything is just broken.

Wher is the code using this? It needs a proper kernel interface.

Same about the warning crap in patch 1.

2024-05-23 19:46:05

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 0/2] Improve dmesg output for swapfile+hibernation

Hi!

> While trying to use a swapfile for hibernation, I noticed that the suspend
> process was failing when it tried to search for the swap to use for snapshot.
> I had created the swapfile on ext4 and got the starting physical block offset
> using the filefrag command.

How is swapfile for hibernation supposed to work? I'm afraid that
can't work, and we should just not allow hibernation if there's
anything else than just one swap partition.

Best regards,
Pavel
--
People of Russia, stop Putin before his war on Ukraine escalates.


Attachments:
(No filename) (561.00 B)
signature.asc (201.00 B)
Download all attachments

2024-05-27 11:02:45

by [email protected]

[permalink] [raw]
Subject: RE: [PATCH 2/2] mm: swap: print starting physical block offset in swapon

Hi Darrick,

On 2024-05-22 23:56, Darrick Wong wrote:
> On Wed, May 22, 2024 at 04:46:58PM +0900, Sukrit Bhatnagar wrote:
>> When a swapfile is created for hibernation purposes, we always need
>> the starting physical block offset, which is usually determined using
>> userspace commands such as filefrag.
>
> If you always need this value, then shouldn't it be exported via sysfs
> or somewhere so that you can always get to it? The kernel ringbuffer
> can overwrite log messages, swapfiles can get disabled, etc.

I agree on using appropriate kernel interfaces instead of kernel log.

>> It would be good to have that value printed when we do swapon and get
>> that value directly from dmesg.
>>
>> Signed-off-by: Sukrit Bhatnagar <[email protected]>
>> ---
>> mm/swapfile.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index f6ca215fb92f..53c9187d5fbe 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -3264,8 +3264,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>> (swap_flags & SWAP_FLAG_PRIO_MASK) >> SWAP_FLAG_PRIO_SHIFT;
>> enable_swap_info(p, prio, swap_map, cluster_info);
>> - pr_info("Adding %uk swap on %s. Priority:%d extents:%d across:%lluk %s%s%s%s\n",
>> + pr_info("Adding %uk swap on %s. Priority:%d extents:%d start:%llu across:%lluk %s%s%s%s\n",
>> K(p->pages), name->name, p->prio, nr_extents,
>> + (unsigned long long)first_se(p)->start_block,
>
> Last time I looked, start_block was in units of PAGE_SIZE, despite
> add_swap_extent confusingly (ab)using the sector_t type. Wherever you
> end up reporting this value, it ought to be converted to something more
> common (like byte offset or 512b-block offset).

I could not find any swap-related entries in the sysfs, but there is
/proc/swaps which shows the enabled swaps in a table.
A column for this start offset could be added there, which as you have
mentioned, should be in a unit such as bytes instead of PAGE_SIZE
blocks.

> Also ... if this is a swap *file* then reporting the path and the
> physical storage device address is not that helpful. Exposing the block
> device major/minor and block device address would be much more useful,
> wouldn't it?

For exposing information about swap file path, I think it wouldn't make
much difference (at least for the hibernate case) as we can always do
the file-path -> bdev-path -> major:minor conversion in userspace.

> (Not that I have any idea what the "suspend process" in the cover letter
> refers to -- suspend and hibernate have been broken on xfs forever...)

By suspend process, I meant the series of steps taken when we trigger
hibernate's suspend-to-disk.
Not the task that started it. (Wrong choice of words, my bad).

--
Sukrit

2024-05-27 11:06:44

by [email protected]

[permalink] [raw]
Subject: RE: [PATCH 0/2] Improve dmesg output for swapfile+hibernation

Hi Pavel!

On 2024-05-24 04:45, Pavel Machek wrote:
> Hi!
>
>> While trying to use a swapfile for hibernation, I noticed that the suspend
>> process was failing when it tried to search for the swap to use for snapshot.
>> I had created the swapfile on ext4 and got the starting physical block offset
>> using the filefrag command.
>
> How is swapfile for hibernation supposed to work? I'm afraid that
> can't work, and we should just not allow hibernation if there's
> anything else than just one swap partition.

I am not sure what you mean.

We can pass the starting physical block offset of a swapfile into
/sys/power/resume_offset, and hibernate can directly read/write
into it using the swap extents information created by iomap during
swapon. On resume, the kernel would read this offset value from
the commandline parameters, and then access the swapfile.

I find having a swapfile option for hibernate useful, in the scenarios
where it is hard to modify the partitioning scheme, or to have a
dedicated swap partition.

Are there any plans to remove swapfile support from hibernation?

--
Sukrit

2024-05-27 11:32:31

by [email protected]

[permalink] [raw]
Subject: RE: [PATCH 1/2] iomap: swap: print warning for unaligned swapfile

Hi Chris,

On 2024-05-23 03:02, Chris Li wrote:
> Hi Sukrit,
>
> It seems that you need the swap file block start address to read the
> swap file headers.
> This warning still requires the user to read the dmesg. The kernel
> still does not have the swapfile header at resume. In other words, it
> does not fix the issue.

This was not intended to be a fix.

I had created this patch thinking that adding an actual fix for this might
be non-trivial and may not be desirable to everyone, especially when
swapfile+hibernation is not commonly used.

> I don't know the suspend/resume code enough, will adding recording the
> physical start address of the swapfile in swap_info_struct help you
> address this problem? The suspend code can write that value to
> "somewhere* for resume to pick it up.
>
> Let's find a proper way to fix this issue rather than just warning on it.

Adding a new member in swap_info_struct for the physical block offset
will help in the fix, I think. It can even be enclosed in #ifdef HIBERNATION
if nothing else needs that value.
This value can be checked by hibernate[1] as:
sis->start_offset == first_se(sis)->start_block

Another possible solution might be to add a new swap flag like SWP_SUSP
or SWP_HIBERNATE and set it only when we don't have this rounding up
issue.

Since we will boot the kernel normally first and later switch to our
pre-hibernate kernel after image load from swap, the value cannot be
set inside the kernel memory.
We pass the start block address in kernel commandline parameter
(resume_offset) for next boot.

[1]: The hibernate swapfile checking code is the function
swap_type_of() in mm/swapfile.c

--
Sukrit

2024-05-27 12:51:54

by [email protected]

[permalink] [raw]
Subject: RE: [PATCH 0/2] Improve dmesg output for swapfile+hibernation

Hi Christoph,

On 2024-05-27 20:20, Christoph Hellwig wrote:
> On Mon, May 27, 2024 at 11:06:11AM +0000, [email protected] wrote:
>> We can pass the starting physical block offset of a swapfile into
>> /sys/power/resume_offset, and hibernate can directly read/write
>> into it using the swap extents information created by iomap during
>> swapon. On resume, the kernel would read this offset value from
>> the commandline parameters, and then access the swapfile.
>
> Reading a physical address from userspace is not a proper interface.
> What is this code even trying to do with it?

I understand your point. Ideally, the low-level stuff such as finding
the physical block offset should not be handled in the userspace.

In my understanding, the resume offset in hibernate is used as follows.

Suspend
- Hibernate looks up the swap/swapfile using the details we pass in the
sysfs entries, in the function swsusp_swap_check():
* /sys/power/resume - path/uuid/major:minor of the swap partition (or
non-swap partition for swapfile)
* /sys/power/resume_offset - physical offset of the swapfile in that
partition
* If no resume device is specified, it just uses the first available swap!
- It then proceeds to write the image to the specified swap.
(The allocation of swap pages is done by the swapfile code internally.)
- When writing is finished, the swap header needs to be updated with some
metadata, in the function mark_swapfiles().
* Hibernate creates bio requests to read/write the header (which is the
first page of swap) using that physical block offset.

Resume
- Hibernate gets the partition and offset values from kernel command-line
parameters "resume" and "resume_offset" (which must be set from
userspace, not ideal).
- It checks for valid hibernate swap signature by reading the swap header.
* Hibernate creates bio requests again, using the physical block offset,
but the one from kernel command-line this time.
- Then it restores image and resumes into the previously saved kernel.

--
Sukrit

2024-05-27 14:57:34

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 0/2] Improve dmesg output for swapfile+hibernation

On Mon, May 27, 2024 at 11:06:11AM +0000, [email protected] wrote:
> We can pass the starting physical block offset of a swapfile into
> /sys/power/resume_offset, and hibernate can directly read/write
> into it using the swap extents information created by iomap during
> swapon. On resume, the kernel would read this offset value from
> the commandline parameters, and then access the swapfile.

Reading a physical address from userspace is not a proper interface.
What is this code even trying to do with it?


2024-05-27 15:52:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 0/2] Improve dmesg output for swapfile+hibernation

On Mon, May 27, 2024 at 12:51:07PM +0000, [email protected] wrote:
> In my understanding, the resume offset in hibernate is used as follows.
>
> Suspend
> - Hibernate looks up the swap/swapfile using the details we pass in the
> sysfs entries, in the function swsusp_swap_check():
> * /sys/power/resume - path/uuid/major:minor of the swap partition (or
> non-swap partition for swapfile)
> * /sys/power/resume_offset - physical offset of the swapfile in that
> partition
> * If no resume device is specified, it just uses the first available swap!
> - It then proceeds to write the image to the specified swap.
> (The allocation of swap pages is done by the swapfile code internally.)

Where "it" is userspace code? If so, that already seems unsafe for
a swap device, but definitely is a no-go for a swapfile.

> - Hibernate gets the partition and offset values from kernel command-line
> parameters "resume" and "resume_offset" (which must be set from
> userspace, not ideal).

Or is it just for these parameters? In which case we "only" need to
specify the swap file, which would then need code in the file system
driver to resolve the logical to physical mapping as swap files don't
need to be contiguous.


2024-05-31 14:17:06

by [email protected]

[permalink] [raw]
Subject: RE: [PATCH 0/2] Improve dmesg output for swapfile+hibernation

On 2024-05-27 21:58, Christoph Hellwig wrote:
> On Mon, May 27, 2024 at 12:51:07PM +0000, [email protected] wrote:
>> In my understanding, the resume offset in hibernate is used as follows.
>>
>> Suspend
>> - Hibernate looks up the swap/swapfile using the details we pass in the
>> sysfs entries, in the function swsusp_swap_check():
>> * /sys/power/resume - path/uuid/major:minor of the swap partition (or
>> non-swap partition for swapfile)
>> * /sys/power/resume_offset - physical offset of the swapfile in that
>> partition
>> * If no resume device is specified, it just uses the first available
>> swap! - It then proceeds to write the image to the specified swap.
>> (The allocation of swap pages is done by the swapfile code
>> internally.)
>
> Where "it" is userspace code? If so, that already seems unsafe for
> a swap device, but definitely is a no-go for a swapfile.

By "it", I meant the hibernate code running in kernel space.
Once userspace triggers hibernation by `echo disk > /sys/power/state`
or a systemd wrapper program etc., and userspace tasks are frozen,
everything happens within kernel context.

>> - Hibernate gets the partition and offset values from kernel command-line
>> parameters "resume" and "resume_offset" (which must be set from
>> userspace, not ideal).
>
> Or is it just for these parameters? In which case we "only" need to
> specify the swap file, which would then need code in the file system
> driver to resolve the logical to physical mapping as swap files don't
> need to be contiguous.

Yes, it is just for setting these parameters in sysfs entries and in kernel
commandline.
I think specifying the swapfile path *may* not work because when we resume
from hibernation, the filesystems are not yet mounted (except for the case
when someone is resuming from initramfs stage).
Using the block device + physical offset, this procedure becomes Independent
of the filesystem and the mounted status.
And since the system swap information is lost on reboot/shutdown, the kernel
which loads the hibernation image will not know what swaps were enabled
when the image was created.

--
Sukrit