2022-05-18 07:49:18

by Vivek Kumar

[permalink] [raw]
Subject: [RFC 0/6] Bootloader based hibernation

Kernel Hibernation

Linux Kernel has been already supporting hibernation, a process which
involves freezing of all userspace tasks, followed by quiescing of all
kernel device drivers and then a DDR snapshot is taken which is saved
to disc-swap partition, after the save, the system can either shutdown
or continue further. Generally during the next power cycle when kernel
boots and after probing almost all of the drivers, in the late_init()
part, it checks if a hibernation image is present in the specified swap
slot, if a valid hibernation image is found, it superimposes the currently
executing Kernel with an older kernel from the snapshot, moving further,
it calls the restore of the drivers and unfreezes the userspace tasks.
CONFIG_HIBERNATION and a designated swap partition needs to be present
for to enable Hibernation.

Bootloader Based Hibernation:

Automotive usecases require better boot KPIs, Hence we are proposing a
bootloader based hibernation restore. Purpose of bootloader based
hibernation is to improve the overall boot time till the first display
frame is seen on the screen or a camera application can be launched from
userspace after the power on reset key is pressed. This RFC patchset
implements a slightly tweaked version of hibernation in which the
restoration of an older snapshot into DDR is being carried out from the
bootloader (ABL) itself, by doing this we are saving some time
(1 second measured on msm-4.14 Kernel) by not running a
temporary kernel and figuring out the hibernation image at late_init().
In order to achieve the same bootloader checks for the hibernation
image at a very early stage from swap partition, it parses the image and
loads it in the DDR instead of loading boot image form boot partition.
Since we are not running the temporary kernel,which would have done some
basic ARM related setup like, MMU enablement, EL2 setup, CPU setup etc,
entry point into hibernation snapshot image directly from bootloader is
different, on similar lines, all device drivers are now re-programming
the IO-mapped registers as part of the restore callback (which is
triggered from the hibernation framework) to bring back the HW/SW sync.

Other factors like, read-speed of the secondary storage device and
organization of the hibernation image in the swap partition effects the
total image restore time and the overall boot time. In our current
implementation we have serialized the allocation of swap-partition's slots
in kernel, so when hibernation image is being saved to disc, each page is
not scattered across various swap-slot offsets, rather it in a serial
manner. For example, if a DDR page at Page frame number 0x8005 is
located at a swap-slot offset 50, the next valid DDR page at PFN 0x8005
will be preset at the swap-slot offset 51. With this optimization in
place, bootloader can utilize the max capacity of issuing a disc-read
for reading a bigger chunk (~50 MBs at once) from the swap slot,
and also parsing of the image becomes simpler as it is available
contiguously.



Vivek Kumar (6):
arm64: hibernate: Introduce new entry point to kernel
PM: Hibernate: Add option to disable disk offset randomization
block: gendisk: Add a new genhd capability flag
mm: swap: Add randomization check for swapon/off calls
Hibernate: Add check for pte_valid in saveable page
irqchip/gic-v3: Re-init GIC hardware upon hibernation restore

Documentation/admin-guide/kernel-parameters.txt | 11 ++
arch/arm64/kernel/hibernate.c | 9 ++
drivers/irqchip/irq-gic-v3.c | 138 ++++++++++++++++-
include/linux/blkdev.h | 1 +
kernel/power/snapshot.c | 43 ++++++++
kernel/power/swap.c | 12 +++
mm/swapfile.c | 6 +-
7 files changed, 216 insertions(+), 4 deletions(-)

--
2.7.4



2022-05-18 07:49:39

by Vivek Kumar

[permalink] [raw]
Subject: [RFC 1/6] arm64: hibernate: Introduce new entry point to kernel

Introduce a new entry point to hibernated kernel image.
This is generally needed when bootloader restores the
hibernated image from disc to ddr and passes control
to it by turning off the mmu, also initialize this new
entry point with cpu_resume which turns on the mmu and
then proceeds with restore routines.

Signed-off-by: Vivek Kumar <[email protected]>
Signed-off-by: Prasanna Kumar <[email protected]>
---
arch/arm64/kernel/hibernate.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 6328308..4e294b3 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -74,6 +74,14 @@ static struct arch_hibernate_hdr {
void (*reenter_kernel)(void);

/*
+ * Another entry point if jump to kernel happens with mmu disabled,
+ * generally done when restoring hibernation image from bootloader
+ * context
+ */
+
+ phys_addr_t phys_reenter_kernel;
+
+ /*
* We need to know where the __hyp_stub_vectors are after restore to
* re-configure el2.
*/
@@ -116,6 +124,7 @@ int arch_hibernation_header_save(void *addr, unsigned int max_size)
arch_hdr_invariants(&hdr->invariants);
hdr->ttbr1_el1 = __pa_symbol(swapper_pg_dir);
hdr->reenter_kernel = _cpu_resume;
+ hdr->phys_reenter_kernel = __pa(cpu_resume);

/* We can't use __hyp_get_vectors() because kvm may still be loaded */
if (el2_reset_needed())
--
2.7.4


2022-05-18 07:49:44

by Vivek Kumar

[permalink] [raw]
Subject: [RFC 2/6] PM: Hibernate: Add option to disable disk offset randomization

Add a kernel parameter to disable the disk offset randomization
for SSD devices in which such feature is available at the
firmware level. This is helpful in improving hibernation
resume time.

Signed-off-by: Vivek Kumar <[email protected]>
Signed-off-by: Prasanna Kumar <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 11 +++++++++++
kernel/power/swap.c | 9 +++++++++
2 files changed, 20 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 666ade9..06b4f10 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5192,6 +5192,17 @@
Useful for devices that are detected asynchronously
(e.g. USB and MMC devices).

+ noswap_randomize
+ Kernel uses random disk offsets to help with wear-levelling
+ of SSD devices, while saving the hibernation snapshot image to
+ disk. Use this parameter to disable this feature for SSD
+ devices in scenarios when, such randomization is addressed at
+ the firmware level and hibenration image is not re-generated
+ frequently.
+ (Useful for improving hibernation resume time as snapshot pages
+ are available in disk serially and can be read in bigger chunks
+ without seeking)
+
retain_initrd [RAM] Keep initrd memory after extraction

rfkill.default_state=
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index 91fffdd..8d5c811 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -44,6 +44,7 @@ u32 swsusp_hardware_signature;
*/
static bool clean_pages_on_read;
static bool clean_pages_on_decompress;
+static bool noswap_randomize;

/*
* The swap map is a data structure used for keeping track of each page
@@ -1616,3 +1617,11 @@ static int __init swsusp_header_init(void)
}

core_initcall(swsusp_header_init);
+
+static int __init noswap_randomize_setup(char *str)
+{
+ noswap_randomize = true;
+ return 1;
+}
+
+__setup("noswap_randomize", noswap_randomize_setup);
--
2.7.4


2022-05-18 07:50:01

by Vivek Kumar

[permalink] [raw]
Subject: [RFC 4/6] mm: swap: Add randomization check for swapon/off calls

Add addtional check on swapon/swapoff sycalls to disable
randomization of swap offsets if GENHD_FL_NO_RANDOMIZE
flag is passed.

Signed-off-by: Vivek Kumar <[email protected]>
Signed-off-by: Prasanna Kumar <[email protected]>
---
mm/swapfile.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1c3d5b9..a3eeab6 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2474,7 +2474,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
if (p->flags & SWP_CONTINUED)
free_swap_count_continuations(p);

- if (!p->bdev || !bdev_nonrot(p->bdev))
+ if (!p->bdev || (p->bdev->bd_disk->flags & GENHD_FL_NO_RANDOMIZE)
+ || !bdev_nonrot(p->bdev))
atomic_dec(&nr_rotate_swap);

mutex_lock(&swapon_mutex);
@@ -3065,7 +3066,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
if (p->bdev && p->bdev->bd_disk->fops->rw_page)
p->flags |= SWP_SYNCHRONOUS_IO;

- if (p->bdev && bdev_nonrot(p->bdev)) {
+ if (p->bdev && !(p->bdev->bd_disk->flags & GENHD_FL_NO_RANDOMIZE) &&
+ bdev_nonrot(p->bdev)) {
int cpu;
unsigned long ci, nr_cluster;

--
2.7.4


2022-05-18 07:50:03

by Vivek Kumar

[permalink] [raw]
Subject: [RFC 5/6] Hibernate: Add check for pte_valid in saveable page

Add check for pte_valid in saveable page after being checked for
the rest. This is required as PTE is removed for pages allocated
with dma_alloc_coherent with DMA_ATTR_NO_KERNEL_MAPPING flag set.
This patch makes sure that these pages are not considered for
snapshot.

Signed-off-by: Vivek Kumar <[email protected]>
Signed-off-by: Prasanna Kumar <[email protected]>
---
kernel/power/snapshot.c | 43 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 43 insertions(+)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 2a40675..a6ad2a5 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1308,6 +1308,41 @@ static inline void *saveable_highmem_page(struct zone *z, unsigned long p)
}
#endif /* CONFIG_HIGHMEM */

+static bool kernel_pte_present(struct page *page)
+{
+ pgd_t *pgdp;
+ p4d_t *p4dp;
+ pud_t *pudp, pud;
+ pmd_t *pmdp, pmd;
+ pte_t *ptep;
+ unsigned long addr = (unsigned long)page_address(page);
+
+ pgdp = pgd_offset_k(addr);
+ if (pgd_none(READ_ONCE(*pgdp)))
+ return false;
+
+ p4dp = p4d_offset(pgdp, addr);
+ if (p4d_none(READ_ONCE(*p4dp)))
+ return false;
+
+ pudp = pud_offset(p4dp, addr);
+ pud = READ_ONCE(*pudp);
+ if (pud_none(pud))
+ return false;
+ if (pud_sect(pud))
+ return true;
+
+ pmdp = pmd_offset(pudp, addr);
+ pmd = READ_ONCE(*pmdp);
+ if (pmd_none(pmd))
+ return false;
+ if (pmd_sect(pmd))
+ return true;
+
+ ptep = pte_offset_kernel(pmdp, addr);
+ return pte_valid(READ_ONCE(*ptep));
+}
+
/**
* saveable_page - Check if the given page is saveable.
*
@@ -1341,6 +1376,14 @@ static struct page *saveable_page(struct zone *zone, unsigned long pfn)
&& (!kernel_page_present(page) || pfn_is_nosave(pfn)))
return NULL;

+ /*
+ * Even if page is not reserved and if it's not present in kernel PTE;
+ * don't snapshot it ! This happens to the pages allocated using
+ * __dma_alloc_coherent with DMA_ATTR_NO_KERNEL_MAPPING flag set.
+ */
+ if (!kernel_pte_present(page))
+ return NULL;
+
if (page_is_guard(page))
return NULL;

--
2.7.4


2022-05-18 07:50:08

by Vivek Kumar

[permalink] [raw]
Subject: [RFC 3/6] block: gendisk: Add a new genhd capability flag

Add a new genhd capability flag to serialize offsets
for swap partition. This flag is enabled for the gendisk of
the block device which will be used for saving the snapshot
of the hibernation image, based on a kernel parameter
"noswap_randomize". Serializing offset in swap partition
helps in improving hibernation resume time from bootloader.

Signed-off-by: Vivek Kumar <[email protected]>
Signed-off-by: Prasanna Kumar <[email protected]>
---
include/linux/blkdev.h | 1 +
kernel/power/swap.c | 3 +++
2 files changed, 4 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1b24c1f..be094e7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -92,6 +92,7 @@ enum {
GENHD_FL_REMOVABLE = 1 << 0,
GENHD_FL_HIDDEN = 1 << 1,
GENHD_FL_NO_PART = 1 << 2,
+ GENHD_FL_NO_RANDOMIZE = 1 << 3,
};

enum {
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index 8d5c811..0a40eda 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -1526,6 +1526,9 @@ int swsusp_check(void)
FMODE_READ | FMODE_EXCL, &holder);
if (!IS_ERR(hib_resume_bdev)) {
set_blocksize(hib_resume_bdev, PAGE_SIZE);
+ if (noswap_randomize)
+ hib_resume_bdev->bd_disk->flags |=
+ GENHD_FL_NO_RANDOMIZE;
clear_page(swsusp_header);
error = hib_submit_io(REQ_OP_READ, 0,
swsusp_resume_block,
--
2.7.4


2022-05-18 07:50:21

by Vivek Kumar

[permalink] [raw]
Subject: [RFC 6/6] irqchip/gic-v3: Re-init GIC hardware upon hibernation restore

Code added in this patch takes backup of different set of
registers during hibernation suspend. On receiving hibernation
restore callback, it restores register values from backup. This
ensures state of hardware to be same just before hibernation and
after restore.

Signed-off-by: Vivek Kumar <[email protected]>
Signed-off-by: Prasanna Kumar <[email protected]>
---
drivers/irqchip/irq-gic-v3.c | 138 ++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 136 insertions(+), 2 deletions(-)

diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 2be8dea..442d32f 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -29,6 +29,10 @@
#include <asm/smp_plat.h>
#include <asm/virt.h>

+#include <linux/syscore_ops.h>
+#include <linux/suspend.h>
+#include <linux/notifier.h>
+
#include "irq-gic-common.h"

#define GICD_INT_NMI_PRI (GICD_INT_DEF_PRI & ~0x80)
@@ -56,6 +60,14 @@ struct gic_chip_data {
bool has_rss;
unsigned int ppi_nr;
struct partition_desc **ppi_descs;
+#ifdef CONFIG_HIBERNATION
+ unsigned int enabled_irqs[32];
+ unsigned int active_irqs[32];
+ unsigned int irq_edg_lvl[64];
+ unsigned int ppi_edg_lvl;
+ unsigned int enabled_sgis;
+ unsigned int pending_sgis;
+#endif
};

static struct gic_chip_data gic_data __read_mostly;
@@ -170,6 +182,9 @@ static enum gic_intid_range get_intid_range(struct irq_data *d)
return __get_intid_range(d->hwirq);
}

+static void gic_dist_init(void);
+static void gic_cpu_init(void);
+
static inline unsigned int gic_irq(struct irq_data *d)
{
return d->hwirq;
@@ -828,7 +843,7 @@ static bool gic_has_group0(void)
return val != 0;
}

-static void __init gic_dist_init(void)
+static void gic_dist_init(void)
{
unsigned int i;
u64 affinity;
@@ -1399,6 +1414,120 @@ static void gic_cpu_pm_init(void)
static inline void gic_cpu_pm_init(void) { }
#endif /* CONFIG_CPU_PM */

+#ifdef CONFIG_PM
+#ifdef CONFIG_HIBERNATION
+extern int in_suspend;
+static bool hibernation;
+
+static int gic_suspend_notifier(struct notifier_block *nb,
+ unsigned long event,
+ void *dummy)
+{
+ if (event == PM_HIBERNATION_PREPARE)
+ hibernation = true;
+ else if (event == PM_POST_HIBERNATION)
+ hibernation = false;
+ return NOTIFY_OK;
+}
+
+static struct notifier_block gic_notif_block = {
+ .notifier_call = gic_suspend_notifier,
+};
+
+static void gic_hibernation_suspend(void)
+{
+ int i;
+ void __iomem *base = gic_data.dist_base;
+ void __iomem *rdist_base = gic_data_rdist_sgi_base();
+
+ gic_data.enabled_sgis = readl_relaxed(rdist_base + GICD_ISENABLER);
+ gic_data.pending_sgis = readl_relaxed(rdist_base + GICD_ISPENDR);
+ /* Store edge level for PPIs by reading GICR_ICFGR1 */
+ gic_data.ppi_edg_lvl = readl_relaxed(rdist_base + GICR_ICFGR0 + 4);
+
+ for (i = 0; i * 32 < GIC_LINE_NR; i++) {
+ gic_data.enabled_irqs[i] = readl_relaxed(base +
+ GICD_ISENABLER + i * 4);
+ gic_data.active_irqs[i] = readl_relaxed(base +
+ GICD_ISPENDR + i * 4);
+ }
+
+ for (i = 2; i < GIC_LINE_NR / 16; i++)
+ gic_data.irq_edg_lvl[i] = readl_relaxed(base +
+ GICD_ICFGR + i * 4);
+}
+#endif
+
+static int gic_suspend(void)
+{
+#ifdef CONFIG_HIBERNATION
+ if (unlikely(hibernation))
+ gic_hibernation_suspend();
+#endif
+ return 0;
+}
+
+void gic_resume(void)
+{
+#ifdef CONFIG_HIBERNATION
+ int i;
+ void __iomem *base = gic_data.dist_base;
+ void __iomem *rdist_base = gic_data_rdist_sgi_base();
+
+ /*
+ * in_suspend is defined in hibernate.c and will be 0 during
+ * hibernation restore case. Also it willl be 0 for suspend to ram case
+ * and similar cases. Underlying code will not get executed in regular
+ * cases and will be executed only for hibernation restore.
+ */
+ if (unlikely((in_suspend == 0 && hibernation))) {
+ pr_info("Re-initializing gic in hibernation restore\n");
+ gic_dist_init();
+ gic_cpu_init();
+ /* Activate and enable SGIs and PPIs */
+ writel_relaxed(gic_data.enabled_sgis,
+ rdist_base + GICD_ISENABLER);
+ writel_relaxed(gic_data.pending_sgis,
+ rdist_base + GICD_ISPENDR);
+ /* Restore edge and level triggers for PPIs from GICR_ICFGR1 */
+ writel_relaxed(gic_data.ppi_edg_lvl,
+ rdist_base + GICR_ICFGR0 + 4);
+
+ /* Restore edge and level triggers */
+ for (i = 2; i < GIC_LINE_NR / 16; i++)
+ writel_relaxed(gic_data.irq_edg_lvl[i],
+ base + GICD_ICFGR + i * 4);
+ gic_dist_wait_for_rwp();
+
+ /* Activate and enable interrupts from backup */
+ for (i = 0; i * 32 < GIC_LINE_NR; i++) {
+ writel_relaxed(gic_data.active_irqs[i],
+ base + GICD_ISPENDR + i * 4);
+
+ writel_relaxed(gic_data.enabled_irqs[i],
+ base + GICD_ISENABLER + i * 4);
+ }
+ gic_dist_wait_for_rwp();
+ }
+#endif
+}
+EXPORT_SYMBOL_GPL(gic_resume);
+
+static struct syscore_ops gic_syscore_ops = {
+ .suspend = gic_suspend,
+ .resume = gic_resume,
+};
+
+static void gic_syscore_init(void)
+{
+ register_syscore_ops(&gic_syscore_ops);
+}
+
+#else
+static inline void gic_syscore_init(void) { }
+void gic_resume(void) { }
+#endif
+
static struct irq_chip gic_chip = {
.name = "GICv3",
.irq_mask = gic_mask_irq,
@@ -1887,6 +2016,7 @@ static int __init gic_init_bases(void __iomem *dist_base,
gic_cpu_init();
gic_smp_init();
gic_cpu_pm_init();
+ gic_syscore_init();

if (gic_dist_supports_lpis()) {
its_init(handle, &gic_data.rdists, gic_data.domain);
@@ -2092,7 +2222,11 @@ static int __init gic_of_init(struct device_node *node, struct device_node *pare
redist_stride, &node->fwnode);
if (err)
goto out_unmap_rdist;
-
+#ifdef CONFIG_HIBERNATION
+ err = register_pm_notifier(&gic_notif_block);
+ if (err)
+ goto out_unmap_rdist;
+#endif
gic_populate_ppi_partitions(node);

if (static_branch_likely(&supports_deactivate_key))
--
2.7.4


2022-05-18 08:10:12

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC 3/6] block: gendisk: Add a new genhd capability flag

This has abslutely nothing to do with the block layer, and should not
abuse the gendisk.

2022-05-18 12:24:51

by Andrew Lunn

[permalink] [raw]
Subject: Re: [RFC 2/6] PM: Hibernate: Add option to disable disk offset randomization

On Wed, May 18, 2022 at 01:18:37PM +0530, Vivek Kumar wrote:
> Add a kernel parameter to disable the disk offset randomization
> for SSD devices in which such feature is available at the
> firmware level. This is helpful in improving hibernation
> resume time.
>
> Signed-off-by: Vivek Kumar <[email protected]>
> Signed-off-by: Prasanna Kumar <[email protected]>
> ---
> Documentation/admin-guide/kernel-parameters.txt | 11 +++++++++++
> kernel/power/swap.c | 9 +++++++++
> 2 files changed, 20 insertions(+)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 666ade9..06b4f10 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5192,6 +5192,17 @@
> Useful for devices that are detected asynchronously
> (e.g. USB and MMC devices).
>
> + noswap_randomize
> + Kernel uses random disk offsets to help with wear-levelling
> + of SSD devices, while saving the hibernation snapshot image to
> + disk. Use this parameter to disable this feature for SSD
> + devices in scenarios when, such randomization is addressed at
> + the firmware level and hibenration image is not re-generated
> + frequently.
> + (Useful for improving hibernation resume time as snapshot pages
> + are available in disk serially and can be read in bigger chunks
> + without seeking)

Seeking is a NOP for SSD, so it seems odd you mentioned that. Is the
real problem here that the bootloader driver is very simple, it does
not queue multiple reads to the hardware, but does it one block at a
time?

Do you have performance numbers for both the bootloader and Linux?
Does Linux performance reading the snapshot increase as much as for
the bootloader?

Andrew

2022-05-18 12:29:49

by Andrew Lunn

[permalink] [raw]
Subject: Re: [RFC 4/6] mm: swap: Add randomization check for swapon/off calls

On Wed, May 18, 2022 at 01:18:39PM +0530, Vivek Kumar wrote:
> Add addtional check on swapon/swapoff sycalls to disable
> randomization of swap offsets if GENHD_FL_NO_RANDOMIZE
> flag is passed.

Is there already a flag in the image header which tells you if the
image is randomozied or not? I assume the bootloader needs to know,
doing a linear read of a randomized image is not going to end well.

Andrew

2022-05-18 13:05:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC 2/6] PM: Hibernate: Add option to disable disk offset randomization

On Wed, May 18, 2022 at 01:18:37PM +0530, Vivek Kumar wrote:
> Add a kernel parameter to disable the disk offset randomization
> for SSD devices in which such feature is available at the
> firmware level. This is helpful in improving hibernation
> resume time.

This patch just adds a global variable which is then entirely
igored.

But the idea of "randomizing" offsets on SSDs sounds like complete BS to
start with. The whole job of the SSD is to remap from a random writable
block device to difference physical blocks to deal with erases and wear
leveling. In other words it really doesn't matter what offset your
write to. That being said I could not actually find any code that does
this randomization to start with, but that might just be my lack of grep
skills.

2022-05-18 13:09:58

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC 5/6] Hibernate: Add check for pte_valid in saveable page

On Wed, May 18, 2022 at 01:18:40PM +0530, Vivek Kumar wrote:
> Add check for pte_valid in saveable page after being checked for
> the rest. This is required as PTE is removed for pages allocated
> with dma_alloc_coherent with DMA_ATTR_NO_KERNEL_MAPPING flag set.
> This patch makes sure that these pages are not considered for
> snapshot.

I don't think we ever remove kernel PTEs for DMA_ATTR_NO_KERNEL_MAPPING.
If the allocation did come from highmem they never had one to start
with. The logic here looks a bit fishy to me.

2022-05-18 17:00:41

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 0/6] Bootloader based hibernation

Hi Vivek,

Interesting stuff.

On Wed, May 18, 2022 at 3:49 AM Vivek Kumar <[email protected]> wrote:
>
> Kernel Hibernation
>
> Linux Kernel has been already supporting hibernation, a process which
> involves freezing of all userspace tasks, followed by quiescing of all
> kernel device drivers and then a DDR snapshot is taken which is saved
> to disc-swap partition, after the save, the system can either shutdown
> or continue further. Generally during the next power cycle when kernel
> boots and after probing almost all of the drivers, in the late_init()
> part, it checks if a hibernation image is present in the specified swap
> slot, if a valid hibernation image is found, it superimposes the currently
> executing Kernel with an older kernel from the snapshot, moving further,
> it calls the restore of the drivers and unfreezes the userspace tasks.
> CONFIG_HIBERNATION and a designated swap partition needs to be present
> for to enable Hibernation.
>
> Bootloader Based Hibernation:
>
> Automotive usecases require better boot KPIs, Hence we are proposing a
> bootloader based hibernation restore. Purpose of bootloader based
> hibernation is to improve the overall boot time till the first display
> frame is seen on the screen or a camera application can be launched from
> userspace after the power on reset key is pressed. This RFC patchset
> implements a slightly tweaked version of hibernation in which the
> restoration of an older snapshot into DDR is being carried out from the
> bootloader (ABL) itself, by doing this we are saving some time
> (1 second measured on msm-4.14 Kernel) by not running a
> temporary kernel and figuring out the hibernation image at late_init().

I wonder where most of the time is spent? Is it initializing struct
pages? Potentially we could enlighten bootloader to determine whether
hibernation image is stored or not on the swap device, and change boot
parameter for the kernel accordingly. The booting kernel would know
from the very beginning of boot that it will eventually resume a
hibernated image, and therefore skip some of initilization parts, and
perhaps limit amount of memory that it initializes.

> In order to achieve the same bootloader checks for the hibernation
> image at a very early stage from swap partition, it parses the image and
> loads it in the DDR instead of loading boot image form boot partition.
> Since we are not running the temporary kernel,which would have done some
> basic ARM related setup like, MMU enablement, EL2 setup, CPU setup etc,

What boot loader is used? I suspect bootloader enables MMU to load the
hibernated image into memory, otherwise the performance would be very
poor. After the image is loaded, and prior to jumping into the entry
address of loaded image the MMU is probably disabled.

> entry point into hibernation snapshot image directly from bootloader is
> different, on similar lines, all device drivers are now re-programming
> the IO-mapped registers as part of the restore callback (which is
> triggered from the hibernation framework) to bring back the HW/SW sync.
>
> Other factors like, read-speed of the secondary storage device and
> organization of the hibernation image in the swap partition effects the
> total image restore time and the overall boot time. In our current
> implementation we have serialized the allocation of swap-partition's slots
> in kernel, so when hibernation image is being saved to disc, each page is
> not scattered across various swap-slot offsets, rather it in a serial
> manner. For example, if a DDR page at Page frame number 0x8005 is
> located at a swap-slot offset 50, the next valid DDR page at PFN 0x8005
> will be preset at the swap-slot offset 51. With this optimization in
> place, bootloader can utilize the max capacity of issuing a disc-read
> for reading a bigger chunk (~50 MBs at once) from the swap slot,
> and also parsing of the image becomes simpler as it is available
> contiguously.

This optimization seems generic enough that it would benefit both
types of resume: from bootloader and from kernel.

Thanks,
Pasha

>
>
>
> Vivek Kumar (6):
> arm64: hibernate: Introduce new entry point to kernel
> PM: Hibernate: Add option to disable disk offset randomization
> block: gendisk: Add a new genhd capability flag
> mm: swap: Add randomization check for swapon/off calls
> Hibernate: Add check for pte_valid in saveable page
> irqchip/gic-v3: Re-init GIC hardware upon hibernation restore
>
> Documentation/admin-guide/kernel-parameters.txt | 11 ++
> arch/arm64/kernel/hibernate.c | 9 ++
> drivers/irqchip/irq-gic-v3.c | 138 ++++++++++++++++-
> include/linux/blkdev.h | 1 +
> kernel/power/snapshot.c | 43 ++++++++
> kernel/power/swap.c | 12 +++
> mm/swapfile.c | 6 +-
> 7 files changed, 216 insertions(+), 4 deletions(-)
>
> --
> 2.7.4
>

2022-05-19 12:58:04

by Marc Zyngier

[permalink] [raw]
Subject: Re: [RFC 6/6] irqchip/gic-v3: Re-init GIC hardware upon hibernation restore

Hi Vivek,

On Wed, 18 May 2022 08:48:41 +0100,
Vivek Kumar <[email protected]> wrote:
>
> Code added in this patch takes backup of different set of
> registers during hibernation suspend. On receiving hibernation
> restore callback, it restores register values from backup. This
> ensures state of hardware to be same just before hibernation and
> after restore.
>
> Signed-off-by: Vivek Kumar <[email protected]>
> Signed-off-by: Prasanna Kumar <[email protected]>
> ---
> drivers/irqchip/irq-gic-v3.c | 138 ++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 136 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
> index 2be8dea..442d32f 100644
> --- a/drivers/irqchip/irq-gic-v3.c
> +++ b/drivers/irqchip/irq-gic-v3.c
> @@ -29,6 +29,10 @@
> #include <asm/smp_plat.h>
> #include <asm/virt.h>
>
> +#include <linux/syscore_ops.h>
> +#include <linux/suspend.h>
> +#include <linux/notifier.h>
> +
> #include "irq-gic-common.h"
>
> #define GICD_INT_NMI_PRI (GICD_INT_DEF_PRI & ~0x80)
> @@ -56,6 +60,14 @@ struct gic_chip_data {
> bool has_rss;
> unsigned int ppi_nr;
> struct partition_desc **ppi_descs;
> +#ifdef CONFIG_HIBERNATION
> + unsigned int enabled_irqs[32];
> + unsigned int active_irqs[32];
> + unsigned int irq_edg_lvl[64];
> + unsigned int ppi_edg_lvl;
> + unsigned int enabled_sgis;
> + unsigned int pending_sgis;
> +#endif

This is either way too much, or way too little. Just restoring these
registers is nowhere near enough, as you are completely ignoring the
ITS, so this will leave the machine broken for anything that requires
LPIs.

But If the bootloader is supposed to replace the kernel to put the HW
in a state where the GIC is usable again, why do we need any of this?

Hibernation relies on a basic promise: the secondary kernel is entered
with the HW in a reasonable state, and the basic infrastructure
(specially for stuff that can be only programmed once per boot such as
the RD tables) is already available. If the bootloader is going to do
the work of the initial kernel, then it must do it fully, and not
require this sort of random sprinkling all over the shop.

Effectively, there is an ABI between the primary kernel and the
secondary, and I don't see why this interface should change to paper
over what I see as the deficiencies of the bootloader.

Am I missing anything?

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2022-05-20 01:02:42

by Marc Zyngier

[permalink] [raw]
Subject: Re: [RFC 1/6] arm64: hibernate: Introduce new entry point to kernel

On 2022-05-18 08:48, Vivek Kumar wrote:
> Introduce a new entry point to hibernated kernel image.
> This is generally needed when bootloader restores the
> hibernated image from disc to ddr and passes control
> to it by turning off the mmu, also initialize this new
> entry point with cpu_resume which turns on the mmu and
> then proceeds with restore routines.
>
> Signed-off-by: Vivek Kumar <[email protected]>
> Signed-off-by: Prasanna Kumar <[email protected]>
> ---
> arch/arm64/kernel/hibernate.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/arch/arm64/kernel/hibernate.c
> b/arch/arm64/kernel/hibernate.c
> index 6328308..4e294b3 100644
> --- a/arch/arm64/kernel/hibernate.c
> +++ b/arch/arm64/kernel/hibernate.c
> @@ -74,6 +74,14 @@ static struct arch_hibernate_hdr {
> void (*reenter_kernel)(void);
>
> /*
> + * Another entry point if jump to kernel happens with mmu disabled,
> + * generally done when restoring hibernation image from bootloader
> + * context
> + */
> +
> + phys_addr_t phys_reenter_kernel;
> +
> + /*
> * We need to know where the __hyp_stub_vectors are after restore to
> * re-configure el2.
> */
> @@ -116,6 +124,7 @@ int arch_hibernation_header_save(void *addr,
> unsigned int max_size)
> arch_hdr_invariants(&hdr->invariants);
> hdr->ttbr1_el1 = __pa_symbol(swapper_pg_dir);
> hdr->reenter_kernel = _cpu_resume;
> + hdr->phys_reenter_kernel = __pa(cpu_resume);
>
> /* We can't use __hyp_get_vectors() because kvm may still be loaded
> */
> if (el2_reset_needed())

So here, you are creating a new ABI with the bootloader, based on
a data structure that isn't mean't to be ABI. It means that we
wouldn't be allowed to ever change this data structure, as this
would mean having to update the bootloader in sync.

Clearly, this isn't acceptable.

M.
--
Jazz is not dead. It just smells funny...

2022-05-23 06:02:27

by Mark Rutland

[permalink] [raw]
Subject: Re: [RFC 0/6] Bootloader based hibernation

Hi,

On Wed, May 18, 2022 at 01:18:35PM +0530, Vivek Kumar wrote:
> Kernel Hibernation
>
> Linux Kernel has been already supporting hibernation, a process which
> involves freezing of all userspace tasks, followed by quiescing of all
> kernel device drivers and then a DDR snapshot is taken which is saved
> to disc-swap partition, after the save, the system can either shutdown
> or continue further. Generally during the next power cycle when kernel
> boots and after probing almost all of the drivers, in the late_init()
> part, it checks if a hibernation image is present in the specified swap
> slot, if a valid hibernation image is found, it superimposes the currently
> executing Kernel with an older kernel from the snapshot, moving further,
> it calls the restore of the drivers and unfreezes the userspace tasks.
> CONFIG_HIBERNATION and a designated swap partition needs to be present
> for to enable Hibernation.
>
> Bootloader Based Hibernation:
>
> Automotive usecases require better boot KPIs, Hence we are proposing a
> bootloader based hibernation restore.

At a high-level, I'm not a fan of adding new ways to enter the kernel, and for
the same reasons that the existing hibernate handover is deliberately *not* a
stable ABI, I don't think we should add an ABI for this. This is not going to
remain maintainable or compatible over time as the kernel evolves.

> Purpose of bootloader based hibernation is to improve the overall boot time
> till the first display frame is seen on the screen or a camera application
> can be launched from userspace after the power on reset key is pressed.

Can you break down the time taken for that today?

What does a cold boot look like?

What *exactly* are you trying to skip by using hibernation?

Thanks,
Mark.

> This RFC patchset
> implements a slightly tweaked version of hibernation in which the
> restoration of an older snapshot into DDR is being carried out from the
> bootloader (ABL) itself, by doing this we are saving some time
> (1 second measured on msm-4.14 Kernel) by not running a
> temporary kernel and figuring out the hibernation image at late_init().
> In order to achieve the same bootloader checks for the hibernation
> image at a very early stage from swap partition, it parses the image and
> loads it in the DDR instead of loading boot image form boot partition.
> Since we are not running the temporary kernel,which would have done some
> basic ARM related setup like, MMU enablement, EL2 setup, CPU setup etc,
> entry point into hibernation snapshot image directly from bootloader is
> different, on similar lines, all device drivers are now re-programming
> the IO-mapped registers as part of the restore callback (which is
> triggered from the hibernation framework) to bring back the HW/SW sync.
>
> Other factors like, read-speed of the secondary storage device and
> organization of the hibernation image in the swap partition effects the
> total image restore time and the overall boot time. In our current
> implementation we have serialized the allocation of swap-partition's slots
> in kernel, so when hibernation image is being saved to disc, each page is
> not scattered across various swap-slot offsets, rather it in a serial
> manner. For example, if a DDR page at Page frame number 0x8005 is
> located at a swap-slot offset 50, the next valid DDR page at PFN 0x8005
> will be preset at the swap-slot offset 51. With this optimization in
> place, bootloader can utilize the max capacity of issuing a disc-read
> for reading a bigger chunk (~50 MBs at once) from the swap slot,
> and also parsing of the image becomes simpler as it is available
> contiguously.
>
>
>
> Vivek Kumar (6):
> arm64: hibernate: Introduce new entry point to kernel
> PM: Hibernate: Add option to disable disk offset randomization
> block: gendisk: Add a new genhd capability flag
> mm: swap: Add randomization check for swapon/off calls
> Hibernate: Add check for pte_valid in saveable page
> irqchip/gic-v3: Re-init GIC hardware upon hibernation restore
>
> Documentation/admin-guide/kernel-parameters.txt | 11 ++
> arch/arm64/kernel/hibernate.c | 9 ++
> drivers/irqchip/irq-gic-v3.c | 138 ++++++++++++++++-
> include/linux/blkdev.h | 1 +
> kernel/power/snapshot.c | 43 ++++++++
> kernel/power/swap.c | 12 +++
> mm/swapfile.c | 6 +-
> 7 files changed, 216 insertions(+), 4 deletions(-)
>
> --
> 2.7.4
>

2022-05-23 21:17:21

by Randy Dunlap

[permalink] [raw]
Subject: Re: [RFC 2/6] PM: Hibernate: Add option to disable disk offset randomization



On 5/18/22 00:48, Vivek Kumar wrote:
> Add a kernel parameter to disable the disk offset randomization
> for SSD devices in which such feature is available at the
> firmware level. This is helpful in improving hibernation
> resume time.
>
> Signed-off-by: Vivek Kumar <[email protected]>
> Signed-off-by: Prasanna Kumar <[email protected]>
> ---
> Documentation/admin-guide/kernel-parameters.txt | 11 +++++++++++
> kernel/power/swap.c | 9 +++++++++
> 2 files changed, 20 insertions(+)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 666ade9..06b4f10 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5192,6 +5192,17 @@
> Useful for devices that are detected asynchronously
> (e.g. USB and MMC devices).
>
> + noswap_randomize
> + Kernel uses random disk offsets to help with wear-levelling

wear-leveling

> + of SSD devices, while saving the hibernation snapshot image to
> + disk. Use this parameter to disable this feature for SSD
> + devices in scenarios when, such randomization is addressed at

no comma ^

> + the firmware level and hibenration image is not re-generated

hibernation

> + frequently.
> + (Useful for improving hibernation resume time as snapshot pages
> + are available in disk serially and can be read in bigger chunks
> + without seeking)
> +
> retain_initrd [RAM] Keep initrd memory after extraction
>
> rfkill.default_state=


--
~Randy