2005-01-20 19:33:22

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH][RFC] swsusp: speed up image restoring on x86-64

Hi,

The following patch speeds up the restoring of swsusp images on x86-64
and makes the assembly code more readable (tested and works on AMD64). It's
against 2.6.11-rc1-mm1, but applies to 2.6.11-rc1-mm2. Please consifer for applying.

Signed-off-by: Rafael J. Wysocki <[email protected]>

--- linux-2.6.11-rc1-mm1/arch/x86_64/kernel/suspend_asm.S 2004-12-24 22:35:28.000000000 +0100
+++ linux-2.6.11-rc1-mm1-rjw/arch/x86_64/kernel/suspend_asm.S 2005-01-20 17:28:30.000000000 +0100
@@ -49,43 +49,28 @@
movq %rcx, %cr3;
movq %rax, %cr4; # turn PGE back on

+ movq pagedir_nosave(%rip), %rdx
+ /* compute the limit */
movl nr_copy_pages(%rip), %eax
- xorl %ecx, %ecx
- movq $0, %r10
testl %eax, %eax
jz done
-.L105:
- xorl %esi, %esi
- movq $0, %r11
- jmp .L104
- .p2align 4,,7
-copy_one_page:
- movq %r10, %rcx
-.L104:
- movq pagedir_nosave(%rip), %rdx
- movq %rcx, %rax
- salq $5, %rax
- movq 8(%rdx,%rax), %rcx
- movq (%rdx,%rax), %rax
- movzbl (%rsi,%rax), %eax
- movb %al, (%rsi,%rcx)
+ shlq $5, %rax; # multiply by sizeof(struct pbe)
+ addq %rdx, %rax
+loop:
+ /* get addresses from the pbe and copy the page */
+ movq (%rdx), %rsi
+ movq 8(%rdx), %rdi
+ movq $512, %rcx
+ rep
+ movsq

- movq %cr3, %rax; # flush TLB
- movq %rax, %cr3;
+ movq %cr3, %rcx; # flush TLB
+ movq %rcx, %cr3;

- movq %r11, %rax
- incq %rax
- cmpq $4095, %rax
- movq %rax, %rsi
- movq %rax, %r11
- jbe copy_one_page
- movq %r10, %rax
- incq %rax
- movq %rax, %rcx
- movq %rax, %r10
- mov nr_copy_pages(%rip), %eax
- cmpq %rax, %rcx
- jb .L105
+ /* progress to the next pbe */
+ addq $32, %rdx; # add sizeof(struct pbe)
+ cmpq %rax, %rdx
+ jb loop
done:
movl $24, %eax
movl %eax, %ds



Greets,
RJW


--
- Would you tell me, please, which way I ought to go from here?
- That depends a good deal on where you want to get to.
-- Lewis Carroll "Alice's Adventures in Wonderland"


2005-01-20 21:04:58

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

On Thursday, 20 of January 2005 20:32, Rafael J. Wysocki wrote:
> Hi,
>
> The following patch speeds up the restoring of swsusp images on x86-64
> and makes the assembly code more readable (tested and works on AMD64). It's
> against 2.6.11-rc1-mm1, but applies to 2.6.11-rc1-mm2. Please consifer for applying.

s/consifer/consider/

Sorry for the typo,
RJW


--
- Would you tell me, please, which way I ought to go from here?
- That depends a good deal on where you want to get to.
-- Lewis Carroll "Alice's Adventures in Wonderland"

2005-01-20 21:11:30

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

Hi!

> The following patch speeds up the restoring of swsusp images on x86-64
> and makes the assembly code more readable (tested and works on AMD64). It's
> against 2.6.11-rc1-mm1, but applies to 2.6.11-rc1-mm2. Please consifer for applying.

Can you really measure the speedup? If you want cheap way to speed it up,
kill cr3 manipulation.

Anyway, this is likely to clash with hugang's work; I'd prefer this not to be applied.

Pavel

--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

2005-01-20 21:46:40

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

On Thursday, 20 of January 2005 21:59, Pavel Machek wrote:
> Hi!
>
> > The following patch speeds up the restoring of swsusp images on x86-64
> > and makes the assembly code more readable (tested and works on AMD64). It's
> > against 2.6.11-rc1-mm1, but applies to 2.6.11-rc1-mm2. Please consifer for applying.
>
> Can you really measure the speedup?

In terms of time? Probably I can, but I prefer to measure it in terms of the numbers of
operations to be performed.

With this patch, at least 8 times less memory accesses are required to restore an image
than without it, and in the original code cr3 is reloaded after copying each _byte_,
let alone the SIB arithmetics. I'd expect it to be 10 times faster or so.

The readability of code is also important, IMHO.

> If you want cheap way to speed it up, kill cr3 manipulation.

Sure, but I think it's there for a reason.

> Anyway, this is likely to clash with hugang's work; I'd prefer this not to be applied.

I am aware of that, but you are not going to merge the hugang's patches soon, are you?
If necessary, I can change the patch to work with his code (hugang, what do you think?).

Greets,
RJW


--
- Would you tell me, please, which way I ought to go from here?
- That depends a good deal on where you want to get to.
-- Lewis Carroll "Alice's Adventures in Wonderland"

2005-01-20 22:09:44

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

Hi!

> > > The following patch speeds up the restoring of swsusp images on x86-64
> > > and makes the assembly code more readable (tested and works on AMD64). It's
> > > against 2.6.11-rc1-mm1, but applies to 2.6.11-rc1-mm2. Please consifer for applying.
> >
> > Can you really measure the speedup?
>
> In terms of time? Probably I can, but I prefer to measure it in terms of the numbers of
> operations to be performed.
>
> With this patch, at least 8 times less memory accesses are required to restore an image
> than without it, and in the original code cr3 is reloaded after copying each _byte_,
> let alone the SIB arithmetics. I'd expect it to be 10 times faster
> or so.

Well, 8 times less cr3 reloads may be significant... for the copy
loop. Speeding up copy loop that takes ... 100msec?... of whole
resume (30 seconds) does not seem too important to me.

> The readability of code is also important, IMHO.

It did not seem too much better to me.

> > If you want cheap way to speed it up, kill cr3 manipulation.
>
> Sure, but I think it's there for a reason.

Reason is "to crash it early if we have wrong pagetables".

> > Anyway, this is likely to clash with hugang's work; I'd prefer this not to be applied.
>
> I am aware of that, but you are not going to merge the hugang's patches soon, are you?
> If necessary, I can change the patch to work with his code (hugang, what do you think?).

I think it is just not worth the effort.
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2005-01-20 22:58:58

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

Hi,

On Thursday, 20 of January 2005 23:06, Pavel Machek wrote:
> Hi!
>
> > > > The following patch speeds up the restoring of swsusp images on x86-64
> > > > and makes the assembly code more readable (tested and works on AMD64). It's
> > > > against 2.6.11-rc1-mm1, but applies to 2.6.11-rc1-mm2. Please consifer for applying.
> > >
> > > Can you really measure the speedup?
> >
> > In terms of time? Probably I can, but I prefer to measure it in terms of the numbers of
> > operations to be performed.
> >
> > With this patch, at least 8 times less memory accesses are required to restore an image
> > than without it, and in the original code cr3 is reloaded after copying each _byte_,
> > let alone the SIB arithmetics. I'd expect it to be 10 times faster
> > or so.
>
> Well, 8 times less cr3 reloads may be significant... for the copy
> loop. Speeding up copy loop that takes ... 100msec?... of whole
> resume (30 seconds) does not seem too important to me.
>
> > The readability of code is also important, IMHO.
>
> It did not seem too much better to me.

Well, the beauty is in the eye of the beholder. :-)

Still, it shrinks the code (22 lines vs 37 lines), it uses less GPRs (5 vs 7), it uses less
SIB arithmetics (0 vs 4 times), it uses a well known scheme for copying data pages.
As far as the result is concerned, it is equivalent to the existing code, but it's simpler
(and faster). IMO, simpler code is always easier to understand.


> > > If you want cheap way to speed it up, kill cr3 manipulation.
> >
> > Sure, but I think it's there for a reason.
>
> Reason is "to crash it early if we have wrong pagetables".
>
> > > Anyway, this is likely to clash with hugang's work; I'd prefer this not to be applied.
> >
> > I am aware of that, but you are not going to merge the hugang's patches soon, are you?
> > If necessary, I can change the patch to work with his code (hugang, what do you think?).
>
> I think it is just not worth the effort.

Why? It won't take much time. I've spent more time for writing the messages
in this thread ... ;-)

Greets,
RJW


--
- Would you tell me, please, which way I ought to go from here?
- That depends a good deal on where you want to get to.
-- Lewis Carroll "Alice's Adventures in Wonderland"

2005-01-20 23:07:09

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

Hi!

> > > The readability of code is also important, IMHO.
> >
> > It did not seem too much better to me.
>
> Well, the beauty is in the eye of the beholder. :-)
>
> Still, it shrinks the code (22 lines vs 37 lines), it uses less GPRs (5 vs 7), it uses less
> SIB arithmetics (0 vs 4 times), it uses a well known scheme for copying data pages.
> As far as the result is concerned, it is equivalent to the existing code, but it's simpler
> (and faster). IMO, simpler code is always easier to understand.
>
>
> > > > If you want cheap way to speed it up, kill cr3 manipulation.
> > >
> > > Sure, but I think it's there for a reason.
> >
> > Reason is "to crash it early if we have wrong pagetables".
> >
> > > > Anyway, this is likely to clash with hugang's work; I'd prefer this not to be applied.
> > >
> > > I am aware of that, but you are not going to merge the hugang's patches soon, are you?
> > > If necessary, I can change the patch to work with his code (hugang, what do you think?).
> >
> > I think it is just not worth the effort.
>
> Why? It won't take much time. I've spent more time for writing the messages
> in this thread ... ;-)

Well, I know that current code works. It was produced by C compiler,
btw. Now, new code works for you, but it was not in kernel for 4
releases, and... this code is pretty subtle. And it is hand-made, not
C produced.

So... your code may be better but I do not think it is so much better
that I'd like to risk it.

Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2005-01-21 00:16:22

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

Hi,

On Friday, 21 of January 2005 00:06, Pavel Machek wrote:
> Hi!
>
> > > > The readability of code is also important, IMHO.
> > >
> > > It did not seem too much better to me.
> >
> > Well, the beauty is in the eye of the beholder. :-)
> >
> > Still, it shrinks the code (22 lines vs 37 lines), it uses less GPRs (5 vs 7), it uses less
> > SIB arithmetics (0 vs 4 times), it uses a well known scheme for copying data pages.
> > As far as the result is concerned, it is equivalent to the existing code, but it's simpler
> > (and faster). IMO, simpler code is always easier to understand.
> >
> >
> > > > > If you want cheap way to speed it up, kill cr3 manipulation.
> > > >
> > > > Sure, but I think it's there for a reason.
> > >
> > > Reason is "to crash it early if we have wrong pagetables".
> > >
> > > > > Anyway, this is likely to clash with hugang's work; I'd prefer this not to be applied.
> > > >
> > > > I am aware of that, but you are not going to merge the hugang's patches soon, are you?
> > > > If necessary, I can change the patch to work with his code (hugang, what do you think?).
> > >
> > > I think it is just not worth the effort.
> >
> > Why? It won't take much time. I've spent more time for writing the messages
> > in this thread ... ;-)
>
> Well, I know that current code works. It was produced by C compiler,
> btw. Now, new code works for you, but it was not in kernel for 4
> releases, and... this code is pretty subtle.

Now, I'm confused. :-) It's roughly this:

struct pbe *pbe = pagedir_nosave, *end;
unsigned n = nr_copy_pages;
if (n) {
end = pbe + n;
do {
memcpy((void *)pbe->orig_address, (void *)pbe->address, PAGE_SIZE);
pbe++;
} while (pbe < end);
}

where memcpy() is of course a hand-written inline that includes the cr3 manipulation,
and pbe, end, n are registers.

> And it is hand-made, not C produced.

Yes, it is.

> So... your code may be better but I do not think it is so much better
> that I'd like to risk it.

Now, that's clear. :-)

Anyway, if anyone could test it or look at it and say a word, please do so.

Greets,
RJW


--
- Would you tell me, please, which way I ought to go from here?
- That depends a good deal on where you want to get to.
-- Lewis Carroll "Alice's Adventures in Wonderland"

2005-01-21 02:26:12

by Hu Gang

[permalink] [raw]
Subject: Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

On Thu, Jan 20, 2005 at 10:46:37PM +0100, Rafael J. Wysocki wrote:
> On Thursday, 20 of January 2005 21:59, Pavel Machek wrote:
>
> Sure, but I think it's there for a reason.
>
> > Anyway, this is likely to clash with hugang's work; I'd prefer this not to be applied.
>
> I am aware of that, but you are not going to merge the hugang's patches soon, are you?
> If necessary, I can change the patch to work with his code (hugang, what do you think?).
>
I like this patch, And I change my code with this, Please have a look,
It pass in qemu X86_64. :)

Full patch still can get from
http://soulinfo.com/~hugang/swsusp/2005-1-21/

here is only x86_64 part.

--- 2.6.11-rc1-mm1/arch/x86_64/kernel/suspend_asm.S 2004-12-30 14:56:35.000000000 +0800
+++ 2.6.11-rc1-mm1-swsusp-x86_64/arch/x86_64/kernel/suspend_asm.S 2005-01-21 10:13:15.000000000 +0800
@@ -35,6 +35,7 @@ ENTRY(swsusp_arch_suspend)
call swsusp_save
ret

+ .section .data.nosave
ENTRY(swsusp_arch_resume)
/* set up cr3 */
leaq init_level4_pgt(%rip),%rax
@@ -49,43 +50,32 @@ ENTRY(swsusp_arch_resume)
movq %rcx, %cr3;
movq %rax, %cr4; # turn PGE back on

- movl nr_copy_pages(%rip), %eax
- xorl %ecx, %ecx
- movq $0, %r10
- testl %eax, %eax
- jz done
-.L105:
- xorl %esi, %esi
- movq $0, %r11
- jmp .L104
- .p2align 4,,7
-copy_one_page:
- movq %r10, %rcx
-.L104:
- movq pagedir_nosave(%rip), %rdx
- movq %rcx, %rax
- salq $5, %rax
- movq 8(%rdx,%rax), %rcx
- movq (%rdx,%rax), %rax
- movzbl (%rsi,%rax), %eax
- movb %al, (%rsi,%rcx)
-
- movq %cr3, %rax; # flush TLB
- movq %rax, %cr3;
-
- movq %r11, %rax
- incq %rax
- cmpq $4095, %rax
- movq %rax, %rsi
- movq %rax, %r11
- jbe copy_one_page
- movq %r10, %rax
- incq %rax
- movq %rax, %rcx
- movq %rax, %r10
- mov nr_copy_pages(%rip), %eax
- cmpq %rax, %rcx
- jb .L105
+ movq pagedir_nosave(%rip), %rax
+ testq %rax, %rax
+ je done
+
+copyback_page:
+ movq 24(%rax), %r9
+ xorl %r8d, %r8d
+
+copy_one_pgdir:
+ movq 8(%rax), %rdi
+ testq %rdi, %rdi
+ je done
+ movq (%rax), %rsi
+ movq $512, %rcx
+ rep
+ movsq
+
+ incq %r8
+ addq $32, %rax
+ cmpq $127, %r8
+ jbe copy_one_pgdir; # copy one pgdir
+
+ testq %r9, %r9
+ movq %r9, %rax
+ jne copyback_page
+
done:
movl $24, %eax
movl %eax, %ds

--
Hu Gang .-.
/v\
// \\
Linux User /( )\ [204016]
GPG Key ID ^^-^^ http://soulinfo.com/~hugang/hugang.asc

2005-01-21 10:05:17

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

Hi!

> > Sure, but I think it's there for a reason.
> >
> > > Anyway, this is likely to clash with hugang's work; I'd prefer this not to be applied.
> >
> > I am aware of that, but you are not going to merge the hugang's patches soon, are you?
> > If necessary, I can change the patch to work with his code (hugang, what do you think?).
> >
> I like this patch, And I change my code with this, Please have a look,
> It pass in qemu X86_64. :)
>
> Full patch still can get from
> http://soulinfo.com/~hugang/swsusp/2005-1-21/
>
> here is only x86_64 part.

Okay, why not, if you are changing it anyway...


> --- 2.6.11-rc1-mm1/arch/x86_64/kernel/suspend_asm.S 2004-12-30 14:56:35.000000000 +0800
> +++ 2.6.11-rc1-mm1-swsusp-x86_64/arch/x86_64/kernel/suspend_asm.S 2005-01-21 10:13:15.000000000 +0800
> @@ -35,6 +35,7 @@ ENTRY(swsusp_arch_suspend)
> call swsusp_save
> ret
>
> + .section .data.nosave
> ENTRY(swsusp_arch_resume)
> /* set up cr3 */
> leaq init_level4_pgt(%rip),%rax

But why does it go into data section?

> @@ -49,43 +50,32 @@ ENTRY(swsusp_arch_resume)
> movq %rcx, %cr3;
> movq %rax, %cr4; # turn PGE back on
>
> - movl nr_copy_pages(%rip), %eax
> - xorl %ecx, %ecx
> - movq $0, %r10
> - testl %eax, %eax
> - jz done
> -.L105:
> - xorl %esi, %esi
> - movq $0, %r11
> - jmp .L104
> - .p2align 4,,7
> -copy_one_page:
> - movq %r10, %rcx
> -.L104:
> - movq pagedir_nosave(%rip), %rdx
> - movq %rcx, %rax
> - salq $5, %rax
> - movq 8(%rdx,%rax), %rcx
> - movq (%rdx,%rax), %rax
> - movzbl (%rsi,%rax), %eax
> - movb %al, (%rsi,%rcx)
> -
> - movq %cr3, %rax; # flush TLB
> - movq %rax, %cr3;
> -
> - movq %r11, %rax
> - incq %rax
> - cmpq $4095, %rax
> - movq %rax, %rsi
> - movq %rax, %r11
> - jbe copy_one_page
> - movq %r10, %rax
> - incq %rax
> - movq %rax, %rcx
> - movq %rax, %r10
> - mov nr_copy_pages(%rip), %eax
> - cmpq %rax, %rcx
> - jb .L105
> + movq pagedir_nosave(%rip), %rax
> + testq %rax, %rax
> + je done
> +
> +copyback_page:
> + movq 24(%rax), %r9
> + xorl %r8d, %r8d
> +

Are you sure %r8 and %r9 are caller-saved? I'd use low registers if I
were you, they look nincer and generate shorter opcodes ;-).

Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2005-01-21 10:22:46

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

Hi!

> > Well, I know that current code works. It was produced by C compiler,
> > btw. Now, new code works for you, but it was not in kernel for 4
> > releases, and... this code is pretty subtle.
>
> Now, I'm confused. :-) It's roughly this:
>
> struct pbe *pbe = pagedir_nosave, *end;
> unsigned n = nr_copy_pages;
> if (n) {
> end = pbe + n;
> do {
> memcpy((void *)pbe->orig_address, (void *)pbe->address, PAGE_SIZE);
> pbe++;
> } while (pbe < end);
> }
>
> where memcpy() is of course a hand-written inline that includes the cr3 manipulation,
> and pbe, end, n are registers.

For example it may not use any variable in memory, and may not use
stack, as memory changes under its hands. Plus assembly is always
subtle ;-).

Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2005-01-21 10:26:26

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

> > +copyback_page:
> > + movq 24(%rax), %r9
> > + xorl %r8d, %r8d
> > +
>
> Are you sure %r8 and %r9 are caller-saved? I'd use low registers if I
> were you, they look nincer and generate shorter opcodes ;-).

They are.

-Andi

2005-01-21 10:31:28

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

Hi!

> Full patch still can get from
> http://soulinfo.com/~hugang/swsusp/2005-1-21/

>From a short look:

core.eatmem.diff of course helps, but is wrong. You should talk to
akpm to find out why shrink_all_memory is not doing its job.

i386: + repz movsl %ds:(%esi),%es:(%edi)
I do not think movsl has any parameters. What is repz? Repeat as long
as it is non-zero? I think this should be "rep movsl".


core:
@@ -576,92 +989,31 @@ static void copy_data_pages(void)
for (zone_pfn = 0; zone_pfn < zone->spanned_pages; ++zone_pfn) {
if (saveable(zone, &zone_pfn)) {
struct page * page;
+ pbe = find_pbe_by_index(pagedir_nosave, nr_copy_pages-to_copy);
+ BUG_ON(pbe == NULL);
page = pfn_to_page(zone_pfn + zone->zone_start_pfn);

Don't you introduce O(n^2) behaviour here? Should not it be something
like pbe_next? And it is the only user of find_pbe_by_index().

I think that read_one_pbe() is too short to be uninlined... Same for
read_one_pagedir and write_one_pbe().

alloc_one_pagedir: why not just alloc page as zeroed?

Okay, it is still too big to merge directly. Would it be possible to
get mod_printk_progress(), introduce *_for_each (but leave there old
implementations), introduce pagedir_free() (but leave old
implementation). Better collision code should already be there, that
should make patch smaller, too. Try not to move code around.

That may be mergeable before 2.6.11...
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2005-01-21 12:32:56

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

On Friday, 21 of January 2005 03:23, [email protected] wrote:
> On Thu, Jan 20, 2005 at 10:46:37PM +0100, Rafael J. Wysocki wrote:
> > On Thursday, 20 of January 2005 21:59, Pavel Machek wrote:
> >
> > Sure, but I think it's there for a reason.
> >
> > > Anyway, this is likely to clash with hugang's work; I'd prefer this not to be applied.
> >
> > I am aware of that, but you are not going to merge the hugang's patches soon, are you?
> > If necessary, I can change the patch to work with his code (hugang, what do you think?).
> >
> I like this patch, And I change my code with this, Please have a look,
> It pass in qemu X86_64. :)

Looks good. I'll test it later today.

Greets,
RJW


--
- Would you tell me, please, which way I ought to go from here?
- That depends a good deal on where you want to get to.
-- Lewis Carroll "Alice's Adventures in Wonderland"

2005-01-21 12:44:48

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

On Friday, 21 of January 2005 11:06, Pavel Machek wrote:
> Hi!
>
> > > Well, I know that current code works. It was produced by C compiler,
> > > btw. Now, new code works for you, but it was not in kernel for 4
> > > releases, and... this code is pretty subtle.
> >
> > Now, I'm confused. :-) It's roughly this:
> >
> > struct pbe *pbe = pagedir_nosave, *end;
> > unsigned n = nr_copy_pages;
> > if (n) {
> > end = pbe + n;
> > do {
> > memcpy((void *)pbe->orig_address, (void *)pbe->address, PAGE_SIZE);
> > pbe++;
> > } while (pbe < end);
> > }
> >
> > where memcpy() is of course a hand-written inline that includes the cr3 manipulation,
> > and pbe, end, n are registers.
>
> For example it may not use any variable in memory, and may not use
> stack, as memory changes under its hands.

Which is not a big problem on x86-64. :-)

> Plus assembly is always subtle ;-).

And that's what makes it interesting.

Greets,
RJW


--
- Would you tell me, please, which way I ought to go from here?
- That depends a good deal on where you want to get to.
-- Lewis Carroll "Alice's Adventures in Wonderland"

2005-01-21 13:49:49

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

Hi,

On Friday, 21 of January 2005 11:30, Pavel Machek wrote:
> Hi!
>
> > Full patch still can get from
> > http://soulinfo.com/~hugang/swsusp/2005-1-21/
>
> From a short look:
>
> core.eatmem.diff of course helps, but is wrong. You should talk to
> akpm to find out why shrink_all_memory is not doing its job.
>
> i386: + repz movsl %ds:(%esi),%es:(%edi)
> I do not think movsl has any parameters. What is repz? Repeat as long
> as it is non-zero?

No, it's "repeat until %ecx is zero or ZF is cleared", but the latter never happens
with movsl. It's intended for cmpsl, scasl and friends (the assembler should
complain about using it here).

> I think this should be "rep movsl".

Yes, it should.


> core:
> @@ -576,92 +989,31 @@ static void copy_data_pages(void)
> for (zone_pfn = 0; zone_pfn < zone->spanned_pages; ++zone_pfn) {
> if (saveable(zone, &zone_pfn)) {
> struct page * page;
> + pbe = find_pbe_by_index(pagedir_nosave, nr_copy_pages-to_copy);
> + BUG_ON(pbe == NULL);
> page = pfn_to_page(zone_pfn + zone->zone_start_pfn);
>
> Don't you introduce O(n^2) behaviour here? Should not it be something
> like pbe_next? And it is the only user of find_pbe_by_index().
>
> I think that read_one_pbe() is too short to be uninlined... Same for
> read_one_pagedir and write_one_pbe().
>
> alloc_one_pagedir: why not just alloc page as zeroed?
>
> Okay, it is still too big to merge directly. Would it be possible to
> get mod_printk_progress(), introduce *_for_each (but leave there old
> implementations), introduce pagedir_free() (but leave old
> implementation). Better collision code should already be there, that
> should make patch smaller, too. Try not to move code around.

I have a suggestion.

hugang, you are currently replacing an array of pbes with a list of arrays
of pbes contained within individual pages.

I would go further and replace it with a single one-directional list
of pbes. Namely, I would modify "struct pbe" in the following way:

struct pbe {
unsigned long address;
unsigned long orig_address;
swp_entry_t swap_address;
struct pbe *next;
};

(AFAICT, the "dummy" field is only used by hugang - as a pointer)
and I would define "for_each_pbe()" as:

#define for_each_pbe(pbe, pblist) \
for (pbe = pblist; pbe; pbe = pbe->next)

Then, the only non-trivial changes would be in alloc_pagedir() and
in swsusp_pagedir_relocate(), where I would need to link pbes to
each other.

This also would make the assembly parts independent of the
sizeof(struct pbe), which is currently hardcoded there.

What do you think?

Greets,
RJW


--
- Would you tell me, please, which way I ought to go from here?
- That depends a good deal on where you want to get to.
-- Lewis Carroll "Alice's Adventures in Wonderland"

2005-01-21 14:34:05

by Hu Gang

[permalink] [raw]
Subject: Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

On Fri, Jan 21, 2005 at 02:42:54PM +0100, Rafael J. Wysocki wrote:
> Hi,
>
>
> No, it's "repeat until %ecx is zero or ZF is cleared", but the latter never happens
> with movsl. It's intended for cmpsl, scasl and friends (the assembler should
> complain about using it here).
>
> > I think this should be "rep movsl".
>
> Yes, it should.
>
I'll change my code.

>
> I have a suggestion.
>
> hugang, you are currently replacing an array of pbes with a list of arrays
> of pbes contained within individual pages.
>
> I would go further and replace it with a single one-directional list
> of pbes. Namely, I would modify "struct pbe" in the following way:
>
> struct pbe {
> unsigned long address;
> unsigned long orig_address;
> swp_entry_t swap_address;
> struct pbe *next;
> };
>
> (AFAICT, the "dummy" field is only used by hugang - as a pointer)
> and I would define "for_each_pbe()" as:
>
> #define for_each_pbe(pbe, pblist) \
> for (pbe = pblist; pbe; pbe = pbe->next)
>
> Then, the only non-trivial changes would be in alloc_pagedir() and
> in swsusp_pagedir_relocate(), where I would need to link pbes to
> each other.
>
> This also would make the assembly parts independent of the
> sizeof(struct pbe), which is currently hardcoded there.
>
> What do you think?
Thanks for point that, That's better solution than current, I'll change
current code to this.

I'm think about, how can I make chang smaller.

--
Hu Gang .-.
/v\
// \\
Linux User /( )\ [204016]
GPG Key ID ^^-^^ http://soulinfo.com/~hugang/hugang.asc

2005-01-21 17:48:40

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

Hi!

> > Okay, it is still too big to merge directly. Would it be possible to
> > get mod_printk_progress(), introduce *_for_each (but leave there old
> > implementations), introduce pagedir_free() (but leave old
> > implementation). Better collision code should already be there, that
> > should make patch smaller, too. Try not to move code around.
>
> I have a suggestion.
>
> hugang, you are currently replacing an array of pbes with a list of arrays
> of pbes contained within individual pages.

Looks good to me. You still want to be able to loop over pages (for
relocation etc), but code is going to get a lot cleaner.

Oh be warned that last pointer on the page is used to linking pages
together on-disk.
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2005-01-21 19:15:00

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

On Friday, 21 of January 2005 13:32, Rafael J. Wysocki wrote:
> On Friday, 21 of January 2005 03:23, [email protected] wrote:
> > On Thu, Jan 20, 2005 at 10:46:37PM +0100, Rafael J. Wysocki wrote:
> > > On Thursday, 20 of January 2005 21:59, Pavel Machek wrote:
> > >
> > > Sure, but I think it's there for a reason.
> > >
> > > > Anyway, this is likely to clash with hugang's work; I'd prefer this not to be applied.
> > >
> > > I am aware of that, but you are not going to merge the hugang's patches soon, are you?
> > > If necessary, I can change the patch to work with his code (hugang, what do you think?).
> > >
> > I like this patch, And I change my code with this, Please have a look,
> > It pass in qemu X86_64. :)
>
> Looks good. I'll test it later today.

It works, but I had to change the "core" patch, so that it applied cleanly to
a "fresh" 2.6.11-rc1-mm2 (the patch that I used is attached). I also removed
some "pr_debug()" statemets that didn't help me at all. :-)

I noticed that it was significantly slower at writing to and reading from swap
than the unpatched swsusp.

Greets,
RJW


--
- Would you tell me, please, which way I ought to go from here?
- That depends a good deal on where you want to get to.
-- Lewis Carroll "Alice's Adventures in Wonderland"


Attachments:
(No filename) (1.26 kB)
2005-1-18.core-2.6.11-rc1-mm2.patch (21.12 kB)
Download all attachments

2005-01-22 02:03:46

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

> With this patch, at least 8 times less memory accesses are required to restore an image
> than without it, and in the original code cr3 is reloaded after copying each _byte_,
> let alone the SIB arithmetics. I'd expect it to be 10 times faster or so.

Probably more. CR3 reload is a serializing operation and is really expensive.

-Andi

2005-01-22 02:50:58

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

On Thu, Jan 20, 2005 at 08:32:31PM +0100, Rafael J. Wysocki wrote:
> Hi,
>
> The following patch speeds up the restoring of swsusp images on x86-64
> and makes the assembly code more readable (tested and works on AMD64). It's
> against 2.6.11-rc1-mm1, but applies to 2.6.11-rc1-mm2. Please consifer for applying.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>

Thanks. I applied it with some small changes to not hardcode any
C fields.

BTW Pavel, while reading the code I noticed some dubious things
in the code:

- The TLB flush doesn't flush global pages (turn of PGE and turn it
on again). Since that handles kernel pages which are marked global
this is surely wrong.

- Also is it really needed to flush the TLB after each page and wouldn't
INVLPG be better here? Or do you want to flush other pages than the
just copied one there too? INVLPG would also take care of the global
pages at least on x86-64 (iirc there are some bugs in this regard on some
older i386 cpus)

- There is a comment that says the code shouldn't use stack, but
it definitely uses the stack for some things. Either the comment
or the code is wrong. Which is?


-Andi


The following patch speeds up the restoring of swsusp images on x86-64
and makes the assembly code more readable (tested and works on AMD64). It's
against 2.6.11-rc1-mm1, but applies to 2.6.11-rc1-mm2. Please consifer for applying.

Signed-off-by: Rafael J. Wysocki <[email protected]>

Changed by AK to not hardcode any C values and get them from offset.h
instead.

Signed-off-by: Andi Kleen <[email protected]>

Index: linux/arch/x86_64/kernel/suspend_asm.S
===================================================================
--- linux.orig/arch/x86_64/kernel/suspend_asm.S 2004-10-19 01:55:08.%N +0200
+++ linux/arch/x86_64/kernel/suspend_asm.S 2005-01-22 03:20:28.%N +0100
@@ -11,6 +12,7 @@
#include <linux/linkage.h>
#include <asm/segment.h>
#include <asm/page.h>
+#include <asm/offset.h>

ENTRY(swsusp_arch_suspend)

@@ -49,43 +51,31 @@
movq %rcx, %cr3;
movq %rax, %cr4; # turn PGE back on

+ movq pagedir_nosave(%rip), %rdx
+ /* compute the limit */
movl nr_copy_pages(%rip), %eax
- xorl %ecx, %ecx
- movq $0, %r10
testl %eax, %eax
jz done
-.L105:
- xorl %esi, %esi
- movq $0, %r11
- jmp .L104
- .p2align 4,,7
-copy_one_page:
- movq %r10, %rcx
-.L104:
- movq pagedir_nosave(%rip), %rdx
- movq %rcx, %rax
- salq $5, %rax
- movq 8(%rdx,%rax), %rcx
- movq (%rdx,%rax), %rax
- movzbl (%rsi,%rax), %eax
- movb %al, (%rsi,%rcx)
-
- movq %cr3, %rax; # flush TLB
- movq %rax, %cr3;
-
- movq %r11, %rax
- incq %rax
- cmpq $4095, %rax
- movq %rax, %rsi
- movq %rax, %r11
- jbe copy_one_page
- movq %r10, %rax
- incq %rax
- movq %rax, %rcx
- movq %rax, %r10
- mov nr_copy_pages(%rip), %eax
- cmpq %rax, %rcx
- jb .L105
+ movq %rdx,%r8
+ movl $SIZEOF_PBE,%r9d
+ mul %r9 # with rax, clobbers rdx
+ movq %r8, %rdx
+ addq %r8, %rax
+loop:
+ /* get addresses from the pbe and copy the page */
+ movq pbe_address(%rdx), %rsi
+ movq pbe_orig_address(%rdx), %rdi
+ movq $512, %rcx
+ rep
+ movsq
+
+ movq %cr3, %rcx; # flush TLB
+ movq %rcx, %cr3;
+
+ /* progress to the next pbe */
+ addq $SIZEOF_PBE, %rdx
+ cmpq %rax, %rdx
+ jb loop
done:
movl $24, %eax
movl %eax, %ds
Index: linux/arch/x86_64/kernel/asm-offsets.c
===================================================================
--- linux.orig/arch/x86_64/kernel/asm-offsets.c 2004-10-19 01:55:08.%N +0200
+++ linux/arch/x86_64/kernel/asm-offsets.c 2005-01-22 03:09:50.%N +0100
@@ -8,6 +8,7 @@
#include <linux/stddef.h>
#include <linux/errno.h>
#include <linux/hardirq.h>
+#include <linux/suspend.h>
#include <asm/pda.h>
#include <asm/processor.h>
#include <asm/segment.h>
@@ -61,6 +62,8 @@
offsetof (struct rt_sigframe32, uc.uc_mcontext));
BLANK();
#endif
-
+ DEFINE(SIZEOF_PBE, sizeof(struct pbe));
+ DEFINE(pbe_address, offsetof(struct pbe, address));
+ DEFINE(pbe_orig_address, offsetof(struct pbe, orig_address));
return 0;
}

2005-01-22 09:55:52

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

Hi!

> > The following patch speeds up the restoring of swsusp images on x86-64
> > and makes the assembly code more readable (tested and works on AMD64). It's
> > against 2.6.11-rc1-mm1, but applies to 2.6.11-rc1-mm2. Please consifer for applying.
> >
> > Signed-off-by: Rafael J. Wysocki <[email protected]>
>
> Thanks. I applied it with some small changes to not hardcode any
> C fields.
>
> BTW Pavel, while reading the code I noticed some dubious things
> in the code:
>
> - The TLB flush doesn't flush global pages (turn of PGE and turn it
> on again). Since that handles kernel pages which are marked global
> this is surely wrong.
>
> - Also is it really needed to flush the TLB after each page and wouldn't
> INVLPG be better here? Or do you want to flush other pages than the
> just copied one there too? INVLPG would also take care of the global
> pages at least on x86-64 (iirc there are some bugs in this regard on some
> older i386 cpus)
>
> - There is a comment that says the code shouldn't use stack, but
> it definitely uses the stack for some things. Either the comment
> or the code is wrong. Which is?

This should address it... I attempted to put answers into the
comments, because probably everyone is interested in those....
Pavel

--- clean/arch/x86_64/kernel/suspend_asm.S 2004-10-19 14:16:28.000000000 +0200
+++ linux/arch/x86_64/kernel/suspend_asm.S 2005-01-22 10:51:28.000000000 +0100
@@ -1,10 +1,16 @@
-/* Originally gcc generated, modified by hand
+/* Copyright 2004,2005 Pavel Machek <[email protected]>, Andi Kleen <[email protected]>, Rafael J. Wysocki <[email protected]>
*
- * This may not use any stack, nor any variable that is not "NoSave":
+ * Distribute under GPLv2.
+ *
+ * swsusp_arch_resume may not use any stack, nor any variable that is
+ * not "NoSave" during copying pages:
*
* Its rewriting one kernel image with another. What is stack in "old"
* image could very well be data page in "new" image, and overwriting
* your own stack under you is bad idea.
+ *
+ * TLB flush is purely and debugging attempt to make it fail fast if we
+ * do something wrong. TLB is properly flushed in swsusp_restore.
*/

.text



--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2005-01-22 11:26:12

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

> + * TLB flush is purely and debugging attempt to make it fail fast if we
> + * do something wrong. TLB is properly flushed in swsusp_restore.

Did you measure it doesn't noticeable slow down suspend? CR3 reload is quite
expensive, and doing it for each page is quite often.

Also if you want to really flush everything you should do a global
flush.

-Andi

2005-01-22 11:31:17

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

Hi!

> > + * TLB flush is purely and debugging attempt to make it fail fast if we
> > + * do something wrong. TLB is properly flushed in swsusp_restore.
>
> Did you measure it doesn't noticeable slow down suspend? CR3 reload is quite
> expensive, and doing it for each page is quite often.

It slows it down horribly on vmware (like adding 2 minutes). On normal
hardware, copying pages does not take long enough for me to notice.

> Also if you want to really flush everything you should do a global
> flush.

That cr3 reload can probably be just removed, because swsusp is now
stable...
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2005-01-22 11:45:02

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH][RFC] swsusp: speed up image restoring on x86-64

> > Also if you want to really flush everything you should do a global
> > flush.
>
> That cr3 reload can probably be just removed, because swsusp is now
> stable...

I will remove it then.
-Andi