On Fri, 2003-02-28 at 02:20, Pavel Machek wrote:
> > > b) introduces hard limit on how much pages you can save (4GB).
> >
> > Well, I might ask how many people you know with 4GB of swap and 4GB of
> > RAM they want to suspend to disk :> Don't forget we still aren't
> > handling himem anyway (at least not last time I checked). As y
>
> Well, on x86-64 it should be able to suspend 8GB machine just fine --
> being 64bit means you don't have to deal with himem. Plus it would
> only be 2GB limit on x86-64.
>
[deletia]
>
> I don't know. I'd let Linus decide. I don't like hard limit on ammount
> of mem, through.
Hi again.
I've thought things through some more. We need to keep in mind that
other patches I intend to submit save the pages that aren't needed for
the suspend process itself separately. Since this includes all the
highmem pages and a reasonable proportion of the normal pages (easily
more than half when we're talking high usage), we don't need to eat
memory and we don't really have a hard limit on the size of the image.
Presumably the same conditions will apply under x86-64.
Thus, I still think we can go with the patch I submitted before. I've
rediffed it against 2.5.63 (less the bits already applied).
Regards,
Nigel
diff -ruN linux-2.5.63/arch/i386/kernel/Makefile linux-2.5.63-01/arch/i386/kernel/Makefile
--- linux-2.5.63/arch/i386/kernel/Makefile 2003-03-01 15:10:16.000000000 +1300
+++ linux-2.5.63-01/arch/i386/kernel/Makefile 2003-03-01 15:14:28.000000000 +1300
@@ -23,7 +23,7 @@
obj-$(CONFIG_X86_MPPARSE) += mpparse.o
obj-$(CONFIG_X86_LOCAL_APIC) += apic.o nmi.o
obj-$(CONFIG_X86_IO_APIC) += io_apic.o
-obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o suspend_asm.o
+obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o
obj-$(CONFIG_X86_NUMAQ) += numaq.o
obj-$(CONFIG_EDD) += edd.o
obj-$(CONFIG_MODULES) += module.o
diff -ruN linux-2.5.63/arch/i386/kernel/suspend.c linux-2.5.63-01/arch/i386/kernel/suspend.c
--- linux-2.5.63/arch/i386/kernel/suspend.c 2003-02-20 08:25:26.000000000 +1300
+++ linux-2.5.63-01/arch/i386/kernel/suspend.c 2003-02-20 08:27:36.000000000 +1300
@@ -133,3 +133,84 @@
}
}
+
+/* Local variables for do_magic */
+static int loop __nosavedata = 0;
+static int loop2 __nosavedata = 0;
+
+/*
+ * FIXME: This function should really be written in assembly. Actually
+ * requirement is that it does not touch stack, because %esp will be
+ * wrong during resume before restore_processor_context(). Check
+ * assembly if you modify this.
+ */
+void do_magic(int resume)
+{
+ if (!resume) {
+ do_magic_suspend_1();
+ save_processor_state(); /* We need to capture registers and memory at "same time" */
+ asm ( "movl %esp, saved_context_esp\n\t"
+ "movl %eax, saved_context_eax\n\t"
+ "movl %ebx, saved_context_ebx\n\t"
+ "movl %ecx, saved_context_ecx\n\t"
+ "movl %edx, saved_context_edx\n\t"
+ "movl %ebp, saved_context_ebp\n\t"
+ "movl %esi, saved_context_esi\n\t"
+ "movl %edi, saved_context_edi\n\t"
+ "pushfl ; popl saved_context_eflags\n\t");
+
+ do_magic_suspend_2(); /* If everything goes okay, this function does not return */
+ return;
+ }
+
+ /* We want to run from swapper_pg_dir, since swapper_pg_dir is stored in constant
+ * place in memory
+ */
+
+ __asm__( "movl %%ecx,%%cr3\n" ::"c"(__pa(swapper_pg_dir)));
+
+/*
+ * Final function for resuming: after copying the pages to their original
+ * position, it restores the register state.
+ *
+ * What about page tables? Writing data pages may toggle
+ * accessed/dirty bits in our page tables. That should be no problems
+ * with 4MB page tables. That's why we require have_pse.
+ *
+ * This loops destroys stack from under itself, so it better should
+ * not use any stack space, itself. When this function is entered at
+ * resume time, we move stack to _old_ place. This is means that this
+ * function must use no stack and no local variables in registers,
+ * until calling restore_processor_context();
+ *
+ * Critical section here: noone should touch saved memory after
+ * do_magic_resume_1; copying works, because nr_copy_pages,
+ * pagedir_nosave, loop and loop2 are nosavedata.
+ */
+ do_magic_resume_1();
+
+ for (loop=0; loop < nr_copy_pages; loop++) {
+ /* You may not call something (like copy_page) here: see above */
+ for (loop2=0; loop2 < PAGE_SIZE; loop2++) {
+ *(((char *)(PAGEDIR_ENTRY(pagedir_nosave,loop)->orig_address))+loop2) =
+ *(((char *)(PAGEDIR_ENTRY(pagedir_nosave,loop)->address))+loop2);
+ __flush_tlb();
+ }
+ }
+
+ asm( "movl saved_context_esp, %esp\n\t"
+ "movl saved_context_ebp, %ebp\n\t"
+ "movl saved_context_eax, %eax\n\t"
+ "movl saved_context_ebx, %ebx\n\t"
+ "movl saved_context_ecx, %ecx\n\t"
+ "movl saved_context_edx, %edx\n\t"
+ "movl saved_context_esi, %esi\n\t"
+ "movl saved_context_edi, %edi\n\t");
+ restore_processor_state();
+ asm("pushl saved_context_eflags ; popfl\n\t");
+
+/* Ahah, we now run with our old stack, and with registers copied from
+ suspend time */
+
+ do_magic_resume_2();
+}
diff -ruN linux-2.5.63/include/linux/page-flags.h linux-2.5.63-01/include/linux/page-flags.h
--- linux-2.5.63/include/linux/page-flags.h 2003-02-20 07:59:33.000000000 +1300
+++ linux-2.5.63-01/include/linux/page-flags.h 2003-02-20 08:28:31.000000000 +1300
@@ -74,6 +74,7 @@
#define PG_mappedtodisk 17 /* Has blocks allocated on-disk */
#define PG_reclaim 18 /* To be reclaimed asap */
#define PG_compound 19 /* Part of a compound page */
+#define PG_collides 20 /* swsusp - page used in save image */
/*
* Global page accounting. One instance per CPU. Only unsigned longs are
@@ -256,6 +257,9 @@
#define SetPageCompound(page) set_bit(PG_compound, &(page)->flags)
#define ClearPageCompound(page) clear_bit(PG_compound, &(page)->flags)
+#define PageCollides(page) test_bit(PG_collides, &(page)->flags)
+#define SetPageCollides(page) set_bit(PG_collides, &(page)->flags)
+#define ClearPageCollides(page) clear_bit(PG_collides, &(page)->flags)
/*
* The PageSwapCache predicate doesn't use a PG_flag at this time,
* but it may again do so one day.
diff -ruN linux-2.5.63/include/linux/suspend.h linux-2.5.63-01/include/linux/suspend.h
--- linux-2.5.63/include/linux/suspend.h 2003-01-15 17:00:58.000000000 +1300
+++ linux-2.5.63-01/include/linux/suspend.h 2003-02-20 08:27:36.000000000 +1300
@@ -34,7 +34,7 @@
char version[20];
int num_cpus;
int page_size;
- suspend_pagedir_t *suspend_pagedir;
+ suspend_pagedir_t **suspend_pagedir;
unsigned int num_pbes;
struct swap_location {
char filename[SWAP_FILENAME_MAXLENGTH];
@@ -42,6 +42,8 @@
};
#define SUSPEND_PD_PAGES(x) (((x)*sizeof(struct pbe))/PAGE_SIZE+1)
+#define PAGEDIR_CAPACITY(x) (((x)*PAGE_SIZE/sizeof(struct pbe)))
+#define PAGEDIR_ENTRY(pagedir, i) (pagedir[i/PAGEDIR_CAPACITY(1)] + (i%PAGEDIR_CAPACITY(1)))
/* mm/vmscan.c */
extern int shrink_mem(void);
@@ -61,7 +63,7 @@
extern void thaw_processes(void);
extern unsigned int nr_copy_pages __nosavedata;
-extern suspend_pagedir_t *pagedir_nosave __nosavedata;
+extern suspend_pagedir_t **pagedir_nosave __nosavedata;
/* Communication between kernel/suspend.c and arch/i386/suspend.c */
diff -ruN linux-2.5.63/kernel/suspend.c linux-2.5.63-01/kernel/suspend.c
--- linux-2.5.63/kernel/suspend.c 2003-02-20 07:59:34.000000000 +1300
+++ linux-2.5.63-01/kernel/suspend.c 2003-02-20 10:42:52.000000000 +1300
@@ -96,7 +96,6 @@
static int new_loglevel = 7;
static int orig_loglevel = 0;
static int orig_fgconsole, orig_kmsg;
-static int pagedir_order_check;
static int nr_copy_pages_check;
static int resume_status = 0;
@@ -116,9 +115,9 @@
allocated at time of resume, that travels through memory not to
collide with anything.
*/
-suspend_pagedir_t *pagedir_nosave __nosavedata = NULL;
-static suspend_pagedir_t *pagedir_save;
-static int pagedir_order __nosavedata = 0;
+suspend_pagedir_t **pagedir_nosave __nosavedata = NULL;
+static suspend_pagedir_t **pagedir_save = NULL;
+static int pagedir_size __nosavedata = 0;
struct link {
char dummy[PAGE_SIZE - sizeof(swp_entry_t)];
@@ -395,7 +394,7 @@
{
int i;
swp_entry_t entry, prev = { 0 };
- int nr_pgdir_pages = SUSPEND_PD_PAGES(nr_copy_pages);
+ int pagedir_size = SUSPEND_PD_PAGES(nr_copy_pages);
union diskpage *cur, *buffer = (union diskpage *)get_zeroed_page(GFP_ATOMIC);
unsigned long address;
struct page *page;
@@ -410,16 +409,15 @@
if (swapfile_used[swp_type(entry)] != SWAPFILE_SUSPEND)
panic("\nPage %d: not enough swapspace on suspend device", i );
- address = (pagedir_nosave+i)->address;
+ address = PAGEDIR_ENTRY(pagedir_nosave,i)->address;
page = virt_to_page(address);
rw_swap_page_sync(WRITE, entry, page);
- (pagedir_nosave+i)->swap_address = entry;
+ PAGEDIR_ENTRY(pagedir_nosave,i)->swap_address = entry;
}
printk( "|\n" );
- printk( "Writing pagedir (%d pages): ", nr_pgdir_pages);
- for (i=0; i<nr_pgdir_pages; i++) {
- cur = (union diskpage *)((char *) pagedir_nosave)+i;
- BUG_ON ((char *) cur != (((char *) pagedir_nosave) + i*PAGE_SIZE));
+ printk( "Writing pagedir (%d pages): ", pagedir_size);
+ for (i=0; i<pagedir_size; i++) {
+ cur = (union diskpage *) pagedir_nosave[i];
printk( "." );
if (!(entry = get_swap_page()).val) {
printk(KERN_CRIT "Not enough swapspace when writing pgdir\n" );
@@ -467,7 +465,7 @@
}
/* if pagedir_p != NULL it also copies the counted pages */
-static int count_and_copy_data_pages(struct pbe *pagedir_p)
+static int count_and_copy_data_pages(struct pbe **pagedir_p)
{
int chunk_size;
int nr_copy_pages = 0;
@@ -507,65 +505,88 @@
critical bios data? */
} else BUG();
- nr_copy_pages++;
if (pagedir_p) {
- pagedir_p->orig_address = ADDRESS(pfn);
- copy_page((void *) pagedir_p->address, (void *) pagedir_p->orig_address);
- pagedir_p++;
+ PAGEDIR_ENTRY(pagedir_p, nr_copy_pages)->orig_address = ADDRESS(pfn);
+ copy_page((void *) PAGEDIR_ENTRY(pagedir_p, nr_copy_pages)->address, (void *) PAGEDIR_ENTRY(pagedir_p, nr_copy_pages)->orig_address);
}
+ nr_copy_pages++;
}
return nr_copy_pages;
}
-static void free_suspend_pagedir(unsigned long this_pagedir)
+static void free_suspend_pagedir(struct pbe ** this_pagedir)
{
- struct page *page;
- int pfn;
- unsigned long this_pagedir_end = this_pagedir +
- (PAGE_SIZE << pagedir_order);
+ int i;
+ int rangestart = -1, rangeend = -1;
- for(pfn = 0; pfn < num_physpages; pfn++) {
- page = pfn_to_page(pfn);
- if (!TestClearPageNosave(page))
- continue;
+ if (pagedir_size == 0)
+ return;
- if (ADDRESS(pfn) >= this_pagedir && ADDRESS(pfn) < this_pagedir_end)
- continue; /* old pagedir gets freed in one */
-
- free_page(ADDRESS(pfn));
+ for(i = 0; i < nr_copy_pages; i++) {
+ if (PAGEDIR_ENTRY(this_pagedir,i)->address) {
+ if (rangestart > -1) {
+ printk("Pagedir entry %d-%d address2 not set!\n", rangestart, rangeend);
+ rangestart = -1;
+ }
+ ClearPageNosave(virt_to_page(PAGEDIR_ENTRY(this_pagedir,i)->address));
+ free_page(PAGEDIR_ENTRY(this_pagedir,i)->address);
+ } else {
+ if (rangestart == -1)
+ rangestart = i;
+ rangeend = i;
+ }
}
- free_pages(this_pagedir, pagedir_order);
+
+ if (rangestart > -1)
+ printk("Pagedir entry %d-%d address not set!\n", rangestart, nr_copy_pages - 1);
+
+ for(i = 0; i < pagedir_size; i++)
+ free_page((unsigned long) this_pagedir[i]);
+
+ free_page((unsigned long) this_pagedir);
+ this_pagedir = NULL;
+ nr_copy_pages = 0;
+ pagedir_size = 0;
}
-static suspend_pagedir_t *create_suspend_pagedir(int nr_copy_pages)
+static suspend_pagedir_t **create_suspend_pagedir(int nr_copy_pages)
{
+ suspend_pagedir_t **pagedir;
+ struct pbe **p;
int i;
- suspend_pagedir_t *pagedir;
- struct pbe *p;
- struct page *page;
- pagedir_order = get_bitmask_order(SUSPEND_PD_PAGES(nr_copy_pages));
+ pagedir_size = SUSPEND_PD_PAGES(nr_copy_pages);
- p = pagedir = (suspend_pagedir_t *)__get_free_pages(GFP_ATOMIC | __GFP_COLD, pagedir_order);
- if(!pagedir)
+ p = pagedir = (suspend_pagedir_t **)__get_free_pages(GFP_ATOMIC | __GFP_COLD, 0);
+ if(!p)
return NULL;
- page = virt_to_page(pagedir);
- for(i=0; i < 1<<pagedir_order; i++)
- SetPageNosave(page++);
-
+ /* We aren't setting the pagedir itself Nosave because we have to be able
+ * to free it during resume, after restoring the image. This means nr_copy_pages
+ * needs to be adjusted */
+
+ for (i = 0; i < pagedir_size; i++) {
+ p[i] = (suspend_pagedir_t *)__get_free_pages(GFP_ATOMIC, 0);
+ if (!p[i]) {
+ int j;
+ for (j = 0; j < i; j++) {
+ free_page((unsigned long) p[j]);
+ }
+ free_page((unsigned long) p);
+ return NULL;
+ }
+ }
+
while(nr_copy_pages--) {
- p->address = get_zeroed_page(GFP_ATOMIC | __GFP_COLD);
- if(!p->address) {
- free_suspend_pagedir((unsigned long) pagedir);
+ PAGEDIR_ENTRY(p, nr_copy_pages)->address = get_zeroed_page(GFP_ATOMIC | __GFP_COLD);
+ if(!PAGEDIR_ENTRY(p, nr_copy_pages)->address) {
+ free_suspend_pagedir(p);
return NULL;
}
- printk(".");
- SetPageNosave(virt_to_page(p->address));
- p->orig_address = 0;
- p++;
+ SetPageNosave(virt_to_page(PAGEDIR_ENTRY(p, nr_copy_pages)->address));
+ PAGEDIR_ENTRY(p, nr_copy_pages)->orig_address = 0;
}
- return pagedir;
+ return p;
}
static int prepare_suspend_console(void)
@@ -604,12 +625,13 @@
static int prepare_suspend_processes(void)
{
+ PRINTK("Syncing...\n");
+ sys_sync();
if (freeze_processes()) {
printk( KERN_ERR "Suspend failed: Not all processes stopped!\n" );
thaw_processes();
return 1;
}
- sys_sync();
return 0;
}
@@ -684,6 +706,7 @@
pagedir_nosave = NULL;
printk( "/critical section: Counting pages to copy" );
nr_copy_pages = count_and_copy_data_pages(NULL);
+ nr_copy_pages += 1 + SUSPEND_PD_PAGES(nr_copy_pages);
nr_needed_pages = nr_copy_pages + PAGES_FOR_IO;
printk(" (pages needed: %d+%d=%d free: %d)\n",nr_copy_pages,PAGES_FOR_IO,nr_needed_pages,nr_free_pages());
@@ -713,7 +736,6 @@
return 1;
}
nr_copy_pages_check = nr_copy_pages;
- pagedir_order_check = pagedir_order;
drain_local_pages(); /* During allocating of suspend pagedir, new cold pages may appear. Kill them */
if (nr_copy_pages != count_and_copy_data_pages(pagedir_nosave)) /* copy */
@@ -789,12 +811,11 @@
void do_magic_resume_2(void)
{
BUG_ON (nr_copy_pages_check != nr_copy_pages);
- BUG_ON (pagedir_order_check != pagedir_order);
__flush_tlb_global(); /* Even mappings of "global" things (vmalloc) need to be fixed */
PRINTK( "Freeing prev allocated pagedir\n" );
- free_suspend_pagedir((unsigned long) pagedir_save);
+ free_suspend_pagedir(pagedir_save);
spin_unlock_irq(&suspend_pagedir_lock);
drivers_resume(RESUME_ALL_PHASES);
@@ -831,7 +852,7 @@
spin_lock_irq(&suspend_pagedir_lock); /* Done to disable interrupts */
mdelay(1000);
- free_pages((unsigned long) pagedir_nosave, pagedir_order);
+ free_suspend_pagedir(pagedir_nosave);
spin_unlock_irq(&suspend_pagedir_lock);
mark_swapfiles(((swp_entry_t) {0}), MARK_SWAP_RESUME);
PRINTK(KERN_WARNING "%sLeaving do_magic_suspend_2...\n", name_suspend);
@@ -894,37 +915,23 @@
/* More restore stuff */
-/* FIXME: Why not memcpy(to, from, 1<<pagedir_order*PAGE_SIZE)? */
-static void copy_pagedir(suspend_pagedir_t *to, suspend_pagedir_t *from)
-{
- int i;
- char *topointer=(char *)to, *frompointer=(char *)from;
-
- for(i=0; i < 1 << pagedir_order; i++) {
- copy_page(topointer, frompointer);
- topointer += PAGE_SIZE;
- frompointer += PAGE_SIZE;
- }
-}
-
-#define does_collide(addr) does_collide_order(pagedir_nosave, addr, 0)
-
-/*
- * Returns true if given address/order collides with any orig_address
- */
-static int does_collide_order(suspend_pagedir_t *pagedir, unsigned long addr,
- int order)
-{
+static void warmup_collision_cache(suspend_pagedir_t **pagedir) {
int i;
- unsigned long addre = addr + (PAGE_SIZE<<order);
- for(i=0; i < nr_copy_pages; i++)
- if((pagedir+i)->orig_address >= addr &&
- (pagedir+i)->orig_address < addre)
- return 1;
+ PRINTK("Setting up pagedir cache");
+ for (i = 0; i < max_pfn; i++)
+ ClearPageCollides(pfn_to_page(i));
- return 0;
+ for(i=0; i < nr_copy_pages; i++) {
+ SetPageCollides(virt_to_page(PAGEDIR_ENTRY(pagedir, i)->orig_address));
+ if (!(i%800)) {
+ PRINTK(".");
+ }
+ }
+ PRINTK("%d", i);
+ PRINTK("|\n");
}
+#define does_collide(address) (PageCollides(virt_to_page(address)))
/*
* We check here that pagedir & pages it points to won't collide with pages
@@ -932,64 +939,106 @@
*/
static int check_pagedir(void)
{
- int i;
+ int i, nrdone = 0;
+ void **eaten_memory = NULL;
+ void **c = eaten_memory, *f, *addr;
for(i=0; i < nr_copy_pages; i++) {
- unsigned long addr;
-
- do {
- addr = get_zeroed_page(GFP_ATOMIC);
- if(!addr)
- return -ENOMEM;
- } while (does_collide(addr));
-
- (pagedir_nosave+i)->address = addr;
+ while ((addr = (void *) get_zeroed_page(GFP_ATOMIC))) {
+ memset(addr, 0, PAGE_SIZE);
+ if (!does_collide((unsigned long) addr)) {
+ break;
+ }
+ eaten_memory = addr;
+ *eaten_memory = c;
+ c = eaten_memory;
+ }
+ PAGEDIR_ENTRY(pagedir_nosave,i)->address = (unsigned long) addr;
+ nrdone++;
+ }
+
+ // Free unwanted memory
+ c = eaten_memory;
+ while(c) {
+ f = c;
+ c = *c;
+ if (f)
+ free_page((unsigned long) f);
}
+ eaten_memory = NULL;
+
return 0;
}
static int relocate_pagedir(void)
{
+ void **eaten_memory = NULL;
+ void **c = eaten_memory, *m = NULL, *f;
+ int oom = 0, i, numeaten = 0;
+ int pagedir_size = SUSPEND_PD_PAGES(nr_copy_pages);
+
/*
* We have to avoid recursion (not to overflow kernel stack),
* and that's why code looks pretty cryptic
*/
- suspend_pagedir_t *new_pagedir, *old_pagedir = pagedir_nosave;
- void **eaten_memory = NULL;
- void **c = eaten_memory, *m, *f;
-
- printk("Relocating pagedir");
- if(!does_collide_order(old_pagedir, (unsigned long)old_pagedir, pagedir_order)) {
- printk("not neccessary\n");
- return 0;
- }
+ PRINTK("Relocating conflicting parts of pagedir.\n");
- while ((m = (void *) __get_free_pages(GFP_ATOMIC, pagedir_order))) {
- memset(m, 0, PAGE_SIZE);
- if (!does_collide_order(old_pagedir, (unsigned long)m, pagedir_order))
- break;
- eaten_memory = m;
- printk( "." );
- *eaten_memory = c;
- c = eaten_memory;
- }
+ for (i = -1; i < pagedir_size; i++) {
+ int this_collides = 0;
- if (!m)
- return -ENOMEM;
-
- pagedir_nosave = new_pagedir = m;
- copy_pagedir(new_pagedir, old_pagedir);
+ if (i == -1)
+ this_collides = does_collide((unsigned long) pagedir_nosave);
+ else
+ this_collides = does_collide((unsigned long) pagedir_nosave[i]);
+
+ if (this_collides) {
+ while ((m = (void *) __get_free_pages(GFP_ATOMIC, 0))) {
+ memset(m, 0, PAGE_SIZE);
+ if (!does_collide((unsigned long)m)) {
+ if (i == -1) {
+ copy_page(m, pagedir_nosave);
+ free_page((unsigned long) pagedir_nosave);
+ pagedir_nosave = m;
+ }
+ else {
+ copy_page(m, (void *) pagedir_nosave[i]);
+ free_page((unsigned long) pagedir_nosave[i]);
+ pagedir_nosave[i] = m;
+ }
+ break;
+ }
+ numeaten++;
+ eaten_memory = m;
+ PRINTK("Eaten: %d. Still to try:%d\r", numeaten, nr_free_pages());
+ *eaten_memory = c;
+ c = eaten_memory;
+ }
+ if (!m) {
+ printk("\nRan out of memory trying to relocate pagedir (tried %d pages).\n", numeaten);
+ oom = 1;
+ break;
+ }
+ }
+ }
+
+ PRINTK("\nFreeing rejected memory locations...");
c = eaten_memory;
while(c) {
- printk(":");
- f = *c;
+ f = c;
c = *c;
if (f)
- free_pages((unsigned long)f, pagedir_order);
+ free_pages((unsigned long) f, 0);
}
- printk("|\n");
+ eaten_memory = NULL;
+
+ PRINTK("\n");
+
+ if (oom)
+ return -ENOMEM;
+ else
+ return 0;
return 0;
}
@@ -1062,7 +1111,7 @@
static int __read_suspend_image(struct block_device *bdev, union diskpage *cur, int noresume)
{
swp_entry_t next;
- int i, nr_pgdir_pages;
+ int i, pagedir_size;
#define PREPARENEXT \
{ next = cur->link.next; \
@@ -1110,24 +1159,39 @@
pagedir_save = cur->sh.suspend_pagedir;
nr_copy_pages = cur->sh.num_pbes;
- nr_pgdir_pages = SUSPEND_PD_PAGES(nr_copy_pages);
- pagedir_order = get_bitmask_order(nr_pgdir_pages);
+ pagedir_size = SUSPEND_PD_PAGES(nr_copy_pages);
- pagedir_nosave = (suspend_pagedir_t *)__get_free_pages(GFP_ATOMIC, pagedir_order);
+ pagedir_nosave = (suspend_pagedir_t **)__get_free_pages(GFP_ATOMIC, 0);
if (!pagedir_nosave)
return -ENOMEM;
+ {
+ int i;
+ for (i = 0; i < pagedir_size; i++) {
+ pagedir_nosave[i] = (suspend_pagedir_t *)__get_free_pages(GFP_ATOMIC, 0);
+ if (!pagedir_nosave[i]) {
+ int j;
+ for (j = 0; j < i; j++)
+ free_page((unsigned long) pagedir_nosave[j]);
+ free_page((unsigned long) pagedir_nosave);
+ spin_unlock_irq(&suspend_pagedir_lock);
+ return -ENOMEM;
+ }
+ }
+ }
PRINTK( "%sReading pagedir, ", name_resume );
/* We get pages in reverse order of saving! */
- for (i=nr_pgdir_pages-1; i>=0; i--) {
+ for (i=pagedir_size-1; i>=0; i--) {
BUG_ON (!next.val);
- cur = (union diskpage *)((char *) pagedir_nosave)+i;
+ cur = (union diskpage *) pagedir_nosave[i];
if (bdev_read_page(bdev, next.val, cur)) return -EIO;
PREPARENEXT;
}
BUG_ON (next.val);
+ warmup_collision_cache(pagedir_nosave);
+
if (relocate_pagedir())
return -ENOMEM;
if (check_pagedir())
@@ -1135,12 +1199,12 @@
printk( "Reading image data (%d pages): ", nr_copy_pages );
for(i=0; i < nr_copy_pages; i++) {
- swp_entry_t swap_address = (pagedir_nosave+i)->swap_address;
+ swp_entry_t swap_address = PAGEDIR_ENTRY(pagedir_nosave,i)->swap_address;
if (!(i%100))
printk( "." );
/* You do not need to check for overlaps...
... check_pagedir already did this work */
- if (bdev_read_page(bdev, swp_offset(swap_address) * PAGE_SIZE, (char *)((pagedir_nosave+i)->address)))
+ if (bdev_read_page(bdev, swp_offset(swap_address) * PAGE_SIZE, (char *)(PAGEDIR_ENTRY(pagedir_nosave,i)->address)))
return -EIO;
}
printk( "|\n" );
Hi there.
> Thus, I still think we can go with the patch I submitted before. I've
> rediffed it against 2.5.63 (less the bits already applied).
I've spent the last week reading, reviewing, and rewriting major portions
of swsusp. I've actually been reasonably impressed, once I was able to get
the code into a much more readable state.
All in all, I think the idea of saving state to swap is dangerous for
various reasons. However, I like some of the other concepts of the code,
and will use them in developing a more palatable mechanism of doing STDs
(hehe, I love saying that). Once I've successfully broken out the pieces I
want to reuse, I'll post the cumulative patch. In the meantime, the
incremental diffs can be viewed here:
http://ldm.bkbits.net:8080/linux-2.5-power
In the meantime, I do have some comments on your patch..
> diff -ruN linux-2.5.63/arch/i386/kernel/suspend.c linux-2.5.63-01/arch/i386/kernel/suspend.c
> --- linux-2.5.63/arch/i386/kernel/suspend.c 2003-02-20 08:25:26.000000000 +1300
> +++ linux-2.5.63-01/arch/i386/kernel/suspend.c 2003-02-20 08:27:36.000000000 +1300
Thank you for putting this back in C, it's much appreciated.
> +void do_magic(int resume)
> +{
> + if (!resume) {
> + do_magic_suspend_1();
> + save_processor_state(); /* We need to capture registers and memory at "same time" */
> + asm ( "movl %esp, saved_context_esp\n\t"
> + "movl %eax, saved_context_eax\n\t"
> + "movl %ebx, saved_context_ebx\n\t"
> + "movl %ecx, saved_context_ecx\n\t"
> + "movl %edx, saved_context_edx\n\t"
> + "movl %ebp, saved_context_ebp\n\t"
> + "movl %esi, saved_context_esi\n\t"
> + "movl %edi, saved_context_edi\n\t"
On x86, %eax, %ecx, and %edx are local scratch registers, and don't need
to be saved. Note that gcc may use them, so check the assembly output.
> +/*
> + * Final function for resuming: after copying the pages to their original
> + * position, it restores the register state.
> + *
> + * What about page tables? Writing data pages may toggle
> + * accessed/dirty bits in our page tables. That should be no problems
> + * with 4MB page tables. That's why we require have_pse.
> + *
> + * This loops destroys stack from under itself, so it better should
> + * not use any stack space, itself. When this function is entered at
> + * resume time, we move stack to _old_ place. This is means that this
> + * function must use no stack and no local variables in registers,
> + * until calling restore_processor_context();
> + *
> + * Critical section here: noone should touch saved memory after
> + * do_magic_resume_1; copying works, because nr_copy_pages,
> + * pagedir_nosave, loop and loop2 are nosavedata.
> + */
Do you have something against indenting comments? ;)
> + for (loop=0; loop < nr_copy_pages; loop++) {
> + /* You may not call something (like copy_page) here: see above */
> + for (loop2=0; loop2 < PAGE_SIZE; loop2++) {
> + *(((char *)(PAGEDIR_ENTRY(pagedir_nosave,loop)->orig_address))+loop2) =
> + *(((char *)(PAGEDIR_ENTRY(pagedir_nosave,loop)->address))+loop2);
> + __flush_tlb();
> + }
> + }
This is better done as
for (loop = 0; loop < nr_copy_pagse; loop++) {
memcpy((char *)pagedir_nosave[loop].orig_address,
(char *)pagedir_nosave[loop].address,
PAGE_SIZE);
__flush_tlb();
}
Is __flush_tlb() really necessary?
> diff -ruN linux-2.5.63/include/linux/page-flags.h linux-2.5.63-01/include/linux/page-flags.h
> --- linux-2.5.63/include/linux/page-flags.h 2003-02-20 07:59:33.000000000 +1300
> +++ linux-2.5.63-01/include/linux/page-flags.h 2003-02-20 08:28:31.000000000 +1300
> @@ -74,6 +74,7 @@
> #define PG_mappedtodisk 17 /* Has blocks allocated on-disk */
> #define PG_reclaim 18 /* To be reclaimed asap */
> #define PG_compound 19 /* Part of a compound page */
> +#define PG_collides 20 /* swsusp - page used in save image */
>
> /*
> * Global page accounting. One instance per CPU. Only unsigned longs are
> @@ -256,6 +257,9 @@
> #define SetPageCompound(page) set_bit(PG_compound, &(page)->flags)
> #define ClearPageCompound(page) clear_bit(PG_compound, &(page)->flags)
>
> +#define PageCollides(page) test_bit(PG_collides, &(page)->flags)
> +#define SetPageCollides(page) set_bit(PG_collides, &(page)->flags)
> +#define ClearPageCollides(page) clear_bit(PG_collides, &(page)->flags)
> /*
> * The PageSwapCache predicate doesn't use a PG_flag at this time,
> * but it may again do so one day.
> diff -ruN linux-2.5.63/include/linux/suspend.h linux-2.5.63-01/include/linux/suspend.h
> --- linux-2.5.63/include/linux/suspend.h 2003-01-15 17:00:58.000000000 +1300
> +++ linux-2.5.63-01/include/linux/suspend.h 2003-02-20 08:27:36.000000000 +1300
> @@ -34,7 +34,7 @@
> char version[20];
> int num_cpus;
> int page_size;
> - suspend_pagedir_t *suspend_pagedir;
> + suspend_pagedir_t **suspend_pagedir;
> unsigned int num_pbes;
> struct swap_location {
> char filename[SWAP_FILENAME_MAXLENGTH];
> @@ -42,6 +42,8 @@
> };
>
> #define SUSPEND_PD_PAGES(x) (((x)*sizeof(struct pbe))/PAGE_SIZE+1)
> +#define PAGEDIR_CAPACITY(x) (((x)*PAGE_SIZE/sizeof(struct pbe)))
> +#define PAGEDIR_ENTRY(pagedir, i) (pagedir[i/PAGEDIR_CAPACITY(1)] + (i%PAGEDIR_CAPACITY(1)))
>
> /* mm/vmscan.c */
> extern int shrink_mem(void);
> @@ -61,7 +63,7 @@
> extern void thaw_processes(void);
>
> extern unsigned int nr_copy_pages __nosavedata;
> -extern suspend_pagedir_t *pagedir_nosave __nosavedata;
> +extern suspend_pagedir_t **pagedir_nosave __nosavedata;
>
> /* Communication between kernel/suspend.c and arch/i386/suspend.c */
>
This, and the rest of the deleted patch, are dubious. Once you start
adding
- more page flag bits
- functions that use double pointers
big warning alarms start going off I haven't looked that far into it yet,
but I suspect there are some design issues there that should get resolved.
-pat
Hi.
Thanks for your comments. I'll take a look at your rewrite. I'm
currently working on a port of the 2.4 beta, so I'm hoping we're not
going at cross-purposes here.
> Thank you for putting this back in C, it's much appreciated.
I know nothing about x86 assembly and have just been following the
existing code, so I can't claim any credit for this. do_magic in
assembly was just a cut and paste from suspend_asm.S so that I could get
things going with the PAGEDIR_ENTRY macro.
> Do you have something against indenting comments? ;)
Cut and paste from the original - not my comment :>
> This is better done as
>
> for (loop = 0; loop < nr_copy_pagse; loop++) {
> memcpy((char *)pagedir_nosave[loop].orig_address,
> (char *)pagedir_nosave[loop].address,
> PAGE_SIZE);
> __flush_tlb();
> }
>
> Is __flush_tlb() really necessary?
Pass. Once again, I'm blindly following the comment that says you can't
use memcpy. All of my changes are algorithm rewrites, not changes to the
'magic'.
> This, and the rest of the deleted patch, are dubious. Once you start
> adding
>
> - more page flag bits
> - functions that use double pointers
>
> big warning alarms start going off I haven't looked that far into it yet,
> but I suspect there are some design issues there that should get resolved.
Longer term, I don't want to add page_flags. A page_flag was just here
because it was the simplest way of getting a working implementation. In
the long term I would use a dynamically allocated bitmap instead.
The double pointers where the true point to the patch. They are
necessary because my aim in future patches is to get 2.5 to the same
point as 2.4. Under 2.4, you can now suspend to disk without needing to
eat any memory (assuming enough swap etc). To achieve this, the pages of
the pagedir must be able to be scattered around memory. With the
existing code, you have to be able to allocate a contiguous set of pages
for the whole pagedir. Take for an example a suspend cycle I did this
morning:
Mar 3 07:45:07 laptop-linux kernel: Free:1343. Sets:7720(7891),21527. PD:170. Swap:29589/53139. RAM to suspend:29930; resume:17459. Limits:30592,0
[deletia]
Mar 3 07:45:07 laptop-linux kernel: - SWSUSP Version : beta 18
Mar 3 07:45:07 laptop-linux kernel: - Swap available : 53139 (amount unused when preparing image).
Mar 3 07:45:07 laptop-linux kernel: - Pageset sizes : 7891 and 21527. (Pagedir size: 170)
Mar 3 07:45:07 laptop-linux kernel: - Expected sizes : 7891 and 21527.
Mar 3 07:45:07 laptop-linux kernel: - Parameters : 1 0 255 255 0
Mar 3 07:45:07 laptop-linux kernel: - Calculations : Image size: 29760. Ram to suspend: 30101. To resume: 17801.
Mar 3 07:45:07 laptop-linux kernel: - Limits : 30592 pages RAM. Initial boot: 29435. Current boot: 0.
In order to save 29760 pages, I needed a pagedir of 170 pages. Using the
old code, I'd have to allocate 256 contiguous pages. With only 1343
available, what do you think my chances are? With the new functionality,
it's no problem. (I should mention that I've seen a way in which I can
reduce the pagedir size that this code uses - the struct this is using
is different to the one currently in 2.5, and I would keep using the 2.5
version. This might mean we would need 128 pages instead of 256 for the
same image size, but I'm sure you'll appreciate that the argument still
stands).
Regards,
Nigel
On Mon, 2003-03-03 at 12:55, Patrick Mochel wrote:
> http://ldm.bkbits.net:8080/linux-2.5-power
Hi. again.
I've taken a look at the comments for your changesets, and our changes
do indeed conflict in a number of places. I'm happy to wait until your
cleanups get included and then merge from there. I've had a brief go at
using BK, but I'm only on a 56K connection, so I'm not sure how
practical it is for me to do pulls etc. I guess for the moment the best
path for me to take might be to continue to port 2.4 and then merge with
you once I'm done.
Regards,
Nigel
Hi!
> > Thus, I still think we can go with the patch I submitted before. I've
> > rediffed it against 2.5.63 (less the bits already applied).
>
> I've spent the last week reading, reviewing, and rewriting major portions
> of swsusp. I've actually been reasonably impressed, once I was able to get
> the code into a much more readable state.
:-).
> > diff -ruN linux-2.5.63/arch/i386/kernel/suspend.c linux-2.5.63-01/arch/i386/kernel/suspend.c
> > --- linux-2.5.63/arch/i386/kernel/suspend.c 2003-02-20 08:25:26.000000000 +1300
> > +++ linux-2.5.63-01/arch/i386/kernel/suspend.c 2003-02-20 08:27:36.000000000 +1300
>
> Thank you for putting this back in C, it's much appreciated.
Actually, it can not be put back in C. Manipulating stack pointer from
gcc inline assembly is just undefined. Its back in C so we can edit
it, but it needs to get back to assembly before merging with Linus.
> > + for (loop=0; loop < nr_copy_pages; loop++) {
> > + /* You may not call something (like copy_page) here: see above */
> > + for (loop2=0; loop2 < PAGE_SIZE; loop2++) {
> > + *(((char *)(PAGEDIR_ENTRY(pagedir_nosave,loop)->orig_address))+loop2) =
> > + *(((char *)(PAGEDIR_ENTRY(pagedir_nosave,loop)->address))+loop2);
> > + __flush_tlb();
> > + }
> > + }
>
> This is better done as
>
> for (loop = 0; loop < nr_copy_pagse; loop++) {
> memcpy((char *)pagedir_nosave[loop].orig_address,
> (char *)pagedir_nosave[loop].address,
> PAGE_SIZE);
> __flush_tlb();
> }
Hehe, try it.
You may not do function call at this point, because you are
overwriting your stack. See mails with Andi Kleen. This *needs* to be
in assembly.
> Is __flush_tlb() really necessary?
Its there to prevent Heisenbugs.
Pavel
--
Horseback riding is like software...
...vgf orggre jura vgf serr.
> > > --- linux-2.5.63/arch/i386/kernel/suspend.c 2003-02-20 08:25:26.000000000 +1300
> > > +++ linux-2.5.63-01/arch/i386/kernel/suspend.c 2003-02-20 08:27:36.000000000 +1300
> >
> > Thank you for putting this back in C, it's much appreciated.
>
> Actually, it can not be put back in C. Manipulating stack pointer from
> gcc inline assembly is just undefined. Its back in C so we can edit
> it, but it needs to get back to assembly before merging with Linus.
Noted. I'll convert it back.
> > This is better done as
> >
> > for (loop = 0; loop < nr_copy_pagse; loop++) {
> > memcpy((char *)pagedir_nosave[loop].orig_address,
> > (char *)pagedir_nosave[loop].address,
> > PAGE_SIZE);
> > __flush_tlb();
> > }
>
> Hehe, try it.
>
> You may not do function call at this point, because you are
> overwriting your stack. See mails with Andi Kleen. This *needs* to be
> in assembly.
memcpy() is inlined, at least on x86, and it seems to work fine for me
here. Besides, even if memcpy is not safe, you could at least copy 4 bytes
at a time. ;)
-pat
Hi!
> > > > --- linux-2.5.63/arch/i386/kernel/suspend.c 2003-02-20 08:25:26.000000000 +1300
> > > > +++ linux-2.5.63-01/arch/i386/kernel/suspend.c 2003-02-20 08:27:36.000000000 +1300
> > >
> > > Thank you for putting this back in C, it's much appreciated.
> >
> > Actually, it can not be put back in C. Manipulating stack pointer from
> > gcc inline assembly is just undefined. Its back in C so we can edit
> > it, but it needs to get back to assembly before merging with Linus.
>
> Noted. I'll convert it back.
Okay.
> > > This is better done as
> > >
> > > for (loop = 0; loop < nr_copy_pagse; loop++) {
> > > memcpy((char *)pagedir_nosave[loop].orig_address,
> > > (char *)pagedir_nosave[loop].address,
> > > PAGE_SIZE);
> > > __flush_tlb();
> > > }
> >
> > Hehe, try it.
> >
> > You may not do function call at this point, because you are
> > overwriting your stack. See mails with Andi Kleen. This *needs* to be
> > in assembly.
>
> memcpy() is inlined, at least on x86, and it seems to work fine for me
> here. Besides, even if memcpy is not safe, you could at least copy 4 bytes
> at a time. ;)
Well, this whole needs to be in assembly, anyway. I decided it is not
perfomance critical, and copied it byte-by-byte. That can be
changed...
Pavel
--
Horseback riding is like software...
...vgf orggre jura vgf serr.
We've been seeing a curious phenomenon on some PIII/ServerWorks
CNB30-LE systems.
The systems fail at relatively low temperatures. While the failures
are not specifically memory related (ECC errors are never a factor),
we have a memory test that's pretty good at triggering them. Data is
apparently getting corrupted on the front-side bus.
Here's the curious thing: when we run the same memory test on a
Windows 2000 system (same hardware; we just swap the disk), we can
run the ambient temperature up to 60C with no problem at all; the
test will run for days. (It occurred to us to try Win2K because the
hardware vendor was using it to test systems at temperature without
seeing problems.)
Swap in the Linux disk, and at that temperature it'll barely run at
all. The memory test fails quickly at 40C ambient.
FWIW, CPU cooling is pretty good in this box.
So, the puzzle: what might account for temperature sensitivity, of
all things, under Linux 2.4.9-31 (RH 7.2), but not Win2K?
--
/Jonathan Lundell.
Jonathan Lundell wrote:
> We've been seeing a curious phenomenon on some PIII/ServerWorks CNB30-LE
> systems.
>
> The systems fail at relatively low temperatures. While the failures are
> So, the puzzle: what might account for temperature sensitivity, of all
> things, under Linux 2.4.9-31 (RH 7.2), but not Win2K?
Linux is more 'busy' than windoze and I have heard of boxes frying when
running Linux. The solution is to find a better motherboard
manufacturer...
Cheers,
--
------------------------------------------------------------------------
Herman Oosthuysen
B.Eng.(E), Member of IEEE
Wireless Networks Inc.
http://www.WirelessNetworksInc.com
E-mail: [email protected]
Phone: 1.403.569-5687, Fax: 1.403.235-3965
------------------------------------------------------------------------
On Thu, 6 Mar 2003 10:11 am, Herman Oosthuysen wrote:
> Jonathan Lundell wrote:
> > We've been seeing a curious phenomenon on some PIII/ServerWorks CNB30-LE
> > systems.
> >
> > The systems fail at relatively low temperatures. While the failures are
> > So, the puzzle: what might account for temperature sensitivity, of all
> > things, under Linux 2.4.9-31 (RH 7.2), but not Win2K?
>
> Linux is more 'busy' than windoze and I have heard of boxes frying when
> running Linux. The solution is to find a better motherboard
> manufacturer...
That doesn't make sense. His post said the temperature was 20 degrees lower
when it failed.
Con
On Thu, Mar 06, 2003 at 10:38:44AM +1100, Con Kolivas wrote:
> On Thu, 6 Mar 2003 10:11 am, Herman Oosthuysen wrote:
> > Linux is more 'busy' than windoze and I have heard of boxes frying when
> > running Linux. The solution is to find a better motherboard
> > manufacturer...
>
> That doesn't make sense. His post said the temperature was 20 degrees lower
> when it failed.
It makes perfect sense. Components drawing power produce heat, which
causes a temperature rise above ambient. Put simply, if a chip that
fails at a case temperature of 50C and you have a 10C rise, it'll fail
at 40C. If you have a 20C rise, it'll fail at 30C.
PS, the efficiency of heatsinks is measured in degC/W - how many degrees
celcius the temperature rises for each watt of power dissipated. Double
the dissipated power, double the temperature rise.
--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html
Russell King wrote:
> On Thu, Mar 06, 2003 at 10:38:44AM +1100, Con Kolivas wrote:
>
>>On Thu, 6 Mar 2003 10:11 am, Herman Oosthuysen wrote:
>>
>>>Linux is more 'busy' than windoze and I have heard of boxes frying when
>>>running Linux. The solution is to find a better motherboard
>>>manufacturer...
>>
>>That doesn't make sense. His post said the temperature was 20 degrees lower
>>when it failed.
>
>
> It makes perfect sense. Components drawing power produce heat, which
> causes a temperature rise above ambient. Put simply, if a chip that
> fails at a case temperature of 50C and you have a 10C rise, it'll fail
> at 40C. If you have a 20C rise, it'll fail at 30C.
>
> PS, the efficiency of heatsinks is measured in degC/W - how many degrees
> celcius the temperature rises for each watt of power dissipated. Double
> the dissipated power, double the temperature rise.
>
that doesn't make much sense.
a chip for a given power output fails at a certain chip temperature,
this temperature doesn't vary by the case temp. If the case temp
increases then the chip temp will increase as long as the cooling system
on the chip doesn't change. Hence if the case temp increases the chip
temperature will increase and that could put it into the range of
failure. If the case temp decreases then the chip temp decreases.
The behavior you describe is when you increase the power output of a
chip beyond normal specifications (overclocking) then the temperature of
failure is lowered. eg. A chip that would run normally at 50C now can
only run stable at 45-40.
chip temp sensors are usually located in a relatively cool area of the
chip, hence chip failure temps occur usually around 60C (max) when in
fact it's around 80-90C. Unfortunately for us, chip temperature is not
uniform across the chip.
Here is a nice little site to get some info on that stuff.
http://users.erols.com/chare/elec.htm
that being said.
I've never heard of running linux frying someone's cpu. I could see
frying a power supply because cheap power supplies will fail after a
while of idle/load cycles that linux is good at using. I really dont see
how else linux could be more "busy" than winows especially since windows
has 5 or 6 spyware ad programs running behind the scenes all the time
anyway and the virus scanner having to check every instruction would
definitly lead to a higher cpu average than a linux box ding the same
things minus the spyware and virus scanner. It just doesn't make any
sense. Erroring out more in linux than windows...possibly yes depending
on which version but not hardware damage under normal use.
> The behavior you describe is when you increase the power output of a
> chip beyond normal specifications (overclocking) then the temperature of
> failure is lowered. eg. A chip that would run normally at 50C now can
> only run stable at 45-40.
You are the one mistaken. Most CPUs don't dissipate a constant amount
of power as heat. That depends on what the CPU is doing. For example,
even the Athlon without disconnect will cool some when it is 'halt'ed.
If a CPU is working more, accomplishing more than it was at another
time, it will be needing to rid itself of more heat. Hence, the fact
that the external temperature becomes the limiting factor (along with
how good the heat exchange system is [i.e. heat sink/fan]).
I do believe the previous poster was incorrect about the mathematical
relationship between case and CPU temperatures. They are NOT a 1:1.
However, he is right, they are mathematically related. Just as the heat
dissipated and the work done are related.
You do not need to overclock a CPU to get this kind of a change. The
change in the efficiency (memory management, task switching, etc.) of
how the work is done can cause the CPU to be worked harder... and when
the CPU is worked harder, so is memory and quite often just about
everything else.
Trever
--
One O.S. to rule them all, One O.S. to find them. One O.S. to bring them
all and in the darkness bind them.
At 7:29pm -0500 3/5/03, Ed Sweetman wrote:
>I've never heard of running linux frying someone's cpu. I could see
>frying a power supply because cheap power supplies will fail after a
>while of idle/load cycles that linux is good at using. I really dont
>see how else linux could be more "busy" than winows especially since
>windows has 5 or 6 spyware ad programs running behind the scenes all
>the time anyway and the virus scanner having to check every
>instruction would definitly lead to a higher cpu average than a
>linux box ding the same things minus the spyware and virus scanner.
>It just doesn't make any sense. Erroring out more in linux than
>windows...possibly yes depending on which version but not hardware
>damage under normal use.
I don't think it's a case of "busy" per se. Both systems are 100%
occupied with a userland memory test (it just mallocs and locks a
biggish buffer, and does reads and writes of various patters). One
pass of the test takes about 104 seconds on both systems (presumably
it's memory-bound, so compiler differences aren't showing up).
It was suggested off-list that I compare the chipset config registers
to see if anything is different. I've been meaning to do that, but
just looking at the registers, I don't see anything that would affect
FSB timing, or the FSB at all, for that matter. Naetheless, I'll do
the comparison as soon as I dig up an lspci equivalent for Win2K.
As for temperature differences, the heat sink temperature (at least)
doesn't seem to differ appreciably between the systems, which is what
I'd expect with essentially the same load on each.
I'm wondering, somewhat ignorantly, if there might be some kind of
CPU configuration that Windows is adjusting, as some kind of
workaround or the like. I don't suppose that this is the best place
to ask how to read MSRs from Windows....
--
/Jonathan Lundell.
On Wed, Mar 05, 2003 at 01:52:16PM -0800, Jonathan Lundell wrote:
> We've been seeing a curious phenomenon on some PIII/ServerWorks
> CNB30-LE systems.
>
> The systems fail at relatively low temperatures. While the failures
> are not specifically memory related (ECC errors are never a factor),
> we have a memory test that's pretty good at triggering them. Data is
> apparently getting corrupted on the front-side bus.
>
> Here's the curious thing: when we run the same memory test on a
> Windows 2000 system (same hardware; we just swap the disk), we can
> run the ambient temperature up to 60C with no problem at all; the
> test will run for days. (It occurred to us to try Win2K because the
> hardware vendor was using it to test systems at temperature without
> seeing problems.)
>
> Swap in the Linux disk, and at that temperature it'll barely run at
> all. The memory test fails quickly at 40C ambient.
>
> FWIW, CPU cooling is pretty good in this box.
>
> So, the puzzle: what might account for temperature sensitivity, of
> all things, under Linux 2.4.9-31 (RH 7.2), but not Win2K?
Since it doesn't sound like this is a memory error, but a chipset driver
error it could be a Linux driver bug.
You are running a very old kernel, at the least upgrade to the latest
errata (which is currently 2.4.18-26.7. You are running the latest
security updates as well, right?
-Dave
On Wed, Mar 05, 2003 at 01:52:16PM -0800, Jonathan Lundell wrote:
> We've been seeing a curious phenomenon on some PIII/ServerWorks
> CNB30-LE systems.
>
> So, the puzzle: what might account for temperature sensitivity, of
> all things, under Linux 2.4.9-31 (RH 7.2), but not Win2K?
Hmmm. Wasn't there something with IDE and the LE-Chipset.
Maybe you should try a current kernel. Don't know if this old-kernel has
the fix.
Bis denn
--
Real Programmers consider "what you see is what you get" to be just as
bad a concept in Text Editors as it is in women. No, the Real Programmer
wants a "you asked for it, you got it" text editor -- complicated,
cryptic, powerful, unforgiving, dangerous.
Am Thu, 6 Mar 2003 10:38:44 +1100
schrieb Con Kolivas <[email protected]>:
>
> That doesn't make sense. His post said the temperature was 20 degrees lower
> when it failed.
>
> Con
I think it does,
look at this:
RAM
._____________________.
_|| | | | | | | | | | ||_. ._/| ._/|
/ ||___________________|| |~\ ||/| ||/|
| |O _____ O| |~\\ /||/| ||/|
| | .-?| | |?-. | |\\\\ //|| | || |
| | / \ |~| | / \ | |\\\\\ //=|| |=|| |
| | /| |\| |~|/| |\ | |\\\\.________. ///=||/|=||/|
| | * | | \_._/ |~| * | |\===| |==///==||/|=||/|
| | |~|~| /CPU\ ~ | | | |====| north |==///==|| |=|| |
| | | | |~\_ _/ | | | | |====| bridge |=======|| |=|| |
| | * | | / ? \ | |~* | |/===| (MEM ) |=======||/|=||/|
| | \| |/| |~|\|~|/ | |//==| (CTRL) |==\\\==||/|=||/|
| | \ / |~| | \ / | |////?~~~~~~~~?==\\\==|| |=|| |
| | ?-.|_|_|.-? | |///// |||||| \\\=|| |=|| |
| |O O| |//// |||||| \\=||/|=||/|
| |~~~~~~~~~~~~~~~~~~~~~| |_// |||||| \\||/| ||/|
?~|| | | | | | | | | | ||~?_/ |||||| \|| | || |
?~~~~~~~~~~~~~~~~~~~~~? |||||| ||/ ||/
CPU TEMP | |||||| |_| |_|
| | voltage ||||||
| ||| ||||||
| ||| .________.
Mainboard | ||| | |
TEMP .,,,,,,. data | south |
O | |=======| bridge |
\\_____?''''''? | (BUS ) |
?~~~~~~? | (CTRL) |
TEMP & ?~~~~~~~~?
VOLTAGE ctrl ////|||\\\\\
chip PCI & other BUS
the sensor for the system temperature (somewhere on the board) is connected to a driver chip (usually on the i2c bus)
like the w83781d (on my board)
if something now causes the (often badly cooled) bridge to get hot (by more load between some periphery and the RAM for example)
, the system temperature doesnt necessary have to increase.
if the bridge has only a heatsink, its temperature is somewhat like
(system TEMP)+ ( produced heatper time / heat given to the air by heatsink per time )
where the heatsinks capacity is dependent on the delta temperature, too, gets complicated ;)
in short, the chips hotter than the rest of the system and if it has high load it gets even hotter,
but its temp is still dependant on the main system TEMP. ;)
blahrgh forget what i talk, watch the ASCII art, and imagine the effect of much data running between
BUS and RAM ;-) (or BUS and BUS if north and southbridge are on the same chip)
CvC
This is getting kicked like a deadhorse by now i think but.
Unlike your cpu which gets idle commands from the OS and thus has an
idle loop where it turns off certain circuits and which can get acpi
commands to turn completely off the other chips in the computer do not
have such a luxury. they are always on like the cpus of yesteryear used
to be. It doesn't matter if they have data moving in them or not, no
big difference. The reason why it seems like this is the case is for
you HSF cooled cpu guys, load on the system bus usually means high cpu
load and that means more heat put into the surrounding air and the
little usually passive cooled but regardless, less hot system bus gets
hotter along with the cpu and cooler when the cpu is idle. People
cooled by other methods that do not dump heat into the surrounding air
inside the case will notice that the system bus temp only varies with
ambient air temp changes, not data transfer going on between ram and cpu.
Corvus Corax wrote:
> Am Thu, 6 Mar 2003 10:38:44 +1100
> schrieb Con Kolivas <[email protected]>:
>
>
>>That doesn't make sense. His post said the temperature was 20 degrees lower
>>when it failed.
>>
>>Con
>
>
> I think it does,
>
> look at this:
>
> RAM
> ._____________________.
> _|| | | | | | | | | | ||_. ._/| ._/|
> / ||___________________|| |~\ ||/| ||/|
> | |O _____ O| |~\\ /||/| ||/|
> | | .-?| | |?-. | |\\\\ //|| | || |
> | | / \ |~| | / \ | |\\\\\ //=|| |=|| |
> | | /| |\| |~|/| |\ | |\\\\.________. ///=||/|=||/|
> | | * | | \_._/ |~| * | |\===| |==///==||/|=||/|
> | | |~|~| /CPU\ ~ | | | |====| north |==///==|| |=|| |
> | | | | |~\_ _/ | | | | |====| bridge |=======|| |=|| |
> | | * | | / ? \ | |~* | |/===| (MEM ) |=======||/|=||/|
> | | \| |/| |~|\|~|/ | |//==| (CTRL) |==\\\==||/|=||/|
> | | \ / |~| | \ / | |////?~~~~~~~~?==\\\==|| |=|| |
> | | ?-.|_|_|.-? | |///// |||||| \\\=|| |=|| |
> | |O O| |//// |||||| \\=||/|=||/|
> | |~~~~~~~~~~~~~~~~~~~~~| |_// |||||| \\||/| ||/|
> ?~|| | | | | | | | | | ||~?_/ |||||| \|| | || |
> ?~~~~~~~~~~~~~~~~~~~~~? |||||| ||/ ||/
> CPU TEMP | |||||| |_| |_|
> | | voltage ||||||
> | ||| ||||||
> | ||| .________.
> Mainboard | ||| | |
> TEMP .,,,,,,. data | south |
> O | |=======| bridge |
> \\_____?''''''? | (BUS ) |
> ?~~~~~~? | (CTRL) |
> TEMP & ?~~~~~~~~?
> VOLTAGE ctrl ////|||\\\\\
> chip PCI & other BUS
>
>
> the sensor for the system temperature (somewhere on the board) is connected to a driver chip (usually on the i2c bus)
> like the w83781d (on my board)
>
> if something now causes the (often badly cooled) bridge to get hot (by more load between some periphery and the RAM for example)
> , the system temperature doesnt necessary have to increase.
>
> if the bridge has only a heatsink, its temperature is somewhat like
> (system TEMP)+ ( produced heatper time / heat given to the air by heatsink per time )
> where the heatsinks capacity is dependent on the delta temperature, too, gets complicated ;)
>
> in short, the chips hotter than the rest of the system and if it has high load it gets even hotter,
> but its temp is still dependant on the main system TEMP. ;)
>
> blahrgh forget what i talk, watch the ASCII art, and imagine the effect of much data running between
> BUS and RAM ;-) (or BUS and BUS if north and southbridge are on the same chip)
>
> CvC
>
Am Thu, 06 Mar 2003 02:57:31 -0500
schrieb Ed Sweetman <[email protected]>:
> This is getting kicked like a deadhorse by now i think but.
>
even dead horses have the right of being kicked ;) but:
>
> Unlike your cpu which gets idle commands from the OS and thus has an
> idle loop where it turns off certain circuits and which can get acpi
> commands to turn completely off the other chips in the computer do not
> have such a luxury. they are always on like the cpus of yesteryear used
> to be. It doesn't matter if they have data moving in them or not, no
> big difference.
I think this is not right so, doe to 2 reasons:
1st, i burned my finger on my bridge often enough that i should know that
its temperature varies, at least on some chips to really huge amounts ;-)
(on my new borad they dont even get hand warm)
2nd. the fact that the chip (or circuit) is on, doesnt mean that there flows current.
all halfway new microchips (including those bridges of course) are build in CMOS
or similar technology, meaning that there is no more static current flowing through
the transistors, but only capacity dependant current, when the transistors change
their state.
if no data flows, little transistors change their state
(only clock signals and some other idle work that is done),
and the output drivers to the bus systems are turned low.
so there is no static current on the bus and no dynamic current in the chip --> low overall current
--> low temperature
on the other hand if data flows, its being processed, and many transistors change their state with the data flow,
as do the output driver blocks --> high static and dynamic current --> high temperature.
> The reason why it seems like this is the case is for
> you HSF cooled cpu guys, load on the system bus usually means high cpu
> load and that means more heat put into the surrounding air and the
> little usually passive cooled but regardless, less hot system bus gets
> hotter along with the cpu and cooler when the cpu is idle. People
> cooled by other methods that do not dump heat into the surrounding air
> inside the case will notice that the system bus temp only varies with
> ambient air temp changes, not data transfer going on between ram and cpu.
>
than this would be measurable as an higher mainboard or system temperature,
which is not in our case, as described in the other mails
greetings,
Corvus V Corax ;)
Corvus Corax wrote:
> Am Thu, 06 Mar 2003 02:57:31 -0500
> schrieb Ed Sweetman <[email protected]>:
>
>
>>This is getting kicked like a deadhorse by now i think but.
>>
>
> even dead horses have the right of being kicked ;) but:
>
>
>>Unlike your cpu which gets idle commands from the OS and thus has an
>>idle loop where it turns off certain circuits and which can get acpi
>>commands to turn completely off the other chips in the computer do not
>>have such a luxury. they are always on like the cpus of yesteryear used
>>to be. It doesn't matter if they have data moving in them or not, no
>>big difference.
>
>
> I think this is not right so, doe to 2 reasons:
>
> 1st, i burned my finger on my bridge often enough that i should know that
> its temperature varies, at least on some chips to really huge amounts ;-)
> (on my new borad they dont even get hand warm)
>
> 2nd. the fact that the chip (or circuit) is on, doesnt mean that there flows current.
>
> all halfway new microchips (including those bridges of course) are build in CMOS
> or similar technology, meaning that there is no more static current flowing through
> the transistors, but only capacity dependant current, when the transistors change
> their state.
>
> if no data flows, little transistors change their state
> (only clock signals and some other idle work that is done),
> and the output drivers to the bus systems are turned low.
>
> so there is no static current on the bus and no dynamic current in the chip --> low overall current
> --> low temperature
>
> on the other hand if data flows, its being processed, and many transistors change their state with the data flow,
> as do the output driver blocks --> high static and dynamic current --> high temperature.
>
>
>
>>The reason why it seems like this is the case is for
>>you HSF cooled cpu guys, load on the system bus usually means high cpu
>>load and that means more heat put into the surrounding air and the
>>little usually passive cooled but regardless, less hot system bus gets
>>hotter along with the cpu and cooler when the cpu is idle. People
>>cooled by other methods that do not dump heat into the surrounding air
>>inside the case will notice that the system bus temp only varies with
>>ambient air temp changes, not data transfer going on between ram and cpu.
>>
>
>
> than this would be measurable as an higher mainboard or system temperature,
> which is not in our case, as described in the other mails
higher than what, the bus power output is constant, the internal
temperature of the case is dictated by the heat given off by all the
components, they're not separable. That's like saying i would be able
to tell if my cpu is putting a constant power output by seeing a higher
ambient air temp in the computer case...well no you wouldn't it becomes
a constant and all the other components that do have varying power
outputs dictate the fluctuations of ambient air. How you get a constant
as contributing to changes in ambient air is beyond me, and finding a
comparison in order to say it makes the air hotter is further beyond me.
The ambient air temp is the temp it is because of all the components
inside the case, including the system bus. The only way the system bus's
power output would be able to be measured as a higher ambient air temp
is if it worked the way you suggested (which i really dont think most
do). If you mean then the system bus should be hotter if it's always on
the way i suggest ...well you'd be wrong. They dont get hot if the
ambient temperature around the bus's heatsink stays cool. Most people
just run them passive when they have watercooling because without all
the heat from the cpu's heatsink, the ambient air around the bus is
sufficiently cool. otherwise a fan is usually needed.
i know for a fact my abit athlon motherboard's bus chip doesn't change
temperature due to load in the system. The only time it fluctuates is
when the temperature of the room changes and that change is not due to
the chip (unless i got no air circulation in the room then the computer
as a whole will heat up all the air and that feeds back on itself)
I believe the originator of the thread went back to check and see if he
can find out exactly what his Windows drivers are enabling. the rest of
the thread has been arguing over if linux can load hardware more than
windows can and what puts off heat and what doesn't which is stupid.
i think the topic of the thread is a bunch of BS because unless he has a
driver that is for some reason changing the frequency of something or
the voltage then linux is not going to stress the system more than
windows. The whole thing wreaks of FUD whether intentional or not.
> greetings,
>
>
> Corvus V Corax ;)
> -
On Wed, Mar 05, 2003 at 07:47:05PM -0500, Trever L. Adams wrote:
> You are the one mistaken. Most CPUs don't dissipate a constant amount
> of power as heat. That depends on what the CPU is doing.
Correct - each time a gate in the CPU switches state, it produces a
small amount of heat. Have enough gates switching, and you produce
a lot of heat (and your current consumption goes up.) This is basic
CMOS operation.
> I do believe the previous poster was incorrect about the mathematical
> relationship between case and CPU temperatures.
I never said there was a 1:1 relationship here - you misread my mail.
I talked about _heat sinks_, not the relationship between the temperature
on the silicon die and the external case temperature, with or without a
heatsink, with or without a fan. If you want to talk about the silicon
die, then you need to take into account thermal resistance between the
die and the case, the case and the heatsink, the heatsink and the
surrounding air, the fact that the heatsink is attached to one side
only, etc.
However, going into it in minute detail with all the maths is NOT a
subject for this list.
--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html
On Thursday 06 March 2003 01:18 am, Corvus Corax wrote:
> Am Thu, 6 Mar 2003 10:38:44 +1100
>
> schrieb Con Kolivas <[email protected]>:
> > That doesn't make sense. His post said the temperature was 20 degrees
> > lower when it failed.
> >
> > Con
>
> I think it does,
snip
> if the bridge has only a heatsink, its temperature is somewhat like
> (system TEMP)+ ( produced heatper time / heat given to the air by heatsink
> per time ) where the heatsinks capacity is dependent on the delta
> temperature, too, gets complicated ;)
>
> in short, the chips hotter than the rest of the system and if it has high
> load it gets even hotter, but its temp is still dependant on the main
> system TEMP. ;)
It is also referred to as thermal inertia. It takes time for the heat sink to
1. heat up
2. start transferring that head out
During that time delay the chip may easily overheat in a burst of activity.
Same thing happens to fuses... a "slow blow" fuse will blow faster in higher
ambient temperature, under conditions that are normal because the AC was
turned on...
--
-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]
Any opinions expressed are solely my own.
On Thursday 06 March 2003 02:58 am, Ed Sweetman wrote:
snip
> i know for a fact my abit athlon motherboard's bus chip doesn't change
> temperature due to load in the system. The only time it fluctuates is
> when the temperature of the room changes and that change is not due to
> the chip (unless i got no air circulation in the room then the computer
> as a whole will heat up all the air and that feeds back on itself)
Only because you are removing the heat as fast as it is being generated.
Which speaks for a good motherboard, heat sink, and fan combination,
along with decent AC for the room.
Additional heat generation with the use of Linux has been documented
going back to the 486 days, when problems were traced to an insufficient
heat sink. (system works with windows, crashes with Linux... replaced heat
sink and all is well).
The entire thread has been about a burst of activity that causes a thermal
spike in one or two possible locations not in the CPU. The internal
ambient temperature takes at least 3-5 seconds to change before the
sensor can report it. If the chip is already operating just below it's
critical temperature (and that varies among chips, even in the same lot)
then it will work with windows.
Linux has a much higher demand on the hardware, partially due to the
ability to generate DMA requests faster. This adds extra heat to the
bridges, and COULD push the chip over the critical temperature for
brief times (I would guess it would be in the millisecond range). Sustained
DMA activity would be a suspect in something like this.
It would be an interesting research topic to put high precision sensors
on all of the important chips on a motherboard (say between the chip and
heat sink) and come up with a time sequence and thermal map of a
collection of motherboards....
--
-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]
Any opinions expressed are solely my own.
At 7:12am +0100 3/6/03, Matthias Schniedermeyer wrote:
>Hmmm. Wasn't there something with IDE and the LE-Chipset.
>
>Maybe you should try a current kernel. Don't know if this old-kernel has
>the fix.
It involved DMA, I think; I've disabled IDE DMA altogether.
My current plan is to run the tests with a more recent kernel, and to
compare PIII MSRs between a Linux and Windows boot.
--
/Jonathan Lundell.
On Thu, March 06, 2003 at 12:58 AM, Ed Sweetman wrote:
>
> I believe the originator of the thread went back to check and
> see if he can find out exactly what his Windows drivers are
> enabling. the rest of the thread has been arguing over if
> linux can load hardware more than windows can and what puts
> off heat and what doesn't which is stupid.
>
> i think the topic of the thread is a bunch of BS because
> unless he has a driver that is for some reason changing the
> frequency of something or the voltage then linux is not going
> to stress the system more than windows. The whole thing wreaks
> of FUD whether intentional or not.
>
Well, here's _my_ stupid BS and (intentional) FUD question: 8)
Does anybody know if XP actively performs progressive power
management actions as the CPU temperature increases inside the
normal operating range? If this were done somewhat linearly
starting at "medium rare", instead of only at "well done" to
save the hardware, wouldn't it look like the reported anomaly?
Always suspicious ;)
Ed
----------------------------------------------------------------
Ed Vance edv (at) macrolink (dot) com
Macrolink, Inc. 1500 N. Kellogg Dr Anaheim, CA 92807
----------------------------------------------------------------
Jonathan Lundell <[email protected]> said:
> We've been seeing a curious phenomenon on some PIII/ServerWorks
> CNB30-LE systems.
>
> The systems fail at relatively low temperatures. While the failures
> are not specifically memory related (ECC errors are never a factor),
> we have a memory test that's pretty good at triggering them. Data is
> apparently getting corrupted on the front-side bus.
>
> Here's the curious thing: when we run the same memory test on a
> Windows 2000 system (same hardware; we just swap the disk), we can
> run the ambient temperature up to 60C with no problem at all; the
> test will run for days. (It occurred to us to try Win2K because the
> hardware vendor was using it to test systems at temperature without
> seeing problems.)
>
> Swap in the Linux disk, and at that temperature it'll barely run at
> all. The memory test fails quickly at 40C ambient.
Linux gives the hardware a _much_ harder workout than Windows.
My first PC was a P/100, overclocked to /120. WinNT worked fine, Linux
wouldn't even finish booting.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513
Hi!
> > Thus, I still think we can go with the patch I submitted before. I've
> > rediffed it against 2.5.63 (less the bits already applied).
>
> I've spent the last week reading, reviewing, and rewriting major portions
> of swsusp. I've actually been reasonably impressed, once I was able to get
> the code into a much more readable state.
>
> All in all, I think the idea of saving state to swap is dangerous for
> various reasons. However, I like some of the other concepts of the code,
Can you elaborate? I believe writing
to swap is good for user; and it works.
> and will use them in developing a more palatable mechanism of doing STDs
What is STD?
> http://ldm.bkbits.net:8080/linux-2.5-power
>
Can you post cumulative diff of work-in-progress?
I am not permitted to use bk. Also please
make sure that you post the diff before
you merge it (and please Cc me).
Pavel
--
Pavel
Written on sharp zaurus, because my Velo1 broke. If you have Velo you don't need...
> > All in all, I think the idea of saving state to swap is dangerous for
> > various reasons. However, I like some of the other concepts of the code,
>
> Can you elaborate? I believe writing
> to swap is good for user; and it works.
It does work, but there are uncertainties inherently present when using
such a solution. Some of them were just the behavior of the current code,
which I fixed, like:
- Only ever using the first swap partition, regardless of space left.
- Not resetting swap signature if a resume failed.
- Almost complete lack of a recovery path if anything failed (i.e. trying
to back out of what has happened, instead of calling BUG() or panic()).
- Function names like do_magic() and friends.
This types of things don't instill any confidence in a user or other
developer looking at the code. It gives the impression that the code is
the result of blind guess work in the dark. After looking at the code, it
was a shock to me that it worked at all.
I understand that getting it to work involves dealing with the
uncertainties. However, there is no reason to pass them on to other users.
There were no comments as to what the do_magic*() functions did, let alone
why they were 'magic', and there were 5 of them.
There are uncertainties still present in the code, like
- #warning about waiting for data to reach the disk.
- "Waiting for DMAs to settle down" delay on resume.
I respect the paranoia. Howver, it's things like these that should be
dealt with before anything else.
The general problems that I see with the solution are:
- It simply won't work if you're low on swap or memory.
- It won't work if you're swap is not persistant across reboots.
- It won't work if you don't use swap.
- It's dependent on the same exact kernel being loaded.
It should only be dependent on the binary format of the written metadata.
It also shouldn't be waiting until all the devices are probed and
initialized, but that problem is out of your hands.
Another problem I see in the future is initramfs, and when things start
executing in there. It's currently unpacked by populate_rootfs() in
init/main.c, long before software_resume() is called. Though it doesn't
cause any explicit problems ATM, it does introduce more uncertainties.
I don't want to cast the entire project in a negative light, though. It
does work, and I'm fairly impressed by it. I do not want to take the
feature away. I see it coexisting and sharing code nicely with any
other solutions.
I've created a registration mechanism for PM 'drivers', and a way for
users to select which driver they want to use for the different PM states.
In the patch, swsusp is just another driver. It can coexist with ACPI or
APM (theoretically) just fine, without requiring a kernel rebuild or
reboot.
This also involves a generic framework for doing system-wide power
management. In this, I've begun extracting bits from swsusp that are
useful for any PM sequence. My goal is to reduce swsusp to just a small
layer that writes/reads the saved pages from swap. The rest of the
sequence, including memory and device handling, happens in generic code.
> > and will use them in developing a more palatable mechanism of doing STDs
>
> What is STD?
Suspend-to-disk.
> > http://ldm.bkbits.net:8080/linux-2.5-power
> >
>
> Can you post cumulative diff of work-in-progress?
> I am not permitted to use bk. Also please
> make sure that you post the diff before
> you merge it (and please Cc me).
Sure. From the above link, you can view the individual patches. I would
hope that you could use wget to snarf them, though I don't know if that's
legally ok (nor do I want to know).
The cumulative patch is here:
http://kernel.org/pub/linux/kernel/people/mochel/power/pm-2.5.64.diff.gz
If I get a chance in the next few days, I'll post incremental diffs.
Without them, the gradual changes are not so obvious.
I understand you may not a rewrite of swsusp (regardless of how much
cleaner the code is), and I respect that. I'm completely willing to leave
kernel/suspend.c intact, and let you work in the integration into the
generic PM model, and/or simply rename the new code something like
swsusp2, swsusp-XP, or swsusp-pat. ;)
-pat
Hi!
> > > All in all, I think the idea of saving state to swap is dangerous for
> > > various reasons. However, I like some of the other concepts of the code,
> >
> > Can you elaborate? I believe writing
> > to swap is good for user; and it works.
>
> It does work, but there are uncertainties inherently present when using
> such a solution. Some of them were just the behavior of the current code,
> which I fixed, like:
>
> - Only ever using the first swap partition, regardless of space
> left.
But your solution would also only support *one* suspend partition,
right? (And patches for using more than one swap partition are
available for 2.4.X; I don't like them due to added complexity).
> - Not resetting swap signature if a resume failed.
Can be fixed in userland. Add option -s that for mkswap that fixes
signature only if it was overwritten by suspend, and add mkswap -s
/swap/partiton in your init scripts.
> - Almost complete lack of a recovery path if anything failed (i.e. trying
> to back out of what has happened, instead of calling BUG() or
> panic()).
Those BUGs / panics should be impossible to trigger. [And this has
nothing to do with fact we suspend-to-swap].
> - Function names like do_magic() and friends.
It is pretty magical operation, so you are at least warned. [And this has
nothing to do with fact we suspend-to-swap].
> This types of things don't instill any confidence in a user or other
> developer looking at the code. It gives the impression that the code is
> the result of blind guess work in the dark. After looking at the code, it
> was a shock to me that it worked at all.
>
> I understand that getting it to work involves dealing with the
> uncertainties. However, there is no reason to pass them on to other users.
> There were no comments as to what the do_magic*() functions did, let alone
> why they were 'magic', and there were 5 of them.
do_magic() replaces one kernel with another. That seems magical enough
to me. [It is in 5 functions so that the real hard part can be in
assembly].
> There are uncertainties still present in the code, like
>
> - #warning about waiting for data to reach the disk.
>
> - "Waiting for DMAs to settle down" delay on resume.
>
> I respect the paranoia. Howver, it's things like these that should be
> dealt with before anything else.
Feel free to fix them. [I believe both warning and waiting for DMA can
be safely killed, but...]
> The general problems that I see with the solution are:
>
> - It simply won't work if you're low on swap or memory.
Your solution will not work with too small suspend partition
too. Being low on memory... You'd have to have > 50% of your memory
allocated by kernel for swsusp to fail. I do not think it can be
sanely done other way. [Having separate disk drivers just for suspend
is *not* sane.]
> - It won't work if you're swap is not persistant across reboots.
Your solution will not work if your suspend partition is not
persistant across reboots. AND WHAT?
> - It won't work if you don't use swap.
Your solution will not work if your suspend partition is not there.
> - It's dependent on the same exact kernel being loaded.
>
> It should only be dependent on the binary format of the written metadata.
...which leads to simpler design and few megabytes less transfered to
/ from disk. I do not think there's easy way to do it with different
kernels. State of devices before switching to new kernel is important...
> It also shouldn't be waiting until all the devices are probed and
> initialized, but that problem is out of your hands.
>
> Another problem I see in the future is initramfs, and when things start
> executing in there. It's currently unpacked by populate_rootfs() in
> init/main.c, long before software_resume() is called. Though it doesn't
> cause any explicit problems ATM, it does introduce more
> uncertainties.
Oops, I have not seen that one. Yep that may turn nasty in
future. software_resume() should really be done before userland starts.
> I don't want to cast the entire project in a negative light, though. It
> does work, and I'm fairly impressed by it.
Thanx.
> I've created a registration mechanism for PM 'drivers', and a way for
> users to select which driver they want to use for the different PM states.
> In the patch, swsusp is just another driver. It can coexist with ACPI or
> APM (theoretically) just fine, without requiring a kernel rebuild or
> reboot.
I believe it can coexist with ACPI and APM already just okay. You can
echo 4b to /proc/acpi/sleep to trigger S4bios.
> This also involves a generic framework for doing system-wide power
> management. In this, I've begun extracting bits from swsusp that are
> useful for any PM sequence. My goal is to reduce swsusp to just a small
> layer that writes/reads the saved pages from swap. The rest of the
> sequence, including memory and device handling, happens in generic
> code.
So you don't really want to create separate "suspend partition"? Good.
More sharing between S3 and S4 is certainly good (but I do not think
much more can be shared).
> > > http://ldm.bkbits.net:8080/linux-2.5-power
> > >
> >
> > Can you post cumulative diff of work-in-progress?
> > I am not permitted to use bk. Also please
> > make sure that you post the diff before
> > you merge it (and please Cc me).
>
> Sure. From the above link, you can view the individual patches. I would
> hope that you could use wget to snarf them, though I don't know if that's
> legally ok (nor do I want to know).
>
> The cumulative patch is here:
>
> > >http://kernel.org/pub/linux/kernel/people/mochel/power/pm-2.5.64.diff.gz
THenx.
> If I get a chance in the next few days, I'll post incremental diffs.
> Without them, the gradual changes are not so obvious.
>
> I understand you may not a rewrite of swsusp (regardless of how much
> cleaner the code is), and I respect that. I'm completely willing to leave
> kernel/suspend.c intact, and let you work in the integration into the
> generic PM model, and/or simply rename the new code something like
> swsusp2, swsusp-XP, or swsusp-pat. ;)
So you want to develop swsusp-pat that will suspend to partition,
allow another kernel version, and you think you can suspend when 90%
of your memory is kmalloc()-ed? Do you agree that separate disk
drivers for suspend is bad idea?
Pavel
--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]
Hi!
> http://kernel.org/pub/linux/kernel/people/mochel/power/pm-2.5.64.diff.gz
+static inline void suspend_restore_mem(void)
This has to be in assembly. You can't trust gcc not to move stack
pointer.
Pavel
--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]
Hi!
> The cumulative patch is here:
>
> http://kernel.org/pub/linux/kernel/people/mochel/power/pm-2.5.64.diff.gz
Hmm, I am not sure if drivers/power is the right place for stuff like
fridge.c. That might be usefull for other stuff, too.
I do not think placing swsusp.h in drivers/power/swsusp is right. It
should be in include/linux or include/linux/power.
Pavel
--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]
> > - It's dependent on the same exact kernel being loaded.
> >
> > It should only be dependent on the binary format of the written metadata.
>
> ...which leads to simpler design and few megabytes less transfered to
> / from disk. I do not think there's easy way to do it with different
> kernels. State of devices before switching to new kernel is important...
I don't think so.
IMHO, the "old" kernel (used for loading the suspend image) should
quiesce devices in a pretty "normal" way in the exact same way
kexec does (and using the same code path/driver notifiers). I see
no reason why there should be any kind of dependency between the
"loader" kernel and the "loaded" kernel in this regard.
In fact, I'm considering for PPC to just trash the "loader" kernel
when possible and directly load the suspend image from the bootloader
Ben.
Hi!
> > > - It's dependent on the same exact kernel being loaded.
> > >
> > > It should only be dependent on the binary format of the written metadata.
> >
> > ...which leads to simpler design and few megabytes less transfered to
> > / from disk. I do not think there's easy way to do it with different
> > kernels. State of devices before switching to new kernel is important...
>
> I don't think so.
>
> IMHO, the "old" kernel (used for loading the suspend image) should
> quiesce devices in a pretty "normal" way in the exact same way
> kexec does (and using the same code path/driver notifiers). I see
> no reason why there should be any kind of dependency between the
> "loader" kernel and the "loaded" kernel in this regard.
But if you add support for quiescing matrox in 2.6.5, you will not be
able to resume 2.6.5 from 2.6.4 kernel. And as bugs are going to be in
that area for a while I'd prefer people to suspend and resume with
same kernel.
Pavel
--
Horseback riding is like software...
...vgf orggre jura vgf serr.
> > The cumulative patch is here:
> >
> > http://kernel.org/pub/linux/kernel/people/mochel/power/pm-2.5.64.diff.gz
>
> Hmm, I am not sure if drivers/power is the right place for stuff like
> fridge.c. That might be usefull for other stuff, too.
That's fine. If it proves useful for other things, we can move it.
> I do not think placing swsusp.h in drivers/power/swsusp is right. It
> should be in include/linux or include/linux/power.
That header is only for the shared functions between
drivers/power/swsusp/*.c. There's no need to export it to everyone.
Under the new model, nothing would call swsusp directly. It would call the
model's functions, which would delegate the call to the user-specified
handler for the action.
-pat
> But your solution would also only support *one* suspend partition,
> right? (And patches for using more than one swap partition are
> available for 2.4.X; I don't like them due to added complexity).
Having a dedicated partition has an advantage in just that - it's
dedicated to saving system state. Users must consciously create it, and
must make it as big as the size of memory they have (or will have). Plus,
it's not tied to the amount of memory being used when you suspend.
Swap space has a specific purpose, I see it as a detriment to overload its
intended usage. Of couse, that's just my opinion, and I don't have code to
back it up.
> > - Not resetting swap signature if a resume failed.
>
> Can be fixed in userland. Add option -s that for mkswap that fixes
> signature only if it was overwritten by suspend, and add mkswap -s
> /swap/partiton in your init scripts.
That's wrong, IMO. If the kernel modifies it, it should reset it. You
shouldn't impose extra burden on the users because your code failed.
Besides, it's fixed anyway.
> > - Almost complete lack of a recovery path if anything failed (i.e. trying
> > to back out of what has happened, instead of calling BUG() or
> > panic()).
>
> Those BUGs / panics should be impossible to trigger. [And this has
> nothing to do with fact we suspend-to-swap].
[ I know these are not suspend-to-swap specific; sorry for implying that.]
If they're really impossible to trigger, then they shouldn't be there at
all. If you can recover from them, then you should, instead of giving up.
Besides, a lot of them were completely bogus things like
BUG_ON(sizeof(foo) != sizeof(bar))
Which are known at compile time, but were buried in the code to read/write
the data, and only convoluted the code even more.
> > - Function names like do_magic() and friends.
>
> It is pretty magical operation, so you are at least warned. [And this has
> nothing to do with fact we suspend-to-swap].
IMO, warnings should be conveyed in comments, not in cryptic function
names. Besides, there is nothing magical about it, unless that sequence of
instructions actually does make your computer glow, levitate, or turn into
a mermaid. In which case, I would like to know where I can find one. ;)
Seriously, you described below what it does, which helps a lot more than
anything named 'magic'.
> > The general problems that I see with the solution are:
> >
> > - It simply won't work if you're low on swap or memory.
>
> Your solution will not work with too small suspend partition
> too. Being low on memory... You'd have to have > 50% of your memory
> allocated by kernel for swsusp to fail. I do not think it can be
> sanely done other way. [Having separate disk drivers just for suspend
> is *not* sane.]
>
> > - It won't work if you're swap is not persistant across reboots.
>
> Your solution will not work if your suspend partition is not
> persistant across reboots. AND WHAT?
>
> > - It won't work if you don't use swap.
>
> Your solution will not work if your suspend partition is not there.
I didn't mean to sound like a hypocrit, I apologize. The advantage of
using a dedicated partition over swap is that in order to create the
partition, the user must make a conscious decision to do so.
There are parameters that can be enforced when making the partition, like
the size and its existence on a persistant medium. These can be enforced
by a user making a swap partition, but it places extra burden on the user.
> > I've created a registration mechanism for PM 'drivers', and a way for
> > users to select which driver they want to use for the different PM states.
> > In the patch, swsusp is just another driver. It can coexist with ACPI or
> > APM (theoretically) just fine, without requiring a kernel rebuild or
> > reboot.
>
> I believe it can coexist with ACPI and APM already just okay. You can
> echo 4b to /proc/acpi/sleep to trigger S4bios.
>
> > This also involves a generic framework for doing system-wide power
> > management. In this, I've begun extracting bits from swsusp that are
> > useful for any PM sequence. My goal is to reduce swsusp to just a small
> > layer that writes/reads the saved pages from swap. The rest of the
> > sequence, including memory and device handling, happens in generic
> > code.
>
> So you don't really want to create separate "suspend partition"? Good.
Sorry, the patch included a few distinct things, and I should have made it
a bit more clear. In includes:
- A generic PM framework which PM drivers can register with.
Users can specificy which handler they wish to use for different states,
based on their preference or the capabilities of their systems.
They can also use one mechanism for entering power states:
/sys/power/power_state, instead of relying different mechanisms for
different PM drivers (/proc/acpi/sleep vs. apm(1) vs. sys_reboot()).
- Generic sequence for entering sleep states, in drivers/power/main.c
- Clean up of swsusp.
- Conversion of swsusp and ACPI to register with the PM model.
- Extraction of swsusp-specific features into the generic PM framework, so
they can be shared with everyone.
In the long run, I'd like to develop a solution using a dedicated
partition. But, that wouldn't necessarily obviate the use of swsusp. It
would coexist alongside it.
> > I understand you may not a rewrite of swsusp (regardless of how much
> > cleaner the code is), and I respect that. I'm completely willing to leave
> > kernel/suspend.c intact, and let you work in the integration into the
> > generic PM model, and/or simply rename the new code something like
> > swsusp2, swsusp-XP, or swsusp-pat. ;)
>
> So you want to develop swsusp-pat that will suspend to partition,
> allow another kernel version, and you think you can suspend when 90%
> of your memory is kmalloc()-ed? Do you agree that separate disk
> drivers for suspend is bad idea?
Yes.
-pat
Hi!
> > > The cumulative patch is here:
> > >
> > > http://kernel.org/pub/linux/kernel/people/mochel/power/pm-2.5.64.diff.gz
> >
> > Hmm, I am not sure if drivers/power is the right place for stuff like
> > fridge.c. That might be usefull for other stuff, too.
>
> That's fine. If it proves useful for other things, we can move it.
Actually, I'd like driver model to specify that things are
refrigerated when device_suspend() and friends are being run. That
should make drivers a lot simpler. [And as non-bitkeeper-capable user
I fear moves ;-)]
> > I do not think placing swsusp.h in drivers/power/swsusp is right. It
> > should be in include/linux or include/linux/power.
>
> That header is only for the shared functions between
> drivers/power/swsusp/*.c. There's no need to export it to everyone.
Well, last time acpi introduced its private include/ directory, it was
a disaster.
Pavel
--
Horseback riding is like software...
...vgf orggre jura vgf serr.
> > > Hmm, I am not sure if drivers/power is the right place for stuff like
> > > fridge.c. That might be usefull for other stuff, too.
> >
> > That's fine. If it proves useful for other things, we can move it.
>
> Actually, I'd like driver model to specify that things are
> refrigerated when device_suspend() and friends are being run. That
> should make drivers a lot simpler. [And as non-bitkeeper-capable user
> I fear moves ;-)]
That's a policy decision outside of the scope of the driver model. It is
however, inside the scope of the PM model, and by using the generic
framework, this decision can be guaranteed to be made.
> > > I do not think placing swsusp.h in drivers/power/swsusp is right. It
> > > should be in include/linux or include/linux/power.
> >
> > That header is only for the shared functions between
> > drivers/power/swsusp/*.c. There's no need to export it to everyone.
>
> Well, last time acpi introduced its private include/ directory, it was
> a disaster.
I don't necessarily agree. IMO, putting things in include/whatever/ makes
it easy for other code to directly access those functions, some of which
you never want people calling directly. And, if it's there, it's likely
someone will use it someday.
But, in the end it's your code, so I don't really care.
-pat
Hi!
> > But your solution would also only support *one* suspend partition,
> > right? (And patches for using more than one swap partition are
> > available for 2.4.X; I don't like them due to added complexity).
>
> Having a dedicated partition has an advantage in just that - it's
> dedicated to saving system state. Users must consciously create it, and
> must make it as big as the size of memory they have (or will have). Plus,
> it's not tied to the amount of memory being used when you suspend.
That's a problem. Users do not have suspend partitions, but they do
have swap partition. And repartitioning existing installation is very
painfull. OTOH it is true that if we want
"emergency-suspend-to-disk-when-battery-low", dedicated partition
makes some sense.... ... Well. You can always do swapoff, swapon,
swsusp. Maybe some processes will die, but that's life ;-).
[But for that you'd have to guarantee that suspend always works, which
is hard, anyway.]
> Swap space has a specific purpose, I see it as a detriment to overload its
> intended usage. Of couse, that's just my opinion, and I don't have code to
> back it up.
Well, I see it as advantage because I have swap space anyway (rarely
really used), so why not reuse it for swsusp?
> > It is pretty magical operation, so you are at least warned. [And this has
> > nothing to do with fact we suspend-to-swap].
>
> IMO, warnings should be conveyed in comments, not in cryptic function
> names. Besides, there is nothing magical about it, unless that sequence of
> instructions actually does make your computer glow, levitate, or turn into
> a mermaid. In which case, I would like to know where I can find one. ;)
:-). Well, comments were getting out of date because code was in
permanent flux. It makes sense to comment it now.
> > Your solution will not work if your suspend partition is not there.
>
> I didn't mean to sound like a hypocrit, I apologize. The advantage of
> using a dedicated partition over swap is that in order to create the
> partition, the user must make a conscious decision to do so.
>
> There are parameters that can be enforced when making the partition, like
> the size and its existence on a persistant medium. These can be enforced
> by a user making a swap partition, but it places extra burden on the user.
Well, IMO checklist like:
if you want to use swsusp you have to
a) check swap is on persistent medium
b) make sure swap is at least as big as memory/2 [not really
neccessary, we might be lucky and swsusp with 30MB of swap...]
is easier for the user than repartitioning their harddrives. [I'd like
to see someone running swap on floppy ;-)]
> > So you don't really want to create separate "suspend partition"? Good.
>
> Sorry, the patch included a few distinct things, and I should have made it
> a bit more clear. In includes:
>
> - A generic PM framework which PM drivers can register with.
>
> Users can specificy which handler they wish to use for different states,
> based on their preference or the capabilities of their systems.
>
> They can also use one mechanism for entering power states:
> /sys/power/power_state, instead of relying different mechanisms for
> different PM drivers (/proc/acpi/sleep vs. apm(1)
> vs. sys_reboot()).
I believe sys_reboot() is the right way to do
that. /sys/power/... needs sysfs mounted etc. /proc/acpi/sleep just
happened to already be there and be very convenient.
> In the long run, I'd like to develop a solution using a dedicated
> partition. But, that wouldn't necessarily obviate the use of swsusp. It
> would coexist alongside it.
Actually "dedicated partition" vs. "swap partition" is quite a small
detail. It only affects disk allocation routines. Basic stuff like
"atomic copy" stays the same...
> > > I understand you may not a rewrite of swsusp (regardless of how much
> > > cleaner the code is), and I respect that. I'm completely willing to leave
> > > kernel/suspend.c intact, and let you work in the integration into the
> > > generic PM model, and/or simply rename the new code something like
> > > swsusp2, swsusp-XP, or swsusp-pat. ;)
> >
> > So you want to develop swsusp-pat that will suspend to partition,
> > allow another kernel version, and you think you can suspend when 90%
> > of your memory is kmalloc()-ed? Do you agree that separate disk
> > drivers for suspend is bad idea?
>
> Yes.
Do you think you can suspend with 90% memory kmalloc()-ed?
Pavel
--
Horseback riding is like software...
...vgf orggre jura vgf serr.
> Do you think you can suspend with 90% memory kmalloc()-ed?
Dunno. I need to iron some other details before I get to play with this..
-pat
Hi.
On Tue, 2003-03-11 at 08:23, Pavel Machek wrote:
> Do you think you can suspend with 90% memory kmalloc()-ed?
Is that a fair question? Would 90% of memory ever be kmalloced? If the
question is can you suspend with 90% of memory used, then I can answer
yes. I do it all the time under the code I'm porting to 2.5. (Nearly
there, by the way).
Regards,
Nigel
Hi!
> > Do you think you can suspend with 90% memory kmalloc()-ed?
>
> Is that a fair question? Would 90% of memory ever be kmalloced? If the
> question is can you suspend with 90% of memory used, then I can answer
> yes. I do it all the time under the code I'm porting to 2.5. (Nearly
> there, by the way).
No, it was not fair question, not at all. If he'd replied with yes,
I'd tell him I don't believe that ;-).
Pavel
--
Horseback riding is like software...
...vgf orggre jura vgf serr.