Here is a kernel patch to support multithreaded coredumps being worked on
by Mark Gross (Intel, [email protected]) and
Vamsi Krishna (IBM, [email protected]).
Multi-threaded core dump patch for 2.4.17:
- multithreaded coredump functionality is enabled by a new sysctl
core_dumps_threads. (0 = off, 1 = on).
- Core dump is started by the first thread which gets the signal
- Threads are located by walking the entire task list looking for tasks
with matching mm as that seems to be the only reliable way to locate
other threads of a given task. In fact, IMO this is the only way
until all user space libraries migrate to using thread groups
provides by linux kernel (CLONE_THREAD).
- Other threads are prevented from executing while core dump is in
progress to improve the accuracy of the dumps. This is done without
changing the state of the task. We set cpus_allowed in task struct
to be 0 to stop a task from being scheduled and reset it to -1 for
resume execution. This has the advantage to not depending on user
space at all for correct functioning. IMO sending SIGSTOP to stop
other threads does not work if the process is being run under a
debugger. The only possible issue with using cpus_allowed is that
we could lose task affinities once a core dump is taken. However,
this is not a big deal as the task is going to die anyway fairly
soon, which is why the dump was taken in the first place.
- Support of SSE registers in the core dump
- Code cleanups/reorg - breakup into smaller functions. Main function
elf_core_dump() reorganized/cleaned up by moving filling up of elfhdr,
prstatus, psinfo and notes to separate functions to make this very
long function a little bit more readable and to reuse some code with
the function dumping status of other threads.
- Easy to port to other architectures. It just needs
ELF_CORE_COPY_TASK_REG - to copy task specific registers
ELF_CORE_COPY_FPREGS - to copy floating point registers and
ELF_CORE_COPY_XFPREGS - to copy extended fp registers(SSE) if present
ELF_CORE_SYNC - to sync up fpu status of other processors in SMP
systems if needed by a particular architecture.
Read the patch for more details.
- We started with the tcore patches by John Jones and Jason Villarreal
as base which were then heavily reworked. This patch is entirely
different from theirs in pretty much all aspects.
Current TODO list:
- May be remove reschedule_other_cpus in suspend_other_threads and do
this as part of ELF_CORE_SYNC. Rescheduling the other CPUs the way
the current patch work may be over kill for accurate core files.
Any thoughts?
- Port to 2.5.x, specially the logic to stop other threads from executing
while dumping is in progress.
- Make the loop looking for other threads a little shorter by counting
the number of tasks found and breaking out of for_each_task loop
when it is equal to current->mm->mm_users.
Some usage notes on this patch:
GDB 5.1 works with the core files produced, but only for Red Hat 7.2, and
only if the /lib/i686/libpthread.so library is hidden. It turns out that
for IA32 RedHat, that there exists 2 libpthread.so files. If the
/lib/i686/libpthread.so is loaded then the gdb post mortem debug will not
work. We don't understand what's going on here, but its real. Hide the
/lib/i686/libpthread.so such that the /lib/libpthread.so gets loaded at
debug time, and then debugger will work with the core file. Any insights
into this is very much welcome. This behavior is very mysterious to us.
Thanks to Bharata B Rao(IBM) for helping with capturing FPU registers and
testing and Suparna Bhattacharya(IBM) for design discussions.
Thanks to Tony Luck (Intel) and Jun Nakajima (Intel) for helping with the
review and design of the suspend_other_threads implementation.
This is currently i386 only, it has been unit tested on 1P-P4, 2P-P4,
and 4P-PIII systems. I haven't seen any failures so far, YMMV.
The patch is against kernel version 2.4.17. We will port this to latest
versions of the kernel if there is any interest.
Regards.. Vamsi.
Vamsi Krishna S.
Linux Technology Center,
IBM Software Lab, Bangalore.
Ph: +91 80 5044959
Internet: [email protected]
-- patch here--
diff -urN -X /home/vamsi/dontdiff /usr/src/2417-pure/arch/i386/kernel/i387.c 2417-tcore/arch/i386/kernel/i387.c
--- /usr/src/2417-pure/arch/i386/kernel/i387.c Fri Feb 23 23:39:08 2001
+++ 2417-tcore/arch/i386/kernel/i387.c Fri Mar 15 11:52:28 2002
@@ -520,3 +520,42 @@
return fpvalid;
}
+
+int dump_task_fpu( struct task_struct *tsk, struct user_i387_struct *fpu )
+{
+ int fpvalid;
+
+ fpvalid = tsk->used_math;
+ if ( fpvalid ) {
+ if (tsk == current) unlazy_fpu( tsk );
+ if ( cpu_has_fxsr ) {
+ copy_fpu_fxsave( tsk, fpu );
+ } else {
+ copy_fpu_fsave( tsk, fpu );
+ }
+ }
+
+ return fpvalid;
+}
+
+int dump_task_extended_fpu( struct task_struct *tsk, struct user_fxsr_struct *fpu )
+{
+ int fpvalid;
+
+ fpvalid = tsk->used_math && cpu_has_fxsr;
+ if ( fpvalid ) {
+ if (tsk == current) unlazy_fpu( tsk );
+ memcpy( fpu, &tsk->thread.i387.fxsave,
+ sizeof(struct user_fxsr_struct) );
+ }
+
+ return fpvalid;
+}
+
+#ifdef CONFIG_SMP
+void dump_smp_unlazy_fpu(void)
+{
+ unlazy_fpu(current);
+ return;
+}
+#endif
diff -urN -X /home/vamsi/dontdiff /usr/src/2417-pure/arch/i386/kernel/process.c 2417-tcore/arch/i386/kernel/process.c
--- /usr/src/2417-pure/arch/i386/kernel/process.c Fri Oct 5 07:12:54 2001
+++ 2417-tcore/arch/i386/kernel/process.c Fri Mar 15 11:52:28 2002
@@ -642,6 +642,19 @@
dump->u_fpvalid = dump_fpu (regs, &dump->i387);
}
+/*
+ * Capture the user space registers if the task is not running (in user space)
+ */
+int dump_task_regs(struct task_struct *tsk, struct pt_regs *regs)
+{
+ *regs = *(struct pt_regs *)((unsigned long)tsk + THREAD_SIZE - sizeof(struct pt_regs));
+ regs->xcs &= 0xffff;
+ regs->xds &= 0xffff;
+ regs->xes &= 0xffff;
+ regs->xss &= 0xffff;
+ return 1;
+}
+
/*
* This special macro can be used to load a debugging register
*/
diff -urN -X /home/vamsi/dontdiff /usr/src/2417-pure/fs/binfmt_elf.c 2417-tcore/fs/binfmt_elf.c
--- /usr/src/2417-pure/fs/binfmt_elf.c Fri Dec 21 23:11:55 2001
+++ 2417-tcore/fs/binfmt_elf.c Fri Mar 15 11:54:49 2002
@@ -31,6 +31,7 @@
#include <linux/init.h>
#include <linux/highuid.h>
#include <linux/smp_lock.h>
+#include <linux/smp.h>
#include <linux/compiler.h>
#include <linux/highmem.h>
@@ -960,7 +961,7 @@
/* #define DEBUG */
#ifdef DEBUG
-static void dump_regs(const char *str, elf_greg_t *r)
+static void dump_regs(const char *str, elf_gregset_t *r)
{
int i;
static const char *regs[] = { "ebx", "ecx", "edx", "esi", "edi", "ebp",
@@ -1008,6 +1009,255 @@
#define DUMP_SEEK(off) \
if (!dump_seek(file, (off))) \
goto end_coredump;
+
+static inline void fill_elf_header(struct elfhdr *elf, int segs)
+{
+ memcpy(elf->e_ident, ELFMAG, SELFMAG);
+ elf->e_ident[EI_CLASS] = ELF_CLASS;
+ elf->e_ident[EI_DATA] = ELF_DATA;
+ elf->e_ident[EI_VERSION] = EV_CURRENT;
+ memset(elf->e_ident+EI_PAD, 0, EI_NIDENT-EI_PAD);
+
+ elf->e_type = ET_CORE;
+ elf->e_machine = ELF_ARCH;
+ elf->e_version = EV_CURRENT;
+ elf->e_entry = 0;
+ elf->e_phoff = sizeof(struct elfhdr);
+ elf->e_shoff = 0;
+ elf->e_flags = 0;
+ elf->e_ehsize = sizeof(struct elfhdr);
+ elf->e_phentsize = sizeof(struct elf_phdr);
+ elf->e_phnum = segs;
+ elf->e_shentsize = 0;
+ elf->e_shnum = 0;
+ elf->e_shstrndx = 0;
+ return;
+}
+
+static inline void fill_elf_note_phdr(struct elf_phdr *phdr, int sz, off_t offset)
+{
+ phdr->p_type = PT_NOTE;
+ phdr->p_offset = offset;
+ phdr->p_vaddr = 0;
+ phdr->p_paddr = 0;
+ phdr->p_filesz = sz;
+ phdr->p_memsz = 0;
+ phdr->p_flags = 0;
+ phdr->p_align = 0;
+ return;
+}
+
+static inline void fill_note(struct memelfnote *note, const char *name, int type,
+ unsigned int sz, void *data)
+{
+ note->name = name;
+ note->type = type;
+ note->datasz = sz;
+ note->data = data;
+ return;
+}
+
+/*
+ * fill up all the fields in prstatus from the given task struct, except registers
+ * which need to be filled up seperately.
+ */
+static inline void fill_prstatus(struct elf_prstatus *prstatus, struct task_struct *p, long signr)
+{
+ prstatus->pr_info.si_signo = prstatus->pr_cursig = signr;
+ prstatus->pr_sigpend = p->pending.signal.sig[0];
+ prstatus->pr_sighold = p->blocked.sig[0];
+ prstatus->pr_pid = p->pid;
+ prstatus->pr_ppid = p->p_pptr->pid;
+ prstatus->pr_pgrp = p->pgrp;
+ prstatus->pr_sid = p->session;
+ prstatus->pr_utime.tv_sec = CT_TO_SECS(p->times.tms_utime);
+ prstatus->pr_utime.tv_usec = CT_TO_USECS(p->times.tms_utime);
+ prstatus->pr_stime.tv_sec = CT_TO_SECS(p->times.tms_stime);
+ prstatus->pr_stime.tv_usec = CT_TO_USECS(p->times.tms_stime);
+ prstatus->pr_cutime.tv_sec = CT_TO_SECS(p->times.tms_cutime);
+ prstatus->pr_cutime.tv_usec = CT_TO_USECS(p->times.tms_cutime);
+ prstatus->pr_cstime.tv_sec = CT_TO_SECS(p->times.tms_cstime);
+ prstatus->pr_cstime.tv_usec = CT_TO_USECS(p->times.tms_cstime);
+ return;
+}
+
+static inline void fill_psinfo(struct elf_prpsinfo *psinfo, struct task_struct *p)
+{
+ int i;
+
+ psinfo->pr_pid = p->pid;
+ psinfo->pr_ppid = p->p_pptr->pid;
+ psinfo->pr_pgrp = p->pgrp;
+ psinfo->pr_sid = p->session;
+
+ i = p->state ? ffz(~p->state) + 1 : 0;
+ psinfo->pr_state = i;
+ psinfo->pr_sname = (i < 0 || i > 5) ? '.' : "RSDZTD"[i];
+ psinfo->pr_zomb = psinfo->pr_sname == 'Z';
+ psinfo->pr_nice = p->nice;
+ psinfo->pr_flag = p->flags;
+ psinfo->pr_uid = NEW_TO_OLD_UID(p->uid);
+ psinfo->pr_gid = NEW_TO_OLD_GID(p->gid);
+ strncpy(psinfo->pr_fname, p->comm, sizeof(psinfo->pr_fname));
+ return;
+}
+
+/*
+ * This is the variable that can be set in proc to determine if we want to
+ * dump a multithreaded core or not. A value of 1 means yes while any
+ * other value means no.
+ *
+ * It is located at /proc/sys/kernel/core_dumps_threads
+ */
+
+int core_dumps_threads = 0;
+
+/* Here is the structure in which status of each thread is captured. */
+struct elf_thread_status
+{
+ struct list_head list;
+ struct elf_prstatus prstatus; /* NT_PRSTATUS */
+ elf_fpregset_t fpu; /* NT_PRFPREG */
+ elf_fpxregset_t xfpu; /* NT_PRXFPREG */
+ struct memelfnote notes[3];
+ int num_notes;
+};
+
+#ifdef CONFIG_SMP
+/*
+ * trivial function used for SMP CPU synchronization.
+ * It doesn't do anything.
+ */
+void do_nothing(void *var)
+{
+ return;
+}
+#endif
+
+/*
+ * Suspend execution of other threads belonging to the same multithreaded process
+ * of current, ASAP.
+ *
+ * Sets the current->cpu_mask to the current cpu to avoid cpu migration durring the dump.
+ * This cpu will also be the only cpu the other threads will be allowed to run after
+ * coredump is completed. This seems to be needed to fix some SMP races. This still
+ * needs some more thought though this solution works.
+ *
+ * TODO: Rethink the logic used to find other threads.
+ */
+static unsigned long suspend_other_threads(void)
+{
+ struct task_struct *p;
+
+ /*
+ * brute force method uses the runqueue_lock contention. Grab this lock, and
+ * force a schedule call on all the other CPU's to get them spinning.
+ */
+ read_lock(&tasklist_lock);
+ spin_lock(&runqueue_lock);
+
+ task_lock(current);
+ current->cpus_allowed = current->cpus_runnable; /* prevent cpu migration */
+ task_unlock(current);
+
+ reschedule_other_cpus();
+
+ for_each_task(p)
+ if (current->mm == p->mm && current != p) {
+ task_lock(p);
+ /*
+ * force yield and keep waking processes from getting scheduled
+ * in. The following will result in these processes getting swapped out and
+ * not swapped in by the scheduler if they have been sleeping.
+ */
+ p->cpus_allowed = 0UL;
+ task_unlock(p);
+ }
+
+ spin_unlock(&runqueue_lock);
+
+ /* let them all run again.. */
+ read_unlock(&tasklist_lock);
+
+ /*
+ * now we sychronize on all the CPU's to make sure
+ * none of the other thread processes are not running in
+ * user space before we proceed.
+ *
+ * We have a race from the time the runqueue_lock is released and the
+ * time __switch_to gets called that can result in bogus FPU/XFPU register
+ * data in the core file, so we use ELF_CORE_SYNC with smp_call_function
+ * which on SMP evaluates to a call which grabs the FPU state.
+ */
+ smp_call_function(ELF_CORE_SYNC, NULL, 1,1);
+
+ return current->cpus_allowed;
+}
+
+/*
+ * resume execution of other threads on the cpu given the cpu_mask.
+ */
+static void resume_other_threads(unsigned long current_cpu_mask)
+{
+ struct task_struct *p;
+
+ if(current_cpu_mask != current->cpus_runnable)
+ printk(KERN_WARNING "tcore: multithread core dump CPU affinity assumption violated"); /* BUG would be too harsh */
+
+ read_lock(&tasklist_lock);
+ for_each_task(p)
+ if (current->mm == p->mm && current != p) {
+ task_lock(p);
+ p->cpus_allowed = current_cpu_mask;
+ task_unlock(p);
+ }
+ read_unlock(&tasklist_lock);
+
+ return;
+}
+
+/*
+ * In order to add the specific thread information for the elf file format,
+ * we need to keep a linked list of every threads pr_status and then
+ * create a single section for them in the final core file.
+ */
+static int elf_dump_thread_status(long signr, struct task_struct * p, struct list_head * thread_list)
+{
+
+ struct elf_thread_status *t;
+ int sz = 0;
+
+ t = kmalloc(sizeof(*t), GFP_KERNEL);
+ if (!t) {
+ printk(KERN_WARNING "Cannot allocate memory for thread status.\n");
+ return 0;
+ }
+
+ INIT_LIST_HEAD(&t->list);
+ t->num_notes = 0;
+
+ fill_prstatus(&t->prstatus, p, signr);
+ elf_core_copy_task_regs(p, &t->prstatus.pr_reg);
+ fill_note(&t->notes[0], "CORE", NT_PRSTATUS, sizeof(t->prstatus), &(t->prstatus));
+ t->num_notes++;
+ sz += notesize(&t->notes[0]);
+
+ if ((t->prstatus.pr_fpvalid = elf_core_copy_task_fpregs(p, &t->fpu))) {
+ fill_note(&t->notes[1], "CORE", NT_PRFPREG, sizeof(t->fpu), &(t->fpu));
+ t->num_notes++;
+ sz += notesize(&t->notes[1]);
+ }
+
+ if (elf_core_copy_task_xfpregs(p, &t->xfpu)) {
+ fill_note(&t->notes[2], "LINUX", NT_PRXFPREG, sizeof(t->xfpu), &(t->xfpu));
+ t->num_notes++;
+ sz += notesize(&t->notes[2]);
+ }
+
+ list_add(&t->list, thread_list);
+ return sz;
+}
+
/*
* Actual dumper
*
@@ -1026,12 +1276,32 @@
struct elfhdr elf;
off_t offset = 0, dataoff;
unsigned long limit = current->rlim[RLIMIT_CORE].rlim_cur;
- int numnote = 4;
- struct memelfnote notes[4];
+ int numnote = 5;
+ struct memelfnote notes[5];
struct elf_prstatus prstatus; /* NT_PRSTATUS */
- elf_fpregset_t fpu; /* NT_PRFPREG */
struct elf_prpsinfo psinfo; /* NT_PRPSINFO */
+ struct task_struct *p;
+ LIST_HEAD(thread_list);
+ struct list_head *t;
+ unsigned long cpu_mask = 0xFFFFFFFF;
+ elf_fpregset_t fpu;
+ elf_fpxregset_t xfpu;
+ int dump_threads = 0;
+ int thread_status_size = 0;
+
+ /* now stop all vm operations */
+ down_write(¤t->mm->mmap_sem);
+ segs = current->mm->map_count;
+
+ if (atomic_read(¤t->mm->mm_users) != 1) {
+ dump_threads = core_dumps_threads;
+ }
+ /* First pause all related threaded processes */
+ if (dump_threads) {
+ cpu_mask = suspend_other_threads();
+ }
+
/* first copy the parameters from user space */
memset(&psinfo, 0, sizeof(psinfo));
{
@@ -1049,34 +1319,30 @@
}
- /* now stop all vm operations */
- down_write(¤t->mm->mmap_sem);
- segs = current->mm->map_count;
+ if (dump_threads) {
+ /* capture the status of all other threads */
+ if (signr) {
+ read_lock(&tasklist_lock);
+ for_each_task(p)
+ if (current->mm == p->mm && current != p) {
+ int sz = elf_dump_thread_status(signr, p, &thread_list);
+ if (!sz) {
+ read_unlock(&tasklist_lock);
+ goto cleanup;
+ }
+ else
+ thread_status_size += sz;
+ }
+ read_unlock(&tasklist_lock);
+ }
+ } /* End if(dump_threads) */
#ifdef DEBUG
printk("elf_core_dump: %d segs %lu limit\n", segs, limit);
#endif
/* Set up header */
- memcpy(elf.e_ident, ELFMAG, SELFMAG);
- elf.e_ident[EI_CLASS] = ELF_CLASS;
- elf.e_ident[EI_DATA] = ELF_DATA;
- elf.e_ident[EI_VERSION] = EV_CURRENT;
- memset(elf.e_ident+EI_PAD, 0, EI_NIDENT-EI_PAD);
-
- elf.e_type = ET_CORE;
- elf.e_machine = ELF_ARCH;
- elf.e_version = EV_CURRENT;
- elf.e_entry = 0;
- elf.e_phoff = sizeof(elf);
- elf.e_shoff = 0;
- elf.e_flags = 0;
- elf.e_ehsize = sizeof(elf);
- elf.e_phentsize = sizeof(struct elf_phdr);
- elf.e_phnum = segs+1; /* Include notes */
- elf.e_shentsize = 0;
- elf.e_shnum = 0;
- elf.e_shstrndx = 0;
+ fill_elf_header(&elf, segs+1); /* including notes section*/
fs = get_fs();
set_fs(KERNEL_DS);
@@ -1093,79 +1359,35 @@
* with info from their /proc.
*/
memset(&prstatus, 0, sizeof(prstatus));
-
- notes[0].name = "CORE";
- notes[0].type = NT_PRSTATUS;
- notes[0].datasz = sizeof(prstatus);
- notes[0].data = &prstatus;
- prstatus.pr_info.si_signo = prstatus.pr_cursig = signr;
- prstatus.pr_sigpend = current->pending.signal.sig[0];
- prstatus.pr_sighold = current->blocked.sig[0];
- psinfo.pr_pid = prstatus.pr_pid = current->pid;
- psinfo.pr_ppid = prstatus.pr_ppid = current->p_pptr->pid;
- psinfo.pr_pgrp = prstatus.pr_pgrp = current->pgrp;
- psinfo.pr_sid = prstatus.pr_sid = current->session;
- prstatus.pr_utime.tv_sec = CT_TO_SECS(current->times.tms_utime);
- prstatus.pr_utime.tv_usec = CT_TO_USECS(current->times.tms_utime);
- prstatus.pr_stime.tv_sec = CT_TO_SECS(current->times.tms_stime);
- prstatus.pr_stime.tv_usec = CT_TO_USECS(current->times.tms_stime);
- prstatus.pr_cutime.tv_sec = CT_TO_SECS(current->times.tms_cutime);
- prstatus.pr_cutime.tv_usec = CT_TO_USECS(current->times.tms_cutime);
- prstatus.pr_cstime.tv_sec = CT_TO_SECS(current->times.tms_cstime);
- prstatus.pr_cstime.tv_usec = CT_TO_USECS(current->times.tms_cstime);
+ fill_prstatus(&prstatus, current, signr);
+ fill_note(¬es[0], "CORE", NT_PRSTATUS, sizeof(prstatus), &prstatus);
/*
* This transfers the registers from regs into the standard
* coredump arrangement, whatever that is.
*/
-#ifdef ELF_CORE_COPY_REGS
- ELF_CORE_COPY_REGS(prstatus.pr_reg, regs)
-#else
- if (sizeof(elf_gregset_t) != sizeof(struct pt_regs))
- {
- printk("sizeof(elf_gregset_t) (%ld) != sizeof(struct pt_regs) (%ld)\n",
- (long)sizeof(elf_gregset_t), (long)sizeof(struct pt_regs));
- }
- else
- *(struct pt_regs *)&prstatus.pr_reg = *regs;
-#endif
+ elf_core_copy_regs(&prstatus.pr_reg, regs);
#ifdef DEBUG
dump_regs("Passed in regs", (elf_greg_t *)regs);
dump_regs("prstatus regs", (elf_greg_t *)&prstatus.pr_reg);
#endif
- notes[1].name = "CORE";
- notes[1].type = NT_PRPSINFO;
- notes[1].datasz = sizeof(psinfo);
- notes[1].data = &psinfo;
- i = current->state ? ffz(~current->state) + 1 : 0;
- psinfo.pr_state = i;
- psinfo.pr_sname = (i < 0 || i > 5) ? '.' : "RSDZTD"[i];
- psinfo.pr_zomb = psinfo.pr_sname == 'Z';
- psinfo.pr_nice = current->nice;
- psinfo.pr_flag = current->flags;
- psinfo.pr_uid = NEW_TO_OLD_UID(current->uid);
- psinfo.pr_gid = NEW_TO_OLD_GID(current->gid);
- strncpy(psinfo.pr_fname, current->comm, sizeof(psinfo.pr_fname));
-
- notes[2].name = "CORE";
- notes[2].type = NT_TASKSTRUCT;
- notes[2].datasz = sizeof(*current);
- notes[2].data = current;
+ fill_psinfo(&psinfo, current);
+ fill_note(¬es[1], "CORE", NT_PRPSINFO, sizeof(psinfo), &psinfo);
+
+ fill_note(¬es[2], "CORE", NT_TASKSTRUCT, sizeof(*current), current);
/* Try to dump the FPU. */
- prstatus.pr_fpvalid = dump_fpu (regs, &fpu);
- if (!prstatus.pr_fpvalid)
- {
- numnote--;
- }
- else
- {
- notes[3].name = "CORE";
- notes[3].type = NT_PRFPREG;
- notes[3].datasz = sizeof(fpu);
- notes[3].data = &fpu;
+ if ((prstatus.pr_fpvalid = elf_core_copy_task_fpregs(current, &fpu))) {
+ fill_note(¬es[3], "CORE", NT_PRFPREG, sizeof(fpu), &fpu);
+ } else {
+ --numnote;
+ }
+ if (elf_core_copy_task_xfpregs(current, &xfpu)) {
+ fill_note(¬es[4], "LINUX", NT_PRXFPREG, sizeof(xfpu), &xfpu);
+ } else {
+ --numnote;
}
/* Write notes phdr entry */
@@ -1175,17 +1397,12 @@
for(i = 0; i < numnote; i++)
sz += notesize(¬es[i]);
+
+ if (dump_threads)
+ sz += thread_status_size;
- phdr.p_type = PT_NOTE;
- phdr.p_offset = offset;
- phdr.p_vaddr = 0;
- phdr.p_paddr = 0;
- phdr.p_filesz = sz;
- phdr.p_memsz = 0;
- phdr.p_flags = 0;
- phdr.p_align = 0;
-
- offset += phdr.p_filesz;
+ fill_elf_note_phdr(&phdr, sz, offset);
+ offset += sz;
DUMP_WRITE(&phdr, sizeof(phdr));
}
@@ -1214,10 +1431,21 @@
DUMP_WRITE(&phdr, sizeof(phdr));
}
+ /* write out the notes section */
for(i = 0; i < numnote; i++)
if (!writenote(¬es[i], file))
goto end_coredump;
+ /* write out the thread status notes section */
+ if (dump_threads) {
+ list_for_each(t, &thread_list) {
+ struct elf_thread_status *tmp = list_entry(t, struct elf_thread_status, list);
+ for (i = 0; i < tmp->num_notes; i++)
+ if (!writenote(&tmp->notes[i], file))
+ goto end_coredump;
+ }
+ }
+
DUMP_SEEK(dataoff);
for(vma = current->mm->mmap; vma != NULL; vma = vma->vm_next) {
@@ -1259,8 +1487,20 @@
(off_t) file->f_pos, offset);
}
- end_coredump:
+end_coredump:
set_fs(fs);
+
+cleanup:
+ if (dump_threads) {
+ while(!list_empty(&thread_list)) {
+ struct list_head *tmp = thread_list.next;
+ list_del(tmp);
+ kfree(list_entry(tmp, struct elf_thread_status, list));
+ }
+
+ resume_other_threads(cpu_mask);
+ }
+
up_write(¤t->mm->mmap_sem);
return has_dumped;
}
diff -urN -X /home/vamsi/dontdiff /usr/src/2417-pure/include/asm-i386/elf.h 2417-tcore/include/asm-i386/elf.h
--- /usr/src/2417-pure/include/asm-i386/elf.h Fri Nov 23 01:18:29 2001
+++ 2417-tcore/include/asm-i386/elf.h Fri Mar 15 11:52:28 2002
@@ -99,6 +99,18 @@
#ifdef __KERNEL__
#define SET_PERSONALITY(ex, ibcs2) set_personality((ibcs2)?PER_SVR4:PER_LINUX)
+
+extern int dump_task_regs (struct task_struct *, struct pt_regs *);
+extern int dump_task_fpu (struct task_struct *, struct user_i387_struct *);
+extern int dump_task_extended_fpu (struct task_struct *, struct user_fxsr_struct *);
+
+#define ELF_CORE_COPY_TASK_REGS(tsk, pt_regs) dump_task_regs(tsk, pt_regs)
+#define ELF_CORE_COPY_FPREGS(tsk, elf_fpregs) dump_task_fpu(tsk, elf_fpregs)
+#define ELF_CORE_COPY_XFPREGS(tsk, elf_xfpregs) dump_task_extended_fpu(tsk, elf_xfpregs)
+#ifdef CONFIG_SMP
+extern void dump_smp_unlazy_fpu(void);
+#define ELF_CORE_SYNC dump_smp_unlazy_fpu
#endif
+#endif /* __KERNEL__ */
#endif
diff -urN -X /home/vamsi/dontdiff /usr/src/2417-pure/include/linux/elf.h 2417-tcore/include/linux/elf.h
--- /usr/src/2417-pure/include/linux/elf.h Fri Nov 23 01:18:29 2001
+++ 2417-tcore/include/linux/elf.h Fri Mar 15 11:52:28 2002
@@ -576,6 +576,8 @@
#define NT_PRPSINFO 3
#define NT_TASKSTRUCT 4
#define NT_PRFPXREG 20
+#define NT_PRXFPREG 0x46e62b7f /* note name must be "LINUX" as per GDB */
+ /* from gdb5.1/include/elf/common.h */
/* Note header in a PT_NOTE section */
typedef struct elf32_note {
diff -urN -X /home/vamsi/dontdiff /usr/src/2417-pure/include/linux/elfcore.h 2417-tcore/include/linux/elfcore.h
--- /usr/src/2417-pure/include/linux/elfcore.h Fri Nov 23 01:19:02 2001
+++ 2417-tcore/include/linux/elfcore.h Fri Mar 15 11:52:28 2002
@@ -86,4 +86,56 @@
#define PRARGSZ ELF_PRARGSZ
#endif
+#ifdef __KERNEL__
+static inline void elf_core_copy_regs(elf_gregset_t *elfregs, struct pt_regs *regs)
+{
+#ifdef ELF_CORE_COPY_REGS
+ ELF_CORE_COPY_REGS((*elfregs), regs)
+#else
+ if (sizeof(elf_gregset_t) != sizeof(struct pt_regs)) {
+ printk("sizeof(elf_gregset_t) (%ld) != sizeof(struct pt_regs) (%ld)\n",
+ (long)sizeof(elf_gregset_t), (long)sizeof(struct pt_regs));
+ } else
+ *(struct pt_regs *)elfregs = *regs;
+#endif
+}
+
+static inline int elf_core_copy_task_regs(struct task_struct *t, elf_gregset_t *elfregs)
+{
+ struct pt_regs regs;
+#ifdef ELF_CORE_COPY_TASK_REGS
+ if (ELF_CORE_COPY_TASK_REGS(t, ®s)) {
+ elf_core_copy_regs(elfregs, ®s);
+ return 1;
+ }
+#endif
+ return 0;
+}
+
+static inline int elf_core_copy_task_fpregs(struct task_struct *t, elf_fpregset_t *fpu)
+{
+#ifdef ELF_CORE_COPY_FPREGS
+ return ELF_CORE_COPY_FPREGS(t, fpu);
+#else
+ return dump_fpu(NULL, fpu);
+#endif
+}
+
+static inline int elf_core_copy_task_xfpregs(struct task_struct *t, elf_fpxregset_t *xfpu)
+{
+#ifdef ELF_CORE_COPY_XFPREGS
+ return ELF_CORE_COPY_XFPREGS(t, xfpu);
+#else
+ return 0;
+#endif
+}
+
+#ifdef CONFIG_SMP
+#ifndef ELF_CORE_SYNC
+#define ELF_CORE_SYNC do_nothing
+#endif
+#endif
+
+#endif /* __KERNEL__ */
+
#endif /* _LINUX_ELFCORE_H */
diff -urN -X /home/vamsi/dontdiff /usr/src/2417-pure/include/linux/sched.h 2417-tcore/include/linux/sched.h
--- /usr/src/2417-pure/include/linux/sched.h Fri Dec 21 23:12:03 2001
+++ 2417-tcore/include/linux/sched.h Fri Mar 15 11:52:28 2002
@@ -160,6 +160,10 @@
extern int start_context_thread(void);
extern int current_is_keventd(void);
+extern void reschedule_other_cpus(void);
+// forces all cpu's other than current to reschedule. Needed for accurate core dumps.
+
+
/*
* The default fd array needs to be at least BITS_PER_LONG,
* as this is the granularity returned by copy_fdset().
diff -urN -X /home/vamsi/dontdiff /usr/src/2417-pure/include/linux/sysctl.h 2417-tcore/include/linux/sysctl.h
--- /usr/src/2417-pure/include/linux/sysctl.h Mon Nov 26 18:59:17 2001
+++ 2417-tcore/include/linux/sysctl.h Fri Mar 15 11:52:28 2002
@@ -87,6 +87,7 @@
KERN_CAP_BSET=14, /* int: capability bounding set */
KERN_PANIC=15, /* int: panic timeout */
KERN_REALROOTDEV=16, /* real root device to mount after initrd */
+ KERN_CORE_DUMPS_THREADS=17, /* int: include status of others threads in dump */
KERN_SPARC_REBOOT=21, /* reboot command on Sparc */
KERN_CTLALTDEL=22, /* int: allow ctl-alt-del to reboot */
diff -urN -X /home/vamsi/dontdiff /usr/src/2417-pure/kernel/sched.c 2417-tcore/kernel/sched.c
--- /usr/src/2417-pure/kernel/sched.c Fri Dec 21 23:12:04 2001
+++ 2417-tcore/kernel/sched.c Fri Mar 15 11:52:28 2002
@@ -121,7 +121,7 @@
#else
#define idle_task(cpu) (&init_task)
-#define can_schedule(p,cpu) (1)
+#define can_schedule(p,cpu) ((p)->cpus_allowed)
#endif
@@ -704,6 +704,28 @@
return;
}
+/*
+ * needed for accurate core dumps of multi-threaded applications.
+ * see binfmt_elf.c for more information.
+ */
+void reschedule_other_cpus(void)
+{
+#ifdef CONFIG_SMP
+ int i, cpu;
+ struct task_struct *p;
+
+ for(i=0; i< smp_num_cpus; i++) {
+ cpu = cpu_logical_map(i);
+ p = cpu_curr(cpu);
+ if (p->processor != smp_processor_id()) {
+ p->need_resched = 1;
+ smp_send_reschedule(p->processor);
+ }
+ }
+#endif
+ return;
+}
+
/*
* The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just wake everything
* up. If it's an exclusive wakeup (nr_exclusive == small +ve number) then we wake all the
diff -urN -X /home/vamsi/dontdiff /usr/src/2417-pure/kernel/sysctl.c 2417-tcore/kernel/sysctl.c
--- /usr/src/2417-pure/kernel/sysctl.c Fri Dec 21 23:12:04 2001
+++ 2417-tcore/kernel/sysctl.c Fri Mar 15 11:52:28 2002
@@ -49,6 +49,7 @@
extern int max_queued_signals;
extern int sysrq_enabled;
extern int core_uses_pid;
+extern int core_dumps_threads;
extern int cad_pid;
/* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
@@ -169,6 +170,8 @@
0644, NULL, &proc_doutsstring, &sysctl_string},
{KERN_PANIC, "panic", &panic_timeout, sizeof(int),
0644, NULL, &proc_dointvec},
+ {KERN_CORE_DUMPS_THREADS, "core_dumps_threads", &core_dumps_threads, sizeof(int),
+ 0644, NULL, &proc_dointvec},
{KERN_CORE_USES_PID, "core_uses_pid", &core_uses_pid, sizeof(int),
0644, NULL, &proc_dointvec},
{KERN_TAINTED, "tainted", &tainted, sizeof(int),
Hi!
> - Other threads are prevented from executing while core dump is in
> progress to improve the accuracy of the dumps. This is done without
> changing the state of the task. We set cpus_allowed in task struct
> to be 0 to stop a task from being scheduled and reset it to -1 for
> resume execution. This has the advantage to not depending on user
> space at all for correct functioning. IMO sending SIGSTOP to stop
> other threads does not work if the process is being run under a
> debugger. The only possible issue with using cpus_allowed is that
In swsusp patch, I had exactly the same problem. I created refrigerator(),
and halfway send a signal....
> +/*
> + * Suspend execution of other threads belonging to the same multithreaded process
> + * of current, ASAP.
> + *
> + * Sets the current->cpu_mask to the current cpu to avoid cpu migration durring the dump.
> + * This cpu will also be the only cpu the other threads will be allowed to run after
> + * coredump is completed. This seems to be needed to fix some SMP races. This still
> + * needs some more thought though this solution works.
What about
app has 5 threads. 1st dumps core, and starts setting cpus_allowed mask to
thread 2. Meanwhile 3nd thread resets the mask back.
Pavel?
--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.
On Tuesday 19 March 2002 10:29 am, Pavel Machek wrote:
> > + *
> > + * Sets the current->cpu_mask to the current cpu to avoid cpu migration
> > durring the dump. + * This cpu will also be the only cpu the other
> > threads will be allowed to run after + * coredump is completed. This
> > seems to be needed to fix some SMP races. This still + * needs some more
> > thought though this solution works.
>
> What about
>
> app has 5 threads. 1st dumps core, and starts setting cpus_allowed mask to
> thread 2. Meanwhile 3nd thread resets the mask back.
>
This patch was intended to prevent this from happening. I hope I didn't miss
something.
The dumping thread doesn't proceed until the other CPU's have gotten into
kernel mode and done 2 IPI's. One to reschedule the other cpu's and one to
synchronize before exiting suspend_other_threads.
The way the IPI's are sent out by this patch, the other CPUs get 2 IPI's and
execute at least one IRET, and hence at least one call to schedule, before
the dumping process continues. This one call to schedule on each of the
other cpu's is what's needed to get all possible related thread processes
swapped out for the duration of the dump.
Unless the IPI's and associated IRET's get dropped by the system, that 3rd
thread will not get a chance to touch the cpu_masks before the dumping
process is finished taking its dump and resume_other_threads gets called.
Because it will have been scheduled out.
--mgross
There is serialization at higher level. We take a write lock
on current->mm->mmap_sem at the beginning of elf_core_dump
function which is released just before leaving the function.
So, if one thread enters elf_core_dump and starts dumping core,
no other thread (same mm) of the same process can start
dumping.
static int elf_core_dump(long signr, struct pt_regs * regs, struct file * file)
{
...
...
/* now stop all vm operations */
down_write(¤t->mm->mmap_sem);
...
...
...
up_write(¤t->mm->mmap_sem);
return has_dumped;
}
Vamsi.
--
Vamsi Krishna S.
Linux Technology Center,
IBM Software Lab, Bangalore.
Ph: +91 80 5262355 Extn: 3959
Internet: [email protected]
On Tue, Mar 19, 2002 at 01:49:58PM -0500, Mark Gross wrote:
> On Tuesday 19 March 2002 10:29 am, Pavel Machek wrote:
> > > + *
> > > + * Sets the current->cpu_mask to the current cpu to avoid cpu migration
> > > durring the dump. + * This cpu will also be the only cpu the other
> > > threads will be allowed to run after + * coredump is completed. This
> > > seems to be needed to fix some SMP races. This still + * needs some more
> > > thought though this solution works.
> >
> > What about
> >
> > app has 5 threads. 1st dumps core, and starts setting cpus_allowed mask to
> > thread 2. Meanwhile 3nd thread resets the mask back.
> >
> This patch was intended to prevent this from happening. I hope I didn't miss
> something.
>
> The dumping thread doesn't proceed until the other CPU's have gotten into
> kernel mode and done 2 IPI's. One to reschedule the other cpu's and one to
> synchronize before exiting suspend_other_threads.
>
> The way the IPI's are sent out by this patch, the other CPUs get 2 IPI's and
> execute at least one IRET, and hence at least one call to schedule, before
> the dumping process continues. This one call to schedule on each of the
> other cpu's is what's needed to get all possible related thread processes
> swapped out for the duration of the dump.
>
> Unless the IPI's and associated IRET's get dropped by the system, that 3rd
> thread will not get a chance to touch the cpu_masks before the dumping
> process is finished taking its dump and resume_other_threads gets called.
> Because it will have been scheduled out.
>
> --mgross
On Wed, Mar 20, 2002 at 11:36:30AM +0530, Vamsi Krishna S . wrote:
> There is serialization at higher level. We take a write lock
> on current->mm->mmap_sem at the beginning of elf_core_dump
> function which is released just before leaving the function.
> So, if one thread enters elf_core_dump and starts dumping core,
> no other thread (same mm) of the same process can start
> dumping.
>
> static int elf_core_dump(long signr, struct pt_regs * regs, struct file * file)
> {
> ...
> ...
> /* now stop all vm operations */
> down_write(¤t->mm->mmap_sem);
> ...
> ...
> ...
> up_write(¤t->mm->mmap_sem);
> return has_dumped;
> }
That's not a feature, it's a bug. You can't take the mmap_sem before
collecting thread status; it will cause a deadlock on at least ia64,
where some registers are collected from user memory.
(Thanks to Manfred Spraul for explaining that to me.)
--
Daniel Jacobowitz Carnegie Mellon University
MontaVista Software Debian GNU/Linux Developer
I've only JUST started on the Itanium version of this patch. In my initial
testing, after hacking around some of the compilation issues, I do get a
type of process freezing when attempting this. Could be this bug.
Thanks for the tip ;)
--mgross
On Wednesday 20 March 2002 01:37 pm, Daniel Jacobowitz wrote:
> On Wed, Mar 20, 2002 at 11:36:30AM +0530, Vamsi Krishna S . wrote:
> >?There is serialization at higher level. We take a write lock
> >?on current->mm->mmap_sem at the beginning of elf_core_dump
> >?function which is released just before leaving the function.
> >?So, if one thread enters elf_core_dump and starts dumping core,
> >?no other thread (same mm) of the same process can start
> >?dumping.
> >?
> >?static int elf_core_dump(long signr, struct pt_regs * regs, struct file *
> > file) {
> >???????...
> >???????...
> >? ? ? ? ?/* now stop all vm operations */
> >? ? ? ? ?down_write(¤t->mm->mmap_sem);
> >???????...
> >???????...
> >???????...
> >? ? ? ? ?up_write(¤t->mm->mmap_sem);
> >? ? ? ? ?return has_dumped;
> >?}
>
> That's not a feature, it's a bug. ?You can't take the mmap_sem before
> collecting thread status; it will cause a deadlock on at least ia64,
> where some registers are collected from user memory.
>
> (Thanks to Manfred Spraul for explaining that to me.)
Mark,
Does moving the down_write() to be after the registers of all
threads are collected help? (This patch on top of our previous
one)
--
--- 2417-tcore/fs/binfmt_elf.c.ori Thu Mar 21 15:30:08 2002
+++ 2417-tcore/fs/binfmt_elf.c Thu Mar 21 15:27:29 2002
@@ -1289,10 +1289,6 @@
int dump_threads = 0;
int thread_status_size = 0;
- /* now stop all vm operations */
- down_write(¤t->mm->mmap_sem);
- segs = current->mm->map_count;
-
if (atomic_read(¤t->mm->mm_users) != 1) {
dump_threads = core_dumps_threads;
}
@@ -1337,6 +1333,19 @@
}
} /* End if(dump_threads) */
+ /*
+ * This transfers the registers from regs into the standard
+ * coredump arrangement, whatever that is. We need to do this
+ * before acquiring mmap_sem as on some architectures (IA64)
+ * we may need to access user pages to get register state.
+ */
+ memset(&prstatus, 0, sizeof(prstatus));
+ elf_core_copy_regs(&prstatus.pr_reg, regs);
+
+ /* now stop all vm operations */
+ down_write(¤t->mm->mmap_sem);
+ segs = current->mm->map_count;
+
#ifdef DEBUG
printk("elf_core_dump: %d segs %lu limit\n", segs, limit);
#endif
@@ -1358,16 +1367,9 @@
* Set up the notes in similar form to SVR4 core dumps made
* with info from their /proc.
*/
- memset(&prstatus, 0, sizeof(prstatus));
fill_prstatus(&prstatus, current, signr);
fill_note(¬es[0], "CORE", NT_PRSTATUS, sizeof(prstatus), &prstatus);
- /*
- * This transfers the registers from regs into the standard
- * coredump arrangement, whatever that is.
- */
- elf_core_copy_regs(&prstatus.pr_reg, regs);
-
#ifdef DEBUG
dump_regs("Passed in regs", (elf_greg_t *)regs);
dump_regs("prstatus regs", (elf_greg_t *)&prstatus.pr_reg);
--
Vamsi Krishna S.
Linux Technology Center,
IBM Software Lab, Bangalore.
Ph: +91 80 5262355 Extn: 3959
Internet: [email protected]
On Wed, Mar 20, 2002 at 11:14:56AM -0500, Mark Gross wrote:
> I've only JUST started on the Itanium version of this patch. In my initial
> testing, after hacking around some of the compilation issues, I do get a
> type of process freezing when attempting this. Could be this bug.
>
> Thanks for the tip ;)
>
> --mgross
>
>
>
> On Wednesday 20 March 2002 01:37 pm, Daniel Jacobowitz wrote:
> > On Wed, Mar 20, 2002 at 11:36:30AM +0530, Vamsi Krishna S . wrote:
> > >?There is serialization at higher level. We take a write lock
> > >?on current->mm->mmap_sem at the beginning of elf_core_dump
> > >?function which is released just before leaving the function.
> > >?So, if one thread enters elf_core_dump and starts dumping core,
> > >?no other thread (same mm) of the same process can start
> > >?dumping.
> > >?<snip>
> >
> > That's not a feature, it's a bug. ?You can't take the mmap_sem before
> > collecting thread status; it will cause a deadlock on at least ia64,
> > where some registers are collected from user memory.
> >
> > (Thanks to Manfred Spraul for explaining that to me.)
Dan,
Thanks for pointing this out. I see that this change has now gone into
2.4.18 as well as 2.5.4. We would ensure that the down_write happens
only after the registers of all threads are collected.
Coming back to the original point raised by Pavel, indeed there is
nothing preventing external code (any other kernel modules) modifying
the cpus_allowed field from under us. This could get worse in 2.5.x
where a user could change cpu affinity (through proc or a syscall,
though I don't think the patches providing this are accepted as yet).
Vamsi.
On Wed, Mar 20, 2002 at 01:37:09PM -0500, Daniel Jacobowitz wrote:
> On Wed, Mar 20, 2002 at 11:36:30AM +0530, Vamsi Krishna S . wrote:
> > There is serialization at higher level. We take a write lock
> > on current->mm->mmap_sem at the beginning of elf_core_dump
> > function which is released just before leaving the function.
> > So, if one thread enters elf_core_dump and starts dumping core,
> > no other thread (same mm) of the same process can start
> > dumping.
> > <snip>
>
> That's not a feature, it's a bug. You can't take the mmap_sem before
> collecting thread status; it will cause a deadlock on at least ia64,
> where some registers are collected from user memory.
>
> (Thanks to Manfred Spraul for explaining that to me.)
>
> --
> Daniel Jacobowitz Carnegie Mellon University
> MontaVista Software Debian GNU/Linux Developer
--
Vamsi Krishna S.
Linux Technology Center,
IBM Software Lab, Bangalore.
Ph: +91 80 5262355 Extn: 3959
Internet: [email protected]
On Thu, Mar 21, 2002 at 03:46:50PM +0530, Vamsi Krishna S . wrote:
> Dan,
>
> Thanks for pointing this out. I see that this change has now gone into
> 2.4.18 as well as 2.5.4. We would ensure that the down_write happens
> only after the registers of all threads are collected.
Yes, your other patch for this looks OK.
> Coming back to the original point raised by Pavel, indeed there is
> nothing preventing external code (any other kernel modules) modifying
> the cpus_allowed field from under us. This could get worse in 2.5.x
> where a user could change cpu affinity (through proc or a syscall,
> though I don't think the patches providing this are accepted as yet).
We really need a non-signal-based way to tell the scheduler that a task
can not be scheduled. A lot of the machinery is all there, but private to
sched.c; the rest is pretty straightforward.
--
Daniel Jacobowitz Carnegie Mellon University
MontaVista Software Debian GNU/Linux Developer
> We really need a non-signal-based way to tell the scheduler that a task
> can not be scheduled. A lot of the machinery is all there, but private to
> sched.c; the rest is pretty straightforward.
You need interrupts to handle this, even if you don't wrap it in the top
layer of signals it will be able to use much of the code I agree. The nasty
case is the "currently running on another cpu" one. Especially since you
can't just "trap it" - if you IPI that processor it might have moved by the
time the IPI arrives 8)
On Thursday 21 March 2002 11:52 am, Alan Cox wrote:
> You need interrupts to handle this, even if you don't wrap it in the top
> layer of signals it will be able to use much of the code I agree. The nasty
> case is the "currently running on another cpu" one. Especially since you
> can't just "trap it" - if you IPI that processor it might have moved by the
> time the IPI arrives 8)
This why I grabbed all those locks, and did the two sets of IPI's in the
tcore patch. Once the runqueue lock is grabbed, even if that process on the
other CPU tries to migrate, it won't get swapped in or looked at by the
scheduler until its cpus_allowed member has been marked. After cpus_allowed
has been marked it won't run.
I don't think there is any faster way of getting the other CPU's into
schedule and a specific running process to be swapped out than what was done
here.
The only risk with this type of code is if other code or drivers attempt
similar maneuvers at the same time. Having a standard mechanism or API for
this in the scheduler would be a "good thing".
--mgross
ps.
I've just started considering how to do this with the 2.5 O(1) scheduler, and
I'm not sure yet how I can accomplish this process "pausing" behavior just
yet.
> This why I grabbed all those locks, and did the two sets of IPI's in the
> tcore patch. Once the runqueue lock is grabbed, even if that process on the
If you IPI holding a lock whats going to happen if while the IPI is going
across the cpus the other processor tries to grab the runqueue lock and
is spinning on it with interrupts off ?
On Thursday 21 March 2002 12:34 pm, Alan Cox wrote:
> >?This why I grabbed all those locks, and did the two sets of IPI's in the
> >?tcore patch. ?Once the runqueue lock is grabbed, even if that process on
> > the
>
> If you IPI holding a lock whats going to happen if while the IPI is going
> across the cpus the other processor tries to grab the runqueue lock and
> is spinning on it with interrupts off ?
Then the at least 2 CPU's would quickly become dead locked on the
synchronization IPI this patch sends at the end of the suspend_other_threads
function call.
Interrupts shouldn't be turned off when grabbing the runqueue lock. Its also
a bad thing if they would happen to be off while calling into to schedule.
I think schedule was designed to be called only while interrupts are turned
on. It BUG's if "in_interrupt" to enforce this.
--mgross
Hi!
> > You need interrupts to handle this, even if you don't wrap it in the top
> > layer of signals it will be able to use much of the code I agree. The nasty
> > case is the "currently running on another cpu" one. Especially since you
> > can't just "trap it" - if you IPI that processor it might have moved by the
> > time the IPI arrives 8)
>
> This why I grabbed all those locks, and did the two sets of IPI's in the
> tcore patch. Once the runqueue lock is grabbed, even if that process on the
> other CPU tries to migrate, it won't get swapped in or looked at by the
> scheduler until its cpus_allowed member has been marked. After cpus_allowed
> has been marked it won't run.
BTW it would be very nice to put "task freezing" in some generic
place. I have my own version of task freezing (with refrigerator), and
it would be good to be able to share that...
> The only risk with this type of code is if other code or drivers attempt
> similar maneuvers at the same time. Having a standard mechanism or API for
> this in the scheduler would be a "good thing".
Ahha, so you know it, too.
> I've just started considering how to do this with the 2.5 O(1) scheduler, and
> I'm not sure yet how I can accomplish this process "pausing" behavior just
> yet.
I'm doing this in my freezer, and it should be safe even on
2.5.X. Most interesting is the part in suspend.c...
Pavel
--- clean.2.4/arch/i386/kernel/apm.c Thu Feb 28 11:18:05 2002
+++ linux-swsusp.24/arch/i386/kernel/apm.c Fri Mar 1 12:44:18 2002
@@ -1664,6 +1664,7 @@
daemonize();
strcpy(current->comm, "kapmd");
+ current->flags |= PF_IOTHREAD;
sigfillset(¤t->blocked);
if (apm_info.connection_version == 0) {
--- clean.2.4/arch/i386/kernel/signal.c Thu Feb 28 11:18:05 2002
+++ linux-swsusp.24/arch/i386/kernel/signal.c Thu Mar 7 23:17:18 2002
@@ -20,6 +20,7 @@
#include <linux/stddef.h>
#include <linux/tty.h>
#include <linux/personality.h>
+#include <linux/suspend.h>
#include <asm/ucontext.h>
#include <asm/uaccess.h>
#include <asm/i387.h>
@@ -595,6 +596,11 @@
if ((regs->xcs & 3) != 3)
return 1;
+ if (current->flags & PF_FREEZE) {
+ refrigerator(0);
+ goto no_signal;
+ }
+
if (!oldset)
oldset = ¤t->blocked;
@@ -705,6 +711,7 @@
return 1;
}
+ no_signal:
/* Did we come from a system call? */
if (regs->orig_eax >= 0) {
/* Restart the system call - no handlers present */
--- clean.2.4/drivers/usb/storage/usb.c Thu Feb 28 11:18:20 2002
+++ linux-swsusp.24/drivers/usb/storage/usb.c Fri Mar 1 12:43:11 2002
@@ -316,6 +316,7 @@
*/
exit_files(current);
current->files = init_task.files;
+ current->flags |= PF_IOTHREAD;
atomic_inc(¤t->files->count);
daemonize();
--- clean.2.4/fs/buffer.c Thu Feb 28 11:18:21 2002
+++ linux-swsusp.24/fs/buffer.c Thu Mar 7 22:51:11 2002
@@ -129,6 +129,8 @@
wake_up(&bh->b_wait);
}
+DECLARE_TASK_QUEUE(tq_bdflush);
+
/*
* Rewrote the wait-routines to use the "new" wait-queue functionality,
* and getting rid of the cli-sti pairs. The wait-queue routines still
@@ -2981,12 +2986,14 @@
spin_unlock_irq(&tsk->sigmask_lock);
complete((struct completion *)startup);
-
+ current->flags |= PF_KERNTHREAD;
for (;;) {
wait_for_some_buffers(NODEV);
/* update interval */
interval = bdf_prm.b_un.interval;
+ if (current->flags & PF_FREEZE)
+ refrigerator(PF_IOTHREAD);
if (interval) {
tsk->state = TASK_INTERRUPTIBLE;
schedule_timeout(interval);
--- clean.2.4/fs/jbd/journal.c Thu Feb 28 11:18:22 2002
+++ linux-swsusp.24/fs/jbd/journal.c Thu Mar 7 23:13:25 2002
@@ -34,6 +34,7 @@
#include <linux/init.h>
#include <linux/mm.h>
#include <linux/slab.h>
+#include <linux/suspend.h>
#include <asm/uaccess.h>
#include <linux/proc_fs.h>
@@ -226,6 +227,7 @@
journal->j_commit_interval / HZ);
list_add(&journal->j_all_journals, &all_journals);
+ current->flags |= PF_KERNTHREAD;
/* And now, wait forever for commit wakeup events. */
while (1) {
if (journal->j_flags & JFS_UNMOUNT)
@@ -246,7 +248,15 @@
}
wake_up(&journal->j_wait_done_commit);
- interruptible_sleep_on(&journal->j_wait_commit);
+ if (current->flags & PF_FREEZE) { /* The simpler the better. Flushing journal isn't a
+ good idea, because that depends on threads that
+ may be already stopped. */
+ jbd_debug(1, "Now suspending kjournald\n");
+ refrigerator(PF_IOTHREAD);
+ jbd_debug(1, "Resuming kjournald\n");
+ } else /* we assume on resume that commits are already there,
+ so we don't sleep */
+ interruptible_sleep_on(&journal->j_wait_commit);
jbd_debug(1, "kjournald wakes\n");
--- clean.2.4/include/linux/sched.h Tue Dec 25 22:39:30 2001
+++ linux-swsusp.24/include/linux/sched.h Thu Mar 7 23:09:25 2002
@@ -427,6 +427,10 @@
#define PF_MEMDIE 0x00001000 /* Killed for out-of-memory */
#define PF_FREE_PAGES 0x00002000 /* per process page freeing */
#define PF_NOIO 0x00004000 /* avoid generating further I/O */
+#define PF_FROZEN 0x00008000 /* frozen for system suspend */
+#define PF_FREEZE 0x00010000 /* this task should be frozen for suspend */
+#define PF_IOTHREAD 0x00020000 /* this thread is needed for doing I/O to swap */
+#define PF_KERNTHREAD 0x00040000 /* this thread is a kernel thread that cannot be sent signals to */
#define PF_USEDFPU 0x00100000 /* task used FPU this quantum (SMP) */
--- clean.2.4/kernel/context.c Thu Oct 11 20:17:22 2001
+++ linux-swsusp.24/kernel/context.c Tue Feb 19 20:33:23 2002
@@ -72,6 +72,7 @@
daemonize();
strcpy(curtask->comm, "keventd");
+ current->flags |= PF_IOTHREAD;
keventd_running = 1;
keventd_task = curtask;
--- clean.2.4/kernel/signal.c Wed Dec 5 23:46:07 2001
+++ linux-swsusp.24/kernel/signal.c Tue Feb 19 20:33:23 2002
@@ -463,7 +463,7 @@
* No need to set need_resched since signal event passing
* goes through ->blocked
*/
-static inline void signal_wake_up(struct task_struct *t)
+inline void signal_wake_up(struct task_struct *t)
{
t->sigpending = 1;
--- clean.2.4/kernel/softirq.c Wed Oct 31 19:26:02 2001
+++ linux-swsusp.24/kernel/softirq.c Tue Feb 19 20:33:23 2002
@@ -366,6 +366,7 @@
daemonize();
current->nice = 19;
+ current->flags |= PF_IOTHREAD;
sigfillset(¤t->blocked);
/* Migrate to the right CPU */
--- clean.2.4/kernel/suspend.c Sun Nov 11 20:26:28 2001
+++ linux-swsusp.24/kernel/suspend.c Tue Mar 19 13:22:14 2002
@@ -0,0 +1,1373 @@
...
+/*
+ * Refrigerator and related stuff
+ */
+
+#define INTERESTING(p) \
+ /* We don't want to touch kernel_threads..*/ \
+ if (p->flags & PF_IOTHREAD) \
+ continue; \
+ if (p == current) \
+ continue; \
+ if (p->state == TASK_ZOMBIE) \
+ continue;
+
+/* Refrigerator is place where frozen processes are stored :-). */
+void refrigerator(unsigned long flag)
+{
+ /* You need correct to work with real-time processes.
+ OTOH, this way one process may see (via /proc/) some other
+ process in stopped state (and thereby discovered we were
+ suspended. We probably do not care.
+ */
+ long save;
+ save = current->state;
+ current->state = TASK_STOPPED;
+// PRINTK("%s entered refrigerator\n", current->comm);
+ printk(":");
+ current->flags &= ~PF_FREEZE;
+ if (flag)
+ flush_signals(current); /* We have signaled a kernel thread, which isn't normal behaviour
+ and that may lead to 100%CPU sucking because those threads
+ just don't manage signals. */
+ current->flags |= PF_FROZEN;
+ while (current->flags & PF_FROZEN)
+ schedule();
+// PRINTK("%s left refrigerator\n", current->comm);
+ printk(":");
+ current->state = save;
+}
+
+/* 0 = success, else # of processes that we failed to stop */
+static int freeze_processes(void)
+{
+ int todo, start_time;
+ struct task_struct *p;
+
+ PRINTS( "Waiting for tasks to stop... " );
+
+ start_time = jiffies;
+ do {
+ todo = 0;
+ read_lock(&tasklist_lock);
+ for_each_task(p) {
+ unsigned long flags;
+ INTERESTING(p);
+ if (p->flags & PF_FROZEN)
+ continue;
+
+ /* FIXME: smp problem here: we may not access other process' flags
+ without locking */
+ p->flags |= PF_FREEZE;
+ spin_lock_irqsave(&p->sigmask_lock, flags);
+ signal_wake_up(p);
+ spin_unlock_irqrestore(&p->sigmask_lock, flags);
+ todo++;
+ }
+ read_unlock(&tasklist_lock);
+ sys_sched_yield();
+ schedule();
+ if (time_after(jiffies, start_time + TIMEOUT)) {
+ PRINTK( "\n" );
+ printk(KERN_ERR " stopping tasks failed (%d tasks remaining)\n", todo );
+ return todo;
+ }
+ } while(todo);
+
+ PRINTK( " ok\n" );
+ return 0;
+}
+
+static void thaw_processes(void)
+{
+ struct task_struct *p;
+
+ PRINTR( "Restarting tasks..." );
+ read_lock(&tasklist_lock);
+ for_each_task(p) {
+ INTERESTING(p);
+
+ if (p->flags & PF_FROZEN) p->flags &= ~PF_FROZEN;
+ else
+ printk(KERN_INFO " Strange, %s not stopped\n", p->comm );
+ wake_up_process(p);
+ }
+ read_unlock(&tasklist_lock);
+ PRINTK( " done\n" );
+ MDELAY(500);
+}
--- clean.2.4/mm/vmscan.c Thu Feb 28 11:18:26 2002
+++ linux-swsusp.24/mm/vmscan.c Thu Mar 7 22:55:49 2002
@@ -723,18 +723,22 @@
* us from recursively trying to free more memory as we're
* trying to free the first piece of memory in the first place).
*/
- tsk->flags |= PF_MEMALLOC;
+ tsk->flags |= PF_MEMALLOC | PF_KERNTHREAD;
/*
* Kswapd main loop.
*/
for (;;) {
+ if (current->flags & PF_FREEZE)
+ refrigerator(PF_IOTHREAD);
__set_current_state(TASK_INTERRUPTIBLE);
add_wait_queue(&kswapd_wait, &wait);
mb();
- if (kswapd_can_sleep())
+ if (kswapd_can_sleep()) {
schedule();
+ }
+
__set_current_state(TASK_RUNNING);
remove_wait_queue(&kswapd_wait, &wait);
--
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.
IIRC there was an observation that spin_lock_irq seems to first disable
interrupts and then start spinning on the lock, which is why such a
situation could arise (even though the code in schedule doesn't appear to
explicitly disable interrupts).
However, in Mark's implementation, its only the first IPI that happens
under the runqueue lock, and that actually doesn't wait for the other CPUs
to receive the IPI. (The purpose of the first IPI was more a matter of
trying to improve accuracy by notifying the other threads as soon as
possible). So there shouldn't be a deadlock. The synchronization/wait
happens in the case of the second IPI (i.e. the smp_call_function), and by
that time the runqueue lock has been released, and cpus_allowed has been
updated.
Regards
Suparna
Suparna Bhattacharya
Linux Technology Center
IBM Software Lab, India
E-mail : [email protected]
Phone : 91-80-5044961
Mark Gross
<[email protected]. To: Alan Cox <[email protected]>
intel.com> cc: [email protected] (Alan Cox),
[email protected] (Daniel Jacobowitz),
03/21/02 08:29 PM [email protected], [email protected] (Pavel
Please respond to Machek), [email protected],
mgross [email protected],
[email protected], [email protected],
S Vamsikrishna/India/IBM@IBMIN, Richard J
Moore/UK/IBM@IBMGB, [email protected],
Suparna Bhattacharya/India/IBM@IBMIN,
[email protected],
[email protected],
[email protected],
[email protected], [email protected]
Subject: Re: [PATCH] multithreaded
coredumps for elf exeecutables
On Thursday 21 March 2002 12:34 pm, Alan Cox wrote:
> >?This why I grabbed all those locks, and did the two sets of IPI's in
the
> >?tcore patch. ?Once the runqueue lock is grabbed, even if that process
on
> > the
>
> If you IPI holding a lock whats going to happen if while the IPI is going
> across the cpus the other processor tries to grab the runqueue lock and
> is spinning on it with interrupts off ?
Then the at least 2 CPU's would quickly become dead locked on the
synchronization IPI this patch sends at the end of the
suspend_other_threads
function call.
Interrupts shouldn't be turned off when grabbing the runqueue lock. Its
also
a bad thing if they would happen to be off while calling into to schedule.
I think schedule was designed to be called only while interrupts are turned
on. It BUG's if "in_interrupt" to enforce this.
--mgross
On Thursday 21 March 2002 05:03 am, Vamsi Krishna S . wrote:
> Mark,
>
> Does moving the down_write() to be after the registers of all
> threads are collected help? (This patch on top of our previous
> one)
Yes, moving the down_write to after the grabbing of the registers fixes the
semi lock ups.
I need to move my Big Sur to RH7.2 to continue my validation. Its running
the 7.1 libs and gdb / libpthreads.so aren't as happy at debug time as they
are for 7.2 on ia32.
Thanks.
--mgross
So, after all this discussion, is there a set of source that I can use to
build a kernel that will
dump ALL threads to a core file?
I recall that Vamsi initially send out the diffs that were to be used as a
patch. This sparked the issue raised by Daniel.
Vamsi: do you have a set of patches that differ than the original patch you
sent?
Thanks!
-- jrj
-----Original Message-----
From: Suparna Bhattacharya [mailto:[email protected]]
Sent: Thursday, March 21, 2002 10:06 PM
To: [email protected]
Cc: Alan Cox; Alan Cox; [email protected]; [email protected];
Daniel Jacobowitz; [email protected]; [email protected];
[email protected]; [email protected];
[email protected]; Pavel Machek; Richard_J_Moore/UK/IBM%IBMGB; S
Vamsikrishna; [email protected]; [email protected];
[email protected]; [email protected]
Subject: Re: [PATCH] multithreaded coredumps for elf exeecutables
Importance: High
IIRC there was an observation that spin_lock_irq seems to first disable
interrupts and then start spinning on the lock, which is why such a
situation could arise (even though the code in schedule doesn't appear to
explicitly disable interrupts).
However, in Mark's implementation, its only the first IPI that happens
under the runqueue lock, and that actually doesn't wait for the other CPUs
to receive the IPI. (The purpose of the first IPI was more a matter of
trying to improve accuracy by notifying the other threads as soon as
possible). So there shouldn't be a deadlock. The synchronization/wait
happens in the case of the second IPI (i.e. the smp_call_function), and by
that time the runqueue lock has been released, and cpus_allowed has been
updated.
Regards
Suparna
Suparna Bhattacharya
Linux Technology Center
IBM Software Lab, India
E-mail : [email protected]
Phone : 91-80-5044961
Mark Gross
<[email protected]. To: Alan Cox
<[email protected]>
intel.com> cc:
[email protected] (Alan Cox),
[email protected] (Daniel
Jacobowitz),
03/21/02 08:29 PM [email protected],
[email protected] (Pavel
Please respond to Machek),
[email protected],
mgross [email protected],
[email protected],
[email protected],
S
Vamsikrishna/India/IBM@IBMIN, Richard J
Moore/UK/IBM@IBMGB,
[email protected],
Suparna
Bhattacharya/India/IBM@IBMIN,
[email protected],
[email protected],
[email protected],
[email protected],
[email protected]
Subject: Re: [PATCH]
multithreaded
coredumps for elf
exeecutables
On Thursday 21 March 2002 12:34 pm, Alan Cox wrote:
> >?This why I grabbed all those locks, and did the two sets of IPI's in
the
> >?tcore patch. ?Once the runqueue lock is grabbed, even if that process
on
> > the
>
> If you IPI holding a lock whats going to happen if while the IPI is going
> across the cpus the other processor tries to grab the runqueue lock and
> is spinning on it with interrupts off ?
Then the at least 2 CPU's would quickly become dead locked on the
synchronization IPI this patch sends at the end of the
suspend_other_threads
function call.
Interrupts shouldn't be turned off when grabbing the runqueue lock. Its
also
a bad thing if they would happen to be off while calling into to schedule.
I think schedule was designed to be called only while interrupts are turned
on. It BUG's if "in_interrupt" to enforce this.
--mgross
Yes.
Patch the 2.4.17 kernel with the patch that Vamsi sent out and you're good to
go on ia32. I've had very good luck with this patch unit tested on 1,2, and
4 way ia32 systems without any failures.
However; I'm currently working on a bug fix to my process pausing
implementation and Itanium support for it as a patch off the 2.4.17 base
kernel. This bug showed up on Itanium, and could bite ia32 users even though
we haven't seen it on ia32 yet.
I should have the bug fix patch and the Itainium patch posted next week.
Take the current 2.4.17 patch and give it a try.
--mgross
On Friday 29 March 2002 12:43 am, Jeff Jenkins wrote:
> So, after all this discussion, is there a set of source that I can use to
> build a kernel that will
> dump ALL threads to a core file?
>
> I recall that Vamsi initially send out the diffs that were to be used as a
> patch. This sparked the issue raised by Daniel.
>
> Vamsi: do you have a set of patches that differ than the original patch
> you sent?
>
> Thanks!
>
> -- jrj
>
> -----Original Message-----
> From: Suparna Bhattacharya [mailto:[email protected]]
> Sent: Thursday, March 21, 2002 10:06 PM
> To: [email protected]
> Cc: Alan Cox; Alan Cox; [email protected]; [email protected];
> Daniel Jacobowitz; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; Pavel Machek; Richard_J_Moore/UK/IBM%IBMGB; S
> Vamsikrishna; [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [PATCH] multithreaded coredumps for elf exeecutables
> Importance: High
>
>
>
> IIRC there was an observation that spin_lock_irq seems to first disable
> interrupts and then start spinning on the lock, which is why such a
> situation could arise (even though the code in schedule doesn't appear to
> explicitly disable interrupts).
>
> However, in Mark's implementation, its only the first IPI that happens
> under the runqueue lock, and that actually doesn't wait for the other CPUs
> to receive the IPI. (The purpose of the first IPI was more a matter of
> trying to improve accuracy by notifying the other threads as soon as
> possible). So there shouldn't be a deadlock. The synchronization/wait
> happens in the case of the second IPI (i.e. the smp_call_function), and by
> that time the runqueue lock has been released, and cpus_allowed has been
> updated.
>
> Regards
> Suparna
>
> Suparna Bhattacharya
> Linux Technology Center
> IBM Software Lab, India
> E-mail : [email protected]
> Phone : 91-80-5044961
>
>
>
>
> Mark Gross
> <[email protected]. To: Alan Cox
> <[email protected]>
> intel.com> cc:
> [email protected] (Alan Cox),
> [email protected] (Daniel
> Jacobowitz),
> 03/21/02 08:29 PM [email protected],
> [email protected] (Pavel
> Please respond to Machek),
> [email protected],
> mgross [email protected],
> [email protected],
> [email protected],
> S
> Vamsikrishna/India/IBM@IBMIN, Richard J
> Moore/UK/IBM@IBMGB,
> [email protected],
> Suparna
> Bhattacharya/India/IBM@IBMIN,
> [email protected],
> [email protected],
> [email protected],
> [email protected],
> [email protected]
> Subject: Re: [PATCH]
> multithreaded
> coredumps for elf
> exeecutables
>
> On Thursday 21 March 2002 12:34 pm, Alan Cox wrote:
> > >?This why I grabbed all those locks, and did the two sets of IPI's in
>
> the
>
> > >?tcore patch. ?Once the runqueue lock is grabbed, even if that process
>
> on
>
> > > the
> >
> > If you IPI holding a lock whats going to happen if while the IPI is going
> > across the cpus the other processor tries to grab the runqueue lock and
> > is spinning on it with interrupts off ?
>
> Then the at least 2 CPU's would quickly become dead locked on the
> synchronization IPI this patch sends at the end of the
> suspend_other_threads
> function call.
>
> Interrupts shouldn't be turned off when grabbing the runqueue lock. Its
> also
> a bad thing if they would happen to be off while calling into to schedule.
>
>
> I think schedule was designed to be called only while interrupts are turned
>
> on. It BUG's if "in_interrupt" to enforce this.
>
> --mgross
>
>
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/