Hello,
This is the implementation of the soft-dirty bit concept that should help
keep track of changes in user memory, which in turn is very-very required by
the checkpoint-restore project (http://criu.org). Let me briefly remind what
the issue is.
<< EOF
To create a dump of an application(s) we save all the information about it
to files, and the biggest part of such dump is the contents of tasks' memory.
However, there are usage scenarios where it's not required to get _all_ the
task memory while creating a dump. For example, when doing periodical dumps,
it's only required to take full memory dump only at the first step and then
take incremental changes of memory. Another example is live migration. We
copy all the memory to the destination node without stopping all tasks, then
stop them, check for what pages has changed, dump it and the rest of the state,
then copy it to the destination node. This decreases freeze time significantly.
That said, some help from kernel to watch how processes modify the contents
of their memory is required.
EOF
The proposal is to track changes with the help of new soft-dirty bit this way:
1. First do "echo 4 > /proc/$pid/clear_refs".
At that point kernel clears the soft dirty _and_ the writable bits from all
ptes of process $pid. From now on every write to any page will result in #pf
and the subsequent call to pte_mkdirty/pmd_mkdirty, which in turn will set
the soft dirty flag.
2. Then read the /proc/$pid/pagemap2 and check the soft-dirty bit reported there
(the 55'th one). If set, the respective pte was written to since last call
to clear refs.
The soft-dirty bit is the _PAGE_BIT_HIDDEN one. Although it's used by kmemcheck,
the latter one marks kernel pages with it, while the former bit is put on user
pages so they do not conflict to each other.
The set is against the v3.9-rc5.
It includes preparations to /proc/pid's clear_refs file, adds the pagemap2 one
and the soft-dirty concept itself with Andrew's comments on the previous patch
(hopefully) fixed.
History of the set:
* Previous version of this patch, commented out by Andrew:
http://lwn.net/Articles/546184/
* Pre-previous ftrace-based approach:
http://permalink.gmane.org/gmane.linux.kernel.mm/91428
This one was not nice, because ftrace could drop events so we might
miss significant information about page updates.
Another issue with it -- it was impossible to use one to watch arbitrary
task -- task had to mark memory areas with madvise itself to make events
occur.
Also, program, that monitored the update events could interfere with
anyone else trying to mess with ftrace.
Signed-off-by: Pavel Emelyanov <[email protected]>
A new clear-refs type will be added in the next patch, so prepare
code for that.
Signed-off-by: Pavel Emelyanov <[email protected]>
---
fs/proc/task_mmu.c | 17 ++++++++++-------
1 files changed, 10 insertions(+), 7 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3e636d8..67c2586 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -688,6 +688,13 @@ const struct file_operations proc_tid_smaps_operations = {
.release = seq_release_private,
};
+enum clear_refs_types {
+ CLEAR_REFS_ALL = 1,
+ CLEAR_REFS_ANON,
+ CLEAR_REFS_MAPPED,
+ CLEAR_REFS_LAST,
+};
+
static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long end, struct mm_walk *walk)
{
@@ -719,10 +726,6 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
return 0;
}
-#define CLEAR_REFS_ALL 1
-#define CLEAR_REFS_ANON 2
-#define CLEAR_REFS_MAPPED 3
-
static ssize_t clear_refs_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
@@ -730,7 +733,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
char buffer[PROC_NUMBUF];
struct mm_struct *mm;
struct vm_area_struct *vma;
- int type;
+ enum clear_refs_types type;
int rv;
memset(buffer, 0, sizeof(buffer));
@@ -738,10 +741,10 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
count = sizeof(buffer) - 1;
if (copy_from_user(buffer, buf, count))
return -EFAULT;
- rv = kstrtoint(strstrip(buffer), 10, &type);
+ rv = kstrtoint(strstrip(buffer), 10, (int *)&type);
if (rv < 0)
return rv;
- if (type < CLEAR_REFS_ALL || type > CLEAR_REFS_MAPPED)
+ if (type < CLEAR_REFS_ALL || type >= CLEAR_REFS_LAST)
return -EINVAL;
task = get_proc_task(file_inode(file));
if (!task)
--
1.7.6.5
In next patch the clear-refs-type will be required in clear_refs_pte_range
funciton, so prepare the walk->private to carry this info.
Signed-off-by: Pavel Emelyanov <[email protected]>
---
fs/proc/task_mmu.c | 12 ++++++++++--
1 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 67c2586..c59a148 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -695,10 +695,15 @@ enum clear_refs_types {
CLEAR_REFS_LAST,
};
+struct clear_refs_private {
+ struct vm_area_struct *vma;
+};
+
static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long end, struct mm_walk *walk)
{
- struct vm_area_struct *vma = walk->private;
+ struct clear_refs_private *cp = walk->private;
+ struct vm_area_struct *vma = cp->vma;
pte_t *pte, ptent;
spinlock_t *ptl;
struct page *page;
@@ -751,13 +756,16 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
return -ESRCH;
mm = get_task_mm(task);
if (mm) {
+ struct clear_refs_private cp = {
+ };
struct mm_walk clear_refs_walk = {
.pmd_entry = clear_refs_pte_range,
.mm = mm,
+ .private = &cp,
};
down_read(&mm->mmap_sem);
for (vma = mm->mmap; vma; vma = vma->vm_next) {
- clear_refs_walk.private = vma;
+ cp.vma = vma;
if (is_vm_hugetlb_page(vma))
continue;
/*
--
1.7.6.5
This file is the same as the pagemap one, but shows entries with bits
55-60 being zero (reserved for future use). Next patch will occupy one
of them.
Signed-off-by: Pavel Emelyanov <[email protected]>
---
Documentation/filesystems/proc.txt | 2 ++
Documentation/vm/pagemap.txt | 3 +++
fs/proc/base.c | 2 ++
fs/proc/internal.h | 1 +
fs/proc/task_mmu.c | 11 +++++++++++
5 files changed, 19 insertions(+), 0 deletions(-)
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index fd8d0d5..22c47ec 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -487,6 +487,8 @@ Any other value written to /proc/PID/clear_refs will have no effect.
The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags
using /proc/kpageflags and number of times a page is mapped using
/proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.txt.
+(There's also a /proc/pid/pagemap2 file which is the 2nd version of the
+ pagemap one).
1.2 Kernel data
---------------
diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 7587493..4350397 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -30,6 +30,9 @@ There are three components to pagemap:
determine which areas of memory are actually mapped and llseek to
skip over unmapped regions.
+ * /proc/pid/pagemap2. This file provides the same info as the pagemap
+ does, but bits 55-60 are reserved for future use and thus zero
+
* /proc/kpagecount. This file contains a 64-bit count of the number of
times each page is mapped, indexed by PFN.
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 69078c7..34966ce 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2537,6 +2537,7 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("clear_refs", S_IWUSR, proc_clear_refs_operations),
REG("smaps", S_IRUGO, proc_pid_smaps_operations),
REG("pagemap", S_IRUGO, proc_pagemap_operations),
+ REG("pagemap2", S_IRUGO, proc_pagemap2_operations),
#endif
#ifdef CONFIG_SECURITY
DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
@@ -2882,6 +2883,7 @@ static const struct pid_entry tid_base_stuff[] = {
REG("clear_refs", S_IWUSR, proc_clear_refs_operations),
REG("smaps", S_IRUGO, proc_tid_smaps_operations),
REG("pagemap", S_IRUGO, proc_pagemap_operations),
+ REG("pagemap2", S_IRUGO, proc_pagemap2_operations),
#endif
#ifdef CONFIG_SECURITY
DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 85ff3a4..cc12bb7 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -67,6 +67,7 @@ extern const struct file_operations proc_pid_smaps_operations;
extern const struct file_operations proc_tid_smaps_operations;
extern const struct file_operations proc_clear_refs_operations;
extern const struct file_operations proc_pagemap_operations;
+extern const struct file_operations proc_pagemap2_operations;
extern const struct file_operations proc_net_operations;
extern const struct inode_operations proc_net_inode_operations;
extern const struct inode_operations proc_pid_link_inode_operations;
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 7f9b66c..3138009 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1135,6 +1135,17 @@ const struct file_operations proc_pagemap_operations = {
.llseek = mem_lseek, /* borrow this */
.read = pagemap_read,
};
+
+static ssize_t pagemap2_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ return do_pagemap_read(file, buf, count, ppos, true);
+}
+
+const struct file_operations proc_pagemap2_operations = {
+ .llseek = mem_lseek, /* borrow this */
+ .read = pagemap2_read,
+};
#endif /* CONFIG_PROC_PAGE_MONITOR */
#ifdef CONFIG_NUMA
--
1.7.6.5
The soft-dirty is a bit on a PTE which helps to track which pages a task
writes to. In order to do this tracking one should
1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
2. Wait some time.
3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)
To do this tracking, the writable bit is cleared from PTEs when the
soft-dirty bit is. Thus, after this, when the task tries to modify a page
at some virtual address the #PF occurs and the kernel sets the soft-dirty
bit on the respective PTE.
Note, that although all the task's address space is marked as r/o after the
soft-dirty bits clear, the #PF-s that occur after that are processed fast.
This is so, since the pages are still mapped to physical memory, and thus
all the kernel does is finds this fact out and puts back writable, dirty
and soft-dirty bits on the PTE.
Another thing to note, is that when mremap moves PTEs they are marked with
soft-dirty as well, since from the user perspective mremap modifies the
virtual memory at mremap's new address.
Signed-off-by: Pavel Emelyanov <[email protected]>
---
Documentation/filesystems/proc.txt | 7 +++++-
Documentation/vm/pagemap.txt | 4 ++-
Documentation/vm/soft-dirty.txt | 36 ++++++++++++++++++++++++++++++++++
arch/x86/include/asm/pgtable.h | 26 ++++++++++++++++++++++-
arch/x86/include/asm/pgtable_types.h | 6 +++++
fs/proc/task_mmu.c | 36 +++++++++++++++++++++++++++++----
include/asm-generic/pgtable.h | 22 ++++++++++++++++++++
mm/Kconfig | 12 +++++++++++
mm/huge_memory.c | 2 +-
mm/mremap.c | 2 +-
10 files changed, 142 insertions(+), 11 deletions(-)
create mode 100644 Documentation/vm/soft-dirty.txt
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 22c47ec..488c094 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -473,7 +473,8 @@ This file is only present if the CONFIG_MMU kernel configuration option is
enabled.
The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG
-bits on both physical and virtual pages associated with a process.
+bits on both physical and virtual pages associated with a process, and the
+soft-dirty bit on pte (see Documentation/vm/soft-dirty.txt for details).
To clear the bits for all the pages associated with the process
> echo 1 > /proc/PID/clear_refs
@@ -482,6 +483,10 @@ To clear the bits for the anonymous pages associated with the process
To clear the bits for the file mapped pages associated with the process
> echo 3 > /proc/PID/clear_refs
+
+To clear the soft-dirty bit
+ > echo 4 > /proc/PID/clear_refs
+
Any other value written to /proc/PID/clear_refs will have no effect.
The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags
diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 4350397..394cc03 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -31,7 +31,9 @@ There are three components to pagemap:
skip over unmapped regions.
* /proc/pid/pagemap2. This file provides the same info as the pagemap
- does, but bits 55-60 are reserved for future use and thus zero
+ does, but bits 56-60 are reserved for future use and thus zero
+
+ Bit 55 means pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
* /proc/kpagecount. This file contains a 64-bit count of the number of
times each page is mapped, indexed by PFN.
diff --git a/Documentation/vm/soft-dirty.txt b/Documentation/vm/soft-dirty.txt
new file mode 100644
index 0000000..9a12a59
--- /dev/null
+++ b/Documentation/vm/soft-dirty.txt
@@ -0,0 +1,36 @@
+ SOFT-DIRTY PTEs
+
+ The soft-dirty is a bit on a PTE which helps to track which pages a task
+writes to. In order to do this tracking one should
+
+ 1. Clear soft-dirty bits from the task's PTEs.
+
+ This is done by writing "4" into the /proc/PID/clear_refs file of the
+ task in question.
+
+ 2. Wait some time.
+
+ 3. Read soft-dirty bits from the PTEs.
+
+ This is done by reading from the /proc/PID/pagemap. The bit 55 of the
+ 64-bit qword is the soft-dirty one. If set, the respective PTE was
+ written to since step 1.
+
+
+ Internally, to do this tracking, the writable bit is cleared from PTEs
+when the soft-dirty bit is cleared. So, after this, when the task tries to
+modify a page at some virtual address the #PF occurs and the kernel sets
+the soft-dirty bit on the respective PTE.
+
+ Note, that although all the task's address space is marked as r/o after the
+soft-dirty bits clear, the #PF-s that occur after that are processed fast.
+This is so, since the pages are still mapped to physical memory, and thus all
+the kernel does is finds this fact out and puts both writable and soft-dirty
+bits on the PTE.
+
+
+ This feature is actively used by the checkpoint-restore project. You
+can find more details about it on http://criu.org
+
+
+-- Pavel Emelyanov, Apr 9, 2013
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 1e67223..eb97470 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -207,7 +207,7 @@ static inline pte_t pte_mkexec(pte_t pte)
static inline pte_t pte_mkdirty(pte_t pte)
{
- return pte_set_flags(pte, _PAGE_DIRTY);
+ return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
}
static inline pte_t pte_mkyoung(pte_t pte)
@@ -271,7 +271,7 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd)
static inline pmd_t pmd_mkdirty(pmd_t pmd)
{
- return pmd_set_flags(pmd, _PAGE_DIRTY);
+ return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
}
static inline pmd_t pmd_mkhuge(pmd_t pmd)
@@ -294,6 +294,28 @@ static inline pmd_t pmd_mknotpresent(pmd_t pmd)
return pmd_clear_flags(pmd, _PAGE_PRESENT);
}
+#define __HAVE_SOFT_DIRTY
+
+static inline int pte_soft_dirty(pte_t pte)
+{
+ return pte_flags(pte) & _PAGE_SOFT_DIRTY;
+}
+
+static inline int pmd_soft_dirty(pmd_t pmd)
+{
+ return pmd_flags(pmd) & _PAGE_SOFT_DIRTY;
+}
+
+static inline pte_t pte_mksoft_dirty(pte_t pte)
+{
+ return pte_set_flags(pte, _PAGE_SOFT_DIRTY);
+}
+
+static inline pmd_t pmd_mksoft_dirty(pmd_t pmd)
+{
+ return pmd_set_flags(pmd, _PAGE_SOFT_DIRTY);
+}
+
/*
* Mask out unsupported bits in a present pgprot. Non-present pgprots
* can use those bits for other purposes, so leave them be.
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 567b5d0..dcf718c 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -55,6 +55,18 @@
#define _PAGE_HIDDEN (_AT(pteval_t, 0))
#endif
+/*
+ * The same hidden bit is used by kmemcheck, but since kmemcheck
+ * works on kernel pages while soft-dirty engine on user space,
+ * they do not conflict with each other.
+ */
+
+#ifdef CONFIG_MEM_SOFT_DIRTY
+#define _PAGE_SOFT_DIRTY (_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
+#else
+#define _PAGE_SOFT_DIRTY (_AT(pteval_t, 0))
+#endif
+
#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
#define _PAGE_NX (_AT(pteval_t, 1) << _PAGE_BIT_NX)
#else
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3138009..aae2474 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -692,13 +692,32 @@ enum clear_refs_types {
CLEAR_REFS_ALL = 1,
CLEAR_REFS_ANON,
CLEAR_REFS_MAPPED,
+ CLEAR_REFS_SOFT_DIRTY,
CLEAR_REFS_LAST,
};
struct clear_refs_private {
struct vm_area_struct *vma;
+ enum clear_refs_types type;
};
+static inline void clear_soft_dirty(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *pte)
+{
+#ifdef CONFIG_MEM_SOFT_DIRTY
+ /*
+ * The soft-dirty tracker uses #PF-s to catch writes
+ * to pages, so write-protect the pte as well. See the
+ * Documentation/vm/soft-dirty.txt for full description
+ * of how soft-dirty works.
+ */
+ pte_t ptent = *pte;
+ ptent = pte_wrprotect(ptent);
+ ptent = pte_clear_flags(ptent, _PAGE_SOFT_DIRTY);
+ set_pte_at(vma->vm_mm, addr, pte, ptent);
+#endif
+}
+
static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long end, struct mm_walk *walk)
{
@@ -718,6 +731,11 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
if (!pte_present(ptent))
continue;
+ if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
+ clear_soft_dirty(vma, addr, pte);
+ continue;
+ }
+
page = vm_normal_page(vma, addr, ptent);
if (!page)
continue;
@@ -757,6 +775,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
mm = get_task_mm(task);
if (mm) {
struct clear_refs_private cp = {
+ .type = type,
};
struct mm_walk clear_refs_walk = {
.pmd_entry = clear_refs_pte_range,
@@ -825,6 +844,7 @@ struct pagemapread {
/* in pagemap2 pshift bits are occupied with more status bits */
#define PM_STATUS2(v2, x) (__PM_PSHIFT(v2 ? x : PAGE_SHIFT))
+#define __PM_SOFT_DIRTY (1LL)
#define PM_PRESENT PM_STATUS(4LL)
#define PM_SWAP PM_STATUS(2LL)
#define PM_FILE PM_STATUS(1LL)
@@ -866,6 +886,7 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
{
u64 frame, flags;
struct page *page = NULL;
+ int flags2 = 0;
if (pte_present(pte)) {
frame = pte_pfn(pte);
@@ -886,13 +907,15 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
if (page && !PageAnon(page))
flags |= PM_FILE;
+ if (pte_soft_dirty(pte))
+ flags2 |= __PM_SOFT_DIRTY;
- *pme = make_pme(PM_PFRAME(frame) | PM_STATUS2(pm->v2, 0) | flags);
+ *pme = make_pme(PM_PFRAME(frame) | PM_STATUS2(pm->v2, flags2) | flags);
}
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
- pmd_t pmd, int offset)
+ pmd_t pmd, int offset, int pmd_flags2)
{
/*
* Currently pmd for thp is always present because thp can not be
@@ -901,13 +924,13 @@ static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *p
*/
if (pmd_present(pmd))
*pme = make_pme(PM_PFRAME(pmd_pfn(pmd) + offset)
- | PM_STATUS2(pm->v2, 0) | PM_PRESENT);
+ | PM_STATUS2(pm->v2, pmd_flags2) | PM_PRESENT);
else
*pme = make_pme(PM_NOT_PRESENT(pm->v2));
}
#else
static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
- pmd_t pmd, int offset)
+ pmd_t pmd, int offset, int pmd_flags2)
{
}
#endif
@@ -924,12 +947,15 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
/* find the first VMA at or above 'addr' */
vma = find_vma(walk->mm, addr);
if (vma && pmd_trans_huge_lock(pmd, vma) == 1) {
+ int pmd_flags2;
+
+ pmd_flags2 = (pmd_soft_dirty(*pmd) ? __PM_SOFT_DIRTY : 0);
for (; addr != end; addr += PAGE_SIZE) {
unsigned long offset;
offset = (addr & ~PAGEMAP_WALK_MASK) >>
PAGE_SHIFT;
- thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset);
+ thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2);
err = add_to_pagemap(addr, &pme, pm);
if (err)
break;
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index bfd8768..d74bdd2 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -386,6 +386,28 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm,
#define arch_start_context_switch(prev) do {} while (0)
#endif
+#ifndef __HAVE_SOFT_DIRTY
+static inline int pte_soft_dirty(pte_t pte)
+{
+ return 0;
+}
+
+static inline int pmd_soft_dirty(pmd_t pmd)
+{
+ return 0;
+}
+
+static inline pte_t pte_mksoft_dirty(pte_t pte)
+{
+ return pte;
+}
+
+static inline pmd_t pmd_mksoft_dirty(pmd_t pmd)
+{
+ return pmd;
+}
+#endif
+
#ifndef __HAVE_PFNMAP_TRACKING
/*
* Interfaces that can be used by architecture code to keep track of
diff --git a/mm/Kconfig b/mm/Kconfig
index 3bea74f..147689e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -471,3 +471,15 @@ config FRONTSWAP
and swap data is stored as normal on the matching swap device.
If unsure, say Y to enable frontswap.
+
+config MEM_SOFT_DIRTY
+ bool "Track memory changes"
+ depends on CHECKPOINT_RESTORE && X86
+ select PROC_PAGE_MONITOR
+ help
+ This option enables memory changes tracking by introducing a
+ soft-dirty bit on pte-s. This bit it set when someone writes
+ into a page just as regular dirty bit, but unlike the latter
+ it can be cleared by hands.
+
+ See Documentation/vm/soft-dirty.txt for more details.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e2f7f5aa..eef1606 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1431,7 +1431,7 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
if (ret == 1) {
pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
VM_BUG_ON(!pmd_none(*new_pmd));
- set_pmd_at(mm, new_addr, new_pmd, pmd);
+ set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
spin_unlock(&mm->page_table_lock);
}
out:
diff --git a/mm/mremap.c b/mm/mremap.c
index 463a257..3708655 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -126,7 +126,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
continue;
pte = ptep_get_and_clear(mm, old_addr, old_pte);
pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
- set_pte_at(mm, new_addr, new_pte, pte);
+ set_pte_at(mm, new_addr, new_pte, pte_mksoft_dirty(pte));
}
arch_leave_lazy_mmu_mode();
--
1.7.6.5
These bits are always constant (== PAGE_SHIFT) and just occupy space in
the entry. Moreover, in next patch we will need to report one more bit in
the pagemap, but all bits are already busy on it.
That said, describe the pagemap entry that has 6 more free zero bits.
Signed-off-by: Pavel Emelyanov <[email protected]>
---
fs/proc/task_mmu.c | 50 ++++++++++++++++++++++++++++++--------------------
1 files changed, 30 insertions(+), 20 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index c59a148..7f9b66c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -805,6 +805,7 @@ typedef struct {
struct pagemapread {
int pos, len;
pagemap_entry_t *buffer;
+ bool v2;
};
#define PAGEMAP_WALK_SIZE (PMD_SIZE)
@@ -818,14 +819,16 @@ struct pagemapread {
#define PM_PSHIFT_BITS 6
#define PM_PSHIFT_OFFSET (PM_STATUS_OFFSET - PM_PSHIFT_BITS)
#define PM_PSHIFT_MASK (((1LL << PM_PSHIFT_BITS) - 1) << PM_PSHIFT_OFFSET)
-#define PM_PSHIFT(x) (((u64) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK)
+#define __PM_PSHIFT(x) (((u64) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK)
#define PM_PFRAME_MASK ((1LL << PM_PSHIFT_OFFSET) - 1)
#define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)
+/* in pagemap2 pshift bits are occupied with more status bits */
+#define PM_STATUS2(v2, x) (__PM_PSHIFT(v2 ? x : PAGE_SHIFT))
#define PM_PRESENT PM_STATUS(4LL)
#define PM_SWAP PM_STATUS(2LL)
#define PM_FILE PM_STATUS(1LL)
-#define PM_NOT_PRESENT PM_PSHIFT(PAGE_SHIFT)
+#define PM_NOT_PRESENT(v2) PM_STATUS2(v2, 0)
#define PM_END_OF_BUFFER 1
static inline pagemap_entry_t make_pme(u64 val)
@@ -848,7 +851,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
struct pagemapread *pm = walk->private;
unsigned long addr;
int err = 0;
- pagemap_entry_t pme = make_pme(PM_NOT_PRESENT);
+ pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2));
for (addr = start; addr < end; addr += PAGE_SIZE) {
err = add_to_pagemap(addr, &pme, pm);
@@ -858,7 +861,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
return err;
}
-static void pte_to_pagemap_entry(pagemap_entry_t *pme,
+static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
struct vm_area_struct *vma, unsigned long addr, pte_t pte)
{
u64 frame, flags;
@@ -877,18 +880,18 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme,
if (is_migration_entry(entry))
page = migration_entry_to_page(entry);
} else {
- *pme = make_pme(PM_NOT_PRESENT);
+ *pme = make_pme(PM_NOT_PRESENT(pm->v2));
return;
}
if (page && !PageAnon(page))
flags |= PM_FILE;
- *pme = make_pme(PM_PFRAME(frame) | PM_PSHIFT(PAGE_SHIFT) | flags);
+ *pme = make_pme(PM_PFRAME(frame) | PM_STATUS2(pm->v2, 0) | flags);
}
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme,
+static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
pmd_t pmd, int offset)
{
/*
@@ -898,12 +901,12 @@ static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme,
*/
if (pmd_present(pmd))
*pme = make_pme(PM_PFRAME(pmd_pfn(pmd) + offset)
- | PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT);
+ | PM_STATUS2(pm->v2, 0) | PM_PRESENT);
else
- *pme = make_pme(PM_NOT_PRESENT);
+ *pme = make_pme(PM_NOT_PRESENT(pm->v2));
}
#else
-static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme,
+static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
pmd_t pmd, int offset)
{
}
@@ -916,7 +919,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
struct pagemapread *pm = walk->private;
pte_t *pte;
int err = 0;
- pagemap_entry_t pme = make_pme(PM_NOT_PRESENT);
+ pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2));
/* find the first VMA at or above 'addr' */
vma = find_vma(walk->mm, addr);
@@ -926,7 +929,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
offset = (addr & ~PAGEMAP_WALK_MASK) >>
PAGE_SHIFT;
- thp_pmd_to_pagemap_entry(&pme, *pmd, offset);
+ thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset);
err = add_to_pagemap(addr, &pme, pm);
if (err)
break;
@@ -943,7 +946,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
* and need a new, higher one */
if (vma && (addr >= vma->vm_end)) {
vma = find_vma(walk->mm, addr);
- pme = make_pme(PM_NOT_PRESENT);
+ pme = make_pme(PM_NOT_PRESENT(pm->v2));
}
/* check that 'vma' actually covers this address,
@@ -951,7 +954,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
if (vma && (vma->vm_start <= addr) &&
!is_vm_hugetlb_page(vma)) {
pte = pte_offset_map(pmd, addr);
- pte_to_pagemap_entry(&pme, vma, addr, *pte);
+ pte_to_pagemap_entry(&pme, pm, vma, addr, *pte);
/* unmap before userspace copy */
pte_unmap(pte);
}
@@ -966,14 +969,14 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
}
#ifdef CONFIG_HUGETLB_PAGE
-static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme,
+static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
pte_t pte, int offset)
{
if (pte_present(pte))
*pme = make_pme(PM_PFRAME(pte_pfn(pte) + offset)
- | PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT);
+ | PM_STATUS2(pm->v2, 0) | PM_PRESENT);
else
- *pme = make_pme(PM_NOT_PRESENT);
+ *pme = make_pme(PM_NOT_PRESENT(pm->v2));
}
/* This function walks within one hugetlb entry in the single call */
@@ -987,7 +990,7 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
for (; addr != end; addr += PAGE_SIZE) {
int offset = (addr & ~hmask) >> PAGE_SHIFT;
- huge_pte_to_pagemap_entry(&pme, *pte, offset);
+ huge_pte_to_pagemap_entry(&pme, pm, *pte, offset);
err = add_to_pagemap(addr, &pme, pm);
if (err)
return err;
@@ -1023,8 +1026,8 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
* determine which areas of memory are actually mapped and llseek to
* skip over unmapped regions.
*/
-static ssize_t pagemap_read(struct file *file, char __user *buf,
- size_t count, loff_t *ppos)
+static ssize_t do_pagemap_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos, bool v2)
{
struct task_struct *task = get_proc_task(file_inode(file));
struct mm_struct *mm;
@@ -1049,6 +1052,7 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
if (!count)
goto out_task;
+ pm.v2 = v2;
pm.len = PM_ENTRY_BYTES * (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
pm.buffer = kmalloc(pm.len, GFP_TEMPORARY);
ret = -ENOMEM;
@@ -1121,6 +1125,12 @@ out:
return ret;
}
+static ssize_t pagemap_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ return do_pagemap_read(file, buf, count, ppos, false);
+}
+
const struct file_operations proc_pagemap_operations = {
.llseek = mem_lseek, /* borrow this */
.read = pagemap_read,
--
1.7.6.5
On Thu, 11 Apr 2013 15:28:51 +0400 Pavel Emelyanov <[email protected]> wrote:
> A new clear-refs type will be added in the next patch, so prepare
> code for that.
>
> @@ -730,7 +733,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> char buffer[PROC_NUMBUF];
> struct mm_struct *mm;
> struct vm_area_struct *vma;
> - int type;
> + enum clear_refs_types type;
> int rv;
>
> memset(buffer, 0, sizeof(buffer));
> @@ -738,10 +741,10 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> count = sizeof(buffer) - 1;
> if (copy_from_user(buffer, buf, count))
> return -EFAULT;
> - rv = kstrtoint(strstrip(buffer), 10, &type);
> + rv = kstrtoint(strstrip(buffer), 10, (int *)&type);
This is naughty. The compiler is allowed to put the enum into storage
which is smaller (or, I guess, larger) than sizeof(int). I've seen one
compiler which puts such an enum into a 16-bit word.
--- a/fs/proc/task_mmu.c~clear_refs-sanitize-accepted-commands-declaration-fix
+++ a/fs/proc/task_mmu.c
@@ -734,6 +734,7 @@ static ssize_t clear_refs_write(struct f
struct mm_struct *mm;
struct vm_area_struct *vma;
enum clear_refs_types type;
+ int itype;
int rv;
memset(buffer, 0, sizeof(buffer));
@@ -741,9 +742,10 @@ static ssize_t clear_refs_write(struct f
count = sizeof(buffer) - 1;
if (copy_from_user(buffer, buf, count))
return -EFAULT;
- rv = kstrtoint(strstrip(buffer), 10, (int *)&type);
+ rv = kstrtoint(strstrip(buffer), 10, &itype);
if (rv < 0)
return rv;
+ type = (enum clear_refs_types)itype;
if (type < CLEAR_REFS_ALL || type >= CLEAR_REFS_LAST)
return -EINVAL;
task = get_proc_task(file_inode(file));
_
On Thu, 11 Apr 2013 15:29:41 +0400 Pavel Emelyanov <[email protected]> wrote:
> This file is the same as the pagemap one, but shows entries with bits
> 55-60 being zero (reserved for future use). Next patch will occupy one
> of them.
I'm not understanding the motivation for this. What does the current
/proc/pid/pagemap have in those bit positions?
On Thu, 11 Apr 2013 15:30:00 +0400 Pavel Emelyanov <[email protected]> wrote:
> The soft-dirty is a bit on a PTE which helps to track which pages a task
> writes to. In order to do this tracking one should
>
> 1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
> 2. Wait some time.
> 3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)
>
> To do this tracking, the writable bit is cleared from PTEs when the
> soft-dirty bit is. Thus, after this, when the task tries to modify a page
> at some virtual address the #PF occurs and the kernel sets the soft-dirty
> bit on the respective PTE.
>
> Note, that although all the task's address space is marked as r/o after the
> soft-dirty bits clear, the #PF-s that occur after that are processed fast.
> This is so, since the pages are still mapped to physical memory, and thus
> all the kernel does is finds this fact out and puts back writable, dirty
> and soft-dirty bits on the PTE.
>
> Another thing to note, is that when mremap moves PTEs they are marked with
> soft-dirty as well, since from the user perspective mremap modifies the
> virtual memory at mremap's new address.
>
> ...
>
> +config MEM_SOFT_DIRTY
> + bool "Track memory changes"
> + depends on CHECKPOINT_RESTORE && X86
I guess we can add the CHECKPOINT_RESTORE dependency for now, but it is
a general facility and I expect others will want to get their hands on
it for unrelated things.
>From that perspective, the dependency on X86 is awful. What's the
problem here and what do other architectures need to do to be able to
support the feature?
You have a test application, I assume. It would be helpful if we could
get that into tools/testing/selftests.
On 04/12/2013 01:19 AM, Andrew Morton wrote:
> On Thu, 11 Apr 2013 15:29:41 +0400 Pavel Emelyanov <[email protected]> wrote:
>
>> This file is the same as the pagemap one, but shows entries with bits
>> 55-60 being zero (reserved for future use). Next patch will occupy one
>> of them.
>
> I'm not understanding the motivation for this. What does the current
> /proc/pid/pagemap have in those bit positions?
A constant PAGE_SHIFT value.
>
> .
>
Thanks,
Pavel
On 04/12/2013 01:24 AM, Andrew Morton wrote:
> On Thu, 11 Apr 2013 15:30:00 +0400 Pavel Emelyanov <[email protected]> wrote:
>
>> The soft-dirty is a bit on a PTE which helps to track which pages a task
>> writes to. In order to do this tracking one should
>>
>> 1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
>> 2. Wait some time.
>> 3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)
>>
>> To do this tracking, the writable bit is cleared from PTEs when the
>> soft-dirty bit is. Thus, after this, when the task tries to modify a page
>> at some virtual address the #PF occurs and the kernel sets the soft-dirty
>> bit on the respective PTE.
>>
>> Note, that although all the task's address space is marked as r/o after the
>> soft-dirty bits clear, the #PF-s that occur after that are processed fast.
>> This is so, since the pages are still mapped to physical memory, and thus
>> all the kernel does is finds this fact out and puts back writable, dirty
>> and soft-dirty bits on the PTE.
>>
>> Another thing to note, is that when mremap moves PTEs they are marked with
>> soft-dirty as well, since from the user perspective mremap modifies the
>> virtual memory at mremap's new address.
>>
>> ...
>>
>> +config MEM_SOFT_DIRTY
>> + bool "Track memory changes"
>> + depends on CHECKPOINT_RESTORE && X86
>
> I guess we can add the CHECKPOINT_RESTORE dependency for now, but it is
> a general facility and I expect others will want to get their hands on
> it for unrelated things.
OK. Just tell me when you need the dependency removing patch.
>>From that perspective, the dependency on X86 is awful. What's the
> problem here and what do other architectures need to do to be able to
> support the feature?
The problem here is that I don't know what free bits are available on
page table entries on other architectures. I was about to resolve this
for ARM very soon, but for the rest of them I need help from other people.
> You have a test application, I assume. It would be helpful if we could
> get that into tools/testing/selftests.
If a very stupid 10-lines test is OK, then I can cook a patch with it.
Other than this I test this using the whole CRIU project, which is too
big for inclusion.
Thanks,
Pavel
It creates a mapping of 3 pages and checks that reads, writes and clear-refs
result in present and soft-dirt bits reported from pagemap2 set as expected.
Signed-off-by: Pavel Emelyanov <[email protected]>
---
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 575ef80..827f2c0 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -6,6 +6,7 @@ TARGETS += cpu-hotplug
TARGETS += memory-hotplug
TARGETS += efivarfs
TARGETS += ptrace
+TARGETS += soft-dirty
all:
for TARGET in $(TARGETS); do \
diff --git a/tools/testing/selftests/soft-dirty/Makefile b/tools/testing/selftests/soft-dirty/Makefile
new file mode 100644
index 0000000..a9cdc82
--- /dev/null
+++ b/tools/testing/selftests/soft-dirty/Makefile
@@ -0,0 +1,10 @@
+CFLAGS += -iquote../../../../include/uapi -Wall
+soft-dirty: soft-dirty.c
+
+all: soft-dirty
+
+clean:
+ rm -f soft-dirty
+
+run_tests: all
+ @./soft-dirty || echo "soft-dirty selftests: [FAIL]"
diff --git a/tools/testing/selftests/soft-dirty/soft-dirty.c b/tools/testing/selftests/soft-dirty/soft-dirty.c
new file mode 100644
index 0000000..aba4f87
--- /dev/null
+++ b/tools/testing/selftests/soft-dirty/soft-dirty.c
@@ -0,0 +1,114 @@
+#include <stdlib.h>
+#include <stdio.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/types.h>
+
+typedef unsigned long long u64;
+
+#define PME_PRESENT (1ULL << 63)
+#define PME_SOFT_DIRTY (1Ull << 55)
+
+#define PAGES_TO_TEST 3
+#ifndef PAGE_SIZE
+#define PAGE_SIZE 4096
+#endif
+
+static void get_pagemap2(char *mem, u64 *map)
+{
+ int fd;
+
+ fd = open("/proc/self/pagemap2", O_RDONLY);
+ if (fd < 0) {
+ perror("Can't open pagemap2");
+ exit(1);
+ }
+
+ lseek(fd, (unsigned long)mem / PAGE_SIZE * sizeof(u64), SEEK_SET);
+ read(fd, map, sizeof(u64) * PAGES_TO_TEST);
+ close(fd);
+}
+
+static inline char map_p(u64 map)
+{
+ return map & PME_PRESENT ? 'p' : '-';
+}
+
+static inline char map_sd(u64 map)
+{
+ return map & PME_SOFT_DIRTY ? 'd' : '-';
+}
+
+static int check_pte(int step, int page, u64 *map, u64 want)
+{
+ if ((map[page] & want) != want) {
+ printf("Step %d Page %d has %c%c, want %c%c\n",
+ step, page,
+ map_p(map[page]), map_sd(map[page]),
+ map_p(want), map_sd(want));
+ return 1;
+ }
+
+ return 0;
+}
+
+static void clear_refs(void)
+{
+ int fd;
+ char *v = "4";
+
+ fd = open("/proc/self/clear_refs", O_WRONLY);
+ if (write(fd, v, 3) < 3) {
+ perror("Can't clear soft-dirty bit");
+ exit(1);
+ }
+ close(fd);
+}
+
+int main(void)
+{
+ char *mem, x;
+ u64 map[PAGES_TO_TEST];
+
+ mem = mmap(NULL, PAGES_TO_TEST * PAGE_SIZE,
+ PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, 0, 0);
+
+ x = mem[0];
+ mem[2 * PAGE_SIZE] = 'c';
+ get_pagemap2(mem, map);
+
+ if (check_pte(1, 0, map, PME_PRESENT))
+ return 1;
+ if (check_pte(1, 1, map, 0))
+ return 1;
+ if (check_pte(1, 2, map, PME_PRESENT | PME_SOFT_DIRTY))
+ return 1;
+
+ clear_refs();
+ get_pagemap2(mem, map);
+
+ if (check_pte(2, 0, map, PME_PRESENT))
+ return 1;
+ if (check_pte(2, 1, map, 0))
+ return 1;
+ if (check_pte(2, 2, map, PME_PRESENT))
+ return 1;
+
+ mem[0] = 'a';
+ mem[PAGE_SIZE] = 'b';
+ x = mem[2 * PAGE_SIZE];
+ get_pagemap2(mem, map);
+
+ if (check_pte(3, 0, map, PME_PRESENT | PME_SOFT_DIRTY))
+ return 1;
+ if (check_pte(3, 1, map, PME_PRESENT | PME_SOFT_DIRTY))
+ return 1;
+ if (check_pte(3, 2, map, PME_PRESENT))
+ return 1;
+
+ (void)x; /* gcc warn */
+
+ printf("PASS\n");
+ return 0;
+}
On Fri, 12 Apr 2013 17:14:03 +0400 Pavel Emelyanov <[email protected]> wrote:
> On 04/12/2013 01:24 AM, Andrew Morton wrote:
> > On Thu, 11 Apr 2013 15:30:00 +0400 Pavel Emelyanov <[email protected]> wrote:
> >
> >> The soft-dirty is a bit on a PTE which helps to track which pages a task
> >> writes to. In order to do this tracking one should
> >>
> >> 1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
> >> 2. Wait some time.
> >> 3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)
> >>
> >> To do this tracking, the writable bit is cleared from PTEs when the
> >> soft-dirty bit is. Thus, after this, when the task tries to modify a page
> >> at some virtual address the #PF occurs and the kernel sets the soft-dirty
> >> bit on the respective PTE.
> >>
> >> Note, that although all the task's address space is marked as r/o after the
> >> soft-dirty bits clear, the #PF-s that occur after that are processed fast.
> >> This is so, since the pages are still mapped to physical memory, and thus
> >> all the kernel does is finds this fact out and puts back writable, dirty
> >> and soft-dirty bits on the PTE.
> >>
> >> Another thing to note, is that when mremap moves PTEs they are marked with
> >> soft-dirty as well, since from the user perspective mremap modifies the
> >> virtual memory at mremap's new address.
> >>
> >> ...
> >>
> >> +config MEM_SOFT_DIRTY
> >> + bool "Track memory changes"
> >> + depends on CHECKPOINT_RESTORE && X86
> >
> > I guess we can add the CHECKPOINT_RESTORE dependency for now, but it is
> > a general facility and I expect others will want to get their hands on
> > it for unrelated things.
>
> OK. Just tell me when you need the dependency removing patch.
>
> >>From that perspective, the dependency on X86 is awful. What's the
> > problem here and what do other architectures need to do to be able to
> > support the feature?
>
> The problem here is that I don't know what free bits are available on
> page table entries on other architectures. I was about to resolve this
> for ARM very soon, but for the rest of them I need help from other people.
Well, this is also a thing arch maintainers can do when they feel a
need to support the feature on their architecture. To support them at
that time we should provide them with a) adequate information in an
easy-to-find place (eg, a nice comment at the site of the reference x86
implementation) and b) a userspace test app.
> > You have a test application, I assume. It would be helpful if we could
> > get that into tools/testing/selftests.
>
> If a very stupid 10-lines test is OK, then I can cook a patch with it.
I think that would be good. As a low-priority thing, please.
On Mon, 15 Apr 2013 14:46:19 -0700 Andrew Morton <[email protected]> wrote:
>
> Well, this is also a thing arch maintainers can do when they feel a
> need to support the feature on their architecture. To support them at
> that time we should provide them with a) adequate information in an
> easy-to-find place (eg, a nice comment at the site of the reference x86
> implementation) and b) a userspace test app.
and c) a CONFIG symbol (maybe CONFIG_HAVE_MEM_SOFT_DIRTY, maybe in
arch/Kconfig) that they can select to get this feature (so that this
feature then depend on that CONFIG symbol instead of X86). That way we
don't have to go back and tidy this up when 15 or so architectures
implement it.
--
Cheers,
Stephen Rothwell [email protected]
As Stephen Rothwell pointed out, config options, that depend on
architecture support, are better to be wrapped into a select +
depends on scheme.
Do this for CONFIG_MEM_SOFT_DIRTY, as it currently works only
for X86.
Signed-off-by: Pavel Emelyanov <[email protected]>
Cc: Stephen Rothwell <[email protected]>
---
diff --git a/arch/Kconfig b/arch/Kconfig
index 1455579..71c06ab 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -365,6 +365,9 @@ config HAVE_IRQ_TIME_ACCOUNTING
config HAVE_ARCH_TRANSPARENT_HUGEPAGE
bool
+config HAVE_ARCH_SOFT_DIRTY
+ bool
+
config HAVE_MOD_ARCH_SPECIFIC
bool
help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 70c0f3d..81c0843 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -120,6 +120,7 @@ config X86
select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
select OLD_SIGACTION if X86_32
select COMPAT_OLD_SIGACTION if IA32_EMULATION
+ select HAVE_ARCH_SOFT_DIRTY
config INSTRUCTION_DECODER
def_bool y
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index eb97470..ebf9373 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -294,8 +294,6 @@ static inline pmd_t pmd_mknotpresent(pmd_t pmd)
return pmd_clear_flags(pmd, _PAGE_PRESENT);
}
-#define __HAVE_SOFT_DIRTY
-
static inline int pte_soft_dirty(pte_t pte)
{
return pte_flags(pte) & _PAGE_SOFT_DIRTY;
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index d74bdd2..a2ca78f 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -386,7 +386,7 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm,
#define arch_start_context_switch(prev) do {} while (0)
#endif
-#ifndef __HAVE_SOFT_DIRTY
+#ifndef CONFIG_HAVE_ARCH_SOFT_DIRTY
static inline int pte_soft_dirty(pte_t pte)
{
return 0;
diff --git a/mm/Kconfig b/mm/Kconfig
index 147689e..7deac66 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -474,7 +474,7 @@ config FRONTSWAP
config MEM_SOFT_DIRTY
bool "Track memory changes"
- depends on CHECKPOINT_RESTORE && X86
+ depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY
select PROC_PAGE_MONITOR
help
This option enables memory changes tracking by introducing a
>>> >From that perspective, the dependency on X86 is awful. What's the
>>> problem here and what do other architectures need to do to be able to
>>> support the feature?
>>
>> The problem here is that I don't know what free bits are available on
>> page table entries on other architectures. I was about to resolve this
>> for ARM very soon, but for the rest of them I need help from other people.
>
> Well, this is also a thing arch maintainers can do when they feel a
> need to support the feature on their architecture. To support them at
> that time we should provide them with a) adequate information in an
> easy-to-find place (eg, a nice comment at the site of the reference x86
> implementation) and b) a userspace test app.
Item a) is presumably covered with two things -- required arch-specific
PTE manipulations are all collected in asm-generic/pgtable.h under the
!CONFIG_HAVE_ARCH_SOFT_DIRTY and the Documentation/vm/soft-dirty.txt
pointed by the API clear_refs_soft_dirty()'s comment.
Item b) was recently merged.
Item c) from Stephen is already sent.
Thank you for your time and help,
Pavel
Hi Pavel,
On Tue, 16 Apr 2013 23:51:36 +0400 Pavel Emelyanov <[email protected]> wrote:
>
> As Stephen Rothwell pointed out, config options, that depend on
> architecture support, are better to be wrapped into a select +
> depends on scheme.
>
> Do this for CONFIG_MEM_SOFT_DIRTY, as it currently works only
> for X86.
>
> Signed-off-by: Pavel Emelyanov <[email protected]>
> Cc: Stephen Rothwell <[email protected]>
Acked-by: Stephen Rothwell <[email protected]>
--
Cheers,
Stephen Rothwell [email protected]
On Thu, Apr 11, 2013 at 03:29:41PM +0400, Pavel Emelyanov wrote:
> This file is the same as the pagemap one, but shows entries with bits
> 55-60 being zero (reserved for future use). Next patch will occupy one
> of them.
This approach doesn't scale as well as it could. As best I can see
CRIU would do:
for each vma in /proc/<pid>/smaps
for each page in /proc/<pid>/pagemap2
if soft dirty bit
copy page
(possibly with pfn checks to avoid copying the same page mapped in
multiple locations..)
However, if soft dirty bit changes could be queued up (from say the
fault handler and page table ops that map/unmap pages) and accumulated
in something like an interval tree it could be something like:
for each range of changed pages
for each page in range
copy page
IOW something that scales with the number of changed pages rather
than the number of mapped pages.
So I wonder if CRIU would abandon pagemap2 in the future for something
like this.
Cheers,
-Matt Helsley
On 05/02/2013 09:08 PM, Matt Helsley wrote:
> On Thu, Apr 11, 2013 at 03:29:41PM +0400, Pavel Emelyanov wrote:
>> This file is the same as the pagemap one, but shows entries with bits
>> 55-60 being zero (reserved for future use). Next patch will occupy one
>> of them.
>
> This approach doesn't scale as well as it could. As best I can see
> CRIU would do:
>
> for each vma in /proc/<pid>/smaps
> for each page in /proc/<pid>/pagemap2
> if soft dirty bit
> copy page
>
> (possibly with pfn checks to avoid copying the same page mapped in
> multiple locations..)
Comparing pfns got from two subsequent pagemap reads doesn't help at all.
If they are equal, this can mean that either page is shared or (less likely,
but still) that the page, that used to be at the 1st pagemap was reclaimed
and mapped to the 2nd between two reads. If they differ, it can again mean
either not-shared (most likely) or shared (pfns were equal, but got reclaimed
and swapped in back).
Some better API for pages sharing would be nice, probably such API could be
also re-used for the user-space KSM :)
> However, if soft dirty bit changes could be queued up (from say the
> fault handler and page table ops that map/unmap pages) and accumulated
> in something like an interval tree it could be something like:
>
> for each range of changed pages
> for each page in range
> copy page
>
> IOW something that scales with the number of changed pages rather
> than the number of mapped pages.
>
> So I wonder if CRIU would abandon pagemap2 in the future for something
> like this.
We'd surely adopt such APIs is one exists. One thing to note about one is that
we'd also appreciate if this API would be able to batch "present" bits as well
as "swapped" and "page-file" ones. We use these three in CRIU as well, and
these bits scanning can also be optimized.
> Cheers,
> -Matt Helsley
>
Thanks,
Pavel