zsmalloc uses set_pte and __flush_tlb_one for performance but
many architecture don't support it. so this patch removes
set_pte and __flush_tlb_one which are x86 dependency.
Instead of it, use local_flush_tlb_kernel_range which are available
by more architectures. It would be better than supporting only x86
and last patch in series will enable again with supporting
local_flush_tlb_kernel_range in x86.
About local_flush_tlb_kernel_range,
If architecture is very smart, it could flush only tlb entries related to vaddr.
If architecture is smart, it could flush only tlb entries related to a CPU.
If architecture is _NOT_ smart, it could flush all entries of all CPUs.
So, it would be best to support both portability and performance.
Cc: Russell King <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Paul Mundt <[email protected]>
Cc: Guan Xuetao <[email protected]>
Cc: Chen Liqin <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
Need double check about supporting local_flush_tlb_kernel_range
in ARM, MIPS, SUPERH maintainers. And I will Ccing unicore32 and
score maintainers because arch directory in those arch have
local_flush_tlb_kernel_range, too but I'm very unfamiliar with those
architecture so pass it to maintainers.
I didn't coded up dumb local_flush_tlb_kernel_range which flush
all cpus. I expect someone need ZSMALLOC will implement it easily in future.
Seth might support it in PowerPC. :)
drivers/staging/zsmalloc/Kconfig | 6 ++---
drivers/staging/zsmalloc/zsmalloc-main.c | 36 +++++++++++++++++++++---------
drivers/staging/zsmalloc/zsmalloc_int.h | 1 -
3 files changed, 29 insertions(+), 14 deletions(-)
diff --git a/drivers/staging/zsmalloc/Kconfig b/drivers/staging/zsmalloc/Kconfig
index a5ab720..def2483 100644
--- a/drivers/staging/zsmalloc/Kconfig
+++ b/drivers/staging/zsmalloc/Kconfig
@@ -1,9 +1,9 @@
config ZSMALLOC
tristate "Memory allocator for compressed pages"
- # X86 dependency is because of the use of __flush_tlb_one and set_pte
+ # arch dependency is because of the use of local_unmap_kernel_range
# in zsmalloc-main.c.
- # TODO: convert these to portable functions
- depends on X86
+ # TODO: implement local_unmap_kernel_range in all architecture.
+ depends on (ARM || MIPS || SUPERH)
default n
help
zsmalloc is a slab-based memory allocator designed to store
diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c b/drivers/staging/zsmalloc/zsmalloc-main.c
index 4496737..8a8b08f 100644
--- a/drivers/staging/zsmalloc/zsmalloc-main.c
+++ b/drivers/staging/zsmalloc/zsmalloc-main.c
@@ -442,7 +442,7 @@ static int zs_cpu_notifier(struct notifier_block *nb, unsigned long action,
area = &per_cpu(zs_map_area, cpu);
if (area->vm)
break;
- area->vm = alloc_vm_area(2 * PAGE_SIZE, area->vm_ptes);
+ area->vm = alloc_vm_area(2 * PAGE_SIZE, NULL);
if (!area->vm)
return notifier_from_errno(-ENOMEM);
break;
@@ -696,13 +696,22 @@ void *zs_map_object(struct zs_pool *pool, void *handle)
} else {
/* this object spans two pages */
struct page *nextp;
+ struct page *pages[2];
+ struct page **page_array = &pages[0];
+ int err;
nextp = get_next_page(page);
BUG_ON(!nextp);
+ page_array[0] = page;
+ page_array[1] = nextp;
- set_pte(area->vm_ptes[0], mk_pte(page, PAGE_KERNEL));
- set_pte(area->vm_ptes[1], mk_pte(nextp, PAGE_KERNEL));
+ /*
+ * map_vm_area never fail because we already allocated
+ * pages for page table in alloc_vm_area.
+ */
+ err = map_vm_area(area->vm, PAGE_KERNEL, &page_array);
+ BUG_ON(err);
/* We pre-allocated VM area so mapping can never fail */
area->vm_addr = area->vm->addr;
@@ -712,6 +721,15 @@ void *zs_map_object(struct zs_pool *pool, void *handle)
}
EXPORT_SYMBOL_GPL(zs_map_object);
+static void local_unmap_kernel_range(unsigned long addr, unsigned long size)
+{
+ unsigned long end = addr + size;
+
+ flush_cache_vunmap(addr, end);
+ unmap_kernel_range_noflush(addr, size);
+ local_flush_tlb_kernel_range(addr, end);
+}
+
void zs_unmap_object(struct zs_pool *pool, void *handle)
{
struct page *page;
@@ -730,14 +748,12 @@ void zs_unmap_object(struct zs_pool *pool, void *handle)
off = obj_idx_to_offset(page, obj_idx, class->size);
area = &__get_cpu_var(zs_map_area);
- if (off + class->size <= PAGE_SIZE) {
+ if (off + class->size <= PAGE_SIZE)
kunmap_atomic(area->vm_addr);
- } else {
- set_pte(area->vm_ptes[0], __pte(0));
- set_pte(area->vm_ptes[1], __pte(0));
- __flush_tlb_one((unsigned long)area->vm_addr);
- __flush_tlb_one((unsigned long)area->vm_addr + PAGE_SIZE);
- }
+ else
+ local_unmap_kernel_range((unsigned long)area->vm->addr,
+ PAGE_SIZE * 2);
+
put_cpu_var(zs_map_area);
}
EXPORT_SYMBOL_GPL(zs_unmap_object);
diff --git a/drivers/staging/zsmalloc/zsmalloc_int.h b/drivers/staging/zsmalloc/zsmalloc_int.h
index 6fd32a9..eaec845 100644
--- a/drivers/staging/zsmalloc/zsmalloc_int.h
+++ b/drivers/staging/zsmalloc/zsmalloc_int.h
@@ -111,7 +111,6 @@ static const int fullness_threshold_frac = 4;
struct mapping_area {
struct vm_struct *vm;
- pte_t *vm_ptes[2];
char *vm_addr;
};
--
1.7.9.5
The zsmalloc [un]maps non-physical contiguos pages to contiguous
virual address frequently so it needs frequent tlb-flush.
Now x86 doesn't support common utility function for flushing just
a few tlb entries so zsmalloc have been used set_pte and __flush_tlb_one
which are x86 specific functions. It means zsmalloc have a dependency
with x86.
This patch adds new function, local_flush_tlb_kernel_range which
are good candidate for being common utility function because other
architecture(ex, MIPS, sh, unicore32, arm, score) already have
supportd it.
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: David Howells <[email protected]>
Cc: [email protected]
Signed-off-by: Minchan Kim <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 12 ++++++++++++
drivers/staging/zsmalloc/Kconfig | 2 +-
2 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4ece077..6e1253a 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -172,4 +172,16 @@ static inline void flush_tlb_kernel_range(unsigned long start,
flush_tlb_all();
}
+static inline void local_flush_tlb_kernel_range(unsigned long start,
+ unsigned long end)
+{
+ if (cpu_has_invlpg) {
+ while (start < end) {
+ __flush_tlb_single(start);
+ start += PAGE_SIZE;
+ }
+ } else
+ local_flush_tlb();
+}
+
#endif /* _ASM_X86_TLBFLUSH_H */
diff --git a/drivers/staging/zsmalloc/Kconfig b/drivers/staging/zsmalloc/Kconfig
index def2483..29819b8 100644
--- a/drivers/staging/zsmalloc/Kconfig
+++ b/drivers/staging/zsmalloc/Kconfig
@@ -3,7 +3,7 @@ config ZSMALLOC
# arch dependency is because of the use of local_unmap_kernel_range
# in zsmalloc-main.c.
# TODO: implement local_unmap_kernel_range in all architecture.
- depends on (ARM || MIPS || SUPERH)
+ depends on (ARM || MIPS || SUPERH || X86)
default n
help
zsmalloc is a slab-based memory allocator designed to store
--
1.7.9.5
Exactly saying, [zram|zcache] should has a dependency with
zsmalloc, not x86. So replace x86 dependeny with ZSMALLOC.
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
drivers/staging/zcache/Kconfig | 3 +--
drivers/staging/zram/Kconfig | 3 +--
2 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/staging/zcache/Kconfig b/drivers/staging/zcache/Kconfig
index 7048e01..ceb7f28 100644
--- a/drivers/staging/zcache/Kconfig
+++ b/drivers/staging/zcache/Kconfig
@@ -2,8 +2,7 @@ config ZCACHE
bool "Dynamic compression of swap pages and clean pagecache pages"
# X86 dependency is because zsmalloc uses non-portable pte/tlb
# functions
- depends on (CLEANCACHE || FRONTSWAP) && CRYPTO=y && X86
- select ZSMALLOC
+ depends on (CLEANCACHE || FRONTSWAP) && CRYPTO=y && ZSMALLOC
select CRYPTO_LZO
default n
help
diff --git a/drivers/staging/zram/Kconfig b/drivers/staging/zram/Kconfig
index 9d11a4c..e3ac62d 100644
--- a/drivers/staging/zram/Kconfig
+++ b/drivers/staging/zram/Kconfig
@@ -2,8 +2,7 @@ config ZRAM
tristate "Compressed RAM block device support"
# X86 dependency is because zsmalloc uses non-portable pte/tlb
# functions
- depends on BLOCK && SYSFS && X86
- select ZSMALLOC
+ depends on BLOCK && SYSFS && ZSMALLOC
select LZO_COMPRESS
select LZO_DECOMPRESS
default n
--
1.7.9.5
On Wed, 2012-05-16 at 11:05 +0900, Minchan Kim wrote:
> zsmalloc uses set_pte and __flush_tlb_one for performance but
> many architecture don't support it. so this patch removes
> set_pte and __flush_tlb_one which are x86 dependency.
> Instead of it, use local_flush_tlb_kernel_range which are available
> by more architectures. It would be better than supporting only x86
> and last patch in series will enable again with supporting
> local_flush_tlb_kernel_range in x86.
>
> About local_flush_tlb_kernel_range,
> If architecture is very smart, it could flush only tlb entries related to vaddr.
> If architecture is smart, it could flush only tlb entries related to a CPU.
> If architecture is _NOT_ smart, it could flush all entries of all CPUs.
> So, it would be best to support both portability and performance.
>
> Cc: Russell King <[email protected]>
> Cc: Ralf Baechle <[email protected]>
> Cc: Paul Mundt <[email protected]>
> Cc: Guan Xuetao <[email protected]>
> Cc: Chen Liqin <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
>
> Need double check about supporting local_flush_tlb_kernel_range
> in ARM, MIPS, SUPERH maintainers. And I will Ccing unicore32 and
> score maintainers because arch directory in those arch have
> local_flush_tlb_kernel_range, too but I'm very unfamiliar with those
> architecture so pass it to maintainers.
> I didn't coded up dumb local_flush_tlb_kernel_range which flush
> all cpus. I expect someone need ZSMALLOC will implement it easily in future.
> Seth might support it in PowerPC. :)
>
>
> drivers/staging/zsmalloc/Kconfig | 6 ++---
> drivers/staging/zsmalloc/zsmalloc-main.c | 36 +++++++++++++++++++++---------
> drivers/staging/zsmalloc/zsmalloc_int.h | 1 -
> 3 files changed, 29 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/staging/zsmalloc/Kconfig b/drivers/staging/zsmalloc/Kconfig
> index a5ab720..def2483 100644
> --- a/drivers/staging/zsmalloc/Kconfig
> +++ b/drivers/staging/zsmalloc/Kconfig
> @@ -1,9 +1,9 @@
> config ZSMALLOC
> tristate "Memory allocator for compressed pages"
> - # X86 dependency is because of the use of __flush_tlb_one and set_pte
> + # arch dependency is because of the use of local_unmap_kernel_range
> # in zsmalloc-main.c.
> - # TODO: convert these to portable functions
> - depends on X86
> + # TODO: implement local_unmap_kernel_range in all architecture.
> + depends on (ARM || MIPS || SUPERH)
I suggest removing above line, so if I want to use zsmalloc, I could
enable this configuration easily.
> default n
> help
> zsmalloc is a slab-based memory allocator designed to store
> diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c b/drivers/staging/zsmalloc/zsmalloc-main.c
> index 4496737..8a8b08f 100644
> --- a/drivers/staging/zsmalloc/zsmalloc-main.c
> +++ b/drivers/staging/zsmalloc/zsmalloc-main.c
> @@ -442,7 +442,7 @@ static int zs_cpu_notifier(struct notifier_block *nb, unsigned long action,
> area = &per_cpu(zs_map_area, cpu);
> if (area->vm)
> break;
> - area->vm = alloc_vm_area(2 * PAGE_SIZE, area->vm_ptes);
> + area->vm = alloc_vm_area(2 * PAGE_SIZE, NULL);
> if (!area->vm)
> return notifier_from_errno(-ENOMEM);
> break;
> @@ -696,13 +696,22 @@ void *zs_map_object(struct zs_pool *pool, void *handle)
> } else {
> /* this object spans two pages */
> struct page *nextp;
> + struct page *pages[2];
> + struct page **page_array = &pages[0];
> + int err;
>
> nextp = get_next_page(page);
> BUG_ON(!nextp);
>
> + page_array[0] = page;
> + page_array[1] = nextp;
>
> - set_pte(area->vm_ptes[0], mk_pte(page, PAGE_KERNEL));
> - set_pte(area->vm_ptes[1], mk_pte(nextp, PAGE_KERNEL));
> + /*
> + * map_vm_area never fail because we already allocated
> + * pages for page table in alloc_vm_area.
> + */
> + err = map_vm_area(area->vm, PAGE_KERNEL, &page_array);
> + BUG_ON(err);
I think WARN_ON() is better than BUG_ON() here.
>
> /* We pre-allocated VM area so mapping can never fail */
> area->vm_addr = area->vm->addr;
> @@ -712,6 +721,15 @@ void *zs_map_object(struct zs_pool *pool, void *handle)
> }
> EXPORT_SYMBOL_GPL(zs_map_object);
>
> +static void local_unmap_kernel_range(unsigned long addr, unsigned long size)
> +{
> + unsigned long end = addr + size;
> +
> + flush_cache_vunmap(addr, end);
> + unmap_kernel_range_noflush(addr, size);
> + local_flush_tlb_kernel_range(addr, end);
> +}
> +
> void zs_unmap_object(struct zs_pool *pool, void *handle)
> {
> struct page *page;
> @@ -730,14 +748,12 @@ void zs_unmap_object(struct zs_pool *pool, void *handle)
> off = obj_idx_to_offset(page, obj_idx, class->size);
>
> area = &__get_cpu_var(zs_map_area);
> - if (off + class->size <= PAGE_SIZE) {
> + if (off + class->size <= PAGE_SIZE)
> kunmap_atomic(area->vm_addr);
> - } else {
> - set_pte(area->vm_ptes[0], __pte(0));
> - set_pte(area->vm_ptes[1], __pte(0));
> - __flush_tlb_one((unsigned long)area->vm_addr);
> - __flush_tlb_one((unsigned long)area->vm_addr + PAGE_SIZE);
> - }
> + else
> + local_unmap_kernel_range((unsigned long)area->vm->addr,
> + PAGE_SIZE * 2);
> +
> put_cpu_var(zs_map_area);
> }
> EXPORT_SYMBOL_GPL(zs_unmap_object);
> diff --git a/drivers/staging/zsmalloc/zsmalloc_int.h b/drivers/staging/zsmalloc/zsmalloc_int.h
> index 6fd32a9..eaec845 100644
> --- a/drivers/staging/zsmalloc/zsmalloc_int.h
> +++ b/drivers/staging/zsmalloc/zsmalloc_int.h
> @@ -111,7 +111,6 @@ static const int fullness_threshold_frac = 4;
>
> struct mapping_area {
> struct vm_struct *vm;
> - pte_t *vm_ptes[2];
> char *vm_addr;
> };
>
On 05/15/2012 09:05 PM, Minchan Kim wrote:
> Exactly saying, [zram|zcache] should has a dependency with
> zsmalloc, not x86. So replace x86 dependeny with ZSMALLOC.
>
> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> drivers/staging/zcache/Kconfig | 3 +--
> drivers/staging/zram/Kconfig | 3 +--
> 2 files changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/staging/zcache/Kconfig b/drivers/staging/zcache/Kconfig
> index 7048e01..ceb7f28 100644
> --- a/drivers/staging/zcache/Kconfig
> +++ b/drivers/staging/zcache/Kconfig
> @@ -2,8 +2,7 @@ config ZCACHE
> bool "Dynamic compression of swap pages and clean pagecache pages"
> # X86 dependency is because zsmalloc uses non-portable pte/tlb
> # functions
> - depends on (CLEANCACHE || FRONTSWAP) && CRYPTO=y && X86
> - select ZSMALLOC
> + depends on (CLEANCACHE || FRONTSWAP) && CRYPTO=y && ZSMALLOC
Sorry Minchan, I should have said this the first time around. I ran
into this issue before with CRYTPO vs CRYPTO=y. ZCACHE is a bool where
ZSMALLOC is a tristate. It is not sufficient for ZSMALLOC to be set; it
_must_ be builtin, otherwise you get linker errors.
The dependency should be ZSMALLOC=y.
Thanks,
Seth
On 05/16/2012 04:28 PM, Guan Xuetao wrote:
> On Wed, 2012-05-16 at 11:05 +0900, Minchan Kim wrote:
>> zsmalloc uses set_pte and __flush_tlb_one for performance but
>> many architecture don't support it. so this patch removes
>> set_pte and __flush_tlb_one which are x86 dependency.
>> Instead of it, use local_flush_tlb_kernel_range which are available
>> by more architectures. It would be better than supporting only x86
>> and last patch in series will enable again with supporting
>> local_flush_tlb_kernel_range in x86.
>>
>> About local_flush_tlb_kernel_range,
>> If architecture is very smart, it could flush only tlb entries related to vaddr.
>> If architecture is smart, it could flush only tlb entries related to a CPU.
>> If architecture is _NOT_ smart, it could flush all entries of all CPUs.
>> So, it would be best to support both portability and performance.
>>
>> Cc: Russell King <[email protected]>
>> Cc: Ralf Baechle <[email protected]>
>> Cc: Paul Mundt <[email protected]>
>> Cc: Guan Xuetao <[email protected]>
>> Cc: Chen Liqin <[email protected]>
>> Signed-off-by: Minchan Kim <[email protected]>
>> ---
>>
>> Need double check about supporting local_flush_tlb_kernel_range
>> in ARM, MIPS, SUPERH maintainers. And I will Ccing unicore32 and
>> score maintainers because arch directory in those arch have
>> local_flush_tlb_kernel_range, too but I'm very unfamiliar with those
>> architecture so pass it to maintainers.
>> I didn't coded up dumb local_flush_tlb_kernel_range which flush
>> all cpus. I expect someone need ZSMALLOC will implement it easily in future.
>> Seth might support it in PowerPC. :)
>>
>>
>> drivers/staging/zsmalloc/Kconfig | 6 ++---
>> drivers/staging/zsmalloc/zsmalloc-main.c | 36 +++++++++++++++++++++---------
>> drivers/staging/zsmalloc/zsmalloc_int.h | 1 -
>> 3 files changed, 29 insertions(+), 14 deletions(-)
>>
>> diff --git a/drivers/staging/zsmalloc/Kconfig b/drivers/staging/zsmalloc/Kconfig
>> index a5ab720..def2483 100644
>> --- a/drivers/staging/zsmalloc/Kconfig
>> +++ b/drivers/staging/zsmalloc/Kconfig
>> @@ -1,9 +1,9 @@
>> config ZSMALLOC
>> tristate "Memory allocator for compressed pages"
>> - # X86 dependency is because of the use of __flush_tlb_one and set_pte
>> + # arch dependency is because of the use of local_unmap_kernel_range
>> # in zsmalloc-main.c.
>> - # TODO: convert these to portable functions
>> - depends on X86
>> + # TODO: implement local_unmap_kernel_range in all architecture.
>> + depends on (ARM || MIPS || SUPERH)
> I suggest removing above line, so if I want to use zsmalloc, I could
> enable this configuration easily.
I don't get it. What do you mean?
If I remove above line, compile error will happen if arch doesn't support local_unmap_kernel_range.
>
>> default n
>> help
>> zsmalloc is a slab-based memory allocator designed to store
>> diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c b/drivers/staging/zsmalloc/zsmalloc-main.c
>> index 4496737..8a8b08f 100644
>> --- a/drivers/staging/zsmalloc/zsmalloc-main.c
>> +++ b/drivers/staging/zsmalloc/zsmalloc-main.c
>> @@ -442,7 +442,7 @@ static int zs_cpu_notifier(struct notifier_block *nb, unsigned long action,
>> area = &per_cpu(zs_map_area, cpu);
>> if (area->vm)
>> break;
>> - area->vm = alloc_vm_area(2 * PAGE_SIZE, area->vm_ptes);
>> + area->vm = alloc_vm_area(2 * PAGE_SIZE, NULL);
>> if (!area->vm)
>> return notifier_from_errno(-ENOMEM);
>> break;
>> @@ -696,13 +696,22 @@ void *zs_map_object(struct zs_pool *pool, void *handle)
>> } else {
>> /* this object spans two pages */
>> struct page *nextp;
>> + struct page *pages[2];
>> + struct page **page_array = &pages[0];
>> + int err;
>>
>> nextp = get_next_page(page);
>> BUG_ON(!nextp);
>>
>> + page_array[0] = page;
>> + page_array[1] = nextp;
>>
>> - set_pte(area->vm_ptes[0], mk_pte(page, PAGE_KERNEL));
>> - set_pte(area->vm_ptes[1], mk_pte(nextp, PAGE_KERNEL));
>> + /*
>> + * map_vm_area never fail because we already allocated
>> + * pages for page table in alloc_vm_area.
>> + */
>> + err = map_vm_area(area->vm, PAGE_KERNEL, &page_array);
>> + BUG_ON(err);
> I think WARN_ON() is better than BUG_ON() here.
If we don't do BUG_ON, zsmalloc's user can use dangling pointer so that it can make system very
unstable, even fatal.
>
>>
>> /* We pre-allocated VM area so mapping can never fail */
>> area->vm_addr = area->vm->addr;
>> @@ -712,6 +721,15 @@ void *zs_map_object(struct zs_pool *pool, void *handle)
>> }
>> EXPORT_SYMBOL_GPL(zs_map_object);
>>
>> +static void local_unmap_kernel_range(unsigned long addr, unsigned long size)
>> +{
>> + unsigned long end = addr + size;
>> +
>> + flush_cache_vunmap(addr, end);
>> + unmap_kernel_range_noflush(addr, size);
>> + local_flush_tlb_kernel_range(addr, end);
>> +}
>> +
>> void zs_unmap_object(struct zs_pool *pool, void *handle)
>> {
>> struct page *page;
>> @@ -730,14 +748,12 @@ void zs_unmap_object(struct zs_pool *pool, void *handle)
>> off = obj_idx_to_offset(page, obj_idx, class->size);
>>
>> area = &__get_cpu_var(zs_map_area);
>> - if (off + class->size <= PAGE_SIZE) {
>> + if (off + class->size <= PAGE_SIZE)
>> kunmap_atomic(area->vm_addr);
>> - } else {
>> - set_pte(area->vm_ptes[0], __pte(0));
>> - set_pte(area->vm_ptes[1], __pte(0));
>> - __flush_tlb_one((unsigned long)area->vm_addr);
>> - __flush_tlb_one((unsigned long)area->vm_addr + PAGE_SIZE);
>> - }
>> + else
>> + local_unmap_kernel_range((unsigned long)area->vm->addr,
>> + PAGE_SIZE * 2);
>> +
>> put_cpu_var(zs_map_area);
>> }
>> EXPORT_SYMBOL_GPL(zs_unmap_object);
>> diff --git a/drivers/staging/zsmalloc/zsmalloc_int.h b/drivers/staging/zsmalloc/zsmalloc_int.h
>> index 6fd32a9..eaec845 100644
>> --- a/drivers/staging/zsmalloc/zsmalloc_int.h
>> +++ b/drivers/staging/zsmalloc/zsmalloc_int.h
>> @@ -111,7 +111,6 @@ static const int fullness_threshold_frac = 4;
>>
>> struct mapping_area {
>> struct vm_struct *vm;
>> - pte_t *vm_ptes[2];
>> char *vm_addr;
>> };
>>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
--
Kind regards,
Minchan Kim
On Thu, 2012-05-17 at 09:07 +0900, Minchan Kim wrote:
> On 05/16/2012 04:28 PM, Guan Xuetao wrote:
>
> > On Wed, 2012-05-16 at 11:05 +0900, Minchan Kim wrote:
> >> zsmalloc uses set_pte and __flush_tlb_one for performance but
> >> many architecture don't support it. so this patch removes
> >> set_pte and __flush_tlb_one which are x86 dependency.
> >> Instead of it, use local_flush_tlb_kernel_range which are available
> >> by more architectures. It would be better than supporting only x86
> >> and last patch in series will enable again with supporting
> >> local_flush_tlb_kernel_range in x86.
> >>
> >> About local_flush_tlb_kernel_range,
> >> If architecture is very smart, it could flush only tlb entries related to vaddr.
> >> If architecture is smart, it could flush only tlb entries related to a CPU.
> >> If architecture is _NOT_ smart, it could flush all entries of all CPUs.
> >> So, it would be best to support both portability and performance.
> >>
> >> Cc: Russell King <[email protected]>
> >> Cc: Ralf Baechle <[email protected]>
> >> Cc: Paul Mundt <[email protected]>
> >> Cc: Guan Xuetao <[email protected]>
> >> Cc: Chen Liqin <[email protected]>
> >> Signed-off-by: Minchan Kim <[email protected]>
> >> ---
> >>
> >> Need double check about supporting local_flush_tlb_kernel_range
> >> in ARM, MIPS, SUPERH maintainers. And I will Ccing unicore32 and
> >> score maintainers because arch directory in those arch have
> >> local_flush_tlb_kernel_range, too but I'm very unfamiliar with those
> >> architecture so pass it to maintainers.
> >> I didn't coded up dumb local_flush_tlb_kernel_range which flush
> >> all cpus. I expect someone need ZSMALLOC will implement it easily in future.
> >> Seth might support it in PowerPC. :)
> >>
> >>
> >> drivers/staging/zsmalloc/Kconfig | 6 ++---
> >> drivers/staging/zsmalloc/zsmalloc-main.c | 36 +++++++++++++++++++++---------
> >> drivers/staging/zsmalloc/zsmalloc_int.h | 1 -
> >> 3 files changed, 29 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/drivers/staging/zsmalloc/Kconfig b/drivers/staging/zsmalloc/Kconfig
> >> index a5ab720..def2483 100644
> >> --- a/drivers/staging/zsmalloc/Kconfig
> >> +++ b/drivers/staging/zsmalloc/Kconfig
> >> @@ -1,9 +1,9 @@
> >> config ZSMALLOC
> >> tristate "Memory allocator for compressed pages"
> >> - # X86 dependency is because of the use of __flush_tlb_one and set_pte
> >> + # arch dependency is because of the use of local_unmap_kernel_range
> >> # in zsmalloc-main.c.
> >> - # TODO: convert these to portable functions
> >> - depends on X86
> >> + # TODO: implement local_unmap_kernel_range in all architecture.
> >> + depends on (ARM || MIPS || SUPERH)
> > I suggest removing above line, so if I want to use zsmalloc, I could
> > enable this configuration easily.
>
>
> I don't get it. What do you mean?
> If I remove above line, compile error will happen if arch doesn't support local_unmap_kernel_range.
If I want to use zsmalloc, I will verify local_unmap_kernel_range
function. In fact, only local_flush_tlb_kernel_range need to be
considered. So, just keeping the default option 'n' is enough.
>
> >
> >> default n
> >> help
> >> zsmalloc is a slab-based memory allocator designed to store
> >> diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c b/drivers/staging/zsmalloc/zsmalloc-main.c
> >> index 4496737..8a8b08f 100644
> >> --- a/drivers/staging/zsmalloc/zsmalloc-main.c
> >> +++ b/drivers/staging/zsmalloc/zsmalloc-main.c
> >> @@ -442,7 +442,7 @@ static int zs_cpu_notifier(struct notifier_block *nb, unsigned long action,
> >> area = &per_cpu(zs_map_area, cpu);
> >> if (area->vm)
> >> break;
> >> - area->vm = alloc_vm_area(2 * PAGE_SIZE, area->vm_ptes);
> >> + area->vm = alloc_vm_area(2 * PAGE_SIZE, NULL);
> >> if (!area->vm)
> >> return notifier_from_errno(-ENOMEM);
> >> break;
> >> @@ -696,13 +696,22 @@ void *zs_map_object(struct zs_pool *pool, void *handle)
> >> } else {
> >> /* this object spans two pages */
> >> struct page *nextp;
> >> + struct page *pages[2];
> >> + struct page **page_array = &pages[0];
> >> + int err;
> >>
> >> nextp = get_next_page(page);
> >> BUG_ON(!nextp);
> >>
> >> + page_array[0] = page;
> >> + page_array[1] = nextp;
> >>
> >> - set_pte(area->vm_ptes[0], mk_pte(page, PAGE_KERNEL));
> >> - set_pte(area->vm_ptes[1], mk_pte(nextp, PAGE_KERNEL));
> >> + /*
> >> + * map_vm_area never fail because we already allocated
> >> + * pages for page table in alloc_vm_area.
> >> + */
> >> + err = map_vm_area(area->vm, PAGE_KERNEL, &page_array);
> >> + BUG_ON(err);
> > I think WARN_ON() is better than BUG_ON() here.
>
>
> If we don't do BUG_ON, zsmalloc's user can use dangling pointer so that it can make system very
> unstable, even fatal.
The majority of archs treat BUG_ON as panic, so if zsmalloc has some
error, the kernel will be halted immediately. Is it the result you want?
If yes, just ignore my suggestion.
>
> >
> >>
> >> /* We pre-allocated VM area so mapping can never fail */
> >> area->vm_addr = area->vm->addr;
> >> @@ -712,6 +721,15 @@ void *zs_map_object(struct zs_pool *pool, void *handle)
> >> }
> >> EXPORT_SYMBOL_GPL(zs_map_object);
> >>
> >> +static void local_unmap_kernel_range(unsigned long addr, unsigned long size)
> >> +{
> >> + unsigned long end = addr + size;
> >> +
> >> + flush_cache_vunmap(addr, end);
> >> + unmap_kernel_range_noflush(addr, size);
> >> + local_flush_tlb_kernel_range(addr, end);
> >> +}
> >> +
> >> void zs_unmap_object(struct zs_pool *pool, void *handle)
> >> {
> >> struct page *page;
> >> @@ -730,14 +748,12 @@ void zs_unmap_object(struct zs_pool *pool, void *handle)
> >> off = obj_idx_to_offset(page, obj_idx, class->size);
> >>
> >> area = &__get_cpu_var(zs_map_area);
> >> - if (off + class->size <= PAGE_SIZE) {
> >> + if (off + class->size <= PAGE_SIZE)
> >> kunmap_atomic(area->vm_addr);
> >> - } else {
> >> - set_pte(area->vm_ptes[0], __pte(0));
> >> - set_pte(area->vm_ptes[1], __pte(0));
> >> - __flush_tlb_one((unsigned long)area->vm_addr);
> >> - __flush_tlb_one((unsigned long)area->vm_addr + PAGE_SIZE);
> >> - }
> >> + else
> >> + local_unmap_kernel_range((unsigned long)area->vm->addr,
> >> + PAGE_SIZE * 2);
> >> +
> >> put_cpu_var(zs_map_area);
> >> }
> >> EXPORT_SYMBOL_GPL(zs_unmap_object);
> >> diff --git a/drivers/staging/zsmalloc/zsmalloc_int.h b/drivers/staging/zsmalloc/zsmalloc_int.h
> >> index 6fd32a9..eaec845 100644
> >> --- a/drivers/staging/zsmalloc/zsmalloc_int.h
> >> +++ b/drivers/staging/zsmalloc/zsmalloc_int.h
> >> @@ -111,7 +111,6 @@ static const int fullness_threshold_frac = 4;
> >>
> >> struct mapping_area {
> >> struct vm_struct *vm;
> >> - pte_t *vm_ptes[2];
> >> char *vm_addr;
> >> };
> >>
> >
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to [email protected]. For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> > Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> >
>
>
>
On 05/17/2012 09:56 AM, Guan Xuetao wrote:
> On Thu, 2012-05-17 at 09:07 +0900, Minchan Kim wrote:
>> On 05/16/2012 04:28 PM, Guan Xuetao wrote:
>>
>>> On Wed, 2012-05-16 at 11:05 +0900, Minchan Kim wrote:
>>>> zsmalloc uses set_pte and __flush_tlb_one for performance but
>>>> many architecture don't support it. so this patch removes
>>>> set_pte and __flush_tlb_one which are x86 dependency.
>>>> Instead of it, use local_flush_tlb_kernel_range which are available
>>>> by more architectures. It would be better than supporting only x86
>>>> and last patch in series will enable again with supporting
>>>> local_flush_tlb_kernel_range in x86.
>>>>
>>>> About local_flush_tlb_kernel_range,
>>>> If architecture is very smart, it could flush only tlb entries related to vaddr.
>>>> If architecture is smart, it could flush only tlb entries related to a CPU.
>>>> If architecture is _NOT_ smart, it could flush all entries of all CPUs.
>>>> So, it would be best to support both portability and performance.
>>>>
>>>> Cc: Russell King <[email protected]>
>>>> Cc: Ralf Baechle <[email protected]>
>>>> Cc: Paul Mundt <[email protected]>
>>>> Cc: Guan Xuetao <[email protected]>
>>>> Cc: Chen Liqin <[email protected]>
>>>> Signed-off-by: Minchan Kim <[email protected]>
>>>> ---
>>>>
>>>> Need double check about supporting local_flush_tlb_kernel_range
>>>> in ARM, MIPS, SUPERH maintainers. And I will Ccing unicore32 and
>>>> score maintainers because arch directory in those arch have
>>>> local_flush_tlb_kernel_range, too but I'm very unfamiliar with those
>>>> architecture so pass it to maintainers.
>>>> I didn't coded up dumb local_flush_tlb_kernel_range which flush
>>>> all cpus. I expect someone need ZSMALLOC will implement it easily in future.
>>>> Seth might support it in PowerPC. :)
>>>>
>>>>
>>>> drivers/staging/zsmalloc/Kconfig | 6 ++---
>>>> drivers/staging/zsmalloc/zsmalloc-main.c | 36 +++++++++++++++++++++---------
>>>> drivers/staging/zsmalloc/zsmalloc_int.h | 1 -
>>>> 3 files changed, 29 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/drivers/staging/zsmalloc/Kconfig b/drivers/staging/zsmalloc/Kconfig
>>>> index a5ab720..def2483 100644
>>>> --- a/drivers/staging/zsmalloc/Kconfig
>>>> +++ b/drivers/staging/zsmalloc/Kconfig
>>>> @@ -1,9 +1,9 @@
>>>> config ZSMALLOC
>>>> tristate "Memory allocator for compressed pages"
>>>> - # X86 dependency is because of the use of __flush_tlb_one and set_pte
>>>> + # arch dependency is because of the use of local_unmap_kernel_range
>>>> # in zsmalloc-main.c.
>>>> - # TODO: convert these to portable functions
>>>> - depends on X86
>>>> + # TODO: implement local_unmap_kernel_range in all architecture.
>>>> + depends on (ARM || MIPS || SUPERH)
>>> I suggest removing above line, so if I want to use zsmalloc, I could
>>> enable this configuration easily.
>>
>>
>> I don't get it. What do you mean?
>> If I remove above line, compile error will happen if arch doesn't support local_unmap_kernel_range.
> If I want to use zsmalloc, I will verify local_unmap_kernel_range
> function. In fact, only local_flush_tlb_kernel_range need to be
> considered. So, just keeping the default option 'n' is enough.
I don't think so.
It's terrible experience if all users have to look up local_flush_tlb_kernel_range of arch for using zsmalloc.
BTW, does unicore32 support that function?
If so, I would like to add unicore32 in Kconfig.
--
Kind regards,
Minchan Kim
On 05/17/2012 02:11 AM, Seth Jennings wrote:
> On 05/15/2012 09:05 PM, Minchan Kim wrote:
>
>> Exactly saying, [zram|zcache] should has a dependency with
>> zsmalloc, not x86. So replace x86 dependeny with ZSMALLOC.
>>
>> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
>> Signed-off-by: Minchan Kim <[email protected]>
>> ---
>> drivers/staging/zcache/Kconfig | 3 +--
>> drivers/staging/zram/Kconfig | 3 +--
>> 2 files changed, 2 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/staging/zcache/Kconfig b/drivers/staging/zcache/Kconfig
>> index 7048e01..ceb7f28 100644
>> --- a/drivers/staging/zcache/Kconfig
>> +++ b/drivers/staging/zcache/Kconfig
>> @@ -2,8 +2,7 @@ config ZCACHE
>> bool "Dynamic compression of swap pages and clean pagecache pages"
>> # X86 dependency is because zsmalloc uses non-portable pte/tlb
>> # functions
>> - depends on (CLEANCACHE || FRONTSWAP) && CRYPTO=y && X86
>> - select ZSMALLOC
>> + depends on (CLEANCACHE || FRONTSWAP) && CRYPTO=y && ZSMALLOC
>
>
> Sorry Minchan, I should have said this the first time around. I ran
> into this issue before with CRYTPO vs CRYPTO=y. ZCACHE is a bool where
> ZSMALLOC is a tristate. It is not sufficient for ZSMALLOC to be set; it
> _must_ be builtin, otherwise you get linker errors.
>
> The dependency should be ZSMALLOC=y.
Sigh. I should have been more careful.
Thanks. I will fix it in next spin.
>
> Thanks,
> Seth
>
--
Kind regards,
Minchan Kim
Isn't there anyone for taking a time to review this patch? :)
On 05/16/2012 11:05 AM, Minchan Kim wrote:
> The zsmalloc [un]maps non-physical contiguos pages to contiguous
> virual address frequently so it needs frequent tlb-flush.
> Now x86 doesn't support common utility function for flushing just
> a few tlb entries so zsmalloc have been used set_pte and __flush_tlb_one
> which are x86 specific functions. It means zsmalloc have a dependency
> with x86.
>
> This patch adds new function, local_flush_tlb_kernel_range which
> are good candidate for being common utility function because other
> architecture(ex, MIPS, sh, unicore32, arm, score) already have
> supportd it.
>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: David Howells <[email protected]>
> Cc: [email protected]
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> arch/x86/include/asm/tlbflush.h | 12 ++++++++++++
> drivers/staging/zsmalloc/Kconfig | 2 +-
> 2 files changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 4ece077..6e1253a 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -172,4 +172,16 @@ static inline void flush_tlb_kernel_range(unsigned long start,
> flush_tlb_all();
> }
>
> +static inline void local_flush_tlb_kernel_range(unsigned long start,
> + unsigned long end)
> +{
> + if (cpu_has_invlpg) {
> + while (start < end) {
> + __flush_tlb_single(start);
> + start += PAGE_SIZE;
> + }
> + } else
> + local_flush_tlb();
> +}
> +
> #endif /* _ASM_X86_TLBFLUSH_H */
> diff --git a/drivers/staging/zsmalloc/Kconfig b/drivers/staging/zsmalloc/Kconfig
> index def2483..29819b8 100644
> --- a/drivers/staging/zsmalloc/Kconfig
> +++ b/drivers/staging/zsmalloc/Kconfig
> @@ -3,7 +3,7 @@ config ZSMALLOC
> # arch dependency is because of the use of local_unmap_kernel_range
> # in zsmalloc-main.c.
> # TODO: implement local_unmap_kernel_range in all architecture.
> - depends on (ARM || MIPS || SUPERH)
> + depends on (ARM || MIPS || SUPERH || X86)
> default n
> help
> zsmalloc is a slab-based memory allocator designed to store
--
Kind regards,
Minchan Kim
On Wed, May 16, 2012 at 11:05:17AM +0900, Minchan Kim wrote:
> About local_flush_tlb_kernel_range,
> If architecture is very smart, it could flush only tlb entries related to vaddr.
> If architecture is smart, it could flush only tlb entries related to a CPU.
> If architecture is _NOT_ smart, it could flush all entries of all CPUs.
> So, it would be best to support both portability and performance.
>
..
> Need double check about supporting local_flush_tlb_kernel_range
> in ARM, MIPS, SUPERH maintainers. And I will Ccing unicore32 and
> score maintainers because arch directory in those arch have
> local_flush_tlb_kernel_range, too but I'm very unfamiliar with those
> architecture so pass it to maintainers.
> I didn't coded up dumb local_flush_tlb_kernel_range which flush
> all cpus. I expect someone need ZSMALLOC will implement it easily in future.
>
One thing you might consider is providing a stubbed definition that wraps
to flush_tlb_kernel_range() in the !SMP case, as this will extend your
testing coverage for staging considerably.
Once you exclude all of the non-SMP platforms, you're left with the
following:
- blackfin: doesn't count, no TLB to worry about.
- hexagon: seems to imply that the SMP case uses thread-based
CPUs that share an MMU, so no additional cost.
- ia64: Does a global flush, which already has a FIXME comment.
- m32r, mn10300: local_flush_tlb_all() could be wrapped.
- parisc: global flush?
- s390: Tests the cpumask to do a local flush, otherwise has a
__tlb_flush_local() that can be wrapped.
- sparc32: global flush
- sparc64: __flush_tlb_kernel_range() looks like a local flush.
- tile: does strange hypervisory things, presumably global.
- x86: has a local_flush_tlb() that could be wrapped.
Which doesn't look quite that bad. You could probably get away with a
Kconfig option for optimized local TLB flushing or something, since
single function Kconfig options seem to be all the rage these days.
On 05/17/2012 05:32 PM, Paul Mundt wrote:
> On Wed, May 16, 2012 at 11:05:17AM +0900, Minchan Kim wrote:
>> About local_flush_tlb_kernel_range,
>> If architecture is very smart, it could flush only tlb entries related to vaddr.
>> If architecture is smart, it could flush only tlb entries related to a CPU.
>> If architecture is _NOT_ smart, it could flush all entries of all CPUs.
>> So, it would be best to support both portability and performance.
>>
> ..
>
>> Need double check about supporting local_flush_tlb_kernel_range
>> in ARM, MIPS, SUPERH maintainers. And I will Ccing unicore32 and
>> score maintainers because arch directory in those arch have
>> local_flush_tlb_kernel_range, too but I'm very unfamiliar with those
>> architecture so pass it to maintainers.
>> I didn't coded up dumb local_flush_tlb_kernel_range which flush
>> all cpus. I expect someone need ZSMALLOC will implement it easily in future.
>>
>
> One thing you might consider is providing a stubbed definition that wraps
> to flush_tlb_kernel_range() in the !SMP case, as this will extend your
> testing coverage for staging considerably.
AFAIUC, you mean following as,
ifndef CONFIG_SMP
void flush_tlb_kernel_range(unsinged long start, unsigned log end)
{
local_flush_tlb_kernel_range(start, end);
}
#endif
I can do it on some arch which I know a little bit but concern is
I'm not sure what's effective between all entries flush and
each entry flush if range is very big.
It's not a goal of this patch so I would like to pass it to arch maintainers.
But I absolutely agree on testing coverage on your comment.
>
> Once you exclude all of the non-SMP platforms, you're left with the
> following:
>
> - blackfin: doesn't count, no TLB to worry about.
> - hexagon: seems to imply that the SMP case uses thread-based
> CPUs that share an MMU, so no additional cost.
> - ia64: Does a global flush, which already has a FIXME comment.
> - m32r, mn10300: local_flush_tlb_all() could be wrapped.
> - parisc: global flush?
> - s390: Tests the cpumask to do a local flush, otherwise has a
> __tlb_flush_local() that can be wrapped.
> - sparc32: global flush
> - sparc64: __flush_tlb_kernel_range() looks like a local flush.
> - tile: does strange hypervisory things, presumably global.
> - x86: has a local_flush_tlb() that could be wrapped.
>
> Which doesn't look quite that bad. You could probably get away with a
> Kconfig option for optimized local TLB flushing or something, since
> single function Kconfig options seem to be all the rage these days.
Actually, I didn't want to implement dumb flush functions on all architecture
which those functions flush all entries although we need flush a few entries.
It might zsmalloc unuseful so I expected each maintainers can implement
it much efficient than stupid me and then, they add their arch in Kconfig. :(
If this approach is really bad, I need time to implement dumb stub functions
in all architecture and have to receive all acks from them. Sigh.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
--
Kind regards,
Minchan Kim
On 05/17/2012 05:32 PM, Paul Mundt wrote:
> On Wed, May 16, 2012 at 11:05:17AM +0900, Minchan Kim wrote:
>> About local_flush_tlb_kernel_range,
>> If architecture is very smart, it could flush only tlb entries related to vaddr.
>> If architecture is smart, it could flush only tlb entries related to a CPU.
>> If architecture is _NOT_ smart, it could flush all entries of all CPUs.
>> So, it would be best to support both portability and performance.
>>
> ..
>
>> Need double check about supporting local_flush_tlb_kernel_range
>> in ARM, MIPS, SUPERH maintainers. And I will Ccing unicore32 and
>> score maintainers because arch directory in those arch have
>> local_flush_tlb_kernel_range, too but I'm very unfamiliar with those
>> architecture so pass it to maintainers.
>> I didn't coded up dumb local_flush_tlb_kernel_range which flush
>> all cpus. I expect someone need ZSMALLOC will implement it easily in future.
>>
>
> One thing you might consider is providing a stubbed definition that wraps
> to flush_tlb_kernel_range() in the !SMP case, as this will extend your
> testing coverage for staging considerably.
>
> Once you exclude all of the non-SMP platforms, you're left with the
> following:
>
> - blackfin: doesn't count, no TLB to worry about.
> - hexagon: seems to imply that the SMP case uses thread-based
> CPUs that share an MMU, so no additional cost.
> - ia64: Does a global flush, which already has a FIXME comment.
> - m32r, mn10300: local_flush_tlb_all() could be wrapped.
> - parisc: global flush?
> - s390: Tests the cpumask to do a local flush, otherwise has a
> __tlb_flush_local() that can be wrapped.
> - sparc32: global flush
> - sparc64: __flush_tlb_kernel_range() looks like a local flush.
> - tile: does strange hypervisory things, presumably global.
> - x86: has a local_flush_tlb() that could be wrapped.
>
> Which doesn't look quite that bad. You could probably get away with a
> Kconfig option for optimized local TLB flushing or something, since
> single function Kconfig options seem to be all the rage these days.
I missed this sentence.
Thanks for very helpful comment, Paul!
--
Kind regards,
Minchan Kim
On Thu, May 17, 2012 at 06:06:56PM +0900, Minchan Kim wrote:
> On 05/17/2012 05:32 PM, Paul Mundt wrote:
> > One thing you might consider is providing a stubbed definition that wraps
> > to flush_tlb_kernel_range() in the !SMP case, as this will extend your
> > testing coverage for staging considerably.
>
>
> AFAIUC, you mean following as,
>
> ifndef CONFIG_SMP
> void flush_tlb_kernel_range(unsinged long start, unsigned log end)
> {
> local_flush_tlb_kernel_range(start, end);
> }
> #endif
>
Actually I meant the opposite:
#ifndef CONFIG_SMP
#define local_flush_tlb_kernel_range flush_tlb_kernel_range
#endif
as the UP case is going to be local already. It's a bit hacky, though.
On Thu, May 17, 2012 at 05:11:08PM +0900, Minchan Kim wrote:
> Isn't there anyone for taking a time to review this patch? :)
>
> On 05/16/2012 11:05 AM, Minchan Kim wrote:
<snip>
You want review within 24 hours for a staging tree patch for a feature
that no one uses?
That's very bold of you. Please be realistic.
greg k-h
On Thu, 2012-05-17 at 17:11 +0900, Minchan Kim wrote:
> > +++ b/arch/x86/include/asm/tlbflush.h
> > @@ -172,4 +172,16 @@ static inline void flush_tlb_kernel_range(unsigned long start,
> > flush_tlb_all();
> > }
> >
> > +static inline void local_flush_tlb_kernel_range(unsigned long start,
> > + unsigned long end)
> > +{
> > + if (cpu_has_invlpg) {
> > + while (start < end) {
> > + __flush_tlb_single(start);
> > + start += PAGE_SIZE;
> > + }
> > + } else
> > + local_flush_tlb();
> > +}
It would be much better if you wait for Alex Shi's patch to mature.
doing the invlpg thing for ranges is not an unconditional win.
Also, does it even work if the range happens to be backed by huge pages?
IIRC we try and do the identity map with large pages wherever possible.
On Thu, 2012-05-17 at 16:51 +0200, Peter Zijlstra wrote:
>
> Also, does it even work if the range happens to be backed by huge pages?
> IIRC we try and do the identity map with large pages wherever possible.
OK, the Intel SDM seems to suggest it will indeed invalidate ANY mapping
to that linear address, which would include 2M and 1G pages.
On Thu, 2012-05-17 at 17:04 +0900, Minchan Kim wrote:
> ...
> I don't think so.
> It's terrible experience if all users have to look up local_flush_tlb_kernel_range of arch for using zsmalloc.
>
> BTW, does unicore32 support that function?
> If so, I would like to add unicore32 in Kconfig.
>
Yes. Thanks.
Could you give me some materials on performance and testsuite?
Guan Xuetao
Hi Greg,
On 05/17/2012 11:46 PM, Greg Kroah-Hartman wrote:
> On Thu, May 17, 2012 at 05:11:08PM +0900, Minchan Kim wrote:
>> Isn't there anyone for taking a time to review this patch? :)
>>
>> On 05/16/2012 11:05 AM, Minchan Kim wrote:
>
> <snip>
>
> You want review within 24 hours for a staging tree patch for a feature
> that no one uses?
>
> That's very bold of you. Please be realistic.
I admit I was hurry because I thought portability issue of zsmalloc may
be a urgent issue for [zram|zsmalloc] to go out of staging and I want to
move [zsmalloc|zram] into mainline as soon as possible.
I should have grown my patience.
As an excuse, exactly speaking, it's not 24 hours because I sent first
patch 5/14 and I didn't change that patch from then. :)
--
Kind regards,
Minchan Kim
Hi Peter,
On 05/17/2012 11:51 PM, Peter Zijlstra wrote:
> On Thu, 2012-05-17 at 17:11 +0900, Minchan Kim wrote:
>>> +++ b/arch/x86/include/asm/tlbflush.h
>>> @@ -172,4 +172,16 @@ static inline void flush_tlb_kernel_range(unsigned long start,
>>> flush_tlb_all();
>>> }
>>>
>>> +static inline void local_flush_tlb_kernel_range(unsigned long start,
>>> + unsigned long end)
>>> +{
>>> + if (cpu_has_invlpg) {
>>> + while (start < end) {
>>> + __flush_tlb_single(start);
>>> + start += PAGE_SIZE;
>>> + }
>>> + } else
>>> + local_flush_tlb();
>>> +}
>
>
> It would be much better if you wait for Alex Shi's patch to mature.
> doing the invlpg thing for ranges is not an unconditional win.
Thanks for the information. I will watch that patchset.
Thanks.
--
Kind regards,
Minchan Kim
Hi Guan,
On 05/18/2012 10:45 AM, Guan Xuetao wrote:
> On Thu, 2012-05-17 at 17:04 +0900, Minchan Kim wrote:
>
>> ...
>> I don't think so.
>> It's terrible experience if all users have to look up local_flush_tlb_kernel_range of arch for using zsmalloc.
>>
>> BTW, does unicore32 support that function?
>> If so, I would like to add unicore32 in Kconfig.
>>
> Yes. Thanks.
>
> Could you give me some materials on performance and testsuite?
Unfortunately, I don't have them. Nitin, Could you help Guan?
--
Kind regards,
Minchan Kim
On Thu, May 17, 2012 at 11:08 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2012-05-17 at 16:51 +0200, Peter Zijlstra wrote:
>>
>> Also, does it even work if the range happens to be backed by huge pages?
>> IIRC we try and do the identity map with large pages wherever possible.
>
> OK, the Intel SDM seems to suggest it will indeed invalidate ANY mapping
> to that linear address, which would include 2M and 1G pages.
As for as I know, the 1GB TLB fold into 2MB now, and 2MB page still
maybe fold into 4K page in TLB flush. We are tracking this.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. ?For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a hrefmailto:"[email protected]"> [email protected] </a>
On 05/15/2012 09:05 PM, Minchan Kim wrote:
> zsmalloc uses set_pte and __flush_tlb_one for performance but
> many architecture don't support it. so this patch removes
> set_pte and __flush_tlb_one which are x86 dependency.
> Instead of it, use local_flush_tlb_kernel_range which are available
> by more architectures. It would be better than supporting only x86
> and last patch in series will enable again with supporting
> local_flush_tlb_kernel_range in x86.
>
> About local_flush_tlb_kernel_range,
> If architecture is very smart, it could flush only tlb entries related to vaddr.
> If architecture is smart, it could flush only tlb entries related to a CPU.
> If architecture is _NOT_ smart, it could flush all entries of all CPUs.
> So, it would be best to support both portability and performance.
>
> Cc: Russell King <[email protected]>
> Cc: Ralf Baechle <[email protected]>
> Cc: Paul Mundt <[email protected]>
> Cc: Guan Xuetao <[email protected]>
> Cc: Chen Liqin <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
For the zsmalloc changes:
Acked-by: Seth Jennings <[email protected]>
Thanks,
Seth
On 05/17/2012 09:51 AM, Peter Zijlstra wrote:
> On Thu, 2012-05-17 at 17:11 +0900, Minchan Kim wrote:
>>> +++ b/arch/x86/include/asm/tlbflush.h
>>> @@ -172,4 +172,16 @@ static inline void flush_tlb_kernel_range(unsigned long start,
>>> flush_tlb_all();
>>> }
>>>
>>> +static inline void local_flush_tlb_kernel_range(unsigned long start,
>>> + unsigned long end)
>>> +{
>>> + if (cpu_has_invlpg) {
>>> + while (start < end) {
>>> + __flush_tlb_single(start);
>>> + start += PAGE_SIZE;
>>> + }
>>> + } else
>>> + local_flush_tlb();
>>> +}
>
>
> It would be much better if you wait for Alex Shi's patch to mature.
> doing the invlpg thing for ranges is not an unconditional win.
>From what I can tell Alex's patches have stalled. The last post was v6
on 5/17 and there wasn't a single reply to them afaict.
According to Alex's investigation of this "tipping point", it seems that
a good generic value is 8. In other words, on most x86 hardware, it is
cheaper to flush up to 8 tlb entries one by one rather than doing a
complete flush.
So we can do something like:
if (cpu_has_invlpg && (end - start)/PAGE_SIZE <= 8) {
while (start < end) {
Would this be acceptable?
Thanks,
Seth
> From: Seth Jennings [mailto:[email protected]]
> Sent: Friday, June 15, 2012 9:13 AM
> To: Peter Zijlstra
> Cc: Minchan Kim; Greg Kroah-Hartman; Nitin Gupta; Dan Magenheimer; [email protected];
> [email protected]; Thomas Gleixner; Ingo Molnar; Tejun Heo; David Howells; [email protected]; Nick
> Piggin
> Subject: Re: [PATCH v2 3/3] x86: Support local_flush_tlb_kernel_range
>
> On 05/17/2012 09:51 AM, Peter Zijlstra wrote:
>
> > On Thu, 2012-05-17 at 17:11 +0900, Minchan Kim wrote:
> >>> +++ b/arch/x86/include/asm/tlbflush.h
> >>> @@ -172,4 +172,16 @@ static inline void flush_tlb_kernel_range(unsigned long start,
> >>> flush_tlb_all();
> >>> }
> >>>
> >>> +static inline void local_flush_tlb_kernel_range(unsigned long start,
> >>> + unsigned long end)
> >>> +{
> >>> + if (cpu_has_invlpg) {
> >>> + while (start < end) {
> >>> + __flush_tlb_single(start);
> >>> + start += PAGE_SIZE;
> >>> + }
> >>> + } else
> >>> + local_flush_tlb();
> >>> +}
> >
> > It would be much better if you wait for Alex Shi's patch to mature.
> > doing the invlpg thing for ranges is not an unconditional win.
>
> From what I can tell Alex's patches have stalled. The last post was v6
> on 5/17 and there wasn't a single reply to them afaict.
>
> According to Alex's investigation of this "tipping point", it seems that
> a good generic value is 8. In other words, on most x86 hardware, it is
> cheaper to flush up to 8 tlb entries one by one rather than doing a
> complete flush.
>
> So we can do something like:
>
> if (cpu_has_invlpg && (end - start)/PAGE_SIZE <= 8) {
> while (start < end) {
>
> Would this be acceptable?
Hey Seth, Nitin --
After more work digging around zsmalloc and zbud, I really think
this TLB flushing, as well as the "page pair mapping" code can be
completely eliminated IFF zsmalloc is limited to items PAGE_SIZE or
less. Since this is already true of zram (and in-tree zcache), and
zsmalloc currently has no other users, I think you should seriously
consider limiting zsmalloc in that way, or possibly splitting out
one version of zsmalloc which handles items PAGE_SIZE or less,
and a second version that can handle larger items but has (AFAIK)
no users.
If you consider it an option to have (a version of) zsmalloc
limited to items PAGE_SIZE or less, let me know and we can
get into the details.
Thanks,
Dan
On 06/15/2012 09:35 AM, Dan Magenheimer wrote:
>> From: Seth Jennings [mailto:[email protected]]
>> Sent: Friday, June 15, 2012 9:13 AM
>> To: Peter Zijlstra
>> Cc: Minchan Kim; Greg Kroah-Hartman; Nitin Gupta; Dan Magenheimer; [email protected];
>> [email protected]; Thomas Gleixner; Ingo Molnar; Tejun Heo; David Howells; [email protected]; Nick
>> Piggin
>> Subject: Re: [PATCH v2 3/3] x86: Support local_flush_tlb_kernel_range
>>
>> On 05/17/2012 09:51 AM, Peter Zijlstra wrote:
>>
>>> On Thu, 2012-05-17 at 17:11 +0900, Minchan Kim wrote:
>>>>> +++ b/arch/x86/include/asm/tlbflush.h
>>>>> @@ -172,4 +172,16 @@ static inline void flush_tlb_kernel_range(unsigned long start,
>>>>> flush_tlb_all();
>>>>> }
>>>>>
>>>>> +static inline void local_flush_tlb_kernel_range(unsigned long start,
>>>>> + unsigned long end)
>>>>> +{
>>>>> + if (cpu_has_invlpg) {
>>>>> + while (start < end) {
>>>>> + __flush_tlb_single(start);
>>>>> + start += PAGE_SIZE;
>>>>> + }
>>>>> + } else
>>>>> + local_flush_tlb();
>>>>> +}
>>>
>>> It would be much better if you wait for Alex Shi's patch to mature.
>>> doing the invlpg thing for ranges is not an unconditional win.
>>
>> From what I can tell Alex's patches have stalled. The last post was v6
>> on 5/17 and there wasn't a single reply to them afaict.
>>
>> According to Alex's investigation of this "tipping point", it seems that
>> a good generic value is 8. In other words, on most x86 hardware, it is
>> cheaper to flush up to 8 tlb entries one by one rather than doing a
>> complete flush.
>>
>> So we can do something like:
>>
>> if (cpu_has_invlpg && (end - start)/PAGE_SIZE <= 8) {
>> while (start < end) {
>>
>> Would this be acceptable?
>
> Hey Seth, Nitin --
>
> After more work digging around zsmalloc and zbud, I really think
> this TLB flushing, as well as the "page pair mapping" code can be
> completely eliminated IFF zsmalloc is limited to items PAGE_SIZE or
> less. Since this is already true of zram (and in-tree zcache), and
> zsmalloc currently has no other users, I think you should seriously
> consider limiting zsmalloc in that way, or possibly splitting out
> one version of zsmalloc which handles items PAGE_SIZE or less,
> and a second version that can handle larger items but has (AFAIK)
> no users.
>
> If you consider it an option to have (a version of) zsmalloc
> limited to items PAGE_SIZE or less, let me know and we can
> get into the details.
>
zsmalloc is already limited to objects of size PAGE_SIZE or less. This
two-page splitting is for efficiently storing objects in range
(PAGE_SIZE/2, PAGE_SIZE) which is very common in both zram and zcache.
SLUB achieves this efficiency by allocating higher order pages but that
is not an option for zsmalloc.
Thanks,
Nitin
On 06/15/2012 11:35 AM, Dan Magenheimer wrote:
>> From: Seth Jennings [mailto:[email protected]]
>> Sent: Friday, June 15, 2012 9:13 AM
>> To: Peter Zijlstra
>> Cc: Minchan Kim; Greg Kroah-Hartman; Nitin Gupta; Dan Magenheimer; [email protected];
>> [email protected]; Thomas Gleixner; Ingo Molnar; Tejun Heo; David Howells; [email protected]; Nick
>> Piggin
>> Subject: Re: [PATCH v2 3/3] x86: Support local_flush_tlb_kernel_range
>>
>> On 05/17/2012 09:51 AM, Peter Zijlstra wrote:
>>
>>> On Thu, 2012-05-17 at 17:11 +0900, Minchan Kim wrote:
>>>>> +++ b/arch/x86/include/asm/tlbflush.h
>>>>> @@ -172,4 +172,16 @@ static inline void flush_tlb_kernel_range(unsigned long start,
>>>>> flush_tlb_all();
>>>>> }
>>>>>
>>>>> +static inline void local_flush_tlb_kernel_range(unsigned long start,
>>>>> + unsigned long end)
>>>>> +{
>>>>> + if (cpu_has_invlpg) {
>>>>> + while (start < end) {
>>>>> + __flush_tlb_single(start);
>>>>> + start += PAGE_SIZE;
>>>>> + }
>>>>> + } else
>>>>> + local_flush_tlb();
>>>>> +}
>>>
>>> It would be much better if you wait for Alex Shi's patch to mature.
>>> doing the invlpg thing for ranges is not an unconditional win.
>>
>> From what I can tell Alex's patches have stalled. The last post was v6
>> on 5/17 and there wasn't a single reply to them afaict.
>>
>> According to Alex's investigation of this "tipping point", it seems that
>> a good generic value is 8. In other words, on most x86 hardware, it is
>> cheaper to flush up to 8 tlb entries one by one rather than doing a
>> complete flush.
>>
>> So we can do something like:
>>
>> if (cpu_has_invlpg && (end - start)/PAGE_SIZE <= 8) {
>> while (start < end) {
>>
>> Would this be acceptable?
>
> Hey Seth, Nitin --
>
> After more work digging around zsmalloc and zbud, I really think
> this TLB flushing, as well as the "page pair mapping" code can be
> completely eliminated IFF zsmalloc is limited to items PAGE_SIZE or
> less.
To add to what Nitin just sent, without the page mapping, zsmalloc and
the late xvmalloc have the same issue. Say you have a whole class of
objects that are 3/4 of a page. Without the mapping, you can't cross
non-contiguous page boundaries and you'll have 25% fragmentation in the
memory pool. This is the whole point of zsmalloc.
--
Seth
> From: Nitin Gupta [mailto:[email protected]]
> Subject: Re: [PATCH v2 3/3] x86: Support local_flush_tlb_kernel_range
>
> On 06/15/2012 09:35 AM, Dan Magenheimer wrote:
> >> From: Seth Jennings [mailto:[email protected]]
> >> Sent: Friday, June 15, 2012 9:13 AM
> >> To: Peter Zijlstra
> >> Cc: Minchan Kim; Greg Kroah-Hartman; Nitin Gupta; Dan Magenheimer; [email protected];
> >> [email protected]; Thomas Gleixner; Ingo Molnar; Tejun Heo; David Howells; [email protected]; Nick
> >> Piggin
> >> Subject: Re: [PATCH v2 3/3] x86: Support local_flush_tlb_kernel_range
> >>
> >> On 05/17/2012 09:51 AM, Peter Zijlstra wrote:
> >>
> >>> On Thu, 2012-05-17 at 17:11 +0900, Minchan Kim wrote:
> >>>>> +++ b/arch/x86/include/asm/tlbflush.h
> >>>>> @@ -172,4 +172,16 @@ static inline void flush_tlb_kernel_range(unsigned long start,
> >>>>> flush_tlb_all();
> >>>>> }
> >>>>>
> >>>>> +static inline void local_flush_tlb_kernel_range(unsigned long start,
> >>>>> + unsigned long end)
> >>>>> +{
> >>>>> + if (cpu_has_invlpg) {
> >>>>> + while (start < end) {
> >>>>> + __flush_tlb_single(start);
> >>>>> + start += PAGE_SIZE;
> >>>>> + }
> >>>>> + } else
> >>>>> + local_flush_tlb();
> >>>>> +}
> >>>
> >>> It would be much better if you wait for Alex Shi's patch to mature.
> >>> doing the invlpg thing for ranges is not an unconditional win.
> >>
> >> From what I can tell Alex's patches have stalled. The last post was v6
> >> on 5/17 and there wasn't a single reply to them afaict.
> >>
> >> According to Alex's investigation of this "tipping point", it seems that
> >> a good generic value is 8. In other words, on most x86 hardware, it is
> >> cheaper to flush up to 8 tlb entries one by one rather than doing a
> >> complete flush.
> >>
> >> So we can do something like:
> >>
> >> if (cpu_has_invlpg && (end - start)/PAGE_SIZE <= 8) {
> >> while (start < end) {
> >>
> >> Would this be acceptable?
> >
> > Hey Seth, Nitin --
> >
> > After more work digging around zsmalloc and zbud, I really think
> > this TLB flushing, as well as the "page pair mapping" code can be
> > completely eliminated IFF zsmalloc is limited to items PAGE_SIZE or
> > less. Since this is already true of zram (and in-tree zcache), and
> > zsmalloc currently has no other users, I think you should seriously
> > consider limiting zsmalloc in that way, or possibly splitting out
> > one version of zsmalloc which handles items PAGE_SIZE or less,
> > and a second version that can handle larger items but has (AFAIK)
> > no users.
> >
> > If you consider it an option to have (a version of) zsmalloc
> > limited to items PAGE_SIZE or less, let me know and we can
> > get into the details.
>
> zsmalloc is already limited to objects of size PAGE_SIZE or less. This
> two-page splitting is for efficiently storing objects in range
> (PAGE_SIZE/2, PAGE_SIZE) which is very common in both zram and zcache.
>
> SLUB achieves this efficiency by allocating higher order pages but that
> is not an option for zsmalloc.
That's what I thought, but a separate thread about ensuring zsmalloc
was as generic as possible led me to believe that zsmalloc was moving
in the direction of larger sizes.
> From: Seth Jennings [mailto:[email protected]]
>
> To add to what Nitin just sent, without the page mapping, zsmalloc and
> the late xvmalloc have the same issue. Say you have a whole class of
> objects that are 3/4 of a page. Without the mapping, you can't cross
> non-contiguous page boundaries and you'll have 25% fragmentation in the
> memory pool. This is the whole point of zsmalloc.
Yes, understood. This suggestion doesn't change any of that.
It only assumes that no more than one page boundary is crossed.
So, briefly, IIRC the "pair mapping" is what creates the necessity
to do special TLB stuff. That pair mapping is necessary
to create the illusion to the compression/decompression code
(and one other memcpy) that no pageframe boundary is crossed.
Correct?
The compression code already compresses to a per-cpu page-pair
already and then that "zpage" is copied into the space allocated
for it by zsmalloc. For that final copy, if the copy code knows
the target may cross a page boundary, has both target pages
kmap'ed, and is smart about doing the copy, the "pair mapping"
can be avoided for compression.
The decompression path calls lzo1x directly and it would be
a huge pain to make lzo1x smart about page boundaries. BUT
since we know that the decompressed result will always fit
into a page (actually exactly a page), you COULD do an extra
copy to the end of the target page (using the same smart-
about-page-boundaries copying code from above) and then do
in-place decompression, knowing that the decompression will
not cross a page boundary. So, with the extra copy, the "pair
mapping" can be avoided for decompression as well.
What about the horrible cost of that extra copy? Well, much
of the cost of a large copy is due to cache effects. Since
you are copying into a page that will immediately be overwritten
by the decompress, I'll bet that cost is much smaller. And
compared to the cost of setting up and tearing down TLB
entries (especially on machines with no local_tlb_kernel_range),
I suspect that special copy may be a LOT cheaper. And
with no special TLB code required, zsmalloc should be a lot
more portable.
Thoughts?
Dan
>> From: Seth Jennings [mailto:[email protected]]
>> To add to what Nitin just sent, without the page mapping, zsmalloc and
>> the late xvmalloc have the same issue. Say you have a whole class of
>> objects that are 3/4 of a page. Without the mapping, you can't cross
>> non-contiguous page boundaries and you'll have 25% fragmentation in the
>> memory pool. This is the whole point of zsmalloc.
>
> Yes, understood. This suggestion doesn't change any of that.
> It only assumes that no more than one page boundary is crossed.
>
> So, briefly, IIRC the "pair mapping" is what creates the necessity
> to do special TLB stuff. That pair mapping is necessary
> to create the illusion to the compression/decompression code
> (and one other memcpy) that no pageframe boundary is crossed.
> Correct?
Yes.
> The compression code already compresses to a per-cpu page-pair
> already and then that "zpage" is copied into the space allocated
> for it by zsmalloc. For that final copy, if the copy code knows
> the target may cross a page boundary, has both target pages
> kmap'ed, and is smart about doing the copy, the "pair mapping"
> can be avoided for compression.
The problem is that by "smart" you mean "has access to zsmalloc
internals". zcache, or any user, would need the know the kmapped
address of the first page, the offset to start at within that page, and
the kmapped address of the second page in order to do the smart copy
you're talking about. Then the complexity to do the smart copy that
would have to be implemented in each user.
> The decompression path calls lzo1x directly and it would be
> a huge pain to make lzo1x smart about page boundaries. BUT
> since we know that the decompressed result will always fit
> into a page (actually exactly a page), you COULD do an extra
> copy to the end of the target page (using the same smart-
> about-page-boundaries copying code from above) and then do
> in-place decompression, knowing that the decompression will
> not cross a page boundary. So, with the extra copy, the "pair
> mapping" can be avoided for decompression as well.
This is an interesting thought.
But this does result in a copy in the decompression (i.e. page fault)
path, where right now, it is copy free. The compressed data is
decompressed directly from its zsmalloc allocation to the page allocated
in the fault path.
Doing this smart copy stuff would move most of the complexity out of
zsmalloc into the user which defeats the purpose of abstracting the
functionality out in the first place: so the each user that wants to do
something like this doesn't have to reinvent the wheel.
--
Seth
> From: Seth Jennings [mailto:[email protected]]
> > The compression code already compresses to a per-cpu page-pair
> > already and then that "zpage" is copied into the space allocated
> > for it by zsmalloc. For that final copy, if the copy code knows
> > the target may cross a page boundary, has both target pages
> > kmap'ed, and is smart about doing the copy, the "pair mapping"
> > can be avoided for compression.
>
> The problem is that by "smart" you mean "has access to zsmalloc
> internals". zcache, or any user, would need the know the kmapped
> address of the first page, the offset to start at within that page, and
> the kmapped address of the second page in order to do the smart copy
> you're talking about. Then the complexity to do the smart copy that
> would have to be implemented in each user.
Or simply add a zsmalloc_copy in zsmalloc and require that
it be used by the caller (instead of a memcpy).
> > The decompression path calls lzo1x directly and it would be
> > a huge pain to make lzo1x smart about page boundaries. BUT
> > since we know that the decompressed result will always fit
> > into a page (actually exactly a page), you COULD do an extra
> > copy to the end of the target page (using the same smart-
> > about-page-boundaries copying code from above) and then do
> > in-place decompression, knowing that the decompression will
> > not cross a page boundary. So, with the extra copy, the "pair
> > mapping" can be avoided for decompression as well.
>
> This is an interesting thought.
>
> But this does result in a copy in the decompression (i.e. page fault)
> path, where right now, it is copy free. The compressed data is
> decompressed directly from its zsmalloc allocation to the page allocated
> in the fault path.
The page fault occurs as soon as the lzo1x compression code starts anyway,
as do all the cache faults... both just occur earlier, so the only
additional cost is the actual cpu instructions to move the sequence of
(compressed) bytes from the zsmalloc-allocated area to the end
of the target page.
TLB operations can be very expensive, not to mention (as the
subject of this thread attests) non-portable.
> Doing this smart copy stuff would move most of the complexity out of
> zsmalloc into the user which defeats the purpose of abstracting the
> functionality out in the first place: so the each user that wants to do
> something like this doesn't have to reinvent the wheel.
See above. It can be buried in zsmalloc.
Again, this is all just a suggestion, admittedly from someone who
doesn't like pure abstractions in kernel code ;-)
Dan
On 06/15/2012 12:39 PM, Dan Magenheimer wrote:
>> From: Seth Jennings [mailto:[email protected]]
>>> The compression code already compresses to a per-cpu page-pair
>>> already and then that "zpage" is copied into the space allocated
>>> for it by zsmalloc. For that final copy, if the copy code knows
>>> the target may cross a page boundary, has both target pages
>>> kmap'ed, and is smart about doing the copy, the "pair mapping"
>>> can be avoided for compression.
>>
>> The problem is that by "smart" you mean "has access to zsmalloc
>> internals". zcache, or any user, would need the know the kmapped
>> address of the first page, the offset to start at within that page, and
>> the kmapped address of the second page in order to do the smart copy
>> you're talking about. Then the complexity to do the smart copy that
>> would have to be implemented in each user.
>
> Or simply add a zsmalloc_copy in zsmalloc and require that
> it be used by the caller (instead of a memcpy).
>
>>> The decompression path calls lzo1x directly and it would be
>>> a huge pain to make lzo1x smart about page boundaries. BUT
>>> since we know that the decompressed result will always fit
>>> into a page (actually exactly a page), you COULD do an extra
>>> copy to the end of the target page (using the same smart-
>>> about-page-boundaries copying code from above) and then do
>>> in-place decompression, knowing that the decompression will
>>> not cross a page boundary. So, with the extra copy, the "pair
>>> mapping" can be avoided for decompression as well.
>>
>> This is an interesting thought.
>>
>> But this does result in a copy in the decompression (i.e. page fault)
>> path, where right now, it is copy free. The compressed data is
>> decompressed directly from its zsmalloc allocation to the page allocated
>> in the fault path.
>
> The page fault occurs as soon as the lzo1x compression code starts anyway,
> a
s do all the cache faults... both just occur earlier, so the only
> additional cost is the actual cpu instructions to move the sequence of
> (compressed) bytes from the zsmalloc-allocated area to the end
> of the target page.
>
> TLB operations can be very expensive, not to mention (as the
> subject of this thread attests) non-portable.
>
Even if you go for copying chunks followed by decompression, it still
requires two kmaps and kunmaps. Each of these require one local TLB
invlpg. So, a total of 2 local maps + unmaps even with this approach.
The only additional requirement of zsmalloc is that it requires two
mappings which are virtually contiguous. The cost is the same in both
approaches but the current zsmalloc approach presents a much cleaner
interface.
Thanks,
Nitin
> From: Nitin Gupta [mailto:[email protected]]
> Subject: Re: [PATCH v2 3/3] x86: Support local_flush_tlb_kernel_range
>
> On 06/15/2012 12:39 PM, Dan Magenheimer wrote:
> >> From: Seth Jennings [mailto:[email protected]]
> >>> The decompression path calls lzo1x directly and it would be
> >>> a huge pain to make lzo1x smart about page boundaries. BUT
> >>> since we know that the decompressed result will always fit
> >>> into a page (actually exactly a page), you COULD do an extra
> >>> copy to the end of the target page (using the same smart-
> >>> about-page-boundaries copying code from above) and then do
> >>> in-place decompression, knowing that the decompression will
> >>> not cross a page boundary. So, with the extra copy, the "pair
> >>> mapping" can be avoided for decompression as well.
> >>
> >> This is an interesting thought.
> >>
> >> But this does result in a copy in the decompression (i.e. page fault)
> >> path, where right now, it is copy free. The compressed data is
> >> decompressed directly from its zsmalloc allocation to the page allocated
> >> in the fault path.
> >
> > The page fault occurs as soon as the lzo1x compression code starts anyway,
> > as do all the cache faults... both just occur earlier, so the only
> > additional cost is the actual cpu instructions to move the sequence of
> > (compressed) bytes from the zsmalloc-allocated area to the end
> > of the target page.
> >
> > TLB operations can be very expensive, not to mention (as the
> > subject of this thread attests) non-portable.
>
> Even if you go for copying chunks followed by decompression, it still
> requires two kmaps and kunmaps. Each of these require one local TLB
> invlpg. So, a total of 2 local maps + unmaps even with this approach.
That may be true for i386, but on a modern (non-highmem) machine those
kmaps/kunmaps are free and "pair mappings" in the TLB are still expensive
and not very portable. Doesn't make sense to me to design for better
performance on highmem and poorer performance on non-highmem.
> The only additional requirement of zsmalloc is that it requires two
> mappings which are virtually contiguous. The cost is the same in both
> approaches but the current zsmalloc approach presents a much cleaner
> interface.
OK, it's your code and I'm just making a suggestion. I will shut up now ;-)
Dan
On 06/15/2012 01:13 PM, Dan Magenheimer wrote:
>> From: Nitin Gupta [mailto:[email protected]]
>> Subject: Re: [PATCH v2 3/3] x86: Support local_flush_tlb_kernel_range
>>
>> On 06/15/2012 12:39 PM, Dan Magenheimer wrote:
>>>> From: Seth Jennings [mailto:[email protected]]
>>>>> The decompression path calls lzo1x directly and it would be
>>>>> a huge pain to make lzo1x smart about page boundaries. BUT
>>>>> since we know that the decompressed result will always fit
>>>>> into a page (actually exactly a page), you COULD do an extra
>>>>> copy to the end of the target page (using the same smart-
>>>>> about-page-boundaries copying code from above) and then do
>>>>> in-place decompression, knowing that the decompression will
>>>>> not cross a page boundary. So, with the extra copy, the "pair
>>>>> mapping" can be avoided for decompression as well.
>>>>
>>>> This is an interesting thought.
>>>>
>>>> But this does result in a copy in the decompression (i.e. page fault)
>>>> path, where right now, it is copy free. The compressed data is
>>>> decompressed directly from its zsmalloc allocation to the page allocated
>>>> in the fault path.
>>>
>>> The page fault occurs as soon as the lzo1x compression code starts anyway,
>>> as do all the cache faults... both just occur earlier, so the only
>>> additional cost is the actual cpu instructions to move the sequence of
>>> (compressed) bytes from the zsmalloc-allocated area to the end
>>> of the target page.
>>>
>>> TLB operations can be very expensive, not to mention (as the
>>> subject of this thread attests) non-portable.
>>
>> Even if you go for copying chunks followed by decompression, it still
>> requires two kmaps and kunmaps. Each of these require one local TLB
>> invlpg. So, a total of 2 local maps + unmaps even with this approach.
>
> That may be true for i386, but on a modern (non-highmem) machine those
> kmaps/kunmaps are free and "pair mappings" in the TLB are still expensive
> and not very portable. Doesn't make sense to me to design for better
> performance on highmem and poorer performance on non-highmem.
>
True, but it seem hard to believe that mapping+unmappig+two local TLB
invlpg's will be more costly than copying, on average, PAGE_SIZE/2 data.
Anyways, I just looked up the LZO documentation and it supposedly allows
in place decompression (hopefully we didn't mess up this feature during
kernel port). Considering cache effects, this may really make copying
overhead much less than expected. Additionally, in case we go for
copying approach, we can further simplify compaction since any client
can never directly reference pool memory.
>> The only additional requirement of zsmalloc is that it requires two
>> mappings which are virtually contiguous. The cost is the same in both
>> approaches but the current zsmalloc approach presents a much cleaner
>> interface.
>
> OK, it's your code and I'm just making a suggestion. I will shut up now ;-)
>
I'm always glad to hear your opinions and was just trying to discuss the
points you raised. I apologize if I sounded rude.
Thanks,
Nitin
On 06/15/2012 04:23 PM, Nitin Gupta wrote:
> On 06/15/2012 01:13 PM, Dan Magenheimer wrote:
>>
>> OK, it's your code and I'm just making a suggestion. I will shut up now ;-)
>>
>
> I'm always glad to hear your opinions and was just trying to discuss the
> points you raised. I apologize if I sounded rude.
Same here. It's always good to go back and rethink your assumptions.
Thanks Dan!
Thanks,
Seth