2008-11-18 14:06:40

by Naval Saini

[permalink] [raw]
Subject: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)

Hi:


We were facing some issues with O_DIRECT, when using XFS (mkfs.xfs) to
format the USB disks. We debugged it to following comment in arm
technical reference manual (which perhaps applies to other
architectures aswell) :-


>From arm 1176 technical ref. manual :-

1. If multiple virtual addresses are mapped onto the same physical
address then for all mappings of bits [13:12] the virtual addresses
must be equal and the same as bits [13:12] of the physical address.
The same physical address can be mapped by TLB entries of different
page sizes, including page sizes over 4KB. Imposing this requirement
on the virtual address is called page coloring.
2. Alternatively, if all mappings to a physical address are of a page
size equal to 4KB, then the restriction that bits [13:12] of the
virtual address must equal bits [13:12] of the physical address is not
necessary. Bits [13:12] of all virtual address aliases must still be
equal.


I am looking forward to help from the community in making this patch a
mainline kernel patch. It need help in reviewing it and fixing a bug
in it (described below).



Our configuration :-

arm 11 architecture / linux-2.6.18.5 / uClibc-0.9.28 .



Proposed Fix :-

Add a mechanism to allocate memory that is page color aligned. Patch
in uclibc and linux kernel.


The idea behind the patch / explanation :-


Modify memalign in uclibc to pass a new flag (MAP_COLORALIGN) to
kernel (via mmap only / no sbrk) ; that directs it to allocate a
memory in which all PAs and VAs are color aligned (see my post above
for why we need it).

The kernel has creates a vm_area for above mmap operation (with
MAP_COLORALIGN flag), and adds VM_COLORALIGN to its vm_area->flags.

When we get an __handle_mm_fault for this area, we have modified some
kernel functions in file mm/page_alloc.c (such as __alloc_pages,
get_page_from_freelist, buffered_rmqueue) to allocate an aligned
memory (as stated above). We have modified the parameter 'order' (in
these functions), to new name 'align_order'. This parameter carries
needed alignment in upper half of the word and order (or size in
pages) in lower half. This change is in effect only when VM_COLORALIGN
is set in flags.

So, finally we allocate the aligned pages in function
__rmqueue_aligned_wrapper . It calls another function
__rmqueue_aligned where it actually does the allocation. In these
functions, we try to allocate the memory from areas (in buddy) of
larger order first (see condition in __rmqueue_aligned_wrapper) and
then from smaller order areas (if former fails). This is done for
optimization considerations.

The extra pages in the front and back of allocated area, are freed in
the function expand_num_pages.

Also i have procured permission from company's legal department to
post the patch to the community.



Bug in the patch :-

I also need help with a bug in the patch. I stumbled across a crash
with a simple program as below, when USB was not connected (ie.
/dev/sda did not have an associated device),

#define ALIGN (16*1024)
#define BS (512)
main ()
{
posix_memalign ( &b2, ALIGN , BS ); ob2 = b2;
sda =open ("/dev/sda", O_RDWR ); // get dump here
return 0;
}


If I modify ALIGN to some lesser value, say 4096, dont get the crash.
Perhaps it is some buddy allocator concept i need to understand, when
i free pages in expand_num_pages (hoping someone can spot it out from
the patch).

Also suggest, how i can do some regression testing here.


The Dump from above program :-


./test_odirect1
Unable to handle kernel NULL pointer dereference at virtual address 00000084
pgd = 97044000
[00000084] *pgd=00000000
Internal error: Oops: 17 [#1]
Modules linked in:
CPU: 0
PC is at do_page_fault+0x20/0x244
LR is at do_DataAbort+0x3c/0xa0
pc : [<9706b748>] lr : [<9706bb2c>] Not tainted
sp : 97720058 ip : 176ce431 fp : 9772009c
r10: 00000000 r9 : 40000113 r8 : 97720150
r7 : 00000000 r6 : 9738966c r5 : 00000017 r4 : 973895fc
r3 : 97720000 r2 : 97720150 r1 : 00000017 r0 : 00000084
Flags: nzcv IRQs on FIQs on Mode SVC_32 Segment user
Control: C5387D Table: 17720000 DAC: 00000017
Process modprobe (pid: 83, stack limit = 0x9771e250)
Stack: (0x97720058 to 0x97720000)
Backtrace:
[<9706b728>] (do_page_fault+0x0/0x244) from [<9706bb2c>]
(do_DataAbort+0x3c/0xa0)
[<9706baf0>] (do_DataAbort+0x0/0xa0) from [<9706482c>] (__dabt_svc+0x4c/0x60)
r8 = 97720290 r7 = 00000000 r6 = 9738966C r5 = 97720184
r4 = FFFFFFFF
[<9706b728>] (do_page_fault+0x0/0x244) from [<9706bb2c>]
(do_DataAbort+0x3c/0xa0)
[<9706baf0>] (do_DataAbort+0x0/0xa0) from [<9706482c>] (__dabt_svc+0x4c/0x60)
r8 = 977203D0 r7 = 00000000 r6 = 9738966C r5 = 977202C4
r4 = FFFFFFFF
[<9706b728>] (do_page_fault+0x0/0x244) from [<9706bb2c>]
(do_DataAbort+0x3c/0xa0)

........ (so on)




PATCHES 1 - applies to kernel
--------------------



--- linux-2.6.18.5/arch/arm/mm/mmap.c 2006-12-02 05:43:05.000000000 +0530
+++ linux-2.6.18.5/arch/arm/mm/mmap.c 2008-11-07 19:29:09.000000000 +0530
@@ -41,7 +41,7 @@ arch_get_unmapped_area(struct file *filp
if (cache_type != read_cpuid(CPUID_ID)) {
aliasing = (cache_type | cache_type >> 12) & (1 << 11);
if (aliasing)
- do_align = filp || flags & MAP_SHARED;
+ do_align = filp || flags & MAP_SHARED || flags & MAP_COLORALIGN;
}
#else
#define do_align 0
--- linux-2.6.18.5/mm/page_alloc.c 2006-12-02 05:43:05.000000000 +0530
+++ linux-2.6.18.5/mm/page_alloc.c 2008-11-07 19:31:43.000000000 +0530
@@ -519,6 +519,32 @@ static inline void expand(struct zone *z
}
}

+/* The order in while pages are being added to the areas, should not
be changed.
+ * ie. smaller pages (with smaller order) towards the front and
larger towards back.
+ */
+static inline void expand_num_pages(struct zone *zone, struct page
*page, unsigned int size)
+{
+ unsigned int mask, order;
+ struct free_area *area;
+
+ mask = 1;
+ order = 0;
+ while (size) {
+ if ( mask & size ) {
+ area = zone->free_area + order;
+ BUG_ON(bad_range(zone, page));
+ list_add(&page->lru, &area->free_list);
+ area->nr_free++;
+ set_page_order(page, order);
+
+ page = &page[mask];
+ size &= (~mask);
+ }
+ mask <<= 1;
+ ++order;
+ }
+}
+
/*
* This page is about to be returned from the page allocator
*/
@@ -591,6 +617,110 @@ static struct page *__rmqueue(struct zon
return NULL;
}

+#define log2(n) ffz(~(n))
+
+/*
+ * __rmqueue_aligned :-
+ * Tries to allocate aligned pages of given order, from among pages
of current_order.
+ */
+static struct page *__rmqueue_aligned (struct zone *zone, unsigned
int align_order , unsigned int current_order )
+{
+ struct free_area *area;
+ struct list_head *pos;
+ struct page *page;
+
+ unsigned int frontpad, padsize, lastpad, page_align;
+ const unsigned int extra = (SHMLBA >> PAGE_SHIFT), extra_order = log2(extra);
+ unsigned int order = EXTRACT_ORDER(__GFP_COLORALIGN,align_order);
+ unsigned int align = EXTRACT_ALIGN(__GFP_COLORALIGN,align_order);
+
+ area = zone->free_area + current_order;
+ if (list_empty(&area->free_list))
+ return NULL;
+
+ list_for_each(pos,&area->free_list) {
+ page = list_entry(pos, struct page, lru);
+ page_align = page_to_pfn(page) & (extra-1);
+
+ if ( align == page_align ) { /* color alignment matches */
+ frontpad = 0;
+ break;
+ }
+
+ if ( current_order > order ) /* chk if its not a tightfit; cant
allocate from tight fit */
+ {
+ frontpad = (align > page_align) ? (align - page_align) : (extra +
align - page_align);
+ /* move below at 2 places, for efficiency */
+
+ if ( current_order > extra_order ) /* allocate from inside node */
+ break;
+
+ if ( (unsigned int) ((0x1UL<<current_order) - (0x1UL<<order)) >= frontpad)
+ break;
+ }
+ }
+
+ if ( pos == &area->free_list) /* is true when we did not find any
suitable match in the area */
+ return NULL;
+
+ list_del(&page->lru);
+ rmv_page_order(page);
+ area->nr_free--;
+ zone->free_pages -= 1UL << order;
+
+ /* Allocation looks as below :-
+ * [current_order pages] = [frontpad pages] + [order pages] +
[lastpad pages] */
+ padsize = (0x01UL<<order) + frontpad;
+ lastpad = (0x01UL<<current_order) - padsize;
+
+ printk (" current %d = ", 0x1UL<<current_order);
+ printk (" frontpad %d + order %d + lastpad %d \n", frontpad,
0x1UL<<order, lastpad);
+
+ /* we need to remove pages in front and back of allocation */
+ if (lastpad)
+ expand_num_pages (zone, &page[padsize], lastpad);
+ if (frontpad) {
+ expand_num_pages (zone, page, frontpad);
+ page = &page[frontpad];
+ rmv_page_order(page);
+ }
+
+ return page;
+}
+
+/* __rmqueue_aligned_wrapper :-
+ * calls __rmqueue_aligned in a manner, optimzed for speed.
__rmqueue_aligned accepts a parameter
+ * current_order, which is the area we check for free pages. We first
check for large pages thus
+ * spending lesser time looping in __rmqueue_aligned.
+ */
+static struct page *__rmqueue_aligned_wrapper (struct zone *zone,
unsigned int align_order)
+{
+ struct page *page = NULL;
+ unsigned int optim_order, current_order;
+ const unsigned int extra = (SHMLBA >> PAGE_SHIFT), extra_order = log2(extra);
+ unsigned int order = EXTRACT_ORDER(__GFP_COLORALIGN,align_order);
+ unsigned int align = EXTRACT_ALIGN(__GFP_COLORALIGN,align_order);
+
+ BUG_ON(align >= extra);
+
+ /* starting with higher order areas, results in faster find */
+ if ( order == 0 )
+ optim_order = extra_order;
+ else if ( order <= extra_order )
+ optim_order = extra_order + 1;
+ else
+ optim_order = order + 1;
+
+ for (current_order = optim_order; current_order < MAX_ORDER && page
== NULL; current_order++ )
+ page = __rmqueue_aligned ( zone , align_order , current_order );
+
+ for (current_order = optim_order-1; current_order >= order && page
== NULL ; current_order-- )
+ page = __rmqueue_aligned ( zone , align_order , current_order );
+
+ return page;
+}
+
+
/*
* Obtain a specified number of elements from the buddy allocator, all under
* a single hold of the lock, for efficiency. Add them to the supplied list.
@@ -773,16 +903,17 @@ void split_page(struct page *page, unsig
* or two.
*/
static struct page *buffered_rmqueue(struct zonelist *zonelist,
- struct zone *zone, int order, gfp_t gfp_flags)
+ struct zone *zone, int align_order, gfp_t gfp_flags)
{
unsigned long flags;
struct page *page;
int cold = !!(gfp_flags & __GFP_COLD);
int cpu;
+ unsigned int order = EXTRACT_ORDER(gfp_flags,align_order);

again:
cpu = get_cpu();
- if (likely(order == 0)) {
+ if (likely(order == 0 && !(gfp_flags & __GFP_COLORALIGN))) {
struct per_cpu_pages *pcp;

pcp = &zone_pcp(zone, cpu)->pcp[cold];
@@ -798,9 +929,13 @@ again:
pcp->count--;
} else {
spin_lock_irqsave(&zone->lock, flags);
- page = __rmqueue(zone, order);
+ if ( likely(!(gfp_flags & __GFP_COLORALIGN)) ) {
+ page = __rmqueue(zone, order);
+ } else {
+ page = __rmqueue_aligned_wrapper(zone, align_order);
+ }
spin_unlock(&zone->lock);
- if (!page)
+ if (!page)
goto failed;
}

@@ -864,12 +999,13 @@ int zone_watermark_ok(struct zone *z, in
* a page.
*/
static struct page *
-get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
+get_page_from_freelist(gfp_t gfp_mask, unsigned int align_order,
struct zonelist *zonelist, int alloc_flags)
{
struct zone **z = zonelist->zones;
struct page *page = NULL;
int classzone_idx = zone_idx(*z);
+ unsigned int order = EXTRACT_ORDER(gfp_mask,align_order);

/*
* Go through the zonelist once, looking for a zone with enough free.
@@ -895,7 +1031,7 @@ get_page_from_freelist(gfp_t gfp_mask, u
continue;
}

- page = buffered_rmqueue(zonelist, *z, order, gfp_mask);
+ page = buffered_rmqueue(zonelist, *z, align_order, gfp_mask);
if (page) {
break;
}
@@ -905,9 +1041,13 @@ get_page_from_freelist(gfp_t gfp_mask, u

/*
* This is the 'heart' of the zoned buddy allocator.
+ *
+ * VIPT cache fix :-
+ * The upper half (ie. upper 16 bits in case of 32-bit integers) of
align_order
+ * are used for passing, the required page color alignment for the
allocated page.
*/
struct page * fastcall
-__alloc_pages(gfp_t gfp_mask, unsigned int order,
+__alloc_pages(gfp_t gfp_mask, unsigned int align_order,
struct zonelist *zonelist)
{
const gfp_t wait = gfp_mask & __GFP_WAIT;
@@ -918,6 +1058,7 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
int do_retry;
int alloc_flags;
int did_some_progress;
+ unsigned int order = EXTRACT_ORDER(gfp_mask,align_order);

might_sleep_if(wait);

@@ -929,7 +1070,7 @@ restart:
return NULL;
}

- page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
+ page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, align_order,
zonelist, ALLOC_WMARK_LOW|ALLOC_CPUSET);
if (page)
goto got_pg;
@@ -964,7 +1105,7 @@ restart:
* Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
*/
- page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
+ page = get_page_from_freelist(gfp_mask, align_order, zonelist, alloc_flags);
if (page)
goto got_pg;

@@ -975,7 +1116,7 @@ restart:
if (!(gfp_mask & __GFP_NOMEMALLOC)) {
nofail_alloc:
/* go through the zonelist yet again, ignoring mins */
- page = get_page_from_freelist(gfp_mask, order,
+ page = get_page_from_freelist(gfp_mask, align_order,
zonelist, ALLOC_NO_WATERMARKS);
if (page)
goto got_pg;
@@ -1008,7 +1149,7 @@ rebalance:
cond_resched();

if (likely(did_some_progress)) {
- page = get_page_from_freelist(gfp_mask, order,
+ page = get_page_from_freelist(gfp_mask, align_order,
zonelist, alloc_flags);
if (page)
goto got_pg;
@@ -1019,7 +1160,7 @@ rebalance:
* a parallel oom killing, we must fail if we're still
* under heavy pressure.
*/
- page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
+ page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, align_order,
zonelist, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
if (page)
goto got_pg;
--- linux-2.6.18.5/include/asm-arm/mman.h 2006-12-02 05:43:05.000000000 +0530
+++ linux-2.6.18.5/include/asm-arm/mman.h 2008-11-07 19:29:09.000000000 +0530
@@ -11,6 +11,11 @@
#define MAP_POPULATE 0x8000 /* populate (prefault) page tables */
#define MAP_NONBLOCK 0x10000 /* do not block on IO */

+#ifdef CONFIG_CPU_V6
+# undef MAP_COLORALIGN
+# define MAP_COLORALIGN 0x0200 /* For VIVT caches - the alignment
should be color aligned */
+#endif
+
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */

--- linux-2.6.18.5/include/asm-arm/page.h 2006-12-02 05:43:05.000000000 +0530
+++ linux-2.6.18.5/include/asm-arm/page.h 2008-11-07 19:29:09.000000000 +0530
@@ -134,6 +134,17 @@ extern void __cpu_copy_user_page(void *t
#define clear_page(page) memzero((void *)(page), PAGE_SIZE)
extern void copy_page(void *to, const void *from);

+#ifdef CONFIG_CPU_V6
+#define BITLEN(x) (sizeof(x)<<3) /* for bytes to bits, multiply by 8 */
+#define EXTRACT_ORDER(gfp,num) ( (gfp & __GFP_COLORALIGN) ? (num &
0xFFFF) : (num) )
+#define EXTRACT_ALIGN(gfp,num) ( (gfp & __GFP_COLORALIGN) ?
(num>>BITLEN(unsigned short)) : (0) )
+
+#define ALIGNMENT_BITS(addr) (((addr&(SHMLBA-1))>>PAGE_SHIFT) <<
BITLEN(unsigned short))
+#define ARCH_ALIGNORDER(flags,addr,order) ((flags & VM_COLORALIGN)?
(ALIGNMENT_BITS(addr) | order) : (0))
+
+#define ARCH_ALIGNGFP(flags) ((flags & VM_COLORALIGN)? (__GFP_COLORALIGN):(0))
+#endif
+
#undef STRICT_MM_TYPECHECKS

#ifdef STRICT_MM_TYPECHECKS
--- linux-2.6.18.5/include/linux/mman.h 2006-12-02 05:43:05.000000000 +0530
+++ linux-2.6.18.5/include/linux/mman.h 2008-11-07 19:29:09.000000000 +0530
@@ -63,7 +63,8 @@ calc_vm_flag_bits(unsigned long flags)
return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) |
_calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) |
_calc_vm_trans(flags, MAP_EXECUTABLE, VM_EXECUTABLE) |
- _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED );
+ _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) |
+ _calc_vm_trans(flags, MAP_COLORALIGN, VM_COLORALIGN);
}
#endif /* __KERNEL__ */
#endif /* _LINUX_MMAN_H */
--- linux-2.6.18.5/include/linux/mm.h 2006-12-02 05:43:05.000000000 +0530
+++ linux-2.6.18.5/include/linux/mm.h 2008-11-07 19:29:09.000000000 +0530
@@ -160,6 +160,7 @@ extern unsigned int kobjsize(const void
#define VM_DONTEXPAND 0x00040000 /* Cannot expand with mremap() */
#define VM_RESERVED 0x00080000 /* Count as reserved_vm like IO */
#define VM_ACCOUNT 0x00100000 /* Is a VM accounted object */
+#define VM_COLORALIGN 0x00200000 /* The vma is aligned with colour
bits for VIVT cache */
#define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */
#define VM_NONLINEAR 0x00800000 /* Is non-linear (remap_file_pages) */
#define VM_MAPPED_COPY 0x01000000 /* T if mapped copy of data (nommu mmap) */
--- linux-2.6.18.5/include/linux/gfp.h 2008-11-07 19:33:55.000000000 +0530
+++ linux-2.6.18.5/include/linux/gfp.h 2008-11-07 19:35:50.000000000 +0530
@@ -47,6 +47,8 @@ struct vm_area_struct;
#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use
emergency reserves */
#define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce
hardwall cpuset memory allocs */

+#define __GFP_COLORALIGN ((__force gfp_t)0x40000u) /* Used by
processors with VIPT caches */
+
#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))

@@ -106,8 +108,9 @@ extern struct page *
FASTCALL(__alloc_pages(gfp_t, unsigned int, struct zonelist *));

static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
- unsigned int order)
+ unsigned int align_order)
{
+ unsigned int order = EXTRACT_ORDER(gfp_mask,align_order);
if (unlikely(order >= MAX_ORDER))
return NULL;

@@ -115,7 +118,7 @@ static inline struct page *alloc_pages_n
if (nid < 0)
nid = numa_node_id();

- return __alloc_pages(gfp_mask, order,
+ return __alloc_pages(gfp_mask, align_order,
NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask));
}

@@ -133,9 +136,11 @@ alloc_pages(gfp_t gfp_mask, unsigned int
extern struct page *alloc_page_vma(gfp_t gfp_mask,
struct vm_area_struct *vma, unsigned long addr);
#else
-#define alloc_pages(gfp_mask, order) \
- alloc_pages_node(numa_node_id(), gfp_mask, order)
-#define alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0)
+
+#define alloc_pages(gfp_mask, align_order) \
+ alloc_pages_node(numa_node_id(), gfp_mask, align_order)
+#define alloc_page_vma(gfp_mask, vma, addr) \
+ alloc_pages(gfp_mask|ARCH_ALIGNGFP(vma->vm_flags),
ARCH_ALIGNORDER(vma->vm_flags,addr,0))
#endif
#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)

--- linux-2.6.18.5/include/linux/highmem.h 2006-12-02 05:43:05.000000000 +0530
+++ linux-2.6.18.5/include/linux/highmem.h 2008-11-07 20:11:44.000000000 +0530
@@ -67,6 +67,12 @@ alloc_zeroed_user_highpage(struct vm_are
}
#endif

+#ifndef ARCH_ALIGNORDER
+#define EXTRACT_ORDER(num) (num)
+#define EXTRACT_ALIGN(num) (0)
+#define ARCH_ALIGNORDER(flag,align,order) (0);
+#endif
+
static inline void clear_highpage(struct page *page)
{
void *kaddr = kmap_atomic(page, KM_USER0);
--- linux-2.6.18.5/include/asm-generic/mman.h 2006-12-02
05:43:05.000000000 +0530
+++ linux-2.6.18.5/include/asm-generic/mman.h 2008-11-07
19:29:09.000000000 +0530
@@ -20,6 +20,9 @@
#define MAP_FIXED 0x10 /* Interpret addr exactly */
#define MAP_ANONYMOUS 0x20 /* don't use a file */

+#define MAP_COLORALIGN 0x0 /* can be redefined (undef) to
0x00200000 in arch specific mmap.h,
+ if the processor has VIVT
cache with multiple associativity */
+
#define MS_ASYNC 1 /* sync memory asynchronously */
#define MS_INVALIDATE 2 /* invalidate the caches */
#define MS_SYNC 4 /* synchronous memory sync */




PATCH 2 -- applies to uclibc ( in buildroot directory toolchain/kernel-headers )
-------------------


--- linux/include/asm-arm/mman.h 2008-10-30 16:16:31.000000000 +0530
+++ linux/include/asm-arm/mman.h 2008-10-30 16:16:46.000000000 +0530
@@ -3,6 +3,8 @@

#include <asm-generic/mman.h>

+#define MAP_COLORALIGN 0x0200 /* For VIVT caches - the
alignment should be color aligned */
+
#define MAP_GROWSDOWN 0x0100 /* stack-like segment */
#define MAP_DENYWRITE 0x0800 /* ETXTBSY */
#define MAP_EXECUTABLE 0x1000 /* mark it as an executable */




PATCH 3 -- applies to uclibc ( in buildroot directory toolchain/uClibc )
---------------


--- uClibc/libc/stdlib/malloc-standard/malloc.c 2008-10-31
13:03:29.000000000 +0530
+++ uClibc/libc/stdlib/malloc-standard/malloc.c 2008-11-03
16:28:16.000000000 +0530
@@ -342,7 +342,15 @@ void __do_check_malloc_state(void)
space to service request for nb bytes, thus requiring that av->top
be extended or replaced.
*/
-static void* __malloc_alloc(size_t nb, mstate av)
+
+/* To do :-
+ Check if brk can be used to allocate smaller chunks. Maybe sbrk call checks
+ for vma_flags (such as MAP_COLORALIGN flag) and only allocates/expands
+ areas that use the same time. If such a functionality is not there,
+ perhaps it can be added in future. For now, all MAP_COLORALIGN allocations
+ go through mmap2.
+*/
+static void* __malloc_alloc(size_t nb, mstate av, const int map_flags)
{
mchunkptr old_top; /* incoming value of av->top */
size_t old_size; /* its size */
@@ -374,7 +382,7 @@ static void* __malloc_alloc(size_t nb, m
than in malloc proper.
*/

- if (have_fastchunks(av)) {
+ if (have_fastchunks(av) && !(map_flags & MAP_COLORALIGN)) {
assert(in_smallbin_range(nb));
__malloc_consolidate(av);
return malloc(nb - MALLOC_ALIGN_MASK);
@@ -389,7 +397,7 @@ static void* __malloc_alloc(size_t nb, m
*/

if ((unsigned long)(nb) >= (unsigned long)(av->mmap_threshold) &&
- (av->n_mmaps < av->n_mmaps_max)) {
+ (av->n_mmaps < av->n_mmaps_max) || (map_flags & MAP_COLORALIGN)) {

char* mm; /* return value from mmap call*/

@@ -403,7 +411,7 @@ static void* __malloc_alloc(size_t nb, m
/* Don't try if size wraps around 0 */
if ((unsigned long)(size) > (unsigned long)(nb)) {

- mm = (char*)(MMAP(0, size, PROT_READ|PROT_WRITE));
+ mm = (char*)(MMAP(0, size, PROT_READ|PROT_WRITE, map_flags));

if (mm != (char*)(MORECORE_FAILURE)) {

@@ -523,7 +531,7 @@ static void* __malloc_alloc(size_t nb, m
/* Don't try if size wraps around 0 */
if ((unsigned long)(size) > (unsigned long)(nb)) {

- fst_brk = (char*)(MMAP(0, size, PROT_READ|PROT_WRITE));
+ fst_brk = (char*)(MMAP(0, size, PROT_READ|PROT_WRITE, map_flags));

if (fst_brk != (char*)(MORECORE_FAILURE)) {

@@ -802,6 +810,20 @@ static int __malloc_largebin_index(unsig
/* ------------------------------ malloc ------------------------------ */
void* malloc(size_t bytes)
{
+ return __internal_malloc(bytes, 0);
+}
+
+
+/* ---------------------- __internal_malloc ------------------------------ */
+
+/* Why did we add ' map_flags ' to parameter list ?
+ Using map_flags, we can inform the kernel of the following :-
+ - vma is page color aligned (For ex, is needed for using O_DIRECT
+ on a VIVT, multiple assosiative cache).
+ The first user of _internal_malloc function would be memalign.
+*/
+void* __internal_malloc(size_t bytes, const int map_flags)
+{
mstate av;

size_t nb; /* normalized request size */
@@ -851,6 +873,9 @@ void* malloc(size_t bytes)
goto use_top;
}

+ /* for VIVT caches, if we want color alignment, only use __malloc_alloc */
+ if (map_flags & MAP_COLORALIGN) goto use_top;
+
/*
If the size qualifies as a fastbin, first check corresponding bin.
*/
@@ -927,7 +952,8 @@ void* malloc(size_t bytes)
if (in_smallbin_range(nb) &&
bck == unsorted_chunks(av) &&
victim == av->last_remainder &&
- (unsigned long)(size) > (unsigned long)(nb + MINSIZE)) {
+ (unsigned long)(size) > (unsigned long)(nb + MINSIZE))
+ {

/* split and reattach remainder */
remainder_size = size - nb;
@@ -1142,7 +1168,7 @@ use_top:
victim = av->top;
size = chunksize(victim);

- if ((unsigned long)(size) >= (unsigned long)(nb + MINSIZE)) {
+ if ((unsigned long)(size) >= (unsigned long)(nb + MINSIZE) && !map_flags) {
remainder_size = size - nb;
remainder = chunk_at_offset(victim, nb);
av->top = remainder;
@@ -1155,7 +1181,7 @@ use_top:
}

/* If no space in top, relay to handle system-dependent cases */
- sysmem = __malloc_alloc(nb, av);
+ sysmem = __malloc_alloc(nb, av, map_flags);
retval = sysmem;
DONE:
__MALLOC_UNLOCK;
--- uClibc/libc/stdlib/malloc-standard/malloc.h 2008-10-31
13:03:35.000000000 +0530
+++ uClibc/libc/stdlib/malloc-standard/malloc.h 2008-11-03
17:59:57.000000000 +0530
@@ -347,6 +347,7 @@ __UCLIBC_MUTEX_EXTERN(__malloc_lock);
/* ------------------ MMAP support ------------------ */
#include <fcntl.h>
#include <sys/mman.h>
+#include <asm/mman.h>

#if !defined(MAP_ANONYMOUS) && defined(MAP_ANON)
#define MAP_ANONYMOUS MAP_ANON
@@ -354,13 +355,13 @@ __UCLIBC_MUTEX_EXTERN(__malloc_lock);

#ifdef __ARCH_USE_MMU__

-#define MMAP(addr, size, prot) \
- (mmap((addr), (size), (prot), MAP_PRIVATE|MAP_ANONYMOUS, 0, 0))
+#define MMAP(addr, size, prot, map) \
+ (mmap((addr), (size), (prot), MAP_PRIVATE|MAP_ANONYMOUS|map, 0, 0))

#else

-#define MMAP(addr, size, prot) \
- (mmap((addr), (size), (prot), MAP_SHARED|MAP_ANONYMOUS, 0, 0))
+#define MMAP(addr, size, prot, map) \
+ (mmap((addr), (size), (prot), MAP_SHARED|MAP_ANONYMOUS|map, 0, 0))

#endif

@@ -931,6 +932,7 @@ extern struct malloc_state __malloc_stat

/* External internal utilities operating on mstates */
void __malloc_consolidate(mstate) attribute_hidden;
+void* __internal_malloc(size_t bytes, const int map_flags);


/* Debugging support */
--- uClibc/libc/stdlib/malloc-standard/memalign.c 2008-10-31
13:03:47.000000000 +0530
+++ uClibc/libc/stdlib/malloc-standard/memalign.c 2008-11-03
16:27:09.000000000 +0530
@@ -59,9 +59,11 @@ void* memalign(size_t alignment, size_t
* request, and then possibly free the leading and trailing space. */


- /* Call malloc with worst case padding to hit alignment. */
-
- m = (char*)(malloc(nb + alignment + MINSIZE));
+ /* Call __internal_malloc with worst case padding to hit alignment.
+ Note: MAP_COLORALIGN would be disregarded in the kernel if architecture
+ does not require it. For ex, It is needed in case of ARM 11 processors.
+ */
+ m = (char*)(__internal_malloc(nb + alignment + MINSIZE, MAP_COLORALIGN));

if (m == 0) {
retval = 0; /* propagate failure */





Regards,
Naval
NXP Semiconductors


2008-11-19 06:40:52

by Nick Piggin

[permalink] [raw]
Subject: Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)

Hi,

On Wednesday 19 November 2008 01:06, Naval Saini wrote:
> Hi:
>
>
> We were facing some issues with O_DIRECT, when using XFS (mkfs.xfs) to
> format the USB disks. We debugged it to following comment in arm
> technical reference manual (which perhaps applies to other
> architectures aswell) :-
>
>
> From arm 1176 technical ref. manual :-
>
> 1. If multiple virtual addresses are mapped onto the same physical
> address then for all mappings of bits [13:12] the virtual addresses
> must be equal and the same as bits [13:12] of the physical address.
> The same physical address can be mapped by TLB entries of different
> page sizes, including page sizes over 4KB. Imposing this requirement
> on the virtual address is called page coloring.
> 2. Alternatively, if all mappings to a physical address are of a page
> size equal to 4KB, then the restriction that bits [13:12] of the
> virtual address must equal bits [13:12] of the physical address is not
> necessary. Bits [13:12] of all virtual address aliases must still be
> equal.
>
>
> I am looking forward to help from the community in making this patch a
> mainline kernel patch. It need help in reviewing it and fixing a bug
> in it (described below).
>
>
>
> Our configuration :-
>
> arm 11 architecture / linux-2.6.18.5 / uClibc-0.9.28 .

It would be interesting to know exactly what problem you are seeing.

ARM I think is supposed to handle aliasing problems by flushing
caches at appropriate points. It would be nice to know what's going
wrong and whether we can cover those holes.

Does the problem show up on the most recent kernel.org kernels?


> Proposed Fix :-
>
> Add a mechanism to allocate memory that is page color aligned. Patch
> in uclibc and linux kernel.

We've traditionally avoided cache colouring in the page allocator for
one reason or another. It should be a workable approach, but I think
the way to get it into mainline is to first fix the problem using the
existing cache flushing approach, and then demonstrate how page
colouring is a better solution.



> The idea behind the patch / explanation :-
>
>
> Modify memalign in uclibc to pass a new flag (MAP_COLORALIGN) to
> kernel (via mmap only / no sbrk) ; that directs it to allocate a
> memory in which all PAs and VAs are color aligned (see my post above
> for why we need it).
>
> The kernel has creates a vm_area for above mmap operation (with
> MAP_COLORALIGN flag), and adds VM_COLORALIGN to its vm_area->flags.
>
> When we get an __handle_mm_fault for this area, we have modified some
> kernel functions in file mm/page_alloc.c (such as __alloc_pages,
> get_page_from_freelist, buffered_rmqueue) to allocate an aligned
> memory (as stated above). We have modified the parameter 'order' (in
> these functions), to new name 'align_order'. This parameter carries
> needed alignment in upper half of the word and order (or size in
> pages) in lower half. This change is in effect only when VM_COLORALIGN
> is set in flags.
>
> So, finally we allocate the aligned pages in function
> __rmqueue_aligned_wrapper . It calls another function
> __rmqueue_aligned where it actually does the allocation. In these
> functions, we try to allocate the memory from areas (in buddy) of
> larger order first (see condition in __rmqueue_aligned_wrapper) and
> then from smaller order areas (if former fails). This is done for
> optimization considerations.
>
> The extra pages in the front and back of allocated area, are freed in
> the function expand_num_pages.
>
> Also i have procured permission from company's legal department to
> post the patch to the community.
>
>
>
> Bug in the patch :-
>
> I also need help with a bug in the patch. I stumbled across a crash
> with a simple program as below, when USB was not connected (ie.
> /dev/sda did not have an associated device),
>
> #define ALIGN (16*1024)
> #define BS (512)
> main ()
> {
> posix_memalign ( &b2, ALIGN , BS ); ob2 = b2;
> sda =open ("/dev/sda", O_RDWR ); // get dump here
> return 0;
> }
>
>
> If I modify ALIGN to some lesser value, say 4096, dont get the crash.
> Perhaps it is some buddy allocator concept i need to understand, when
> i free pages in expand_num_pages (hoping someone can spot it out from
> the patch).
>
> Also suggest, how i can do some regression testing here.
>
>
> The Dump from above program :-
>
>
> ./test_odirect1
> Unable to handle kernel NULL pointer dereference at virtual address
> 00000084 pgd = 97044000
> [00000084] *pgd=00000000
> Internal error: Oops: 17 [#1]
> Modules linked in:
> CPU: 0
> PC is at do_page_fault+0x20/0x244
> LR is at do_DataAbort+0x3c/0xa0
> pc : [<9706b748>] lr : [<9706bb2c>] Not tainted
> sp : 97720058 ip : 176ce431 fp : 9772009c
> r10: 00000000 r9 : 40000113 r8 : 97720150
> r7 : 00000000 r6 : 9738966c r5 : 00000017 r4 : 973895fc
> r3 : 97720000 r2 : 97720150 r1 : 00000017 r0 : 00000084
> Flags: nzcv IRQs on FIQs on Mode SVC_32 Segment user
> Control: C5387D Table: 17720000 DAC: 00000017
> Process modprobe (pid: 83, stack limit = 0x9771e250)
> Stack: (0x97720058 to 0x97720000)
> Backtrace:
> [<9706b728>] (do_page_fault+0x0/0x244) from [<9706bb2c>]
> (do_DataAbort+0x3c/0xa0)
> [<9706baf0>] (do_DataAbort+0x0/0xa0) from [<9706482c>]
> (__dabt_svc+0x4c/0x60) r8 = 97720290 r7 = 00000000 r6 = 9738966C r5 =
> 97720184
> r4 = FFFFFFFF
> [<9706b728>] (do_page_fault+0x0/0x244) from [<9706bb2c>]
> (do_DataAbort+0x3c/0xa0)
> [<9706baf0>] (do_DataAbort+0x0/0xa0) from [<9706482c>]
> (__dabt_svc+0x4c/0x60) r8 = 977203D0 r7 = 00000000 r6 = 9738966C r5 =
> 977202C4
> r4 = FFFFFFFF
> [<9706b728>] (do_page_fault+0x0/0x244) from [<9706bb2c>]
> (do_DataAbort+0x3c/0xa0)
>
> ........ (so on)
>
>
>
>
> PATCHES 1 - applies to kernel
> --------------------
>
>
>
> --- linux-2.6.18.5/arch/arm/mm/mmap.c 2006-12-02 05:43:05.000000000 +0530
> +++ linux-2.6.18.5/arch/arm/mm/mmap.c 2008-11-07 19:29:09.000000000 +0530
> @@ -41,7 +41,7 @@ arch_get_unmapped_area(struct file *filp
> if (cache_type != read_cpuid(CPUID_ID)) {
> aliasing = (cache_type | cache_type >> 12) & (1 << 11);
> if (aliasing)
> - do_align = filp || flags & MAP_SHARED;
> + do_align = filp || flags & MAP_SHARED || flags & MAP_COLORALIGN;
> }
> #else
> #define do_align 0
> --- linux-2.6.18.5/mm/page_alloc.c 2006-12-02 05:43:05.000000000 +0530
> +++ linux-2.6.18.5/mm/page_alloc.c 2008-11-07 19:31:43.000000000 +0530
> @@ -519,6 +519,32 @@ static inline void expand(struct zone *z
> }
> }
>
> +/* The order in while pages are being added to the areas, should not
> be changed.
> + * ie. smaller pages (with smaller order) towards the front and
> larger towards back.
> + */
> +static inline void expand_num_pages(struct zone *zone, struct page
> *page, unsigned int size)
> +{
> + unsigned int mask, order;
> + struct free_area *area;
> +
> + mask = 1;
> + order = 0;
> + while (size) {
> + if ( mask & size ) {
> + area = zone->free_area + order;
> + BUG_ON(bad_range(zone, page));
> + list_add(&page->lru, &area->free_list);
> + area->nr_free++;
> + set_page_order(page, order);
> +
> + page = &page[mask];
> + size &= (~mask);
> + }
> + mask <<= 1;
> + ++order;
> + }
> +}
> +
> /*
> * This page is about to be returned from the page allocator
> */
> @@ -591,6 +617,110 @@ static struct page *__rmqueue(struct zon
> return NULL;
> }
>
> +#define log2(n) ffz(~(n))
> +
> +/*
> + * __rmqueue_aligned :-
> + * Tries to allocate aligned pages of given order, from among pages
> of current_order.
> + */
> +static struct page *__rmqueue_aligned (struct zone *zone, unsigned
> int align_order , unsigned int current_order )
> +{
> + struct free_area *area;
> + struct list_head *pos;
> + struct page *page;
> +
> + unsigned int frontpad, padsize, lastpad, page_align;
> + const unsigned int extra = (SHMLBA >> PAGE_SHIFT), extra_order =
> log2(extra); + unsigned int order =
> EXTRACT_ORDER(__GFP_COLORALIGN,align_order); + unsigned int align =
> EXTRACT_ALIGN(__GFP_COLORALIGN,align_order); +
> + area = zone->free_area + current_order;
> + if (list_empty(&area->free_list))
> + return NULL;
> +
> + list_for_each(pos,&area->free_list) {
> + page = list_entry(pos, struct page, lru);
> + page_align = page_to_pfn(page) & (extra-1);
> +
> + if ( align == page_align ) { /* color alignment matches */
> + frontpad = 0;
> + break;
> + }
> +
> + if ( current_order > order ) /* chk if its not a tightfit; cant
> allocate from tight fit */
> + {
> + frontpad = (align > page_align) ? (align - page_align) : (extra +
> align - page_align);
> + /* move below at 2 places, for efficiency */
> +
> + if ( current_order > extra_order ) /* allocate from inside node */
> + break;
> +
> + if ( (unsigned int) ((0x1UL<<current_order) - (0x1UL<<order)) >=
> frontpad) + break;
> + }
> + }
> +
> + if ( pos == &area->free_list) /* is true when we did not find any
> suitable match in the area */
> + return NULL;
> +
> + list_del(&page->lru);
> + rmv_page_order(page);
> + area->nr_free--;
> + zone->free_pages -= 1UL << order;
> +
> + /* Allocation looks as below :-
> + * [current_order pages] = [frontpad pages] + [order pages] +
> [lastpad pages] */
> + padsize = (0x01UL<<order) + frontpad;
> + lastpad = (0x01UL<<current_order) - padsize;
> +
> + printk (" current %d = ", 0x1UL<<current_order);
> + printk (" frontpad %d + order %d + lastpad %d \n", frontpad,
> 0x1UL<<order, lastpad);
> +
> + /* we need to remove pages in front and back of allocation */
> + if (lastpad)
> + expand_num_pages (zone, &page[padsize], lastpad);
> + if (frontpad) {
> + expand_num_pages (zone, page, frontpad);
> + page = &page[frontpad];
> + rmv_page_order(page);
> + }
> +
> + return page;
> +}
> +
> +/* __rmqueue_aligned_wrapper :-
> + * calls __rmqueue_aligned in a manner, optimzed for speed.
> __rmqueue_aligned accepts a parameter
> + * current_order, which is the area we check for free pages. We first
> check for large pages thus
> + * spending lesser time looping in __rmqueue_aligned.
> + */
> +static struct page *__rmqueue_aligned_wrapper (struct zone *zone,
> unsigned int align_order)
> +{
> + struct page *page = NULL;
> + unsigned int optim_order, current_order;
> + const unsigned int extra = (SHMLBA >> PAGE_SHIFT), extra_order =
> log2(extra); + unsigned int order =
> EXTRACT_ORDER(__GFP_COLORALIGN,align_order); + unsigned int align =
> EXTRACT_ALIGN(__GFP_COLORALIGN,align_order); +
> + BUG_ON(align >= extra);
> +
> + /* starting with higher order areas, results in faster find */
> + if ( order == 0 )
> + optim_order = extra_order;
> + else if ( order <= extra_order )
> + optim_order = extra_order + 1;
> + else
> + optim_order = order + 1;
> +
> + for (current_order = optim_order; current_order < MAX_ORDER && page
> == NULL; current_order++ )
> + page = __rmqueue_aligned ( zone , align_order , current_order );
> +
> + for (current_order = optim_order-1; current_order >= order && page
> == NULL ; current_order-- )
> + page = __rmqueue_aligned ( zone , align_order , current_order );
> +
> + return page;
> +}
> +
> +
> /*
> * Obtain a specified number of elements from the buddy allocator, all
> under * a single hold of the lock, for efficiency. Add them to the
> supplied list. @@ -773,16 +903,17 @@ void split_page(struct page *page,
> unsig
> * or two.
> */
> static struct page *buffered_rmqueue(struct zonelist *zonelist,
> - struct zone *zone, int order, gfp_t gfp_flags)
> + struct zone *zone, int align_order, gfp_t gfp_flags)
> {
> unsigned long flags;
> struct page *page;
> int cold = !!(gfp_flags & __GFP_COLD);
> int cpu;
> + unsigned int order = EXTRACT_ORDER(gfp_flags,align_order);
>
> again:
> cpu = get_cpu();
> - if (likely(order == 0)) {
> + if (likely(order == 0 && !(gfp_flags & __GFP_COLORALIGN))) {
> struct per_cpu_pages *pcp;
>
> pcp = &zone_pcp(zone, cpu)->pcp[cold];
> @@ -798,9 +929,13 @@ again:
> pcp->count--;
> } else {
> spin_lock_irqsave(&zone->lock, flags);
> - page = __rmqueue(zone, order);
> + if ( likely(!(gfp_flags & __GFP_COLORALIGN)) ) {
> + page = __rmqueue(zone, order);
> + } else {
> + page = __rmqueue_aligned_wrapper(zone, align_order);
> + }
> spin_unlock(&zone->lock);
> - if (!page)
> + if (!page)
> goto failed;
> }
>
> @@ -864,12 +999,13 @@ int zone_watermark_ok(struct zone *z, in
> * a page.
> */
> static struct page *
> -get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
> +get_page_from_freelist(gfp_t gfp_mask, unsigned int align_order,
> struct zonelist *zonelist, int alloc_flags)
> {
> struct zone **z = zonelist->zones;
> struct page *page = NULL;
> int classzone_idx = zone_idx(*z);
> + unsigned int order = EXTRACT_ORDER(gfp_mask,align_order);
>
> /*
> * Go through the zonelist once, looking for a zone with enough free.
> @@ -895,7 +1031,7 @@ get_page_from_freelist(gfp_t gfp_mask, u
> continue;
> }
>
> - page = buffered_rmqueue(zonelist, *z, order, gfp_mask);
> + page = buffered_rmqueue(zonelist, *z, align_order, gfp_mask);
> if (page) {
> break;
> }
> @@ -905,9 +1041,13 @@ get_page_from_freelist(gfp_t gfp_mask, u
>
> /*
> * This is the 'heart' of the zoned buddy allocator.
> + *
> + * VIPT cache fix :-
> + * The upper half (ie. upper 16 bits in case of 32-bit integers) of
> align_order
> + * are used for passing, the required page color alignment for the
> allocated page.
> */
> struct page * fastcall
> -__alloc_pages(gfp_t gfp_mask, unsigned int order,
> +__alloc_pages(gfp_t gfp_mask, unsigned int align_order,
> struct zonelist *zonelist)
> {
> const gfp_t wait = gfp_mask & __GFP_WAIT;
> @@ -918,6 +1058,7 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
> int do_retry;
> int alloc_flags;
> int did_some_progress;
> + unsigned int order = EXTRACT_ORDER(gfp_mask,align_order);
>
> might_sleep_if(wait);
>
> @@ -929,7 +1070,7 @@ restart:
> return NULL;
> }
>
> - page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
> + page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, align_order,
> zonelist, ALLOC_WMARK_LOW|ALLOC_CPUSET);
> if (page)
> goto got_pg;
> @@ -964,7 +1105,7 @@ restart:
> * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
> * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> */
> - page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
> + page = get_page_from_freelist(gfp_mask, align_order, zonelist,
> alloc_flags); if (page)
> goto got_pg;
>
> @@ -975,7 +1116,7 @@ restart:
> if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> nofail_alloc:
> /* go through the zonelist yet again, ignoring mins */
> - page = get_page_from_freelist(gfp_mask, order,
> + page = get_page_from_freelist(gfp_mask, align_order,
> zonelist, ALLOC_NO_WATERMARKS);
> if (page)
> goto got_pg;
> @@ -1008,7 +1149,7 @@ rebalance:
> cond_resched();
>
> if (likely(did_some_progress)) {
> - page = get_page_from_freelist(gfp_mask, order,
> + page = get_page_from_freelist(gfp_mask, align_order,
> zonelist, alloc_flags);
> if (page)
> goto got_pg;
> @@ -1019,7 +1160,7 @@ rebalance:
> * a parallel oom killing, we must fail if we're still
> * under heavy pressure.
> */
> - page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
> + page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, align_order,
> zonelist, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
> if (page)
> goto got_pg;
> --- linux-2.6.18.5/include/asm-arm/mman.h 2006-12-02 05:43:05.000000000
> +0530 +++ linux-2.6.18.5/include/asm-arm/mman.h 2008-11-07
> 19:29:09.000000000 +0530 @@ -11,6 +11,11 @@
> #define MAP_POPULATE 0x8000 /* populate (prefault) page tables */
> #define MAP_NONBLOCK 0x10000 /* do not block on IO */
>
> +#ifdef CONFIG_CPU_V6
> +# undef MAP_COLORALIGN
> +# define MAP_COLORALIGN 0x0200 /* For VIVT caches - the alignment
> should be color aligned */
> +#endif
> +
> #define MCL_CURRENT 1 /* lock all current mappings */
> #define MCL_FUTURE 2 /* lock all future mappings */
>
> --- linux-2.6.18.5/include/asm-arm/page.h 2006-12-02 05:43:05.000000000
> +0530 +++ linux-2.6.18.5/include/asm-arm/page.h 2008-11-07
> 19:29:09.000000000 +0530 @@ -134,6 +134,17 @@ extern void
> __cpu_copy_user_page(void *t
> #define clear_page(page) memzero((void *)(page), PAGE_SIZE)
> extern void copy_page(void *to, const void *from);
>
> +#ifdef CONFIG_CPU_V6
> +#define BITLEN(x) (sizeof(x)<<3) /* for bytes to bits, multiply by 8 */
> +#define EXTRACT_ORDER(gfp,num) ( (gfp & __GFP_COLORALIGN) ? (num &
> 0xFFFF) : (num) )
> +#define EXTRACT_ALIGN(gfp,num) ( (gfp & __GFP_COLORALIGN) ?
> (num>>BITLEN(unsigned short)) : (0) )
> +
> +#define ALIGNMENT_BITS(addr) (((addr&(SHMLBA-1))>>PAGE_SHIFT) <<
> BITLEN(unsigned short))
> +#define ARCH_ALIGNORDER(flags,addr,order) ((flags & VM_COLORALIGN)?
> (ALIGNMENT_BITS(addr) | order) : (0))
> +
> +#define ARCH_ALIGNGFP(flags) ((flags & VM_COLORALIGN)?
> (__GFP_COLORALIGN):(0)) +#endif
> +
> #undef STRICT_MM_TYPECHECKS
>
> #ifdef STRICT_MM_TYPECHECKS
> --- linux-2.6.18.5/include/linux/mman.h 2006-12-02 05:43:05.000000000 +0530
> +++ linux-2.6.18.5/include/linux/mman.h 2008-11-07 19:29:09.000000000 +0530
> @@ -63,7 +63,8 @@ calc_vm_flag_bits(unsigned long flags)
> return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) |
> _calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) |
> _calc_vm_trans(flags, MAP_EXECUTABLE, VM_EXECUTABLE) |
> - _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED );
> + _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) |
> + _calc_vm_trans(flags, MAP_COLORALIGN, VM_COLORALIGN);
> }
> #endif /* __KERNEL__ */
> #endif /* _LINUX_MMAN_H */
> --- linux-2.6.18.5/include/linux/mm.h 2006-12-02 05:43:05.000000000 +0530
> +++ linux-2.6.18.5/include/linux/mm.h 2008-11-07 19:29:09.000000000 +0530
> @@ -160,6 +160,7 @@ extern unsigned int kobjsize(const void
> #define VM_DONTEXPAND 0x00040000 /* Cannot expand with mremap() */
> #define VM_RESERVED 0x00080000 /* Count as reserved_vm like IO */
> #define VM_ACCOUNT 0x00100000 /* Is a VM accounted object */
> +#define VM_COLORALIGN 0x00200000 /* The vma is aligned with colour
> bits for VIVT cache */
> #define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */
> #define VM_NONLINEAR 0x00800000 /* Is non-linear (remap_file_pages) */
> #define VM_MAPPED_COPY 0x01000000 /* T if mapped copy of data (nommu mmap)
> */ --- linux-2.6.18.5/include/linux/gfp.h 2008-11-07 19:33:55.000000000
> +0530 +++ linux-2.6.18.5/include/linux/gfp.h 2008-11-07 19:35:50.000000000
> +0530 @@ -47,6 +47,8 @@ struct vm_area_struct;
> #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use
> emergency reserves */
> #define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce
> hardwall cpuset memory allocs */
>
> +#define __GFP_COLORALIGN ((__force gfp_t)0x40000u) /* Used by
> processors with VIPT caches */
> +
> #define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
> #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>
> @@ -106,8 +108,9 @@ extern struct page *
> FASTCALL(__alloc_pages(gfp_t, unsigned int, struct zonelist *));
>
> static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
> - unsigned int order)
> + unsigned int align_order)
> {
> + unsigned int order = EXTRACT_ORDER(gfp_mask,align_order);
> if (unlikely(order >= MAX_ORDER))
> return NULL;
>
> @@ -115,7 +118,7 @@ static inline struct page *alloc_pages_n
> if (nid < 0)
> nid = numa_node_id();
>
> - return __alloc_pages(gfp_mask, order,
> + return __alloc_pages(gfp_mask, align_order,
> NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask));
> }
>
> @@ -133,9 +136,11 @@ alloc_pages(gfp_t gfp_mask, unsigned int
> extern struct page *alloc_page_vma(gfp_t gfp_mask,
> struct vm_area_struct *vma, unsigned long addr);
> #else
> -#define alloc_pages(gfp_mask, order) \
> - alloc_pages_node(numa_node_id(), gfp_mask, order)
> -#define alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0)
> +
> +#define alloc_pages(gfp_mask, align_order) \
> + alloc_pages_node(numa_node_id(), gfp_mask, align_order)
> +#define alloc_page_vma(gfp_mask, vma, addr) \
> + alloc_pages(gfp_mask|ARCH_ALIGNGFP(vma->vm_flags),
> ARCH_ALIGNORDER(vma->vm_flags,addr,0))
> #endif
> #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
>
> --- linux-2.6.18.5/include/linux/highmem.h 2006-12-02 05:43:05.000000000
> +0530 +++ linux-2.6.18.5/include/linux/highmem.h 2008-11-07
> 20:11:44.000000000 +0530 @@ -67,6 +67,12 @@
> alloc_zeroed_user_highpage(struct vm_are
> }
> #endif
>
> +#ifndef ARCH_ALIGNORDER
> +#define EXTRACT_ORDER(num) (num)
> +#define EXTRACT_ALIGN(num) (0)
> +#define ARCH_ALIGNORDER(flag,align,order) (0);
> +#endif
> +
> static inline void clear_highpage(struct page *page)
> {
> void *kaddr = kmap_atomic(page, KM_USER0);
> --- linux-2.6.18.5/include/asm-generic/mman.h 2006-12-02
> 05:43:05.000000000 +0530
> +++ linux-2.6.18.5/include/asm-generic/mman.h 2008-11-07
> 19:29:09.000000000 +0530
> @@ -20,6 +20,9 @@
> #define MAP_FIXED 0x10 /* Interpret addr exactly */
> #define MAP_ANONYMOUS 0x20 /* don't use a file */
>
> +#define MAP_COLORALIGN 0x0 /* can be redefined (undef) to
> 0x00200000 in arch specific mmap.h,
> + if the processor has VIVT
> cache with multiple associativity */
> +
> #define MS_ASYNC 1 /* sync memory asynchronously */
> #define MS_INVALIDATE 2 /* invalidate the caches */
> #define MS_SYNC 4 /* synchronous memory sync */
>
>
>
>
> PATCH 2 -- applies to uclibc ( in buildroot directory
> toolchain/kernel-headers ) -------------------
>
>
> --- linux/include/asm-arm/mman.h 2008-10-30 16:16:31.000000000 +0530
> +++ linux/include/asm-arm/mman.h 2008-10-30 16:16:46.000000000 +0530
> @@ -3,6 +3,8 @@
>
> #include <asm-generic/mman.h>
>
> +#define MAP_COLORALIGN 0x0200 /* For VIVT caches - the
> alignment should be color aligned */
> +
> #define MAP_GROWSDOWN 0x0100 /* stack-like segment */
> #define MAP_DENYWRITE 0x0800 /* ETXTBSY */
> #define MAP_EXECUTABLE 0x1000 /* mark it as an executable */
>
>
>
>
> PATCH 3 -- applies to uclibc ( in buildroot directory toolchain/uClibc )
> ---------------
>
>
> --- uClibc/libc/stdlib/malloc-standard/malloc.c 2008-10-31
> 13:03:29.000000000 +0530
> +++ uClibc/libc/stdlib/malloc-standard/malloc.c 2008-11-03
> 16:28:16.000000000 +0530
> @@ -342,7 +342,15 @@ void __do_check_malloc_state(void)
> space to service request for nb bytes, thus requiring that av->top
> be extended or replaced.
> */
> -static void* __malloc_alloc(size_t nb, mstate av)
> +
> +/* To do :-
> + Check if brk can be used to allocate smaller chunks. Maybe sbrk call
> checks + for vma_flags (such as MAP_COLORALIGN flag) and only
> allocates/expands + areas that use the same time. If such a functionality
> is not there, + perhaps it can be added in future. For now, all
> MAP_COLORALIGN allocations + go through mmap2.
> +*/
> +static void* __malloc_alloc(size_t nb, mstate av, const int map_flags)
> {
> mchunkptr old_top; /* incoming value of av->top */
> size_t old_size; /* its size */
> @@ -374,7 +382,7 @@ static void* __malloc_alloc(size_t nb, m
> than in malloc proper.
> */
>
> - if (have_fastchunks(av)) {
> + if (have_fastchunks(av) && !(map_flags & MAP_COLORALIGN)) {
> assert(in_smallbin_range(nb));
> __malloc_consolidate(av);
> return malloc(nb - MALLOC_ALIGN_MASK);
> @@ -389,7 +397,7 @@ static void* __malloc_alloc(size_t nb, m
> */
>
> if ((unsigned long)(nb) >= (unsigned long)(av->mmap_threshold) &&
> - (av->n_mmaps < av->n_mmaps_max)) {
> + (av->n_mmaps < av->n_mmaps_max) || (map_flags & MAP_COLORALIGN)) {
>
> char* mm; /* return value from mmap call*/
>
> @@ -403,7 +411,7 @@ static void* __malloc_alloc(size_t nb, m
> /* Don't try if size wraps around 0 */
> if ((unsigned long)(size) > (unsigned long)(nb)) {
>
> - mm = (char*)(MMAP(0, size, PROT_READ|PROT_WRITE));
> + mm = (char*)(MMAP(0, size, PROT_READ|PROT_WRITE, map_flags));
>
> if (mm != (char*)(MORECORE_FAILURE)) {
>
> @@ -523,7 +531,7 @@ static void* __malloc_alloc(size_t nb, m
> /* Don't try if size wraps around 0 */
> if ((unsigned long)(size) > (unsigned long)(nb)) {
>
> - fst_brk = (char*)(MMAP(0, size, PROT_READ|PROT_WRITE));
> + fst_brk = (char*)(MMAP(0, size, PROT_READ|PROT_WRITE, map_flags));
>
> if (fst_brk != (char*)(MORECORE_FAILURE)) {
>
> @@ -802,6 +810,20 @@ static int __malloc_largebin_index(unsig
> /* ------------------------------ malloc ------------------------------ */
> void* malloc(size_t bytes)
> {
> + return __internal_malloc(bytes, 0);
> +}
> +
> +
> +/* ---------------------- __internal_malloc ------------------------------
> */ +
> +/* Why did we add ' map_flags ' to parameter list ?
> + Using map_flags, we can inform the kernel of the following :-
> + - vma is page color aligned (For ex, is needed for using O_DIRECT
> + on a VIVT, multiple assosiative cache).
> + The first user of _internal_malloc function would be memalign.
> +*/
> +void* __internal_malloc(size_t bytes, const int map_flags)
> +{
> mstate av;
>
> size_t nb; /* normalized request size */
> @@ -851,6 +873,9 @@ void* malloc(size_t bytes)
> goto use_top;
> }
>
> + /* for VIVT caches, if we want color alignment, only use
> __malloc_alloc */ + if (map_flags & MAP_COLORALIGN) goto use_top;
> +
> /*
> If the size qualifies as a fastbin, first check corresponding bin.
> */
> @@ -927,7 +952,8 @@ void* malloc(size_t bytes)
> if (in_smallbin_range(nb) &&
> bck == unsorted_chunks(av) &&
> victim == av->last_remainder &&
> - (unsigned long)(size) > (unsigned long)(nb + MINSIZE)) {
> + (unsigned long)(size) > (unsigned long)(nb + MINSIZE))
> + {
>
> /* split and reattach remainder */
> remainder_size = size - nb;
> @@ -1142,7 +1168,7 @@ use_top:
> victim = av->top;
> size = chunksize(victim);
>
> - if ((unsigned long)(size) >= (unsigned long)(nb + MINSIZE)) {
> + if ((unsigned long)(size) >= (unsigned long)(nb + MINSIZE) &&
> !map_flags) { remainder_size = size - nb;
> remainder = chunk_at_offset(victim, nb);
> av->top = remainder;
> @@ -1155,7 +1181,7 @@ use_top:
> }
>
> /* If no space in top, relay to handle system-dependent cases */
> - sysmem = __malloc_alloc(nb, av);
> + sysmem = __malloc_alloc(nb, av, map_flags);
> retval = sysmem;
> DONE:
> __MALLOC_UNLOCK;
> --- uClibc/libc/stdlib/malloc-standard/malloc.h 2008-10-31
> 13:03:35.000000000 +0530
> +++ uClibc/libc/stdlib/malloc-standard/malloc.h 2008-11-03
> 17:59:57.000000000 +0530
> @@ -347,6 +347,7 @@ __UCLIBC_MUTEX_EXTERN(__malloc_lock);
> /* ------------------ MMAP support ------------------ */
> #include <fcntl.h>
> #include <sys/mman.h>
> +#include <asm/mman.h>
>
> #if !defined(MAP_ANONYMOUS) && defined(MAP_ANON)
> #define MAP_ANONYMOUS MAP_ANON
> @@ -354,13 +355,13 @@ __UCLIBC_MUTEX_EXTERN(__malloc_lock);
>
> #ifdef __ARCH_USE_MMU__
>
> -#define MMAP(addr, size, prot) \
> - (mmap((addr), (size), (prot), MAP_PRIVATE|MAP_ANONYMOUS, 0, 0))
> +#define MMAP(addr, size, prot, map) \
> + (mmap((addr), (size), (prot), MAP_PRIVATE|MAP_ANONYMOUS|map, 0, 0))
>
> #else
>
> -#define MMAP(addr, size, prot) \
> - (mmap((addr), (size), (prot), MAP_SHARED|MAP_ANONYMOUS, 0, 0))
> +#define MMAP(addr, size, prot, map) \
> + (mmap((addr), (size), (prot), MAP_SHARED|MAP_ANONYMOUS|map, 0, 0))
>
> #endif
>
> @@ -931,6 +932,7 @@ extern struct malloc_state __malloc_stat
>
> /* External internal utilities operating on mstates */
> void __malloc_consolidate(mstate) attribute_hidden;
> +void* __internal_malloc(size_t bytes, const int map_flags);
>
>
> /* Debugging support */
> --- uClibc/libc/stdlib/malloc-standard/memalign.c 2008-10-31
> 13:03:47.000000000 +0530
> +++ uClibc/libc/stdlib/malloc-standard/memalign.c 2008-11-03
> 16:27:09.000000000 +0530
> @@ -59,9 +59,11 @@ void* memalign(size_t alignment, size_t
> * request, and then possibly free the leading and trailing space. */
>
>
> - /* Call malloc with worst case padding to hit alignment. */
> -
> - m = (char*)(malloc(nb + alignment + MINSIZE));
> + /* Call __internal_malloc with worst case padding to hit alignment.
> + Note: MAP_COLORALIGN would be disregarded in the kernel if
> architecture + does not require it. For ex, It is needed in case of
> ARM 11 processors. + */
> + m = (char*)(__internal_malloc(nb + alignment + MINSIZE,
> MAP_COLORALIGN));
>
> if (m == 0) {
> retval = 0; /* propagate failure */

2008-11-19 20:43:47

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)

On Wed, Nov 19, 2008 at 05:40:23PM +1100, Nick Piggin wrote:
> It would be interesting to know exactly what problem you are seeing.
>
> ARM I think is supposed to handle aliasing problems by flushing
> caches at appropriate points. It would be nice to know what's going
> wrong and whether we can cover those holes.

I think there's a problem here: the existing cache handling API is
designed around the MM's manipulation of page tables. It is generally
not designed to handle aliasing between multiple mappings of the same
page, except with one exception: page cache pages mmap'd into userspace,
which is handled via flush_dcache_page().

O_DIRECT on ARM is probably completely untested for the most part. It's
not something that is encountered very often, and as such gets zero
testing. I've certainly never had the tools to be able to test it out,
so even on VIVT it's probably completely buggy. Bear in mind that most
of my modern platforms use MTD (either cramfs or in the rare case jffs2)
so I'm not sure that I could sensibly even test O_DIRECT - isn't O_DIRECT
for use with proper block devices?

As it's probably clear, I've no clue of the O_DIRECT implementation or
how it's supposed to work. At a guess, it probably needs a new cache
handling API to be designed, since the existing flush_cache_(range|page)
are basically no-ops on VIPT. Or maybe we need some way to mark the
userspace pages as being write-through or uncacheable depending on the
features we have available on the processor. Or something. Don't know.

So, what is O_DIRECT, and where can I find some information about it?
I don't see anything in Documentation/ describing it.

2008-11-20 06:59:24

by Nick Piggin

[permalink] [raw]
Subject: Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)

On Thursday 20 November 2008 07:43, Russell King - ARM Linux wrote:
> On Wed, Nov 19, 2008 at 05:40:23PM +1100, Nick Piggin wrote:
> > It would be interesting to know exactly what problem you are seeing.
> >
> > ARM I think is supposed to handle aliasing problems by flushing
> > caches at appropriate points. It would be nice to know what's going
> > wrong and whether we can cover those holes.
>
> I think there's a problem here: the existing cache handling API is
> designed around the MM's manipulation of page tables. It is generally
> not designed to handle aliasing between multiple mappings of the same
> page, except with one exception: page cache pages mmap'd into userspace,
> which is handled via flush_dcache_page().

Right, flush_dcache_page.


> O_DIRECT on ARM is probably completely untested for the most part. It's
> not something that is encountered very often, and as such gets zero
> testing. I've certainly never had the tools to be able to test it out,
> so even on VIVT it's probably completely buggy. Bear in mind that most
> of my modern platforms use MTD (either cramfs or in the rare case jffs2)
> so I'm not sure that I could sensibly even test O_DIRECT - isn't O_DIRECT
> for use with proper block devices?

Yes, although there are other things that use get_user_pages as well
(eg. splice, ptrace). So it would be interesting if they have corner
case problems as well.


> As it's probably clear, I've no clue of the O_DIRECT implementation or
> how it's supposed to work. At a guess, it probably needs a new cache
> handling API to be designed, since the existing flush_cache_(range|page)
> are basically no-ops on VIPT. Or maybe we need some way to mark the
> userspace pages as being write-through or uncacheable depending on the
> features we have available on the processor. Or something. Don't know.
>
> So, what is O_DIRECT, and where can I find some information about it?
> I don't see anything in Documentation/ describing it.

(open(2), with O_DIRECT flag)

O_DIRECT uses "get_user_pages()" pages (user mapped pages) to feed the
block layer with rather than pagecache pages. However pagecache pages
can also be user mapped, so I'm thinking there should be enough cache
flushing APIs to be able to handle either case.

Basically, an O_DIRECT write involves:

- The program storing into some virtual address, then passing that virtual
address as the buffer to write(2).

- The kernel will get_user_pages() to get the struct page * of that user
virtual address. At this point, get_user_pages does flush_dcache_page.
(Which should write back the user caches?)

- Then the struct page is sent to the block layer (it won't tend to be
touched by the kernel via the kernel linear map, unless we have like an
"emulated" block device block device like 'brd').

- Even if it is read via the kernel linear map, AFAIKS, we should be OK
due to the flush_dcache_page().

An O_DIRECT read involves:

- Same first 2 steps as O_DIRECT write, including flush_dcache_page. So the
user mapping should not have any previously dirtied lines around.

- The page is sent to the block layer, which stores into the page. Some
block devices like 'brd' will potentially store via the kernel linear map
here, and they probably don't do enough cache flushing. But a regular
block device should go via DMA, which AFAIK should be OK? (the user address
should remain invalidated because it would be a bug to read from the buffer
before the read has completed)

So I can't see exactly where the problem would be. Can you?

2008-11-20 09:24:57

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)

On Thu, Nov 20, 2008 at 05:59:00PM +1100, Nick Piggin wrote:
> Basically, an O_DIRECT write involves:
>
> - The program storing into some virtual address, then passing that virtual
> address as the buffer to write(2).
>
> - The kernel will get_user_pages() to get the struct page * of that user
> virtual address. At this point, get_user_pages does flush_dcache_page.
> (Which should write back the user caches?)
>
> - Then the struct page is sent to the block layer (it won't tend to be
> touched by the kernel via the kernel linear map, unless we have like an
> "emulated" block device block device like 'brd').
>
> - Even if it is read via the kernel linear map, AFAIKS, we should be OK
> due to the flush_dcache_page().

That seems sane, and yes, flush_dcache_page() will write back and
invalidate dirty cache lines in both the kernel and user mappings.

> An O_DIRECT read involves:
>
> - Same first 2 steps as O_DIRECT write, including flush_dcache_page. So the
> user mapping should not have any previously dirtied lines around.
>
> - The page is sent to the block layer, which stores into the page. Some
> block devices like 'brd' will potentially store via the kernel linear map
> here, and they probably don't do enough cache flushing. But a regular
> block device should go via DMA, which AFAIK should be OK? (the user address
> should remain invalidated because it would be a bug to read from the buffer
> before the read has completed)

This is where things get icky with lots of drivers - DMA is fine, but
many PIO based drivers don't handle the implications of writing to the
kernel page cache page when there may be CPU cache side effects.

If the cache is in read allocate mode, then in this case there shouldn't
be any dirty cache lines. (That's not always the case though, esp. via
conventional IO.) If the cache is in write allocate mode, PIO data will
sit in the kernel mapping and won't be visible to userspace.

That is a years-old bug, one that I've been unable to run tests for here
(because my platforms don't have the right combinations of CPUs supporting
write alloc and/or a problem block driver.) I've even been accused of
being uncooperative over testing possible bug fixes by various people
(if I don't have hardware which can show the problem, how can I test
possible fixes?) So I've given up with that issue - as far as I'm
concerned, it's a problem for others to sort out.

Do we know what hardware, which IO drivers are being used, and any
relevent configuration of the drivers?

2008-11-20 10:55:45

by Naval Saini

[permalink] [raw]
Subject: Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)

Hi Russel/Nick:

Thanks for the replies - I would like to reply ASAP to the mails (out
of interest also) - but have been unable to for some reasons. I donot
understand all that has been talked about in the discussions and would
spend more time before I reply to your suggestions.

We are working on arm 11 based embedded systems and were debugging an
issue with mkfs.xfs (uses O_DIRECT) utility. After quite a bit of
debugging, we found a pattern in addresses of user space buffers
allocated with memalign. For some address (of these buffers), the
format would result in corrupt disk. By applying this patch, I was
able to allocate memory that was page color aligned (as required by
arm technical ref. manual). We have observed the same problems in VIVT
caches (arm 9 processor) based. We also found that we could do away
with the problem by disabling cache.

Rest, the problem was observed not during O_DIRECT write (but at time
of read). When the same buffer was used and when user virtual address
(UVA) and kernal virtual address (KVA) did not meet alignment
requirements, the d_cache_invalidate operation (in consistent_sync
func in consistent.c) on KVA did not have the desired affect. When the
same memory was accessed from UVA (that should have been invalidated
along with KVA invalidation), it corrupted the buffer we had just
read. This in mkfs.xfs utility, corrupted the disk format. In lay man
terms, looked like KVA and UVA occupied seperate locations in cache,
when they were used one after the other (specifically KVA after UVA).

Hope this would be of immidiate help. I would comment on the strageies
proposed by nick in next mail.

Further as i have said, our kernel version in linux-2.6.18.5 ,
uclibc-0.9.28 , mkfs is irrelevant.

Thanks,
Naval


On 11/20/08, Russell King - ARM Linux <[email protected]> wrote:
> On Thu, Nov 20, 2008 at 05:59:00PM +1100, Nick Piggin wrote:
> > Basically, an O_DIRECT write involves:
> >
> > - The program storing into some virtual address, then passing that virtual
> > address as the buffer to write(2).
> >
> > - The kernel will get_user_pages() to get the struct page * of that user
> > virtual address. At this point, get_user_pages does flush_dcache_page.
> > (Which should write back the user caches?)
> >
> > - Then the struct page is sent to the block layer (it won't tend to be
> > touched by the kernel via the kernel linear map, unless we have like an
> > "emulated" block device block device like 'brd').
> >
> > - Even if it is read via the kernel linear map, AFAIKS, we should be OK
> > due to the flush_dcache_page().
>
>
> That seems sane, and yes, flush_dcache_page() will write back and
> invalidate dirty cache lines in both the kernel and user mappings.
>
>
> > An O_DIRECT read involves:
> >
> > - Same first 2 steps as O_DIRECT write, including flush_dcache_page. So the
> > user mapping should not have any previously dirtied lines around.
> >
> > - The page is sent to the block layer, which stores into the page. Some
> > block devices like 'brd' will potentially store via the kernel linear map
> > here, and they probably don't do enough cache flushing. But a regular
> > block device should go via DMA, which AFAIK should be OK? (the user address
> > should remain invalidated because it would be a bug to read from the buffer
> > before the read has completed)
>
>
> This is where things get icky with lots of drivers - DMA is fine, but
> many PIO based drivers don't handle the implications of writing to the
> kernel page cache page when there may be CPU cache side effects.
>
> If the cache is in read allocate mode, then in this case there shouldn't
> be any dirty cache lines. (That's not always the case though, esp. via
> conventional IO.) If the cache is in write allocate mode, PIO data will
> sit in the kernel mapping and won't be visible to userspace.
>
> That is a years-old bug, one that I've been unable to run tests for here
> (because my platforms don't have the right combinations of CPUs supporting
> write alloc and/or a problem block driver.) I've even been accused of
> being uncooperative over testing possible bug fixes by various people
> (if I don't have hardware which can show the problem, how can I test
> possible fixes?) So I've given up with that issue - as far as I'm
> concerned, it's a problem for others to sort out.
>
> Do we know what hardware, which IO drivers are being used, and any
> relevent configuration of the drivers?
>

2008-11-20 12:28:21

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)

2008/11/20 Nick Piggin <[email protected]>:
>
> [ ... ]
>
> - The page is sent to the block layer, which stores into the page. Some
> block devices like 'brd' will potentially store via the kernel linear map
> here, and they probably don't do enough cache flushing.

btw., if someone is curious, here is another case of what may happen
on VIPT systems when someone uses a "virtual" block device (like
'brd') as, heh, a swap :-)

http://www.linux-mips.org/archives/linux-mips/2008-11/msg00038.html


--
Best regards,
Dmitry Adamushko

2008-11-20 13:26:10

by Nick Piggin

[permalink] [raw]
Subject: Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)

On Thursday 20 November 2008 23:28, Dmitry Adamushko wrote:
> 2008/11/20 Nick Piggin <[email protected]>:
> > [ ... ]
> >
> > - The page is sent to the block layer, which stores into the page. Some
> > block devices like 'brd' will potentially store via the kernel linear
> > map here, and they probably don't do enough cache flushing.
>
> btw., if someone is curious, here is another case of what may happen
> on VIPT systems when someone uses a "virtual" block device (like
> 'brd') as, heh, a swap :-)
>
> http://www.linux-mips.org/archives/linux-mips/2008-11/msg00038.html

Right... Now I'm lacking knowledge when it comes to devices, but I
think it is probably reasonable for the block device layer to ensure
the physical memory is uptodate after it signals request completion.

That is, there shouldn't be any potentially aliasing dirty lines.
Block devices which do any writeout via the kernel linear address
(eg. brd) should do a flush_dcache_page.

2008-11-20 13:56:31

by Ralf Baechle

[permalink] [raw]
Subject: Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)

On Fri, Nov 21, 2008 at 12:25:39AM +1100, Nick Piggin wrote:

> > > - The page is sent to the block layer, which stores into the page. Some
> > > block devices like 'brd' will potentially store via the kernel linear
> > > map here, and they probably don't do enough cache flushing.
> >
> > btw., if someone is curious, here is another case of what may happen
> > on VIPT systems when someone uses a "virtual" block device (like
> > 'brd') as, heh, a swap :-)
> >
> > http://www.linux-mips.org/archives/linux-mips/2008-11/msg00038.html
>
> Right... Now I'm lacking knowledge when it comes to devices, but I
> think it is probably reasonable for the block device layer to ensure
> the physical memory is uptodate after it signals request completion.
>
> That is, there shouldn't be any potentially aliasing dirty lines.
> Block devices which do any writeout via the kernel linear address
> (eg. brd) should do a flush_dcache_page.

It's better to avoid aliases than dealing with them by flushing. A way to
avoid aliases whenever a page is mapped to userspace, one creates a mapping
at a carefully choosen address that doesn't alias. On architectures with
software reload TLBs such as MIPS that's very cheap and the entire
cacheflush with all it's associated pains can go away. Right now MIPS uses
such a mechanism:

void *kmap_coherent(struct page *page, unsigned long addr);
void kunmap_coherent(void);

within the architecture private implementation but it could be use beyond
that, probably on all architectures though I know that there would be some
solvable issues on PARISC. Lightweight, no ordering constraints between
kernel and userspace accesses, so also no locking needed.

Does this look like a possible avenue?

Ralf

2008-11-20 13:59:25

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)

2008/11/20 Nick Piggin <[email protected]>:
> On Thursday 20 November 2008 23:28, Dmitry Adamushko wrote:
>> 2008/11/20 Nick Piggin <[email protected]>:
>> > [ ... ]
>> >
>> > - The page is sent to the block layer, which stores into the page. Some
>> > block devices like 'brd' will potentially store via the kernel linear
>> > map here, and they probably don't do enough cache flushing.
>>
>> btw., if someone is curious, here is another case of what may happen
>> on VIPT systems when someone uses a "virtual" block device (like
>> 'brd') as, heh, a swap :-)
>>
>> http://www.linux-mips.org/archives/linux-mips/2008-11/msg00038.html
>
> Right... Now I'm lacking knowledge when it comes to devices, but I
> think it is probably reasonable for the block device layer to ensure
> the physical memory is uptodate after it signals request completion.
>
> That is, there shouldn't be any potentially aliasing dirty lines.
> Block devices which do any writeout via the kernel linear address
> (eg. brd) should do a flush_dcache_page.
>

Yeah, but my point is that flush_dcache_page() in its current
incarnation will _not_ always match the expectetions of such block
devices (e.g. brd or compcache) when those devices are used as "swap"
:-)

IOW, although flush_dcache_page() is in place, it won't work as expected.

--
Best regards,
Dmitry Adamushko

2008-11-20 15:27:50

by Matthew Wilcox

[permalink] [raw]
Subject: Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)

On Thu, Nov 20, 2008 at 01:55:58PM +0000, Ralf Baechle wrote:
> It's better to avoid aliases than dealing with them by flushing. A way to
> avoid aliases whenever a page is mapped to userspace, one creates a mapping
> at a carefully choosen address that doesn't alias. On architectures with
> software reload TLBs such as MIPS that's very cheap and the entire
> cacheflush with all it's associated pains can go away. Right now MIPS uses
> such a mechanism:
>
> void *kmap_coherent(struct page *page, unsigned long addr);
> void kunmap_coherent(void);
>
> within the architecture private implementation but it could be use beyond
> that, probably on all architectures though I know that there would be some
> solvable issues on PARISC. Lightweight, no ordering constraints between
> kernel and userspace accesses, so also no locking needed.

I'm not quite sure why you need kmap_coherent(). If a page is mapped into
userspace, you can find what address it's mapped to from
page->mapping->i_mmap and page->index. OTOH, that's potentially
expensive since you need to grab the spinlock, and unless you have all
user addresses coherent with each other (like parisc does), you need to
figure out which process to be coherent with.

I know James Bottomley did an experiment (and did an OLS presentation
...) on unmapping the entire page cache and greatly expanding the kmap
area to do just this kind of thing. I think he even got a speedup.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-11-20 16:34:49

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)

On Thu, Nov 20, 2008 at 01:55:58PM +0000, Ralf Baechle wrote:
> It's better to avoid aliases than dealing with them by flushing. A way to
> avoid aliases whenever a page is mapped to userspace, one creates a mapping
> at a carefully choosen address that doesn't alias. On architectures with
> software reload TLBs such as MIPS that's very cheap and the entire
> cacheflush with all it's associated pains can go away. Right now MIPS uses
> such a mechanism:
>
> void *kmap_coherent(struct page *page, unsigned long addr);
> void kunmap_coherent(void);

Yes, we effectively already do that for our copy_user_highpage/
clear_user_highpage (note: we really want to override both of those
functions, not just one as the code in include/linux/highmem.h is
currently setup. And I wish that V4L didn't use the non-highmem
clear_user_page() call...)

Why? On ARM, we're starting to support highmem on VIVT and VIPT, and
having *_user_highpage() kmap, and then have the *_user_page() setup
yet another mapping for VIPT caches is just silly.

Which reminds me that I need to get a patch to make clear_user_highpage()
overridable out on the relevent lists...

> within the architecture private implementation but it could be use beyond
> that, probably on all architectures though I know that there would be some
> solvable issues on PARISC. Lightweight, no ordering constraints between
> kernel and userspace accesses, so also no locking needed.

The "no locking needed" depends. If we're going to be asking drivers
to copy use such an interface in order to copy data into page cache
pages, you can't have two drivers calling it from interrupt context
without some form of locking.

Nor if you're SMP and you don't have the mappings setup per-CPU.

However, as davem has said in the past, the result of a device reading
or writing the page cache should result in the same conditions no
matter how the read or write is actually done. That's only common
sense. So ensuring that device PIO accesses are done with the same
cache colour as userspace results in a difference between PIO and DMA,
one which could potentially bite if we end up reading from the kernel
mapping of those pages first.

If we extend the argument that "everything must be the same cache colour"
then we need cache colouring when allocating page cache pages - since
page cache pages are part of the kernel direct mapped memory, their
colour is already fixed from the time that you've setup the initial
kernel memory mappings. In ARMs case of an aliasing VIPT cache, that
means that VA [13:12] must equal PA [13:12] for every single mapping
no matter where...

That would be nice, since we don't then have to worry about aliases
between kernel and userspace anymore, so we don't have to play these
remapping games in kernel space. ;)

2008-11-20 17:18:00

by Ralf Baechle

[permalink] [raw]
Subject: Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)

On Thu, Nov 20, 2008 at 08:27:19AM -0700, Matthew Wilcox wrote:

> I'm not quite sure why you need kmap_coherent(). If a page is mapped into
> userspace, you can find what address it's mapped to from
> page->mapping->i_mmap and page->index. OTOH, that's potentially

Even if we know the userspace address of a page we do not necessarily have
a usable mapping for kernel purposes. The userspace mapping might be r/o
when we need r/w or it might be in another process. kmap_coherent takes
the job of creating a r/w mapping on a suitable kernel virtual address
that will avoid any aliases.

> page->mapping->i_mmap and page->index. OTOH, that's potentially
> expensive since you need to grab the spinlock, and unless you have all
> user addresses coherent with each other (like parisc does), you need to
> figure out which process to be coherent with.

Having all userspace addresses of a page across all processes coherent with
each other is the only practicable solution in Linux; at least I don't think
how otherwise and within the currently kernel framework a platform could
sanely handle userspace-userspace aliases. So we're talking about extending
this to cover userspace-kernelspace aliases.

The original reason for the introduction of kmap_coherent was avoiding
a cache alias in when a multi-threaded process forks. The issue has been
debated on lkml in 2006 as part of my submission of a patchset under the
subject of "Fix COW D-cache aliasing on fork". The description is somewhat
lengthy so I omit it here.

One of the ugly parts of kmap_coherent() is that it cannot be used safely
if the page has been marked as dirty by flush_dcache_page(); the callers
know about this and deal with it.

> I know James Bottomley did an experiment (and did an OLS presentation
> ...) on unmapping the entire page cache and greatly expanding the kmap
> area to do just this kind of thing. I think he even got a speedup.

The speedup is no surprise.

Ralf

2008-11-20 17:41:28

by Matthew Wilcox

[permalink] [raw]
Subject: Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)

On Thu, Nov 20, 2008 at 05:17:14PM +0000, Ralf Baechle wrote:
> On Thu, Nov 20, 2008 at 08:27:19AM -0700, Matthew Wilcox wrote:
> > I'm not quite sure why you need kmap_coherent(). If a page is mapped into
> > userspace, you can find what address it's mapped to from
> > page->mapping->i_mmap and page->index. OTOH, that's potentially
>
> Even if we know the userspace address of a page we do not necessarily have
> a usable mapping for kernel purposes. The userspace mapping might be r/o
> when we need r/w or it might be in another process. kmap_coherent takes
> the job of creating a r/w mapping on a suitable kernel virtual address
> that will avoid any aliases.

Ah, I didn't do a good enough job of explaining. My question was why
you needed the 'address' argument to kmap_coherent (and thus had a
different prototype from kmap) instead of just implementing kmap() on
mips.

> The original reason for the introduction of kmap_coherent was avoiding
> a cache alias in when a multi-threaded process forks. The issue has been
> debated on lkml in 2006 as part of my submission of a patchset under the
> subject of "Fix COW D-cache aliasing on fork". The description is somewhat
> lengthy so I omit it here.
>
> One of the ugly parts of kmap_coherent() is that it cannot be used safely
> if the page has been marked as dirty by flush_dcache_page(); the callers
> know about this and deal with it.

If you fix kmap() to only give you addresses that are coherent with
userspace, you no longer need flush_dcache_page().

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-11-20 19:43:10

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)

On Thu, Nov 20, 2008 at 10:40:54AM -0700, Matthew Wilcox wrote:
> On Thu, Nov 20, 2008 at 05:17:14PM +0000, Ralf Baechle wrote:
> > On Thu, Nov 20, 2008 at 08:27:19AM -0700, Matthew Wilcox wrote:
> > > I'm not quite sure why you need kmap_coherent(). If a page is mapped into
> > > userspace, you can find what address it's mapped to from
> > > page->mapping->i_mmap and page->index. OTOH, that's potentially
> >
> > Even if we know the userspace address of a page we do not necessarily have
> > a usable mapping for kernel purposes. The userspace mapping might be r/o
> > when we need r/w or it might be in another process. kmap_coherent takes
> > the job of creating a r/w mapping on a suitable kernel virtual address
> > that will avoid any aliases.
>
> Ah, I didn't do a good enough job of explaining. My question was why
> you needed the 'address' argument to kmap_coherent (and thus had a
> different prototype from kmap) instead of just implementing kmap() on
> mips.

Ok, having discussed this with Matthew, I'm beginning to really warm
to the idea of always kmapping page cache accesses.

That nicely means that we never touch the page cache via the kernel
direct mapping, which in turn means we can reduce the amount of RAM
mapped, freeing up space for kmap in the virtual address space.

I think this should work really nicely, and make flush_dcache_page()
almost a no-op for aliasing VIPT caches (on ARM, we still have a
problem with instruction/data cache aliasing, but that's a far easier
issue to solve.)

Ralf: were you saying that you already had an implementation similar
to Matthew's? As I see it, if you use page->index to determine the
userspace mapping colouring (which we do on ARM) then kmap() already
has access the necessary information to correctly place the page...

2008-11-21 17:10:48

by Catalin Marinas

[permalink] [raw]
Subject: Re: O_DIRECT patch for processors with VIPT cache for mainline kernel (specifically arm in our case)

On Thu, 2008-11-20 at 09:19 +0000, Russell King - ARM Linux wrote:
> On Thu, Nov 20, 2008 at 05:59:00PM +1100, Nick Piggin wrote:
> > - The page is sent to the block layer, which stores into the page. Some
> > block devices like 'brd' will potentially store via the kernel linear map
> > here, and they probably don't do enough cache flushing. But a regular
> > block device should go via DMA, which AFAIK should be OK? (the user address
> > should remain invalidated because it would be a bug to read from the buffer
> > before the read has completed)
>
> This is where things get icky with lots of drivers - DMA is fine, but
> many PIO based drivers don't handle the implications of writing to the
> kernel page cache page when there may be CPU cache side effects.

And for PIO devices, what cache flushing function should be used if the
page isn't available (maybe just a pointer to a buffer) to do a
flush_dcache_page? Should this be done in the filesystem layer?

> If the cache is in read allocate mode, then in this case there shouldn't
> be any dirty cache lines. (That's not always the case though, esp. via
> conventional IO.) If the cache is in write allocate mode, PIO data will
> sit in the kernel mapping and won't be visible to userspace.

This problem re-appeared in our tests when someone started using an ext2
filesystem with mtd+slram with write-allocate caches. Basically the D
cache isn't flushed to ensure its coherency with the I cache.

The very dirty workaround was to use flush_icache_range (I know, it's
meant for something completely different) in mtd_blkdevs.c. The slram.c
doesn't have access to any page information either and I noticed that
there are very few block devices that call flush_dcache_page directly.
Should this be done by the filesystem layer?

It looks to me like some filesystems like cramfs call flush_dcache_page
in their "readpage" functions but ext2 doesn't. My other, less dirty,
workaround for the mtd+slram problem is below. It appears to solve the
problem though I'm not sure it is the right solution (ext2 uses
mpage_readpages).

diff --git a/fs/mpage.c b/fs/mpage.c
index 552b80b..979a4a9 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -51,6 +51,7 @@ static void mpage_end_io_read(struct bio *bio, int err)
prefetchw(&bvec->bv_page->flags);

if (uptodate) {
+ flush_dcache_page(page);
SetPageUptodate(page);
} else {
ClearPageUptodate(page);


Thanks.

--
Catalin