2008-03-14 00:16:33

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] mm: make reserve_bootmem can crossed the nodes

Hi,

"Yinghai Lu" <[email protected]> writes:

> Index: linux-2.6/mm/bootmem.c
> ===================================================================
> --- linux-2.6.orig/mm/bootmem.c
> +++ linux-2.6/mm/bootmem.c
> @@ -111,44 +111,69 @@ static unsigned long __init init_bootmem
> * might be used for boot-time allocations - or it might get added
> * to the free page pool later on.
> */
> -static int __init reserve_bootmem_core(bootmem_data_t *bdata,
> +static int __init can_reserve_bootmem_core(bootmem_data_t *bdata,
> unsigned long addr, unsigned long size, int flags)
> {
> unsigned long sidx, eidx;
> unsigned long i;
> - int ret;
> +
> + BUG_ON(!size);
> +
> + /* out of range, don't hold other */
> + if (addr >= bdata->node_boot_start && addr < bdata->last_success)
> + return 0;
>
> /*
> - * round up, partially reserved pages are considered
> - * fully reserved.
> + * Round up to index to the range.
> */
> + if (addr > bdata->node_boot_start)
> + sidx= PFN_DOWN(addr - bdata->node_boot_start);
> + else
> + sidx = 0;
> +
> + eidx = PFN_UP(addr + size - bdata->node_boot_start);
> + if (eidx > bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start))
> + eidx = bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start);
> +
> + for (i = sidx; i < eidx; i++)
> + if (test_bit(i, bdata->node_bootmem_map)) {
> + if (flags & BOOTMEM_EXCLUSIVE)
> + return -EBUSY;
> + }
> +
> + return 0;
> +
> +}
> +static void __init reserve_bootmem_core(bootmem_data_t *bdata,
> + unsigned long addr, unsigned long size, int flags)
> +{
> + unsigned long sidx, eidx;
> + unsigned long i;
> +
> BUG_ON(!size);
> - BUG_ON(PFN_DOWN(addr) >= bdata->node_low_pfn);
> - BUG_ON(PFN_UP(addr + size) > bdata->node_low_pfn);
> - BUG_ON(addr < bdata->node_boot_start);
>
> - sidx = PFN_DOWN(addr - bdata->node_boot_start);
> + /* out of range */
> + if (addr >= bdata->node_boot_start && addr < bdata->last_success)
> + return;
> +
> + /*
> + * Round up to index to the range.
> + */
> + if (addr > bdata->node_boot_start)
> + sidx= PFN_DOWN(addr - bdata->node_boot_start);
> + else
> + sidx = 0;
> +
> eidx = PFN_UP(addr + size - bdata->node_boot_start);
> + if (eidx > bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start))
> + eidx = bdata->node_low_pfn - PFN_DOWN(bdata->node_boot_start);
>
> for (i = sidx; i < eidx; i++)
> if (test_and_set_bit(i, bdata->node_bootmem_map)) {
> #ifdef CONFIG_DEBUG_BOOTMEM
> printk("hm, page %08lx reserved twice.\n", i*PAGE_SIZE);
> #endif
> - if (flags & BOOTMEM_EXCLUSIVE) {
> - ret = -EBUSY;
> - goto err;
> - }
> }
> -
> - return 0;
> -
> -err:
> - /* unreserve memory we accidentally reserved */
> - for (i--; i >= sidx; i--)
> - clear_bit(i, bdata->node_bootmem_map);
> -
> - return ret;
> }
>
> static void __init free_bootmem_core(bootmem_data_t *bdata, unsigned long addr,
> @@ -407,6 +432,11 @@ unsigned long __init init_bootmem_node(p
> void __init reserve_bootmem_node(pg_data_t *pgdat, unsigned long physaddr,
> unsigned long size, int flags)
> {
> + int ret;
> +
> + ret = can_reserve_bootmem_core(pgdat->bdata, physaddr, size, flags);
> + if (ret < 0)
> + return;
> reserve_bootmem_core(pgdat->bdata, physaddr, size, flags);

I don't get it. Sorry. What is the purpose of
can_reserve_bootmem_core()? It does exactly what reserve_bootmem_core
does besides actually setting the bits. All the pre-checking you wanted
to have out of the way is repeated again in reserve_bootmem_core()
(well, almost all).

Hannes

2008-03-14 01:04:17

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] mm: make reserve_bootmem can crossed the nodes

On Thu, Mar 13, 2008 at 5:13 PM, Johannes Weiner <[email protected]> wrote:
> Hi,
>
> "Yinghai Lu" <[email protected]> writes:
>

> > static void __init free_bootmem_core(bootmem_data_t *bdata, unsigned long addr,
> > @@ -407,6 +432,11 @@ unsigned long __init init_bootmem_node(p
> > void __init reserve_bootmem_node(pg_data_t *pgdat, unsigned long physaddr,
> > unsigned long size, int flags)
> > {
> > + int ret;
> > +
> > + ret = can_reserve_bootmem_core(pgdat->bdata, physaddr, size, flags);
> > + if (ret < 0)
> > + return;
> > reserve_bootmem_core(pgdat->bdata, physaddr, size, flags);
>
> I don't get it. Sorry. What is the purpose of
> can_reserve_bootmem_core()? It does exactly what reserve_bootmem_core
> does besides actually setting the bits. All the pre-checking you wanted
> to have out of the way is repeated again in reserve_bootmem_core()
> (well, almost all).

can_reserve_bootmem_core is check if there is some pages is reserved
already with
> > + if (flags & BOOTMEM_EXCLUSIVE)
> > + return -EBUSY;

so it will avoid the restoring later.

YH

2008-03-14 01:30:52

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] mm: make reserve_bootmem can crossed the nodes

On Thu, 13 Mar 2008 16:45:42 -0700
"Yinghai Lu" <[email protected]> wrote:
> int __init reserve_bootmem(unsigned long addr, unsigned long size,
> int flags)
> {
> - return reserve_bootmem_core(NODE_DATA(0)->bdata, addr, size, flags);
> + int ret;
> + bootmem_data_t *bdata;
> + list_for_each_entry(bdata, &bdata_list, list) {
> + ret = can_reserve_bootmem_core(bdata, addr, size, flags);
> + if (ret < 0)
> + return ret;
> + }
> + list_for_each_entry(bdata, &bdata_list, list)
> + reserve_bootmem_core(bdata, addr, size, flags);
> + return 0;
> }

why list_for_each twice ?

Thanks,
-Kame

2008-03-14 01:36:11

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] mm: make reserve_bootmem can crossed the nodes

On Thu, Mar 13, 2008 at 6:34 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Thu, 13 Mar 2008 16:45:42 -0700
> "Yinghai Lu" <[email protected]> wrote:
> > int __init reserve_bootmem(unsigned long addr, unsigned long size,
> > int flags)
> > {
> > - return reserve_bootmem_core(NODE_DATA(0)->bdata, addr, size, flags);
> > + int ret;
> > + bootmem_data_t *bdata;
> > + list_for_each_entry(bdata, &bdata_list, list) {
> > + ret = can_reserve_bootmem_core(bdata, addr, size, flags);
> > + if (ret < 0)
> > + return ret;
> > + }
> > + list_for_each_entry(bdata, &bdata_list, list)
> > + reserve_bootmem_core(bdata, addr, size, flags);
> > + return 0;
> > }
>
> why list_for_each twice ?

first_for_each only check if we can reserve that.
second will do the reserve job.

Thanks

Yinghai Lu

2008-03-14 01:51:41

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] mm: make reserve_bootmem can crossed the nodes

On Thu, 13 Mar 2008 18:35:51 -0700
"Yinghai Lu" <[email protected]> wrote:

> On Thu, Mar 13, 2008 at 6:34 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > On Thu, 13 Mar 2008 16:45:42 -0700
> > "Yinghai Lu" <[email protected]> wrote:
> > > int __init reserve_bootmem(unsigned long addr, unsigned long size,
> > > int flags)
> > > {
> > > - return reserve_bootmem_core(NODE_DATA(0)->bdata, addr, size, flags);
> > > + int ret;
> > > + bootmem_data_t *bdata;
> > > + list_for_each_entry(bdata, &bdata_list, list) {
> > > + ret = can_reserve_bootmem_core(bdata, addr, size, flags);
> > > + if (ret < 0)
> > > + return ret;
> > > + }
> > > + list_for_each_entry(bdata, &bdata_list, list)
> > > + reserve_bootmem_core(bdata, addr, size, flags);
> > > + return 0;
> > > }
> >
> > why list_for_each twice ?
>
> first_for_each only check if we can reserve that.
> second will do the reserve job.
>
Hmm, can a call to reserve_bootmem() return a memory region which spread across
prural nodes ?


Thanks,
-Kame

2008-03-14 02:00:43

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] mm: make reserve_bootmem can crossed the nodes

On Thu, Mar 13, 2008 at 6:55 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Thu, 13 Mar 2008 18:35:51 -0700
>
> "Yinghai Lu" <[email protected]> wrote:
>
> > On Thu, Mar 13, 2008 at 6:34 PM, KAMEZAWA Hiroyuki
> > <[email protected]> wrote:
> > > On Thu, 13 Mar 2008 16:45:42 -0700
> > > "Yinghai Lu" <[email protected]> wrote:
> > > > int __init reserve_bootmem(unsigned long addr, unsigned long size,
> > > > int flags)
> > > > {
> > > > - return reserve_bootmem_core(NODE_DATA(0)->bdata, addr, size, flags);
> > > > + int ret;
> > > > + bootmem_data_t *bdata;
> > > > + list_for_each_entry(bdata, &bdata_list, list) {
> > > > + ret = can_reserve_bootmem_core(bdata, addr, size, flags);
> > > > + if (ret < 0)
> > > > + return ret;
> > > > + }
> > > > + list_for_each_entry(bdata, &bdata_list, list)
> > > > + reserve_bootmem_core(bdata, addr, size, flags);
> > > > + return 0;
> > > > }
> > >
> > > why list_for_each twice ?
> >
> > first_for_each only check if we can reserve that.
> > second will do the reserve job.
> >
> Hmm, can a call to reserve_bootmem() return a memory region which spread across
> prural nodes ?

Yes. there is one reserve_bootmem in reserve_crashkernel. and position
is passed via commandline...
there is some chance for ramdisk too. it is called via reserve_bootmem_generic.

YH

2008-03-14 02:33:17

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] mm: make reserve_bootmem can crossed the nodes

On Thu, 13 Mar 2008 19:00:33 -0700
"Yinghai Lu" <[email protected]> wrote:
> > Hmm, can a call to reserve_bootmem() return a memory region which spread across
> > prural nodes ?
>
> Yes. there is one reserve_bootmem in reserve_crashkernel. and position
> is passed via commandline...
> there is some chance for ramdisk too. it is called via reserve_bootmem_generic.
>
Hmm, ok. I finally see your point.

please allow me to make clarification

Background:
- reserve_bootmem is called for *reserve* memory before bootmem allocator.
- It specifies <address, size>.
- Because <addrees, size> are argments, memory hole in <address, size> is not
problem here.

Before change,
- reseve_bootmem() only works for Node(0)

After change
- reserve_bootmem() works on sutiable nodes for <address, size>
- It can spread accross among prural nodes.

thank you for your work.

Regards,
-Kame

2008-03-14 02:53:27

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] mm: make reserve_bootmem can crossed the nodes

Hi,

"Yinghai Lu" <[email protected]> writes:

> On Thu, Mar 13, 2008 at 5:13 PM, Johannes Weiner <[email protected]> wrote:
>> Hi,
>>
>> "Yinghai Lu" <[email protected]> writes:
>>
>
>> > static void __init free_bootmem_core(bootmem_data_t *bdata, unsigned long addr,
>> > @@ -407,6 +432,11 @@ unsigned long __init init_bootmem_node(p
>> > void __init reserve_bootmem_node(pg_data_t *pgdat, unsigned long physaddr,
>> > unsigned long size, int flags)
>> > {
>> > + int ret;
>> > +
>> > + ret = can_reserve_bootmem_core(pgdat->bdata, physaddr, size, flags);
>> > + if (ret < 0)
>> > + return;
>> > reserve_bootmem_core(pgdat->bdata, physaddr, size, flags);
>>
>> I don't get it. Sorry. What is the purpose of
>> can_reserve_bootmem_core()? It does exactly what reserve_bootmem_core
>> does besides actually setting the bits. All the pre-checking you wanted
>> to have out of the way is repeated again in reserve_bootmem_core()
>> (well, almost all).
>
> can_reserve_bootmem_core is check if there is some pages is reserved
> already with
>> > + if (flags & BOOTMEM_EXCLUSIVE)
>> > + return -EBUSY;
>
> so it will avoid the restoring later.

Yes, I understood that. But you skipped the lower part of my email:

Your current state now is _not_ that you have one function that
prechecks the range and another function that reserves it! You have
_two_ functions checking the range and the second reserving it.

Why double-check most of the things? If you want to have a pre-check
function, _move_ all the pre-checks into another function, not
copy-paste them.

And is the condition of trying to reserve a range twice, the second time
exclusively, so common that it is worth iterating twice over the nodes
(once for checking, once for reserving) instead of just unwinding the
reservation if it fails in between?

On something else: is there a bug when a memory range is reserved with
BOOTMEM_EXCLUSIVE and then again without this flag? The second call
does not return an error then. Is my understanding of
BOOTMEM_EXCLUSIVE's semantics wrong?

Hannes

2008-03-14 03:00:41

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] mm: make reserve_bootmem can crossed the nodes

Hi,

KAMEZAWA Hiroyuki <[email protected]> writes:

> Background:
> - reserve_bootmem is called for *reserve* memory before bootmem allocator.
> - It specifies <address, size>.
> - Because <addrees, size> are argments, memory hole in <address, size> is not
> problem here.
>
> Before change,
> - reseve_bootmem() only works for Node(0)
>
> After change
> - reserve_bootmem() works on sutiable nodes for <address, size>
> - It can spread accross among prural nodes.

After the change it will iterate over all nodes, reserving the range
`address to address+size' on each of them.

Hannes

2008-03-14 03:03:48

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] mm: make reserve_bootmem can crossed the nodes

On Thu, Mar 13, 2008 at 7:58 PM, Johannes Weiner <[email protected]> wrote:
> Hi,
>
>
> KAMEZAWA Hiroyuki <[email protected]> writes:
>
> > Background:
> > - reserve_bootmem is called for *reserve* memory before bootmem allocator.
> > - It specifies <address, size>.
> > - Because <addrees, size> are argments, memory hole in <address, size> is not
> > problem here.
> >
> > Before change,
> > - reseve_bootmem() only works for Node(0)
> >
> > After change
> > - reserve_bootmem() works on sutiable nodes for <address, size>
> > - It can spread accross among prural nodes.
>
> After the change it will iterate over all nodes, reserving the range
> `address to address+size' on each of them.

+ /* out of range */
+ if (addr >= bdata->node_boot_start && addr < bdata->last_success)
+ return;

out of range will bail out...

YH

2008-03-14 03:09:26

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] mm: make reserve_bootmem can crossed the nodes

On Thu, Mar 13, 2008 at 7:50 PM, Johannes Weiner <[email protected]> wrote:
>
> Hi,
>
> "Yinghai Lu" <[email protected]> writes:
>
> > On Thu, Mar 13, 2008 at 5:13 PM, Johannes Weiner <[email protected]> wrote:
> >> Hi,
> >>
> >> "Yinghai Lu" <[email protected]> writes:
> >>
> >
> >> > static void __init free_bootmem_core(bootmem_data_t *bdata, unsigned long addr,
> >> > @@ -407,6 +432,11 @@ unsigned long __init init_bootmem_node(p
> >> > void __init reserve_bootmem_node(pg_data_t *pgdat, unsigned long physaddr,
> >> > unsigned long size, int flags)
> >> > {
> >> > + int ret;
> >> > +
> >> > + ret = can_reserve_bootmem_core(pgdat->bdata, physaddr, size, flags);
> >> > + if (ret < 0)
> >> > + return;
> >> > reserve_bootmem_core(pgdat->bdata, physaddr, size, flags);
> >>
> >> I don't get it. Sorry. What is the purpose of
> >> can_reserve_bootmem_core()? It does exactly what reserve_bootmem_core
> >> does besides actually setting the bits. All the pre-checking you wanted
> >> to have out of the way is repeated again in reserve_bootmem_core()
> >> (well, almost all).
> >
> > can_reserve_bootmem_core is check if there is some pages is reserved
> > already with
> >> > + if (flags & BOOTMEM_EXCLUSIVE)
> >> > + return -EBUSY;
> >
> > so it will avoid the restoring later.
>
> Yes, I understood that. But you skipped the lower part of my email:
>
> Your current state now is _not_ that you have one function that
> prechecks the range and another function that reserves it! You have
> _two_ functions checking the range and the second reserving it.

Yes

>
> Why double-check most of the things? If you want to have a pre-check
> function, _move_ all the pre-checks into another function, not
> copy-paste them.

for cross the nodes

>
> And is the condition of trying to reserve a range twice, the second time
> exclusively, so common that it is worth iterating twice over the nodes
> (once for checking, once for reserving) instead of just unwinding the
> reservation if it fails in between?
>
> On something else: is there a bug when a memory range is reserved with
> BOOTMEM_EXCLUSIVE and then again without this flag? The second call
> does not return an error then.

Yes. only provide one direction protection.
so always make sure crash_kernel reserve_bootmem as the last. that is doable.

YH

2008-03-15 03:11:15

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] mm: make reserve_bootmem can crossed the nodes

Hi,

"Yinghai Lu" <[email protected]> writes:

> On Thu, Mar 13, 2008 at 7:58 PM, Johannes Weiner <[email protected]> wrote:
>> Hi,
>>
>>
>> KAMEZAWA Hiroyuki <[email protected]> writes:
>>
>> > Background:
>> > - reserve_bootmem is called for *reserve* memory before bootmem allocator.
>> > - It specifies <address, size>.
>> > - Because <addrees, size> are argments, memory hole in <address, size> is not
>> > problem here.
>> >
>> > Before change,
>> > - reseve_bootmem() only works for Node(0)
>> >
>> > After change
>> > - reserve_bootmem() works on sutiable nodes for <address, size>
>> > - It can spread accross among prural nodes.
>>
>> After the change it will iterate over all nodes, reserving the range
>> `address to address+size' on each of them.
>
> + /* out of range */
> + if (addr >= bdata->node_boot_start && addr < bdata->last_success)
> + return;
>
> out of range will bail out...

Sorry, I did not see that.

Hannes