2006-09-22 04:02:23

by Christoph Lameter

[permalink] [raw]
Subject: [RFC] Initial alpha-0 for new page allocator API

We have repeatedly discussed the problems of devices having varying
address range requirements for doing DMA. We would like for the device
drivers to have the ability to specify exactly which address range is
allowed. Also the NUMA guys would like to have the ability to specify NUMA
related information.

So I have put all the important allocation information in a struct
allocation_control and am trying to get a new API developed that will fit
our needs for better control over allocations. The discussion of the exact
nature of the implementation necessary in the page allocator to supply
pages fulfilling the criteria specified may better be deferred until we
have a reasonable API.

The implementation given here uses the existing page_allocator in order to
define the behavior of the new functions. The free functions have an _a_
and a _some_ in there to avoid name clashes. Will be removed later.

This is only a discussion basis. Once we agree on the API then I will
implement that API with minimal effort on top of the existing page
allocator and then we can try to see how a device driver would be working
with this. I envision that we would have one allocation_control structure
in the task structure to control the allocation of pages for a process.
This should allow us to move the memory policy related information into
the allocation_control structure. I also would think that a device driver
would have an allocation_control structure somewhere where parameters are
set up so that allocations using this structure will yield the pages
satisfying the requirements of the device drivers.

Index: linux-2.6.18-rc6-mm1/include/linux/allocator.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-rc6-mm1/include/linux/allocator.h 2006-09-21 20:48:30.000000000 -0700
@@ -0,0 +1,35 @@
+/*
+ * Necessary definitions to perform allocations
+ */
+
+#include <linux/mm.h>
+
+struct allocation_control {
+ gfp_t flags;
+ int order;
+#ifdef CONFIG_NUMA
+ int node;
+ struct mempol *mpol;
+#endif
+ unsigned long low_boundary;
+ unsigned long high_boundary;
+};
+
+/*
+ * Functions to allocate memory in units of PAGE_SIZE
+ */
+struct page *allocate_pages(struct allocation_control *ac,
+ gfp_t additional_flags);
+
+/*
+ * Free a single page
+ * (which may be of higher order if allocated with GFP_COMP set)
+ */
+void free_a_page(struct page *);
+
+/*
+ * Free a page as allocated with the given allocation control.
+ * This is needed if higher order pages were allocated without GFP_COMP set.
+ */
+void free_some_pages(struct page *, struct allocation_control *ac);
+
Index: linux-2.6.18-rc6-mm1/mm/page_allocator.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-rc6-mm1/mm/page_allocator.c 2006-09-21 20:49:37.000000000 -0700
@@ -0,0 +1,37 @@
+/*
+ * Standard Page Allocator Definitions
+ */
+#include <linux/allocator.h>
+
+struct page *allocate_pages(struct allocation_control *ac,
+ gfp_t additional_flags)
+{
+ gfp_t gfp_flags = additional_flags | ac->flags;
+
+#ifdef CONFIG_ZONE_DMA32
+ if (high_boundary < MAX_DMA32_ADDRESS)
+ gfp_flags |= __GFP_DMA32;
+ else
+#endif
+#ifdef CONFIG_ZONE_DMA
+ if (high_boundary < MAX_DMA_ADDRESS)
+ gfp_flags |= GFP_DMA;
+#endif
+
+#ifdef CONFIG_NUMA
+ if (ac->node != -1)
+ return alloc_pages_node(ac->node, gfp_flags, ac->order);
+#endif
+
+ return alloc_pages(gfp_flags, ac->order);
+}
+
+void free_a_page(struct page *page)
+{
+ __free_pages(page, 0);
+}
+
+void free_some_pages(struct page *page, struct allocation_control *ac)
+{
+ __free_pages(page, ac->order);
+}


2006-09-22 06:18:16

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] Initial alpha-0 for new page allocator API

On Friday 22 September 2006 06:02, Christoph Lameter wrote:
> We have repeatedly discussed the problems of devices having varying
> address range requirements for doing DMA.

We already have such an API. dma_alloc_coherent(). Device drivers
are not supposed to mess with GFP_DMA* directly anymore for quite
some time.

> We would like for the device
> drivers to have the ability to specify exactly which address range is
> allowed.

I actually have my doubts it is a good idea to add that now. The devices
with weird requirements are steadily going away.

-Andi

2006-09-22 16:35:35

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] Initial alpha-0 for new page allocator API

On Fri, 22 Sep 2006, Andi Kleen wrote:

> On Friday 22 September 2006 06:02, Christoph Lameter wrote:
> > We have repeatedly discussed the problems of devices having varying
> > address range requirements for doing DMA.
>
> We already have such an API. dma_alloc_coherent(). Device drivers
> are not supposed to mess with GFP_DMA* directly anymore for quite
> some time.

Device drivers need to be able to indicate ranges of addresses that may be
different from ZONE_DMA. This is an attempt to come up with a future
scheme that does no longer rely on device drivers referring to zoies.

> > We would like for the device
> > drivers to have the ability to specify exactly which address range is
> > allowed.
>
> I actually have my doubts it is a good idea to add that now. The devices
> with weird requirements are steadily going away

Hmm.... Martin?

2006-09-22 17:36:18

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] Initial alpha-0 for new page allocator API

The problems to be solved are:

1. Have a means to allocate from a range of memory that is defined by the
device driver and *not* by the architecture. Devices currently cannot rely
on GFP_DMA because the range vary according to the architecture.

2. I wish we there would be some point in the future where we could get
rid of GFP_DMAxx... As Andi notes most hardware these days is sane so
there is less need to create VM overhead by managing additional zones.

3. There are issues with memory policies coming from the process
environment that may redirect allocations. We also have additional calls
with xx_node like alloc_pages and alloc_pages_node. A new API could
fix these and allow a complete specification of how the allocation should
proceeds without strange side effect from the process (which makes
GFP_THISNODE necessary).

One easy alternate way to support allocating from a range of memory
without reworking the API would be to simply add a new page allocator
call:

struct page *alloc_pages_range(int order, gfp_t gfp_flags, unsigned long
low, unsigned long high [ , node ? ]);

This would scan through the freelists for available memory in that range
and if not found simply do page reclaim until such memory becomes
available. We could get more sophisticated than that but this would allow
allocating memory from the ranges needed by broken devices and it would
penalize the device for the problem it has and would not impact the rest
of the system.

2006-09-22 19:11:18

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] Initial alpha-0 for new page allocator API

On Friday 22 September 2006 18:35, Christoph Lameter wrote:
> On Fri, 22 Sep 2006, Andi Kleen wrote:
>
> > On Friday 22 September 2006 06:02, Christoph Lameter wrote:
> > > We have repeatedly discussed the problems of devices having varying
> > > address range requirements for doing DMA.
> >
> > We already have such an API. dma_alloc_coherent(). Device drivers
> > are not supposed to mess with GFP_DMA* directly anymore for quite
> > some time.
>
> Device drivers need to be able to indicate ranges of addresses that may be
> different from ZONE_DMA. This is an attempt to come up with a future
> scheme that does no longer rely on device drivers referring to zoies.

We already have that scheme. Any existing driver should be already converted
away from GFP_DMA towards dma_*/pci_*. dma_* knows all the magic
how to get memory for the various ranges. No need to mess up the
main allocator.

Anyways, i suppose what could be added as a fallback would be a
really_slow_brute_force_try_to_get_something_in_this_range() allocator
that basically goes through the buddy lists freeing in >O(1)
and does some directed reclaim, but that would likely be a separate
path anyways and not need your new structure to impact the O(1)
allocator.

I am still unconvinced of the real need. The only gaping hole was
GFP_DMA32, which we fixed already.

Ok there is aacraid with its weird 2GB limit, but in case there are
really enough users running into this broken then then the really_slow_*
thing above would be likely fine. And those cards are slowly going
away too.

If we managed to resist for too long now is the wrong time.

> > I actually have my doubts it is a good idea to add that now. The devices
> > with weird requirements are steadily going away

> Hmm.... Martin?

Think of it this way: all the weird slow devices of 5-10 years ago have USB
interfaces today and that does 32bit just fine (=GFP_DMA32). And old 5-10 years old weird
devices are usually fine with 16MB of playground only.

Ok now I'm sure someone will come up with a counter example (hi Alan), but:
- Does the device really need more than 16MB?
- How often is it used on systems with >1/2GB with a 64bit kernel?
[consider that 64bit kernels don't support ISA]
- How many users of that particular thing around?


I think my point stands.

-And

2006-09-22 19:17:21

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] Initial alpha-0 for new page allocator API

On Fri, 22 Sep 2006, Andi Kleen wrote:

> We already have that scheme. Any existing driver should be already converted
> away from GFP_DMA towards dma_*/pci_*. dma_* knows all the magic
> how to get memory for the various ranges. No need to mess up the
> main allocator.

That is not the case. The "magic" ends in arch specific
*_alloc_dma_coherent function tinkering around with __GFP_DMA and in
x86_64 in addition GFP_DMA32.
>
> Anyways, i suppose what could be added as a fallback would be a
> really_slow_brute_force_try_to_get_something_in_this_range() allocator
> that basically goes through the buddy lists freeing in >O(1)
> and does some directed reclaim, but that would likely be a separate
> path anyways and not need your new structure to impact the O(1)
> allocator.

Right.

> I am still unconvinced of the real need. The only gaping hole was
> GFP_DMA32, which we fixed already.

And then about DMA zones being associated with arch independent memory
ranges which is not the case. GFP_DMA32 just happens to be defined by a
single arch and thus is has only one interpretation.

> Ok there is aacraid with its weird 2GB limit, but in case there are
> really enough users running into this broken then then the really_slow_*
> thing above would be likely fine. And those cards are slowly going
> away too.

I agree.

2006-09-22 19:24:28

by Martin Bligh

[permalink] [raw]
Subject: Re: [RFC] Initial alpha-0 for new page allocator API

Andi Kleen wrote:
> On Friday 22 September 2006 18:35, Christoph Lameter wrote:
>
>>On Fri, 22 Sep 2006, Andi Kleen wrote:
>>
>>
>>>On Friday 22 September 2006 06:02, Christoph Lameter wrote:
>>>
>>>>We have repeatedly discussed the problems of devices having varying
>>>>address range requirements for doing DMA.
>>>
>>>We already have such an API. dma_alloc_coherent(). Device drivers
>>>are not supposed to mess with GFP_DMA* directly anymore for quite
>>>some time.
>>
>>Device drivers need to be able to indicate ranges of addresses that may be
>>different from ZONE_DMA. This is an attempt to come up with a future
>>scheme that does no longer rely on device drivers referring to zoies.
>
>
> We already have that scheme. Any existing driver should be already converted
> away from GFP_DMA towards dma_*/pci_*. dma_* knows all the magic
> how to get memory for the various ranges. No need to mess up the
> main allocator.

mbligh@mbligh:~/linux/views/linux-2.6.18$ grep -r GFP_DMA drivers

drivers/atm/fore200e.c: chunk->alloc_addr =
fore200e_kmalloc(chunk->alloc_size, GFP_KERNEL | GFP_DMA);
drivers/atm/fore200e.c: data = kmalloc(tx_len, GFP_ATOMIC | GFP_DMA);
drivers/atm/fore200e.c: fore200e->stats = fore200e_kmalloc(sizeof(struct
stats), GFP_KERNEL | GFP_DMA);
drivers/atm/fore200e.c: struct prom_data* prom =
fore200e_kmalloc(sizeof(struct prom_data), GFP_KERNEL | GFP_DMA);
drivers/atm/iphase.c: cpcs = kmalloc(sizeof(*cpcs),
GFP_KERNEL|GFP_DMA);
drivers/char/synclink.c: info->intermediate_rxbuffer =
kmalloc(info->max_frame_size, GFP_KERNEL | GFP_DMA);
drivers/isdn/hisax/netjet.c: GFP_KERNEL | GFP_DMA))) {
drivers/isdn/hisax/netjet.c: GFP_KERNEL | GFP_DMA))) {
drivers/media/dvb/dvb-usb/gp8psk.c: buf = kmalloc(512, GFP_KERNEL |
GFP_DMA);
drivers/media/video/arv.c: ar->line_buff = kmalloc(MAX_AR_LINE_BYTES,
GFP_KERNEL | GFP_DMA);
drivers/media/video/planb.c: |GFP_DMA, 0);
drivers/media/video/vino.c: GFP_KERNEL | GFP_DMA);
drivers/media/video/vino.c: get_zeroed_page(GFP_KERNEL | GFP_DMA);
drivers/media/video/vino.c: GFP_KERNEL | GFP_DMA);
drivers/media/video/vino.c: get_zeroed_page(GFP_KERNEL | GFP_DMA);
drivers/media/video/vino.c: vino_drvdata->dummy_page =
get_zeroed_page(GFP_KERNEL | GFP_DMA);
drivers/media/video/vino.c: GFP_KERNEL | GFP_DMA);
drivers/media/video/zr36120_mem.c: mem =
(void*)__get_free_pages(GFP_USER|GFP_DMA,get_order(size));
drivers/mmc/wbsd.c: GFP_NOIO | GFP_DMA | __GFP_REPEAT | __GFP_NOWARN);
drivers/net/b44.c: skb = __dev_alloc_skb(RX_PKT_BUF_SZ,GFP_DMA);
drivers/net/b44.c: GFP_ATOMIC|GFP_DMA);
drivers/net/b44.c: insisting on use of GFP_DMA, which is more
restrictive
drivers/net/b44.c: insisting on use of GFP_DMA, which is more
restrictive
drivers/net/gt96100eth.c: ret = (void *)__get_free_pages(GFP_ATOMIC |
GFP_DMA, get_order(size));
drivers/net/hamradio/dmascc.c: info = kmalloc(sizeof(struct scc_info),
GFP_KERNEL | GFP_DMA);
drivers/net/hp100.c: * PCI cards can access the whole PC memory.
Therefore GFP_DMA is not
drivers/net/irda/au1k_ir.c: int gfp = GFP_ATOMIC | GFP_DMA;
drivers/net/irda/pxaficp_ir.c: io->head = kmalloc(size, GFP_KERNEL |
GFP_DMA);
drivers/net/irda/sa1100_ir.c: io->head = kmalloc(size, GFP_KERNEL |
GFP_DMA);
drivers/net/irda/vlsi_ir.c: rd->buf = kmalloc(len, GFP_KERNEL|GFP_DMA);
drivers/net/lance.c: lp = kmalloc(sizeof(*lp), GFP_DMA | GFP_KERNEL);
drivers/net/lance.c: GFP_DMA | GFP_KERNEL);
drivers/net/lance.c: GFP_DMA | GFP_KERNEL);
drivers/net/lance.c: skb = alloc_skb(PKT_BUF_SZ, GFP_DMA | gfp);
drivers/net/lance.c: rx_buff = kmalloc(PKT_BUF_SZ, GFP_DMA | gfp);
drivers/net/macmace.c: mp->rx_ring = (void *)
__get_free_pages(GFP_KERNEL | GFP_DMA, N_RX_PAGES);
drivers/net/macmace.c: mp->tx_ring = (void *)
__get_free_pages(GFP_KERNEL | GFP_DMA, 0);
drivers/net/meth.c: skb = alloc_skb(METH_RX_BUFF_SIZE, GFP_ATOMIC |
GFP_DMA);
drivers/net/ni65.c: ret = skb = alloc_skb(2+16+size,GFP_KERNEL|GFP_DMA);
drivers/net/ni65.c: ret = ptr = kmalloc(T_BUF_SIZE,GFP_KERNEL | GFP_DMA);
drivers/net/tokenring/3c359.c: xl_priv->xl_tx_ring =
kmalloc((sizeof(struct xl_tx_desc) * XL_TX_RING_SIZE) + 7, GFP_DMA |
GFP_KERNEL) ;
drivers/net/tokenring/3c359.c: xl_priv->xl_rx_ring =
kmalloc((sizeof(struct xl_rx_desc) * XL_RX_RING_SIZE) +7, GFP_DMA |
GFP_KERNEL) ;
drivers/net/wan/cosa.c: cosa->bouncebuf = kmalloc(COSA_MTU,
GFP_KERNEL|GFP_DMA);
drivers/net/wan/cosa.c: if ((chan->rxdata = kmalloc(COSA_MTU,
GFP_DMA|GFP_KERNEL)) == NULL) {
drivers/net/wan/cosa.c: if ((kbuf = kmalloc(count, GFP_KERNEL|GFP_DMA))
== NULL) {
drivers/net/wan/z85230.c: c->rx_buf[0]=(void
*)get_zeroed_page(GFP_KERNEL|GFP_DMA);
drivers/net/wan/z85230.c: c->tx_dma_buf[0]=(void
*)get_zeroed_page(GFP_KERNEL|GFP_DMA);
drivers/net/wan/z85230.c: c->tx_dma_buf[0]=(void
*)get_zeroed_page(GFP_KERNEL|GFP_DMA);
drivers/net/znet.c: if (!(znet->rx_start = kmalloc (DMA_BUF_SIZE,
GFP_KERNEL | GFP_DMA)))
drivers/net/znet.c: if (!(znet->tx_start = kmalloc (DMA_BUF_SIZE,
GFP_KERNEL | GFP_DMA)))
drivers/s390/block/dasd.c: device->ccw_mem = (void *)
__get_free_pages(GFP_ATOMIC | GFP_DMA, 1);
drivers/s390/block/dasd.c: device->erp_mem = (void *)
get_zeroed_page(GFP_ATOMIC | GFP_DMA);
drivers/s390/block/dasd.c: GFP_ATOMIC | GFP_DMA);
drivers/s390/block/dasd.c: cqr->data = kzalloc(datasize, GFP_ATOMIC |
GFP_DMA);
drivers/s390/block/dasd_eckd.c: GFP_KERNEL | GFP_DMA);
drivers/s390/char/con3215.c: RAW3215_INBUF_SIZE, GFP_KERNEL|GFP_DMA);
drivers/s390/char/con3215.c: GFP_KERNEL|GFP_DMA);
drivers/s390/char/raw3270.c: rq = kzalloc(sizeof(struct
raw3270_request), GFP_KERNEL | GFP_DMA);
drivers/s390/char/raw3270.c: rq->buffer = kmalloc(size, GFP_KERNEL |
GFP_DMA);
drivers/s390/char/raw3270.c: rp = kmalloc(sizeof(struct raw3270),
GFP_KERNEL | GFP_DMA);
drivers/s390/char/sclp_cpi.c: sccb = (struct cpi_sccb *)
__get_free_page(GFP_KERNEL | GFP_DMA);
drivers/s390/char/sclp_tty.c: page = (void *)
get_zeroed_page(GFP_KERNEL | GFP_DMA);
drivers/s390/char/sclp_vt220.c: page = (void *)
get_zeroed_page(GFP_KERNEL | GFP_DMA);
drivers/s390/char/tape_3590.c: GFP_KERNEL | GFP_DMA);
drivers/s390/char/tape_core.c: device->modeset_byte = kmalloc(1,
GFP_KERNEL | GFP_DMA);
drivers/s390/char/tape_core.c: GFP_ATOMIC | GFP_DMA);
drivers/s390/char/tape_core.c: request->cpdata = kzalloc(datasize,
GFP_KERNEL | GFP_DMA);
drivers/s390/char/tty3270.c: __get_free_pages(GFP_KERNEL|GFP_DMA, 0);
drivers/s390/char/vmcp.c: | __GFP_REPEAT | GFP_DMA,
drivers/s390/cio/chsc.c: page = (void *)get_zeroed_page(GFP_KERNEL |
GFP_DMA);
drivers/s390/cio/chsc.c: secm_area = (void *)get_zeroed_page(GFP_KERNEL
| GFP_DMA);
drivers/s390/cio/chsc.c: css->cub_addr1 = (void
*)get_zeroed_page(GFP_KERNEL | GFP_DMA);
drivers/s390/cio/chsc.c: css->cub_addr2 = (void
*)get_zeroed_page(GFP_KERNEL | GFP_DMA);
drivers/s390/cio/chsc.c: scpd_area = (void *)get_zeroed_page(GFP_KERNEL
| GFP_DMA);
drivers/s390/cio/chsc.c: scmc_area = (void *)get_zeroed_page(GFP_KERNEL
| GFP_DMA);
drivers/s390/cio/chsc.c: sei_page = (void *)get_zeroed_page(GFP_KERNEL |
GFP_DMA);
drivers/s390/cio/chsc.c: sda_area = (void
*)get_zeroed_page(GFP_KERNEL|GFP_DMA);
drivers/s390/cio/chsc.c: scsc_area = (void *)get_zeroed_page(GFP_KERNEL
| GFP_DMA);
drivers/s390/cio/cmf.c: mem = (void*)__get_free_pages(GFP_KERNEL | GFP_DMA,
drivers/s390/cio/css.c: sch = kmalloc (sizeof (*sch), GFP_KERNEL | GFP_DMA);
drivers/s390/cio/device.c: GFP_KERNEL | GFP_DMA);
drivers/s390/cio/device_ops.c: rdc_ccw = kzalloc(sizeof(struct ccw1),
GFP_KERNEL | GFP_DMA);
drivers/s390/cio/device_ops.c: rcd_ccw = kzalloc(sizeof(struct ccw1),
GFP_KERNEL | GFP_DMA);
drivers/s390/cio/device_ops.c: rcd_buf = kzalloc(ciw->count, GFP_KERNEL
| GFP_DMA);
drivers/s390/cio/device_ops.c: buf = kmalloc(32*sizeof(char),
GFP_DMA|GFP_KERNEL);
drivers/s390/cio/device_ops.c: buf2 = kmalloc(32*sizeof(char),
GFP_DMA|GFP_KERNEL);
drivers/s390/cio/qdio.c: irq_ptr = (void *) get_zeroed_page(GFP_KERNEL |
GFP_DMA);
drivers/s390/cio/qdio.c: irq_ptr->qdr=kmalloc(sizeof(struct qdr),
GFP_KERNEL | GFP_DMA);
drivers/s390/cio/qdio.c: return (void *) get_zeroed_page(gfp_mask|GFP_DMA);
drivers/s390/net/claw.c: (void *)__get_free_pages(__GFP_DMA,
drivers/s390/net/claw.c: (void *)__get_free_pages(__GFP_DMA,
drivers/s390/net/claw.c: p_buff=(void
*)__get_free_pages(__GFP_DMA,
drivers/s390/net/claw.c: (void *)__get_free_pages(__GFP_DMA,
drivers/s390/net/claw.c: p_buff = (void
*)__get_free_pages(__GFP_DMA,
drivers/s390/net/ctcmain.c: GFP_ATOMIC | GFP_DMA);
drivers/s390/net/ctcmain.c: GFP_KERNEL | GFP_DMA)) == NULL) {
drivers/s390/net/ctcmain.c: nskb = alloc_skb(skb->len, GFP_ATOMIC |
GFP_DMA);
drivers/s390/net/iucv.c: /* Note: GFP_DMA used used to get memory below
2G */
drivers/s390/net/iucv.c: GFP_KERNEL|GFP_DMA);
drivers/s390/net/iucv.c: GFP_KERNEL|GFP_DMA);
drivers/s390/net/lcs.c: kzalloc(LCS_IOBUFFERSIZE, GFP_DMA | GFP_KERNEL);
drivers/s390/net/lcs.c: card = kzalloc(sizeof(struct lcs_card),
GFP_KERNEL | GFP_DMA);
drivers/s390/net/lcs.c: * Note: we have allocated the buffer with
GFP_DMA, so
drivers/s390/net/lcs.c: * Note: we have allocated the buffer with
GFP_DMA, so
drivers/s390/net/netiucv.c: NETIUCV_HDRLEN, GFP_ATOMIC | GFP_DMA);
drivers/s390/net/netiucv.c: GFP_KERNEL | GFP_DMA);
drivers/s390/net/netiucv.c: GFP_KERNEL | GFP_DMA);
drivers/s390/net/qeth_main.c: kmalloc(QETH_BUFSIZE, GFP_DMA|GFP_KERNEL);
drivers/s390/net/qeth_main.c: card = kzalloc(sizeof(struct qeth_card),
GFP_DMA|GFP_KERNEL);
drivers/s390/net/qeth_main.c: ptr = (void *)
__get_free_page(GFP_KERNEL|GFP_DMA);
drivers/s390/net/qeth_main.c: GFP_KERNEL|GFP_DMA);
drivers/s390/net/qeth_main.c: GFP_KERNEL|GFP_DMA);
drivers/s390/net/smsgiucv.c: msg = kmalloc(len + 1, GFP_ATOMIC|GFP_DMA);
drivers/scsi/53c7xx.c:/* FIXME: for ISA bus '7xx chips, we need to or
GFP_DMA in here */
drivers/scsi/aacraid/commctrl.c: /* Does this really need to be
GFP_DMA? */
drivers/scsi/aacraid/commctrl.c: p =
kmalloc(usg->sg[i].count,GFP_KERNEL|__GFP_DMA);
drivers/scsi/aha1542.c: SCpnt->host_scribble = (unsigned char *)
kmalloc(512, GFP_KERNEL | GFP_DMA);
drivers/scsi/ch.c: buffer = kmalloc(512, GFP_KERNEL | GFP_DMA);
drivers/scsi/ch.c: buffer = kmalloc(512, GFP_KERNEL | GFP_DMA);
drivers/scsi/ch.c: buffer = kmalloc(512, GFP_KERNEL | GFP_DMA);
drivers/scsi/eata.c: gfp_t gfp_mask = (shost->unchecked_isa_dma ?
GFP_DMA : 0) | GFP_ATOMIC;
drivers/scsi/hosts.c: gfp_mask |= __GFP_DMA;
drivers/scsi/initio.c: if ((tul_scb = (SCB *) kmalloc(i, GFP_ATOMIC |
GFP_DMA)) != NULL)
drivers/scsi/ips.c:/* 4.71.00 - Change all memory allocations to not
use GFP_DMA flag */
drivers/scsi/osst.c: priority |= GFP_DMA;
drivers/scsi/pluto.c: fcs = (struct ctrl_inquiry *) kmalloc (sizeof
(struct ctrl_inquiry) * fcscount, GFP_DMA);
drivers/scsi/scsi.c: .gfp_mask = __GFP_DMA,
drivers/scsi/scsi_error.c: gfp_mask |= __GFP_DMA;
drivers/scsi/scsi_scan.c: ((shost->unchecked_isa_dma) ? __GFP_DMA : 0));
drivers/scsi/scsi_scan.c: (sdev->host->unchecked_isa_dma ?
__GFP_DMA : 0));
drivers/scsi/sd.c: buffer = kmalloc(SD_BUF_SIZE, GFP_KERNEL | __GFP_DMA);
drivers/scsi/sg.c: * XXX(hch): we shouldn't need GFP_DMA for the actual
S/G list.
drivers/scsi/sg.c: gfp_flags |= GFP_DMA;
drivers/scsi/sg.c: page_mask = GFP_ATOMIC | GFP_DMA | __GFP_COMP |
__GFP_NOWARN;
drivers/scsi/sr.c: buffer = kmalloc(512, GFP_KERNEL | GFP_DMA);
drivers/scsi/sr.c: buffer = kmalloc(512, GFP_KERNEL | GFP_DMA);
drivers/scsi/sr_ioctl.c:/* primitive to determine whether we need to
have GFP_DMA set based on
drivers/scsi/sr_ioctl.c:#define SR_GFP_DMA(cd)
(((cd)->device->host->unchecked_isa_dma) ? GFP_DMA : 0)
drivers/scsi/sr_ioctl.c: buffer = kmalloc(32, GFP_KERNEL | SR_GFP_DMA(cd));
drivers/scsi/sr_ioctl.c: buffer = kmalloc(32, GFP_KERNEL | SR_GFP_DMA(cd));
drivers/scsi/sr_ioctl.c: char *buffer = kmalloc(32, GFP_KERNEL |
SR_GFP_DMA(cd));
drivers/scsi/sr_ioctl.c: raw_sector = (unsigned char *) kmalloc(2048,
GFP_KERNEL | SR_GFP_DMA(cd));
drivers/scsi/sr_vendor.c: buffer = (unsigned char *) kmalloc(512,
GFP_KERNEL | GFP_DMA);
drivers/scsi/sr_vendor.c: buffer = (unsigned char *) kmalloc(512,
GFP_KERNEL | GFP_DMA);
drivers/scsi/st.c: priority |= GFP_DMA;
drivers/scsi/u14-34f.c: (sh[j]->unchecked_isa_dma ? GFP_DMA :
0) | GFP_ATOMIC))) {
drivers/usb/gadget/lh7a40x_udc.c: retval = kmalloc(bytes, gfp_flags &
~(__GFP_DMA | __GFP_HIGHMEM));
drivers/usb/gadget/pxa2xx_udc.c: retval = kmalloc (bytes, gfp_flags &
~(__GFP_DMA|__GFP_HIGHMEM));

2006-09-22 19:47:28

by Alan

[permalink] [raw]
Subject: Re: [RFC] Initial alpha-0 for new page allocator API

Ar Gwe, 2006-09-22 am 21:10 +0200, ysgrifennodd Andi Kleen:
> We already have that scheme. Any existing driver should be already converted
> away from GFP_DMA towards dma_*/pci_*. dma_* knows all the magic
> how to get memory for the various ranges. No need to mess up the
> main allocator.

Add an isa_device class and that'll fall into place nicely. isa_alloc_*
will end up asking for 20bit DMA and it will work nicely.

> Anyways, i suppose what could be added as a fallback would be a
> really_slow_brute_force_try_to_get_something_in_this_range() allocator

Implementation detail although I note that the defrag/antifrag proposal
made at the vm summit would mean it mostly come out for free. If we have
an isa_dma_* API then the detail is platform specific.

> that basically goes through the buddy lists freeing in >O(1)
> and does some directed reclaim, but that would likely be a separate
> path anyways and not need your new structure to impact the O(1)
> allocator.

Just search within the candidate 4MB (or whatever it is these days)
chunks.

> I am still unconvinced of the real need. The only gaping hole was
> GFP_DMA32, which we fixed already.

Various devices are 30 and 31bit today - some broadcom for example.

> Ok there is aacraid with its weird 2GB limit,
> Ok now I'm sure someone will come up with a counter example (hi Alan), but:
> - Does the device really need more than 16MB?
> - How often is it used on systems with >1/2GB with a 64bit kernel?
> [consider that 64bit kernels don't support ISA]
> - How many users of that particular thing around?

Ok the examples I know about are
- ESS Maestro series audio - PCI, common on 32bit boxes a few years ago,
no longer shipped and unlikely to be met on 64bit. Also slow allocations
is fine.
- Some aacraid, mostly only for control structures. Those found on 64bit
are probably fine with slow alloc.
- Broadcom stuff - not sure if 30 or 31bit, around today and on 64bit
- Floppy controller

> I think my point stands.

I think its worthy of discussion.

2006-09-22 20:02:48

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] Initial alpha-0 for new page allocator API

On Friday 22 September 2006 22:10, Alan Cox wrote:
> Ar Gwe, 2006-09-22 am 21:10 +0200, ysgrifennodd Andi Kleen:
> > We already have that scheme. Any existing driver should be already converted
> > away from GFP_DMA towards dma_*/pci_*. dma_* knows all the magic
> > how to get memory for the various ranges. No need to mess up the
> > main allocator.
>
> Add an isa_device class and that'll fall into place nicely. isa_alloc_*
> will end up asking for 20bit DMA and it will work nicely.


The old school way is to pass NULL to pci_alloc_coherent()

> > that basically goes through the buddy lists freeing in >O(1)
> > and does some directed reclaim, but that would likely be a separate
> > path anyways and not need your new structure to impact the O(1)
> > allocator.
>
> Just search within the candidate 4MB (or whatever it is these days)
> chunks.
>

What chunks?

> Ok the examples I know about are
> - ESS Maestro series audio - PCI, common on 32bit boxes a few years ago,
> no longer shipped and unlikely to be met on 64bit. Also slow allocations
> is fine.

And is fine with 16MB anyways I think.

> - Some aacraid, mostly only for control structures. Those found on 64bit
> are probably fine with slow alloc.

That is the only case where there are rumours they are not fine with 16MB.

> - Broadcom stuff - not sure if 30 or 31bit, around today and on 64bit

b44 is 30bit. That's true. I even got one here.

But it doesn't count really because we can handle it fine with existing
16MB GFP_DMA

> - Floppy controller

That one only needs one page or so. In the worst case memory could be preallocated
in .bss for it.

-Andi

2006-09-22 20:14:24

by Martin Bligh

[permalink] [raw]
Subject: Re: [RFC] Initial alpha-0 for new page allocator API


> And is fine with 16MB anyways I think.
>
>
>>- Some aacraid, mostly only for control structures. Those found on 64bit
>>are probably fine with slow alloc.
>
>
> That is the only case where there are rumours they are not fine with 16MB.
>
>
>>- Broadcom stuff - not sure if 30 or 31bit, around today and on 64bit
>
>
> b44 is 30bit. That's true. I even got one here.
>
> But it doesn't count really because we can handle it fine with existing
> 16MB GFP_DMA

The problem is that GFP_DMA does not mean 16MB on all architectures.

M.

2006-09-22 20:23:54

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] Initial alpha-0 for new page allocator API

Here is an iniitial patch of alloc_pages_range (untested, compiles).
Directed reclaim missing. Feedback wanted. There are some comments in the
patch where I am at the boundary of my knowledge and it would be good if
someone could supply the info needed.

Index: linux-2.6.18-rc7-mm1/arch/i386/kernel/pci-dma.c
===================================================================
--- linux-2.6.18-rc7-mm1.orig/arch/i386/kernel/pci-dma.c 2006-09-22 15:10:42.246731179 -0500
+++ linux-2.6.18-rc7-mm1/arch/i386/kernel/pci-dma.c 2006-09-22 15:11:10.449709078 -0500
@@ -26,6 +26,8 @@ void *dma_alloc_coherent(struct device *
dma_addr_t *dma_handle, gfp_t gfp)
{
void *ret;
+ unsigned long low = 0L;
+ unsigned long high = 0xffffffff;
struct dma_coherent_mem *mem = dev ? dev->dma_mem : NULL;
int order = get_order(size);
/* ignore region specifiers */
@@ -44,10 +46,14 @@ void *dma_alloc_coherent(struct device *
return NULL;
}

- if (dev == NULL || (dev->coherent_dma_mask < 0xffffffff))
- gfp |= GFP_DMA;
+ if (dev == NULL)
+ /* Apply safe ISA LIMITS */
+ high = 16*1024*1024L;
+ else
+ if (dev->coherent_dma_mask < 0xffffffff)
+ high = dev->coherent_dma_mask;

- ret = (void *)__get_free_pages(gfp, order);
+ ret = page_address(alloc_pages_range(low, high, gfp, order));

if (ret != NULL) {
memset(ret, 0, size);
Index: linux-2.6.18-rc7-mm1/include/linux/gfp.h
===================================================================
--- linux-2.6.18-rc7-mm1.orig/include/linux/gfp.h 2006-09-22 15:10:42.235994626 -0500
+++ linux-2.6.18-rc7-mm1/include/linux/gfp.h 2006-09-22 15:11:10.462397735 -0500
@@ -136,6 +136,9 @@ static inline struct page *alloc_pages_n
NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask));
}

+extern struct page *alloc_pages_range(unsigned long low, unsigned long high,
+ int nid, gfp_t gfp_mask, unsigned int order);
+
#ifdef CONFIG_NUMA
extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);

Index: linux-2.6.18-rc7-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.18-rc7-mm1.orig/mm/page_alloc.c 2006-09-22 15:10:53.973976539 -0500
+++ linux-2.6.18-rc7-mm1/mm/page_alloc.c 2006-09-22 15:19:59.996440889 -0500
@@ -1195,9 +1195,119 @@ got_pg:
#endif
return page;
}
-
EXPORT_SYMBOL(__alloc_pages);

+static struct page *rmqueue_range(unsigned long low, unsigned long high,
+ struct zone *zone, unsigned int order)
+{
+ struct free_area * area;
+ unsigned int current_order;
+ struct page *page;
+
+ for (current_order = order; current_order < MAX_ORDER; ++current_order) {
+ area = zone->free_area + current_order;
+ if (list_empty(&area->free_list))
+ continue;
+
+ list_for_each_entry(page, &area->free_list, lru) {
+ unsigned long addr = (unsigned long)page_address(page);
+
+ if (addr >= low ||
+ addr < high - (PAGE_SIZE << order))
+ goto found_match;
+ }
+ continue;
+found_match:
+ list_del(&page->lru);
+ rmv_page_order(page);
+ area->nr_free--;
+ zone->free_pages -= 1UL << order;
+ expand(zone, page, order, current_order, area);
+ return page;
+ }
+ return NULL;
+}
+
+struct page *alloc_pages_range(unsigned long low, unsigned long high, int node,
+ gfp_t gfp_flags, unsigned int order)
+{
+ const gfp_t wait = gfp_flags & __GFP_WAIT;
+ struct zonelist *zl;
+ struct zone **z;
+ struct page *page;
+
+#ifdef CONFIG_ZONE_DMA
+ if (high < MAX_DMA_ADDRESS)
+ return alloc_pages(gfp_flags | __GFP_DMA, order);
+#endif
+#ifdef CONFIG_ZONE_DMA32
+ if (high < MAX_DMA32_ADDRESS)
+ return alloc_pages(gfp_flags | __GFP_DMA32, order);
+#endif
+ /*
+ * Is there an upper/lower limit of installed memory that we could
+ * check against instead of -1 ? The less memory installed the less
+ * the chance that we would have to do the expensive range search.
+ */
+ if (high == -1L && low == 0L)
+ return alloc_pages(gfp_flags, order);
+
+ if (node == -1)
+ node = numa_node_id();
+
+ /*
+ * Scan in the page allocator for memory.
+ * We skip all the niceties of the page allocator since this is
+ * used for device allocations that require memory from a limited
+ * address range.
+ */
+
+ might_sleep_if(wait);
+
+ zl = &NODE_DATA(node)->node_zonelists[gfp_zone(gfp_flags)];
+
+ z = zl->zones;
+
+ if (unlikely(*z == NULL))
+ /* Should this ever happen?? */
+ return NULL;
+
+ do {
+ struct zone *zone = *z;
+ unsigned long flags;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ page = rmqueue_range(low, high, zone, order);
+ spin_unlock(&zone->lock);
+ if (!page) {
+ local_irq_restore(flags);
+ put_cpu();
+ continue;
+ }
+ __count_zone_vm_events(PGALLOC, zone, 1 << order);
+ zone_statistics(zl, zone);
+ local_irq_restore(flags);
+ put_cpu();
+
+ VM_BUG_ON(bad_range(zone, page));
+ if (!prep_new_page(page, order, gfp_flags))
+ goto got_pg;
+
+ } while (*(++z) != NULL);
+
+ /*
+ * For now just give up. In the future we need something like
+ * directed reclaim here.
+ */
+ page = NULL;
+got_pg:
+#ifdef CONFIG_PAGE_OWNER
+ if (page)
+ set_page_owner(page, order, gfp_flags);
+#endif
+ return page;
+}
+
/*
* Common helper functions.
*/

2006-09-22 20:43:12

by Jesse Barnes

[permalink] [raw]
Subject: Re: [RFC] Initial alpha-0 for new page allocator API

On Friday, September 22, 2006 1:23 pm, Christoph Lameter wrote:
> Here is an iniitial patch of alloc_pages_range (untested, compiles).
> Directed reclaim missing. Feedback wanted. There are some comments in
> the patch where I am at the boundary of my knowledge and it would be
> good if someone could supply the info needed.
>
> Index: linux-2.6.18-rc7-mm1/arch/i386/kernel/pci-dma.c
> ===================================================================
> --- linux-2.6.18-rc7-mm1.orig/arch/i386/kernel/pci-dma.c 2006-09-22
> 15:10:42.246731179 -0500 +++
> linux-2.6.18-rc7-mm1/arch/i386/kernel/pci-dma.c 2006-09-22
> 15:11:10.449709078 -0500 @@ -26,6 +26,8 @@ void
> *dma_alloc_coherent(struct device *
> dma_addr_t *dma_handle, gfp_t gfp)
> {
> void *ret;
> + unsigned long low = 0L;
> + unsigned long high = 0xffffffff;
> struct dma_coherent_mem *mem = dev ? dev->dma_mem : NULL;
> int order = get_order(size);
> /* ignore region specifiers */
> @@ -44,10 +46,14 @@ void *dma_alloc_coherent(struct device *
> return NULL;
> }
>
> - if (dev == NULL || (dev->coherent_dma_mask < 0xffffffff))
> - gfp |= GFP_DMA;
> + if (dev == NULL)
> + /* Apply safe ISA LIMITS */
> + high = 16*1024*1024L;
> + else
> + if (dev->coherent_dma_mask < 0xffffffff)
> + high = dev->coherent_dma_mask;

With your alloc_pages_range this check can go away. I think only the dev
== NULL check is needed with this scheme since it looks like there's no
way (currently) for ISA devices to store their masks for later
consultation by arch code?

> + /*
> + * Is there an upper/lower limit of installed memory that we could
> + * check against instead of -1 ? The less memory installed the less
> + * the chance that we would have to do the expensive range search.
> + */
> + if (high == -1L && low == 0L)
> + return alloc_pages(gfp_flags, order);

There's max_pfn, but on machines with large memory holes using it might not
help much.

Jesse

2006-09-22 20:48:47

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] Initial alpha-0 for new page allocator API

On Friday 22 September 2006 22:23, Christoph Lameter wrote:
> Here is an iniitial patch of alloc_pages_range (untested, compiles).
> Directed reclaim missing. Feedback wanted. There are some comments in the
> patch where I am at the boundary of my knowledge and it would be good if
> someone could supply the info needed.

Looks like a good start. Surprising how little additional code it is.

-Andi

2006-09-22 21:02:15

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] Initial alpha-0 for new page allocator API

On Fri, 22 Sep 2006, Jesse Barnes wrote:

> > + if (dev->coherent_dma_mask < 0xffffffff)
> > + high = dev->coherent_dma_mask;
>
> With your alloc_pages_range this check can go away. I think only the dev
> == NULL check is needed with this scheme since it looks like there's no
> way (currently) for ISA devices to store their masks for later
> consultation by arch code?

This check is necessary to set up the correct high boundary for
alloc_page_range.

> > + if (high == -1L && low == 0L)
> > + return alloc_pages(gfp_flags, order);
>
> There's max_pfn, but on machines with large memory holes using it might not
> help much.

I found node_start_pfn and node_spanned_pages in the node structure. That
gives me the boundaries for a node and I think I can work with that.

2006-09-22 21:13:35

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] Initial alpha-0 for new page allocator API

Next try.

- Drop node parameter since nodes have physical address spaces and
we can match on those using the high / low parameters.

- Check the boundaries of a node before searching the zones in the
node. This includes checking the upper / lower
boundary of present memory. So we can simply fall back to regular alloc
pages if f.e. we have a x86_64 with all memory below 4GB and we have
configured ZONE_DMA and ZONE_DMA32 off.

- Still no reclaim.

- Hmmm... I have no floppy drive....

Index: linux-2.6.18-rc7-mm1/arch/i386/kernel/pci-dma.c
===================================================================
--- linux-2.6.18-rc7-mm1.orig/arch/i386/kernel/pci-dma.c 2006-09-22 15:10:42.246731179 -0500
+++ linux-2.6.18-rc7-mm1/arch/i386/kernel/pci-dma.c 2006-09-22 15:37:41.464093162 -0500
@@ -26,6 +26,8 @@ void *dma_alloc_coherent(struct device *
dma_addr_t *dma_handle, gfp_t gfp)
{
void *ret;
+ unsigned long low = 0L;
+ unsigned long high = 0xffffffff;
struct dma_coherent_mem *mem = dev ? dev->dma_mem : NULL;
int order = get_order(size);
/* ignore region specifiers */
@@ -44,10 +46,14 @@ void *dma_alloc_coherent(struct device *
return NULL;
}

- if (dev == NULL || (dev->coherent_dma_mask < 0xffffffff))
- gfp |= GFP_DMA;
+ if (dev == NULL)
+ /* Apply safe ISA LIMITS */
+ high = 16*1024*1024L;
+ else
+ if (dev->coherent_dma_mask < 0xffffffff)
+ high = dev->coherent_dma_mask;

- ret = (void *)__get_free_pages(gfp, order);
+ ret = page_address(alloc_pages_range(low, high, gfp, order));

if (ret != NULL) {
memset(ret, 0, size);
Index: linux-2.6.18-rc7-mm1/include/linux/gfp.h
===================================================================
--- linux-2.6.18-rc7-mm1.orig/include/linux/gfp.h 2006-09-22 15:10:42.235994626 -0500
+++ linux-2.6.18-rc7-mm1/include/linux/gfp.h 2006-09-22 15:58:53.385391317 -0500
@@ -136,6 +136,9 @@ static inline struct page *alloc_pages_n
NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask));
}

+extern struct page *alloc_pages_range(unsigned long low, unsigned long high,
+ gfp_t gfp_mask, unsigned int order);
+
#ifdef CONFIG_NUMA
extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);

Index: linux-2.6.18-rc7-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.18-rc7-mm1.orig/mm/page_alloc.c 2006-09-22 15:10:53.973976539 -0500
+++ linux-2.6.18-rc7-mm1/mm/page_alloc.c 2006-09-22 16:10:13.940439657 -0500
@@ -1195,9 +1195,145 @@ got_pg:
#endif
return page;
}
-
EXPORT_SYMBOL(__alloc_pages);

+static struct page *rmqueue_range(unsigned long low, unsigned long high,
+ struct zone *zone, unsigned int order)
+{
+ struct free_area * area;
+ unsigned int current_order;
+ struct page *page;
+
+ for (current_order = order; current_order < MAX_ORDER; ++current_order) {
+ area = zone->free_area + current_order;
+ if (list_empty(&area->free_list))
+ continue;
+
+ list_for_each_entry(page, &area->free_list, lru) {
+ unsigned long addr = (unsigned long)page_address(page);
+
+ if (addr >= low &&
+ addr < high - (PAGE_SIZE << order))
+ goto found_match;
+ }
+ continue;
+found_match:
+ list_del(&page->lru);
+ rmv_page_order(page);
+ area->nr_free--;
+ zone->free_pages -= 1UL << order;
+ expand(zone, page, order, current_order, area);
+ return page;
+ }
+ return NULL;
+}
+
+static struct page *zonelist_alloc_range(unsigned long low, unsigned long high,
+ int order, gfp_t gfp_flags,
+ struct zonelist *zl)
+{
+ struct zone **z = zl->zones;
+ struct page *page;
+
+ if (unlikely(*z == NULL))
+ /* Should this ever happen?? */
+ return NULL;
+
+ do {
+ struct zone *zone = *z;
+ unsigned long flags;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ page = rmqueue_range(low, high, zone, order);
+ spin_unlock(&zone->lock);
+ if (!page) {
+ local_irq_restore(flags);
+ put_cpu();
+ continue;
+ }
+ __count_zone_vm_events(PGALLOC, zone, 1 << order);
+ zone_statistics(zl, zone);
+ local_irq_restore(flags);
+ put_cpu();
+
+ VM_BUG_ON(bad_range(zone, page));
+ if (!prep_new_page(page, order, gfp_flags))
+ goto got_pg;
+
+ } while (*(++z) != NULL);
+
+ /*
+ * For now just give up. In the future we need something like
+ * directed reclaim here.
+ */
+ page = NULL;
+got_pg:
+#ifdef CONFIG_PAGE_OWNER
+ if (page)
+ set_page_owner(page, order, gfp_flags);
+#endif
+ return page;
+}
+
+struct page *alloc_pages_range(unsigned long low, unsigned long high,
+ gfp_t gfp_flags, unsigned int order)
+{
+ const gfp_t wait = gfp_flags & __GFP_WAIT;
+ struct page *page = NULL;
+ struct pglist_data *lastpgdat;
+ int node;
+
+#ifdef CONFIG_ZONE_DMA
+ if (high < MAX_DMA_ADDRESS)
+ return alloc_pages(gfp_flags | __GFP_DMA, order);
+#endif
+#ifdef CONFIG_ZONE_DMA32
+ if (high < MAX_DMA32_ADDRESS)
+ return alloc_pages(gfp_flags | __GFP_DMA32, order);
+#endif
+ /*
+ * Is there an upper/lower limit of installed memory that we could
+ * check against instead of -1 ? The less memory installed the less
+ * the chance that we would have to do the expensive range search.
+ */
+
+ /* This probably should check against the last online node in the future */
+ lastpgdat = NODE_DATA(MAX_NUMNODES -1);
+
+ if (high >= ((lastpgdat->node_start_pfn + lastpgdat->node_spanned_pages) << PAGE_SHIFT) &&
+ low <= (NODE_DATA(0)->node_start_pfn << PAGE_SHIFT))
+ return alloc_pages(gfp_flags, order);
+
+ /*
+ * Scan in the page allocator for memory.
+ * We skip all the niceties of the page allocator since this is
+ * used for device allocations that require memory from a limited
+ * address range.
+ */
+
+ might_sleep_if(wait);
+
+ for_each_online_node(node) {
+ struct pglist_data *pgdat = NODE_DATA(node);
+
+ if (low > ((pgdat->node_start_pfn +
+ pgdat->node_spanned_pages) << PAGE_SHIFT))
+ continue;
+
+ /*
+ * This check assumes that increasing node numbers go
+ * along with increasing addresses!
+ */
+ if (high < (pgdat->node_start_pfn << PAGE_SHIFT))
+ break;
+
+ page = zonelist_alloc_range(low, high, gfp_flags, order,
+ NODE_DATA(node)->node_zonelists + gfp_zone(gfp_flags));
+ if (page)
+ break;
+ }
+ return page;
+}
/*
* Common helper functions.
*/

2006-09-22 21:13:23

by Jesse Barnes

[permalink] [raw]
Subject: Re: [RFC] Initial alpha-0 for new page allocator API

On Friday, September 22, 2006 2:01 pm, Christoph Lameter wrote:
> On Fri, 22 Sep 2006, Jesse Barnes wrote:
> > > + if (dev->coherent_dma_mask < 0xffffffff)
> > > + high = dev->coherent_dma_mask;
> >
> > With your alloc_pages_range this check can go away. I think only the
> > dev == NULL check is needed with this scheme since it looks like
> > there's no way (currently) for ISA devices to store their masks for
> > later consultation by arch code?
>
> This check is necessary to set up the correct high boundary for
> alloc_page_range.

I was suggesting something like:

high = dev ? dev->coherent_dma_mask : 16*1024*1024;

instead. May as well combine your NULL check and your assignment. It'll
also do the right thing for 64 bit devices so we don't put unnecessary
pressure on the 32 bit range. Or am I spacing out and reading the code
wrong?

> > There's max_pfn, but on machines with large memory holes using it
> > might not help much.
>
> I found node_start_pfn and node_spanned_pages in the node structure.
> That gives me the boundaries for a node and I think I can work with
> that.

Even better.

Jesse

2006-09-22 21:22:10

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] Initial alpha-0 for new page allocator API

On Fri, 22 Sep 2006, Jesse Barnes wrote:

> I was suggesting something like:
>
> high = dev ? dev->coherent_dma_mask : 16*1024*1024;
>
> instead. May as well combine your NULL check and your assignment. It'll
> also do the right thing for 64 bit devices so we don't put unnecessary
> pressure on the 32 bit range. Or am I spacing out and reading the code
> wrong?

Ahh.. Yes something like this will save a lot of lines:

Index: linux-2.6.18-rc7-mm1/arch/i386/kernel/pci-dma.c
===================================================================
--- linux-2.6.18-rc7-mm1.orig/arch/i386/kernel/pci-dma.c 2006-09-22 15:37:41.000000000 -0500
+++ linux-2.6.18-rc7-mm1/arch/i386/kernel/pci-dma.c 2006-09-22 16:20:49.849799156 -0500
@@ -26,8 +26,6 @@ void *dma_alloc_coherent(struct device *
dma_addr_t *dma_handle, gfp_t gfp)
{
void *ret;
- unsigned long low = 0L;
- unsigned long high = 0xffffffff;
struct dma_coherent_mem *mem = dev ? dev->dma_mem : NULL;
int order = get_order(size);
/* ignore region specifiers */
@@ -46,14 +44,9 @@ void *dma_alloc_coherent(struct device *
return NULL;
}

- if (dev == NULL)
- /* Apply safe ISA LIMITS */
- high = 16*1024*1024L;
- else
- if (dev->coherent_dma_mask < 0xffffffff)
- high = dev->coherent_dma_mask;
-
- ret = page_address(alloc_pages_range(low, high, gfp, order));
+ ret = page_address(alloc_pages_range(0L,
+ dev ? dev->coherent_dma_mask : 16*1024*1024,
+ gfp, order));

if (ret != NULL) {
memset(ret, 0, size);

2006-09-22 21:32:53

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] Initial alpha-0 for new page allocator API

On Fri, 22 Sep 2006, Andi Kleen wrote:

> Looks like a good start. Surprising how little additional code it is.

Gosh. I just looked at x86_64 dma_alloc_coherent. There is mind boogling
series of tricks with __GFP_DMA and GFP_DMA32 going on. Could you get me a
patch that sorts this out if we have alloc_pages_range()? I would expect
that the will become much simpler.

2006-09-22 23:35:09

by Andi Kleen

[permalink] [raw]
Subject: More thoughts on getting rid of ZONE_DMA

On Friday 22 September 2006 22:23, Christoph Lameter wrote:
> Here is an iniitial patch of alloc_pages_range (untested, compiles).
> Directed reclaim missing. Feedback wanted. There are some comments in the
> patch where I am at the boundary of my knowledge and it would be good if
> someone could supply the info needed.


Christoph,

I thought a little more about the problem.

Currently I don't think we can get rid of ZONE_DMA even with your patch.

The problem is that if someone has a workload with lots of pinned pages
(e.g. lots of mlock) then the first 16MB might fill up completely and there
is no chance at all to free it because it's pinned

This is not theoretical: Andrea originally implemented the keep lower zones free
heuristics exactly because this happened in the field.

So we need some way to reserve some low memory pages (a "low mem mempool" so to
say). Otherwise it could always run into deadlocks later under load.

As I understand it your goal is to remove knowledge of the DMA zones from
the generic VM to save some cache lines in hot paths.

First ZONE_DMA32 likely needs to be kept in the normal allocator because there
are just too many potential users of it, and some of them even need fast memory allocation.

But AFAIK all 16MB ZONE_DMA don't need fast allocation, so being a bit
slower for them is ok.

What we could do instead is to have a configurable pool starting at zero with
a special allocator that can allocate ranges in there. This wouldn't need to
be a 16MB pool, but could be a kernel boot parameter. This would keep
it completely out of the fast VM path and reach your original goals.

This would also fix aacraid because users of it could just configure a larger
pool (we could potentially even have a heuristic to size it based on PCI IDs;
this wouldn't deal with hotplug but would be still much better than shifting
it completely to the user)

-Andi

2006-09-23 00:24:11

by Christoph Lameter

[permalink] [raw]
Subject: Re: More thoughts on getting rid of ZONE_DMA

On Sat, 23 Sep 2006, Andi Kleen wrote:

> The problem is that if someone has a workload with lots of pinned pages
> (e.g. lots of mlock) then the first 16MB might fill up completely and there
> is no chance at all to free it because it's pinned

Ok. That may be a problem for i386. After the removal of the GFP_DMA
and ZONE_DMA stuff it is then be possible to redefine ZONE_DMA (or
whatever we may call it ZONE_RESERVE?) to an arbitrary size a the
beginning of memory. Then alloc_pages_range() can dynamically decide to
tap that pool if necessary. I already have checks for ZONE_DMA and
ZONE_DMA32 in there. If we just rename those then what you wanted would
be there. If additional memory pools are available then they
are used if the allocation restrictions fit to avoid a lengthy search.

This may mean that i386 and x86_64 will still have two zones. Its somewhat
better.

However, on IA64 we would not need this since our DMA limit has been
4GB in the past.

2006-09-23 00:26:05

by Christoph Lameter

[permalink] [raw]
Subject: Re: More thoughts on getting rid of ZONE_DMA

Another solution may be to favor high adresses in the page allocator?

2006-09-23 00:40:22

by Andi Kleen

[permalink] [raw]
Subject: Re: More thoughts on getting rid of ZONE_DMA

On Saturday 23 September 2006 02:25, Christoph Lameter wrote:
> Another solution may be to favor high adresses in the page allocator?

We used to do that, but it got changed because IO request merging without
IOMMU works much better if you start low and go up instead of the other
way round.

-Andi

2006-09-23 00:40:23

by Andi Kleen

[permalink] [raw]
Subject: Re: More thoughts on getting rid of ZONE_DMA

On Saturday 23 September 2006 02:23, Christoph Lameter wrote:
> On Sat, 23 Sep 2006, Andi Kleen wrote:
>
> > The problem is that if someone has a workload with lots of pinned pages
> > (e.g. lots of mlock) then the first 16MB might fill up completely and there
> > is no chance at all to free it because it's pinned
>
> Ok. That may be a problem for i386. After the removal of the GFP_DMA
> and ZONE_DMA stuff it is then be possible to redefine ZONE_DMA (or
> whatever we may call it ZONE_RESERVE?) to an arbitrary size a the
> beginning of memory. Then alloc_pages_range() can dynamically decide to
> tap that pool if necessary.

That's should work yes. Just we need the pool.

-Andi

2006-09-24 02:14:26

by Christoph Lameter

[permalink] [raw]
Subject: Re: More thoughts on getting rid of ZONE_DMA

On Sat, 23 Sep 2006, Andi Kleen wrote:

> The problem is that if someone has a workload with lots of pinned pages
> (e.g. lots of mlock) then the first 16MB might fill up completely and there
> is no chance at all to free it because it's pinned

Note that mlock'ed pages are movable. mlock only specifies that pages
must stay in memory. It does not say that they cannot be moved. So
page migration could help there.

This brings up a possible problem spot in the current kernel: It seems
that the VM is capable of migrating pages from ZONE_DMA to
ZONE_NORMAL! So once pages are in memory then they may move out of the
DMA-able area.

I assume the writeback paths have some means of detecting that a
page is out of range during writeback and then do page bouncing?

If that is the case then we could simply move movable pages out
if necessary. That would be a kind of bouncing logic there that
would only kick in if necessary.

2006-09-24 02:37:23

by Martin Bligh

[permalink] [raw]
Subject: Re: More thoughts on getting rid of ZONE_DMA

Christoph Lameter wrote:
> On Sat, 23 Sep 2006, Andi Kleen wrote:
>
>> The problem is that if someone has a workload with lots of pinned pages
>> (e.g. lots of mlock) then the first 16MB might fill up completely and there
>> is no chance at all to free it because it's pinned
>
> Note that mlock'ed pages are movable. mlock only specifies that pages
> must stay in memory. It does not say that they cannot be moved. So
> page migration could help there.
>
> This brings up a possible problem spot in the current kernel: It seems
> that the VM is capable of migrating pages from ZONE_DMA to
> ZONE_NORMAL! So once pages are in memory then they may move out of the
> DMA-able area.
>
> I assume the writeback paths have some means of detecting that a
> page is out of range during writeback and then do page bouncing?
>
> If that is the case then we could simply move movable pages out
> if necessary. That would be a kind of bouncing logic there that
> would only kick in if necessary.

If it's the 16MB DMA window for ia32 we're talking about, wouldn't
it be easier just to remove it from the fallback lists? (assuming
you have at least 128MB of memory or something, blah, blah). Saves
doing migration later.

M.

2006-09-24 07:19:36

by Andi Kleen

[permalink] [raw]
Subject: Re: More thoughts on getting rid of ZONE_DMA

On Sunday 24 September 2006 04:13, Christoph Lameter wrote:
> On Sat, 23 Sep 2006, Andi Kleen wrote:
> > The problem is that if someone has a workload with lots of pinned pages
> > (e.g. lots of mlock) then the first 16MB might fill up completely and
> > there is no chance at all to free it because it's pinned
>
> Note that mlock'ed pages are movable. mlock only specifies that pages
> must stay in memory. It does not say that they cannot be moved. So
> page migration could help there.

There are still other cases where it is not the case. e.g. long term
pinned IO. Or kernel workloads that allocate unfreeable objects.
Not doing any reservation would just seem risky and fragile to me.

Also I'm not sure we want to have the quite large page migration code in small
i386 kernels.

I think what would be a good short term idea would be to remove ZONE_DMA
from the normal zone lists. Possible even allocate its metadata somewhere
else to give the cache advantages you were looking for.

Then make its size configurable and only
allow allocating from it using some variant of your range allocator
via dma_alloc_*(). Actually in this case it might be better to write
a new specialized small allocator that is more optimized for the "range" task
than buddy. Buta brute force version of buddy would
likely work too, at least short term.

Then if that works swiotlb could be converted over to use it too. Once
there is such a allocator there is really no reason it still needs a separate
pool. Ok it is effectively GFP_ATOMIC so perhaps the thresholds for keeping
free memory for atomic purposes would need to be a little more aggressive
on the DMA pool.

In terms of memory consumption it should be similar as now because
the current VM usually keeps ZONE_DMA free too. But with a configurable
size it might make more people happy (both those for which
16MB is too much and for those where 16MB is too little). Although
to be honest the 16MB default seems to work for most people anyways,
so it's not really a very urgently needed change.

But this would require getting rid of all GFP_DMA users first
and converting them over the dma_alloc_*. Unfortunately
there are still quite a few left.

$ grep -r GFP_DMA drivers/* | wc -l
148

Anyone volunteering for auditing and fixing them?

Most drivers are probably relatively easy (except that testing
them might be difficult), but there are cases where it is used
in the block layer for generic allocations and finding everyone who
relies on that might be nasty.

Most of the culprits are likely CONFIG_ISA so they don't matter
for x86-64, but i386 unfortunately still needs to support that .

I suppose it would be also a good opportunity to get rid of some
really old broken drivers. e.g. supporting ISA doesn't need to mean
still keeping drivers that haven't compiled for years.

-Andi

2006-09-24 07:26:13

by Andi Kleen

[permalink] [raw]
Subject: Re: More thoughts on getting rid of ZONE_DMA


> If it's the 16MB DMA window for ia32 we're talking about, wouldn't
> it be easier just to remove it from the fallback lists? (assuming
> you have at least 128MB of memory or something, blah, blah). Saves
> doing migration later.

That is essentially already the case because the mm has special
heuristics to preserve lower zones. Usually those tend to keep the
16MB mostly free unless you really use GFP_DMA.

-Andi