This patch series enhances the existing virtio-balloon with the following
new features:
1) fast ballooning: transfer ballooned pages between the guest and host in
chunks using sgs, instead of one array each time; and
2) free page block reporting: a new virtqueue to report guest free pages
to the host.
The second feature can be used to accelerate live migration of VMs. Here
are some details:
Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.
The second feature enables the optimization of the 1st round memory
transfer - the hypervisor can skip the transfer of guest free pages in the
1st round. It is not concerned that the memory pages are used after they
are given to the hypervisor as a hint of the free pages, because they will
be tracked by the hypervisor and transferred in the next round if they are
used and written.
ChangeLog:
v18->v19:
1) patch 3:
- xb_clear_bit_range and xb_find_next_bit will deal with range [start,
end), where end is changed to be exclusive of the range.
- add overflow checks at the end of xb_clear_bit_range and
xb_find_next_bit
- add overflow related test cases
2) patch 4:
- change back to the previous add_one_sg methond, which is based on the
scatterlist struct
- tell_host_sgs: use "uint64_t len" to avoid overflow
- batch_balloon_page_sg: a simpler function to implement the batching of
sgs
3) patch 6: batch_free_page_sg: batch sgs using the previous scatterlist struct
4) patch 7: add a config field, poison_val, to tell the host about the poison
value
v17->v18:
1) patch 1-2: new to solve some tools related compilation issues
2) patch 3: revert to the original xbitmap implementation from Matthew
Wilcox with some minor changes (e.g. comments added to the exported
functions)
3) patch 4: summarize the changes we want to make to patch 3
4) patch 5: add the developer notes as a reminder for users to avoid
concurrent accesses to the ida bitmap
5) patch 6: a new vring API to allow users to directly pass in a physical
address to a vring desc
6) patch 7: ballooning time is reduced from ~490ms to ~440ms with the new
implementation
- use the new API from patch 6 to send balloon pages
- xb_preload with "GFP_NOWAIT | __GFP_NOWARN" flag;
- handle the case when xb_set_page() fails to avoid memory leak;
- put xb_set_page() under the balloon lock
7) patch 9: simper implementation
- start free page reporting by sending a new cmd id from the host
- guest acks the start or stop via adding a cmd id to the free page vq
- use vb->report_free_page, instead of vb->report_free_page_stop
- use WRITE_ONCE/READ_ONCE to access vb->report_free_page
- use the new API from patch 6 to send free pages to avoid the
unnecessary use of kaddr.
8) patch 10: new patch to solve the page posioning issue reported by
Michael S. Tsirkin
v16->v17:
1) patch 1: please check the commit log there;
2) patch 3: included Michael S. Tsirkin patch to fix the potential
deadlock issue;
3) patch 4: use BUG_ON if virtqueue_add_ returns error, which is
expected never to happen;
4) patch 4: add leak_balloon_sg_oom, which is used in the oom case when
VIRTIO_BALLOON_F_SG is in use;
5) patch 6: use config registers, instead of a vq, as the command channel
between the host and guest;
6) patch 6: add the command sequence id support.
v15->v16:
1) mm: stop reporting the free pfn range if the callback returns false;
2) mm: move some implementaion of walk_free_mem_block into a function to
make the code layout looks better;
3) xbitmap: added some optimizations suggested by Matthew, please refer to
the ChangLog in the xbitmap patch for details.
4) xbitmap: added a test suite
5) virtio-balloon: bail out with a warning when virtqueue_add_inbuf returns
an error
6) virtio-balloon: some small code re-arrangement, e.g. detachinf used buf
from the vq before adding a new buf
v14->v15:
1) mm: make the report callback return a bool value - returning 1 to stop
walking through the free page list.
2) virtio-balloon: batching sgs of balloon pages till the vq is full
3) virtio-balloon: create a new workqueue, rather than using the default
system_wq, to queue the free page reporting work item.
4) virtio-balloon: add a ctrl_vq to be a central control plane which will
handle all the future control related commands between the host and guest.
Add free page report as the first feature controlled under ctrl_vq, and
the free_page_vq is a data plane vq dedicated to the transmission of free
page blocks.
v13->v14:
1) xbitmap: move the code from lib/radix-tree.c to lib/xbitmap.c.
2) xbitmap: consolidate the implementation of xb_bit_set/clear/test into
one xb_bit_ops.
3) xbitmap: add documents for the exported APIs.
4) mm: rewrite the function to walk through free page blocks.
5) virtio-balloon: when reporting a free page blcok to the device, if the
vq is full (less likey to happen in practice), just skip reporting this
block, instead of busywaiting till an entry gets released.
6) virtio-balloon: fail the probe function if adding the signal buf in
init_vqs fails.
v12->v13:
1) mm: use a callback function to handle the the free page blocks from the
report function. This avoids exposing the zone internal to a kernel
module.
2) virtio-balloon: send balloon pages or a free page block using a single
sg each time. This has the benefits of simpler implementation with no new
APIs.
3) virtio-balloon: the free_page_vq is used to report free pages only (no
multiple usages interleaving)
4) virtio-balloon: Balloon pages and free page blocks are sent via input
sgs, and the completion signal to the host is sent via an output sg.
v11->v12:
1) xbitmap: use the xbitmap from Matthew Wilcox to record ballooned pages.
2) virtio-ring: enable the driver to build up a desc chain using vring
desc.
3) virtio-ring: Add locking to the existing START_USE() and END_USE()
macro to lock/unlock the vq when a vq operation starts/ends.
4) virtio-ring: add virtqueue_kick_sync() and virtqueue_kick_async()
5) virtio-balloon: describe chunks of ballooned pages and free pages
blocks directly using one or more chains of desc from the vq.
v10->v11:
1) virtio_balloon: use vring_desc to describe a chunk;
2) virtio_ring: support to add an indirect desc table to virtqueue;
3) virtio_balloon: use cmdq to report guest memory statistics.
v9->v10:
1) mm: put report_unused_page_block() under CONFIG_VIRTIO_BALLOON;
2) virtio-balloon: add virtballoon_validate();
3) virtio-balloon: msg format change;
4) virtio-balloon: move miscq handling to a task on system_freezable_wq;
5) virtio-balloon: code cleanup.
v8->v9:
1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
implementation;
2) Simpler function to get the free page block.
v7->v8:
1) Use only one chunk format, instead of two.
2) re-write the virtio-balloon implementation patch.
3) commit changes
4) patch re-org
Matthew Wilcox (1):
xbitmap: Introduce xbitmap
Wei Wang (6):
xbitmap: potential improvement
xbitmap: add more operations
virtio-balloon: VIRTIO_BALLOON_F_SG
mm: support reporting free page blocks
virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
virtio-balloon: don't report free pages when page poisoning is enabled
drivers/virtio/virtio_balloon.c | 442 +++++++++++++++++++++++++++----
include/linux/mm.h | 6 +
include/linux/radix-tree.h | 2 +
include/linux/xbitmap.h | 55 ++++
include/uapi/linux/virtio_balloon.h | 7 +
lib/Makefile | 2 +-
lib/radix-tree.c | 41 ++-
lib/xbitmap.c | 410 ++++++++++++++++++++++++++++
mm/page_alloc.c | 91 +++++++
tools/include/linux/bitmap.h | 34 +++
tools/include/linux/kernel.h | 2 +
tools/testing/radix-tree/Makefile | 12 +-
tools/testing/radix-tree/linux/xbitmap.h | 1 +
tools/testing/radix-tree/main.c | 4 +
tools/testing/radix-tree/test.h | 1 +
15 files changed, 1056 insertions(+), 54 deletions(-)
create mode 100644 include/linux/xbitmap.h
create mode 100644 lib/xbitmap.c
create mode 100644 tools/testing/radix-tree/linux/xbitmap.h
--
2.7.4
Add a new feature, VIRTIO_BALLOON_F_SG, which enables the transfer of
balloon (i.e. inflated/deflated) pages using scatter-gather lists to the
host.
The implementation of the previous virtio-balloon is not very efficient,
because the balloon pages are transferred to the host by one array each
time. Here is the breakdown of the time in percentage spent on each step
of the balloon inflating process (inflating 7GB of an 8GB idle guest).
1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)
It takes about 4126ms for the inflating process to complete. The above
profiling shows that the bottlenecks are stage 2) and stage 4).
This patch optimizes step 2) by transferring pages to host in sgs. An sg
describes a chunk of guest physically continuous pages. With this
mechanism, step 4) can also be optimized by doing address translation and
madvise() in chunks rather than page by page.
With this new feature, the above ballooning process takes ~490ms resulting
in an improvement of ~88%.
TODO:
- optimize stage 1) by allocating/freeing a chunk of pages instead of a
single page each time.
- sort the internal balloon page queue.
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Liang Li <[email protected]>
Suggested-by: Michael S. Tsirkin <[email protected]>
Cc: Tetsuo Handa <[email protected]>
---
drivers/virtio/virtio_balloon.c | 232 +++++++++++++++++++++++++++++++++---
include/uapi/linux/virtio_balloon.h | 1 +
2 files changed, 215 insertions(+), 18 deletions(-)
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index a1fb52c..8e840a2 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -32,6 +32,8 @@
#include <linux/mm.h>
#include <linux/mount.h>
#include <linux/magic.h>
+#include <linux/xbitmap.h>
+#include <asm/page.h>
/*
* Balloon device works in 4K page units. So each page is pointed to by
@@ -79,6 +81,9 @@ struct virtio_balloon {
/* Synchronize access/update to this struct virtio_balloon elements */
struct mutex balloon_lock;
+ /* The xbitmap used to record balloon pages */
+ struct xb page_xb;
+
/* The array of pfns we tell the Host about. */
unsigned int num_pfns;
__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -141,15 +146,127 @@ static void set_page_pfns(struct virtio_balloon *vb,
page_to_balloon_pfn(page) + i);
}
+static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
+{
+ unsigned int len;
+
+ virtqueue_kick(vq);
+ wait_event(wq_head, virtqueue_get_buf(vq, &len));
+}
+
+static void add_one_sg(struct virtqueue *vq, unsigned long pfn, uint32_t len)
+{
+ struct scatterlist sg;
+ unsigned int unused;
+ int err;
+
+ sg_init_table(&sg, 1);
+ sg_set_page(&sg, pfn_to_page(pfn), len, 0);
+
+ /* Detach all the used buffers from the vq */
+ while (virtqueue_get_buf(vq, &unused))
+ ;
+
+ err = virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
+ /*
+ * This is expected to never fail: there is always at least 1 entry
+ * available on the vq, because when the vq is full the worker thread
+ * that adds the sg will be put into sleep until at least 1 entry is
+ * available to use.
+ */
+ BUG_ON(err);
+}
+
+static void batch_balloon_page_sg(struct virtio_balloon *vb,
+ struct virtqueue *vq,
+ unsigned long pfn,
+ uint32_t len)
+{
+ add_one_sg(vq, pfn, len);
+
+ /* Batch till the vq is full */
+ if (!vq->num_free)
+ kick_and_wait(vq, vb->acked);
+}
+
+/*
+ * Send balloon pages in sgs to host. The balloon pages are recorded in the
+ * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
+ * The page xbitmap is searched for continuous "1" bits, which correspond
+ * to continuous pages, to chunk into sgs.
+ *
+ * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
+ * need to be searched.
+ */
+static void tell_host_sgs(struct virtio_balloon *vb,
+ struct virtqueue *vq,
+ unsigned long page_xb_start,
+ unsigned long page_xb_end)
+{
+ unsigned long pfn_start, pfn_end;
+ uint32_t max_len = round_down(UINT_MAX, PAGE_SIZE);
+ uint64_t len;
+
+ pfn_start = page_xb_start;
+ while (pfn_start < page_xb_end) {
+ pfn_start = xb_find_next_set_bit(&vb->page_xb, pfn_start,
+ page_xb_end + 1);
+ if (pfn_start == page_xb_end + 1)
+ break;
+ pfn_end = xb_find_next_zero_bit(&vb->page_xb,
+ pfn_start,
+ page_xb_end + 1);
+ len = (pfn_end - pfn_start) << PAGE_SHIFT;
+ while (len > max_len) {
+ batch_balloon_page_sg(vb, vq, pfn_start, max_len);
+ pfn_start += max_len >> PAGE_SHIFT;
+ len -= max_len;
+ }
+ batch_balloon_page_sg(vb, vq, pfn_start, (uint32_t)len);
+ pfn_start = pfn_end + 1;
+ }
+
+ /*
+ * The last few sgs may not reach the batch size, but need a kick to
+ * notify the device to handle them.
+ */
+ if (vq->num_free != virtqueue_get_vring_size(vq))
+ kick_and_wait(vq, vb->acked);
+
+ xb_clear_bit_range(&vb->page_xb, page_xb_start, page_xb_end);
+}
+
+static inline int xb_set_page(struct virtio_balloon *vb,
+ struct page *page,
+ unsigned long *pfn_min,
+ unsigned long *pfn_max)
+{
+ unsigned long pfn = page_to_pfn(page);
+ int ret;
+
+ *pfn_min = min(pfn, *pfn_min);
+ *pfn_max = max(pfn, *pfn_max);
+
+ do {
+ ret = xb_preload_and_set_bit(&vb->page_xb, pfn,
+ GFP_NOWAIT | __GFP_NOWARN);
+ } while (unlikely(ret == -EAGAIN));
+
+ return ret;
+}
+
static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
{
unsigned num_allocated_pages;
unsigned num_pfns;
struct page *page;
LIST_HEAD(pages);
+ bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+ unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
/* We can only do one array worth at a time. */
- num = min(num, ARRAY_SIZE(vb->pfns));
+ if (!use_sg)
+ num = min(num, ARRAY_SIZE(vb->pfns));
for (num_pfns = 0; num_pfns < num;
num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
@@ -173,8 +290,15 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
while ((page = balloon_page_pop(&pages))) {
balloon_page_enqueue(&vb->vb_dev_info, page);
+ if (use_sg) {
+ if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
+ __free_page(page);
+ continue;
+ }
+ } else {
+ set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+ }
- set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
if (!virtio_has_feature(vb->vdev,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -184,8 +308,12 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
num_allocated_pages = vb->num_pfns;
/* Did we get any? */
- if (vb->num_pfns != 0)
- tell_host(vb, vb->inflate_vq);
+ if (vb->num_pfns) {
+ if (use_sg)
+ tell_host_sgs(vb, vb->inflate_vq, pfn_min, pfn_max);
+ else
+ tell_host(vb, vb->inflate_vq);
+ }
mutex_unlock(&vb->balloon_lock);
return num_allocated_pages;
@@ -211,9 +339,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
struct page *page;
struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
LIST_HEAD(pages);
+ bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
+ unsigned long pfn_max = 0, pfn_min = ULONG_MAX;
- /* We can only do one array worth at a time. */
- num = min(num, ARRAY_SIZE(vb->pfns));
+ /* Traditionally, we can only do one array worth at a time. */
+ if (!use_sg)
+ num = min(num, ARRAY_SIZE(vb->pfns));
mutex_lock(&vb->balloon_lock);
/* We can't release more pages than taken */
@@ -223,7 +354,14 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
page = balloon_page_dequeue(vb_dev_info);
if (!page)
break;
- set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+ if (use_sg) {
+ if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
+ balloon_page_enqueue(&vb->vb_dev_info, page);
+ break;
+ }
+ } else {
+ set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+ }
list_add(&page->lru, &pages);
vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
}
@@ -234,13 +372,55 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
* virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
* is true, we *have* to do it in this order
*/
- if (vb->num_pfns != 0)
- tell_host(vb, vb->deflate_vq);
+ if (vb->num_pfns) {
+ if (use_sg)
+ tell_host_sgs(vb, vb->deflate_vq, pfn_min, pfn_max);
+ else
+ tell_host(vb, vb->deflate_vq);
+ }
release_pages_balloon(vb, &pages);
mutex_unlock(&vb->balloon_lock);
return num_freed_pages;
}
+/*
+ * The regular leak_balloon() with VIRTIO_BALLOON_F_SG needs memory allocation
+ * for xbitmap, which is not suitable for the oom case. This function does not
+ * use xbitmap to chunk pages, so it can be used by oom notifier to deflate
+ * pages when VIRTIO_BALLOON_F_SG is negotiated.
+ */
+static unsigned int leak_balloon_sg_oom(struct virtio_balloon *vb)
+{
+ unsigned int n;
+ struct page *page;
+ struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
+ struct virtqueue *vq = vb->deflate_vq;
+ LIST_HEAD(pages);
+
+ mutex_lock(&vb->balloon_lock);
+ for (n = 0; n < oom_pages; n++) {
+ page = balloon_page_dequeue(vb_dev_info);
+ if (!page)
+ break;
+
+ list_add(&page->lru, &pages);
+ vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
+ batch_balloon_page_sg(vb, vb->deflate_vq, page_to_pfn(page),
+ PAGE_SIZE);
+ release_pages_balloon(vb, &pages);
+ }
+
+ /*
+ * The last few sgs may not reach the batch size, but need a kick to
+ * notify the device to handle them.
+ */
+ if (vq->num_free != virtqueue_get_vring_size(vq))
+ kick_and_wait(vq, vb->acked);
+ mutex_unlock(&vb->balloon_lock);
+
+ return n;
+}
+
static inline void update_stat(struct virtio_balloon *vb, int idx,
u16 tag, u64 val)
{
@@ -380,7 +560,10 @@ static int virtballoon_oom_notify(struct notifier_block *self,
return NOTIFY_OK;
freed = parm;
- num_freed_pages = leak_balloon(vb, oom_pages);
+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG))
+ num_freed_pages = leak_balloon_sg_oom(vb);
+ else
+ num_freed_pages = leak_balloon(vb, oom_pages);
update_balloon_size(vb);
*freed += num_freed_pages;
@@ -477,6 +660,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
{
struct virtio_balloon *vb = container_of(vb_dev_info,
struct virtio_balloon, vb_dev_info);
+ bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
unsigned long flags;
/*
@@ -498,16 +682,24 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
vb_dev_info->isolated_pages--;
__count_vm_event(BALLOON_MIGRATE);
spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
- vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
- set_page_pfns(vb, vb->pfns, newpage);
- tell_host(vb, vb->inflate_vq);
-
+ if (use_sg) {
+ add_one_sg(vb->inflate_vq, page_to_pfn(newpage), PAGE_SIZE);
+ kick_and_wait(vb->inflate_vq, vb->acked);
+ } else {
+ vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+ set_page_pfns(vb, vb->pfns, newpage);
+ tell_host(vb, vb->inflate_vq);
+ }
/* balloon's page migration 2nd step -- deflate "page" */
balloon_page_delete(page);
- vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
- set_page_pfns(vb, vb->pfns, page);
- tell_host(vb, vb->deflate_vq);
-
+ if (use_sg) {
+ add_one_sg(vb->deflate_vq, page_to_pfn(page), PAGE_SIZE);
+ kick_and_wait(vb->deflate_vq, vb->acked);
+ } else {
+ vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+ set_page_pfns(vb, vb->pfns, page);
+ tell_host(vb, vb->deflate_vq);
+ }
mutex_unlock(&vb->balloon_lock);
put_page(page); /* balloon reference */
@@ -566,6 +758,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
if (err)
goto out_free_vb;
+ if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
+ xb_init(&vb->page_xb);
+
vb->nb.notifier_call = virtballoon_oom_notify;
vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
err = register_oom_notifier(&vb->nb);
@@ -682,6 +877,7 @@ static unsigned int features[] = {
VIRTIO_BALLOON_F_MUST_TELL_HOST,
VIRTIO_BALLOON_F_STATS_VQ,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+ VIRTIO_BALLOON_F_SG,
};
static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..37780a7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
#define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */
#define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */
#define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_SG 3 /* Use sg instead of PFN lists */
/* Size of a PFN in the balloon interface. */
#define VIRTIO_BALLOON_PFN_SHIFT 12
--
2.7.4
The guest free pages should not be discarded by the live migration thread
when page poisoning is enabled with PAGE_POISONING_NO_SANITY=n, because
skipping the transfer of such poisoned free pages will trigger false
positive when new pages are allocated and checked on the destination.
This patch adds a config field, poison_val. Guest writes to the config
field to tell the host about the poisoning value. The value will be 0 in
the following cases:
1) PAGE_POISONING_NO_SANITY is enabled;
2) page poisoning is disabled; or
3) PAGE_POISONING_ZERO is enabled.
Signed-off-by: Wei Wang <[email protected]>
Suggested-by: Michael S. Tsirkin <[email protected]>
Cc: Michal Hocko <[email protected]>
---
drivers/virtio/virtio_balloon.c | 8 ++++++++
include/uapi/linux/virtio_balloon.h | 2 ++
2 files changed, 10 insertions(+)
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 6e4f359..3050c9f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -858,6 +858,7 @@ static struct file_system_type balloon_fs = {
static int virtballoon_probe(struct virtio_device *vdev)
{
struct virtio_balloon *vb;
+ __u32 poison_val;
int err;
if (!vdev->config->get) {
@@ -895,6 +896,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
INIT_WORK(&vb->report_free_page_work, report_free_page);
vb->stop_cmd_id = VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID;
+ if (IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY) ||
+ !page_poisoning_enabled())
+ poison_val = 0;
+ else
+ poison_val = PAGE_POISON;
+ virtio_cwrite(vb->vdev, struct virtio_balloon_config,
+ poison_val, &poison_val);
}
vb->nb.notifier_call = virtballoon_oom_notify;
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 58f1274..f270e9e 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -48,6 +48,8 @@ struct virtio_balloon_config {
__u32 actual;
/* Free page report command id, readonly by guest */
__u32 free_page_report_cmd_id;
+ /* Stores PAGE_POISON if page poisoning with sanity check is in use */
+ __u32 poison_val;
};
#define VIRTIO_BALLOON_S_SWAP_IN 0 /* Amount of memory swapped in */
--
2.7.4
Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_VQ feature indicates the
support of reporting hints of guest free pages to host via virtio-balloon.
Host requests the guest to report free pages by sending a new cmd
id to the guest via the free_page_report_cmd_id configuration register.
When the guest starts to report, the first element added to the free page
vq is the cmd id given by host. When the guest finishes the reporting
of all the free pages, VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID is added
to the vq to tell host that the reporting is done. Host may also requests
the guest to stop the reporting in advance by sending the stop cmd id to
the guest via the configuration register.
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Liang Li <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Michal Hocko <[email protected]>
---
drivers/virtio/virtio_balloon.c | 202 ++++++++++++++++++++++++++++++------
include/uapi/linux/virtio_balloon.h | 4 +
2 files changed, 174 insertions(+), 32 deletions(-)
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 8e840a2..6e4f359 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -55,7 +55,12 @@ static struct vfsmount *balloon_mnt;
struct virtio_balloon {
struct virtio_device *vdev;
- struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+ struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
+
+ /* Balloon's own wq for cpu-intensive work items */
+ struct workqueue_struct *balloon_wq;
+ /* The free page reporting work item submitted to the balloon wq */
+ struct work_struct report_free_page_work;
/* The balloon servicing is delegated to a freezable workqueue. */
struct work_struct update_balloon_stats_work;
@@ -65,6 +70,13 @@ struct virtio_balloon {
spinlock_t stop_update_lock;
bool stop_update;
+ /* Start to report free pages */
+ bool report_free_page;
+ /* Stores the cmd id given by host to start the free page reporting */
+ uint32_t start_cmd_id;
+ /* Stores STOP_ID as a sign to tell host that the reporting is done */
+ uint32_t stop_cmd_id;
+
/* Waiting for host to ack the pages we released. */
wait_queue_head_t acked;
@@ -189,6 +201,28 @@ static void batch_balloon_page_sg(struct virtio_balloon *vb,
kick_and_wait(vq, vb->acked);
}
+static void batch_free_page_sg(struct virtqueue *vq,
+ unsigned long pfn,
+ uint32_t len)
+{
+ add_one_sg(vq, pfn, len);
+
+ /* Batch till the vq is full */
+ if (!vq->num_free)
+ virtqueue_kick(vq);
+}
+
+static void send_cmd_id(struct virtio_balloon *vb, void *addr)
+{
+ struct scatterlist sg;
+ int err;
+
+ sg_init_one(&sg, addr, sizeof(uint32_t));
+ err = virtqueue_add_outbuf(vb->free_page_vq, &sg, 1, vb, GFP_KERNEL);
+ BUG_ON(err);
+ virtqueue_kick(vb->free_page_vq);
+}
+
/*
* Send balloon pages in sgs to host. The balloon pages are recorded in the
* page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
@@ -496,17 +530,6 @@ static void stats_handle_request(struct virtio_balloon *vb)
virtqueue_kick(vq);
}
-static void virtballoon_changed(struct virtio_device *vdev)
-{
- struct virtio_balloon *vb = vdev->priv;
- unsigned long flags;
-
- spin_lock_irqsave(&vb->stop_update_lock, flags);
- if (!vb->stop_update)
- queue_work(system_freezable_wq, &vb->update_balloon_size_work);
- spin_unlock_irqrestore(&vb->stop_update_lock, flags);
-}
-
static inline s64 towards_target(struct virtio_balloon *vb)
{
s64 target;
@@ -523,6 +546,36 @@ static inline s64 towards_target(struct virtio_balloon *vb)
return target - vb->num_pages;
}
+static void virtballoon_changed(struct virtio_device *vdev)
+{
+ struct virtio_balloon *vb = vdev->priv;
+ unsigned long flags;
+ __u32 cmd_id;
+ s64 diff = towards_target(vb);
+
+ if (diff) {
+ spin_lock_irqsave(&vb->stop_update_lock, flags);
+ if (!vb->stop_update)
+ queue_work(system_freezable_wq,
+ &vb->update_balloon_size_work);
+ spin_unlock_irqrestore(&vb->stop_update_lock, flags);
+ }
+
+ virtio_cread(vb->vdev, struct virtio_balloon_config,
+ free_page_report_cmd_id, &cmd_id);
+ if (cmd_id == VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID) {
+ WRITE_ONCE(vb->report_free_page, false);
+ } else if (cmd_id != vb->start_cmd_id) {
+ /*
+ * Host requests to start the reporting by sending a new cmd
+ * id.
+ */
+ WRITE_ONCE(vb->report_free_page, true);
+ vb->start_cmd_id = cmd_id;
+ queue_work(vb->balloon_wq, &vb->report_free_page_work);
+ }
+}
+
static void update_balloon_size(struct virtio_balloon *vb)
{
u32 actual = vb->num_pages;
@@ -600,40 +653,116 @@ static void update_balloon_size_func(struct work_struct *work)
static int init_vqs(struct virtio_balloon *vb)
{
- struct virtqueue *vqs[3];
- vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
- static const char * const names[] = { "inflate", "deflate", "stats" };
- int err, nvqs;
+ struct virtqueue **vqs;
+ vq_callback_t **callbacks;
+ const char **names;
+ struct scatterlist sg;
+ int i, nvqs, err = -ENOMEM;
+
+ /* Inflateq and deflateq are used unconditionally */
+ nvqs = 2;
+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+ nvqs++;
+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
+ nvqs++;
+
+ /* Allocate space for find_vqs parameters */
+ vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
+ if (!vqs)
+ goto err_vq;
+ callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
+ if (!callbacks)
+ goto err_callback;
+ names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
+ if (!names)
+ goto err_names;
+
+ callbacks[0] = balloon_ack;
+ names[0] = "inflate";
+ callbacks[1] = balloon_ack;
+ names[1] = "deflate";
+
+ i = 2;
+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+ callbacks[i] = stats_request;
+ names[i] = "stats";
+ i++;
+ }
- /*
- * We expect two virtqueues: inflate and deflate, and
- * optionally stat.
- */
- nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
- err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
+ callbacks[i] = NULL;
+ names[i] = "free_page_vq";
+ }
+
+ err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names,
+ NULL, NULL);
if (err)
- return err;
+ goto err_find;
vb->inflate_vq = vqs[0];
vb->deflate_vq = vqs[1];
+ i = 2;
if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
- struct scatterlist sg;
- unsigned int num_stats;
- vb->stats_vq = vqs[2];
-
+ vb->stats_vq = vqs[i++];
/*
* Prime this virtqueue with one buffer so the hypervisor can
* use it to signal us later (it can't be broken yet!).
*/
- num_stats = update_balloon_stats(vb);
-
- sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
+ sg_init_one(&sg, vb->stats, sizeof(vb->stats));
if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
- < 0)
- BUG();
+ < 0) {
+ dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
+ __func__);
+ goto err_find;
+ }
virtqueue_kick(vb->stats_vq);
}
+
+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ))
+ vb->free_page_vq = vqs[i];
+
+ kfree(names);
+ kfree(callbacks);
+ kfree(vqs);
return 0;
+
+err_find:
+ kfree(names);
+err_names:
+ kfree(callbacks);
+err_callback:
+ kfree(vqs);
+err_vq:
+ return err;
+}
+
+static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
+ unsigned long nr_pages)
+{
+ struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
+ uint32_t len = nr_pages << PAGE_SHIFT;
+
+ if (!READ_ONCE(vb->report_free_page))
+ return false;
+
+ batch_free_page_sg(vb->free_page_vq, pfn, len);
+
+ return true;
+}
+
+static void report_free_page(struct work_struct *work)
+{
+ struct virtio_balloon *vb;
+
+ vb = container_of(work, struct virtio_balloon, report_free_page_work);
+ /* Start by sending the obtained cmd id to the host with an outbuf */
+ send_cmd_id(vb, &vb->start_cmd_id);
+ walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
+ /*
+ * End by sending the stop id to the host with an outbuf. Use the
+ * non-batching mode here to trigger a kick after adding the stop id.
+ */
+ send_cmd_id(vb, &vb->stop_cmd_id);
}
#ifdef CONFIG_BALLOON_COMPACTION
@@ -761,6 +890,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_SG))
xb_init(&vb->page_xb);
+ if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_VQ)) {
+ vb->balloon_wq = alloc_workqueue("balloon-wq",
+ WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
+ INIT_WORK(&vb->report_free_page_work, report_free_page);
+ vb->stop_cmd_id = VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID;
+ }
+
vb->nb.notifier_call = virtballoon_oom_notify;
vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
err = register_oom_notifier(&vb->nb);
@@ -825,6 +961,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
spin_unlock_irq(&vb->stop_update_lock);
cancel_work_sync(&vb->update_balloon_size_work);
cancel_work_sync(&vb->update_balloon_stats_work);
+ cancel_work_sync(&vb->report_free_page_work);
remove_common(vb);
#ifdef CONFIG_BALLOON_COMPACTION
@@ -878,6 +1015,7 @@ static unsigned int features[] = {
VIRTIO_BALLOON_F_STATS_VQ,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
VIRTIO_BALLOON_F_SG,
+ VIRTIO_BALLOON_F_FREE_PAGE_VQ,
};
static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 37780a7..58f1274 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,15 +35,19 @@
#define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */
#define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
#define VIRTIO_BALLOON_F_SG 3 /* Use sg instead of PFN lists */
+#define VIRTIO_BALLOON_F_FREE_PAGE_VQ 4 /* VQ to report free pages */
/* Size of a PFN in the balloon interface. */
#define VIRTIO_BALLOON_PFN_SHIFT 12
+#define VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID 0
struct virtio_balloon_config {
/* Number of pages host wants Guest to give up. */
__u32 num_pages;
/* Number of pages we've actually got in balloon. */
__u32 actual;
+ /* Free page report command id, readonly by guest */
+ __u32 free_page_report_cmd_id;
};
#define VIRTIO_BALLOON_S_SWAP_IN 0 /* Amount of memory swapped in */
--
2.7.4
This patch adds support to walk through the free page blocks in the
system and report them via a callback function. Some page blocks may
leave the free list after zone->lock is released, so it is the caller's
responsibility to either detect or prevent the use of such pages.
One use example of this patch is to accelerate live migration by skipping
the transfer of free pages reported from the guest. A popular method used
by the hypervisor to track which part of memory is written during live
migration is to write-protect all the guest memory. So, those pages that
are reported as free pages but are written after the report function
returns will be captured by the hypervisor, and they will be added to the
next round of memory transfer.
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Liang Li <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
include/linux/mm.h | 6 ++++
mm/page_alloc.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 97 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ea818ff..b3077dd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1938,6 +1938,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
extern void free_initmem(void);
+extern void walk_free_mem_block(void *opaque,
+ int min_order,
+ bool (*report_pfn_range)(void *opaque,
+ unsigned long pfn,
+ unsigned long num));
+
/*
* Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
* into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 73f5d45..0de461d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4888,6 +4888,97 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
show_swap_cache_info();
}
+/*
+ * Walk through a free page list and report the found pfn range via the
+ * callback.
+ *
+ * Return false if the callback requests to stop reporting. Otherwise,
+ * return true.
+ */
+static bool walk_free_page_list(void *opaque,
+ struct zone *zone,
+ int order,
+ enum migratetype mt,
+ bool (*report_pfn_range)(void *,
+ unsigned long,
+ unsigned long))
+{
+ struct page *page;
+ struct list_head *list;
+ unsigned long pfn, flags;
+ bool ret;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ list = &zone->free_area[order].free_list[mt];
+ list_for_each_entry(page, list, lru) {
+ pfn = page_to_pfn(page);
+ ret = report_pfn_range(opaque, pfn, 1 << order);
+ if (!ret)
+ break;
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+
+ return ret;
+}
+
+/**
+ * walk_free_mem_block - Walk through the free page blocks in the system
+ * @opaque: the context passed from the caller
+ * @min_order: the minimum order of free lists to check
+ * @report_pfn_range: the callback to report the pfn range of the free pages
+ *
+ * If the callback returns false, stop iterating the list of free page blocks.
+ * Otherwise, continue to report.
+ *
+ * Please note that there are no locking guarantees for the callback and
+ * that the reported pfn range might be freed or disappear after the
+ * callback returns so the caller has to be very careful how it is used.
+ *
+ * The callback itself must not sleep or perform any operations which would
+ * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
+ * or via any lock dependency. It is generally advisable to implement
+ * the callback as simple as possible and defer any heavy lifting to a
+ * different context.
+ *
+ * There is no guarantee that each free range will be reported only once
+ * during one walk_free_mem_block invocation.
+ *
+ * pfn_to_page on the given range is strongly discouraged and if there is
+ * an absolute need for that make sure to contact MM people to discuss
+ * potential problems.
+ *
+ * The function itself might sleep so it cannot be called from atomic
+ * contexts.
+ *
+ * In general low orders tend to be very volatile and so it makes more
+ * sense to query larger ones first for various optimizations which like
+ * ballooning etc... This will reduce the overhead as well.
+ */
+void walk_free_mem_block(void *opaque,
+ int min_order,
+ bool (*report_pfn_range)(void *opaque,
+ unsigned long pfn,
+ unsigned long num))
+{
+ struct zone *zone;
+ int order;
+ enum migratetype mt;
+ bool ret;
+
+ for_each_populated_zone(zone) {
+ for (order = MAX_ORDER - 1; order >= min_order; order--) {
+ for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+ ret = walk_free_page_list(opaque, zone,
+ order, mt,
+ report_pfn_range);
+ if (!ret)
+ return;
+ }
+ }
+ }
+}
+EXPORT_SYMBOL_GPL(walk_free_mem_block);
+
static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
{
zoneref->zone = zone;
--
2.7.4
From: Matthew Wilcox <[email protected]>
The eXtensible Bitmap is a sparse bitmap representation which is
efficient for set bits which tend to cluster. It supports up to
'unsigned long' worth of bits, and this commit adds the bare bones --
xb_set_bit(), xb_clear_bit() and xb_test_bit().
Signed-off-by: Wei Wang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Tetsuo Handa <[email protected]>
---
include/linux/radix-tree.h | 2 +
include/linux/xbitmap.h | 52 +++++++++
lib/Makefile | 2 +-
lib/radix-tree.c | 26 ++++-
lib/xbitmap.c | 179 +++++++++++++++++++++++++++++++
tools/testing/radix-tree/Makefile | 12 ++-
tools/testing/radix-tree/linux/xbitmap.h | 1 +
tools/testing/radix-tree/main.c | 4 +
tools/testing/radix-tree/test.h | 1 +
9 files changed, 275 insertions(+), 4 deletions(-)
create mode 100644 include/linux/xbitmap.h
create mode 100644 lib/xbitmap.c
create mode 100644 tools/testing/radix-tree/linux/xbitmap.h
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 23a9c89..fe44f4b 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -315,6 +315,8 @@ void radix_tree_iter_delete(struct radix_tree_root *,
struct radix_tree_iter *iter, void __rcu **slot);
void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
void *radix_tree_delete(struct radix_tree_root *, unsigned long);
+bool __radix_tree_delete(struct radix_tree_root *r, struct radix_tree_node *n,
+ void **slot);
void radix_tree_clear_tags(struct radix_tree_root *, struct radix_tree_node *,
void __rcu **slot);
unsigned int radix_tree_gang_lookup(const struct radix_tree_root *,
diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
new file mode 100644
index 0000000..ed75d87
--- /dev/null
+++ b/include/linux/xbitmap.h
@@ -0,0 +1,52 @@
+/*
+ * eXtensible Bitmaps
+ * Copyright (c) 2017 Microsoft Corporation <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * eXtensible Bitmaps provide an unlimited-size sparse bitmap facility.
+ * All bits are initially zero.
+ */
+
+#include <linux/idr.h>
+
+struct xb {
+ struct radix_tree_root xbrt;
+};
+
+#define XB_INIT { \
+ .xbrt = RADIX_TREE_INIT(IDR_RT_MARKER | GFP_NOWAIT), \
+}
+#define DEFINE_XB(name) struct xb name = XB_INIT
+
+static inline void xb_init(struct xb *xb)
+{
+ INIT_RADIX_TREE(&xb->xbrt, IDR_RT_MARKER | GFP_NOWAIT);
+}
+
+int xb_set_bit(struct xb *xb, unsigned long bit);
+bool xb_test_bit(const struct xb *xb, unsigned long bit);
+int xb_clear_bit(struct xb *xb, unsigned long bit);
+
+int xb_zero(struct xb *xb, unsigned long start, unsigned long nbits);
+int xb_fill(struct xb *xb, unsigned long start, unsigned long nbits);
+
+static inline bool xb_empty(const struct xb *xb)
+{
+ return radix_tree_empty(&xb->xbrt);
+}
+
+void xb_preload(gfp_t);
+
+static inline void xb_preload_end(void)
+{
+ preempt_enable();
+}
diff --git a/lib/Makefile b/lib/Makefile
index d11c48e..08a8183 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -19,7 +19,7 @@ KCOV_INSTRUMENT_dynamic_debug.o := n
lib-y := ctype.o string.o vsprintf.o cmdline.o \
rbtree.o radix-tree.o dump_stack.o timerqueue.o\
- idr.o int_sqrt.o extable.o \
+ idr.o xbitmap.o int_sqrt.o extable.o \
sha1.o chacha20.o irq_regs.o argv_split.o \
flex_proportions.o ratelimit.o show_mem.o \
is_single_threaded.o plist.o decompress.o kobject_uevent.o \
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index c8d5556..7000ad6 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -78,6 +78,14 @@ static struct kmem_cache *radix_tree_node_cachep;
#define IDA_PRELOAD_SIZE (IDA_MAX_PATH * 2 - 1)
/*
+ * The XB can go up to unsigned long, but also uses a bitmap.
+ */
+#define XB_INDEX_BITS (BITS_PER_LONG - ilog2(IDA_BITMAP_BITS))
+#define XB_MAX_PATH (DIV_ROUND_UP(XB_INDEX_BITS, \
+ RADIX_TREE_MAP_SHIFT))
+#define XB_PRELOAD_SIZE (XB_MAX_PATH * 2 - 1)
+
+/*
* Per-cpu pool of preloaded nodes
*/
struct radix_tree_preload {
@@ -839,6 +847,8 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
offset, 0, 0);
if (!child)
return -ENOMEM;
+ if (is_idr(root))
+ all_tag_set(child, IDR_FREE);
rcu_assign_pointer(*slot, node_to_entry(child));
if (node)
node->count++;
@@ -1982,7 +1992,7 @@ void __radix_tree_delete_node(struct radix_tree_root *root,
delete_node(root, node, update_node);
}
-static bool __radix_tree_delete(struct radix_tree_root *root,
+bool __radix_tree_delete(struct radix_tree_root *root,
struct radix_tree_node *node, void __rcu **slot)
{
void *old = rcu_dereference_raw(*slot);
@@ -2135,6 +2145,20 @@ int ida_pre_get(struct ida *ida, gfp_t gfp)
}
EXPORT_SYMBOL(ida_pre_get);
+void xb_preload(gfp_t gfp)
+{
+ __radix_tree_preload(gfp, XB_PRELOAD_SIZE);
+ if (!this_cpu_read(ida_bitmap)) {
+ struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
+
+ if (!bitmap)
+ return;
+ bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
+ kfree(bitmap);
+ }
+}
+EXPORT_SYMBOL(xb_preload);
+
void __rcu **idr_get_free_cmn(struct radix_tree_root *root,
struct radix_tree_iter *iter, gfp_t gfp,
unsigned long max)
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
new file mode 100644
index 0000000..2b547a73
--- /dev/null
+++ b/lib/xbitmap.c
@@ -0,0 +1,179 @@
+#include <linux/export.h>
+#include <linux/xbitmap.h>
+#include <linux/bitmap.h>
+#include <linux/slab.h>
+
+/**
+ * xb_set_bit - set a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to set
+ *
+ * This function is used to set a bit in the xbitmap. If the bitmap that @bit
+ * resides in is not there, the per-cpu ida_bitmap will be taken.
+ *
+ * Returns: 0 on success. %-EAGAIN indicates that @bit was not set.
+ */
+int xb_set_bit(struct xb *xb, unsigned long bit)
+{
+ int err;
+ unsigned long index = bit / IDA_BITMAP_BITS;
+ struct radix_tree_root *root = &xb->xbrt;
+ struct radix_tree_node *node;
+ void **slot;
+ struct ida_bitmap *bitmap;
+ unsigned long ebit;
+
+ bit %= IDA_BITMAP_BITS;
+ ebit = bit + 2;
+
+ err = __radix_tree_create(root, index, 0, &node, &slot);
+ if (err)
+ return err;
+ bitmap = rcu_dereference_raw(*slot);
+ if (radix_tree_exception(bitmap)) {
+ unsigned long tmp = (unsigned long)bitmap;
+
+ if (ebit < BITS_PER_LONG) {
+ tmp |= 1UL << ebit;
+ rcu_assign_pointer(*slot, (void *)tmp);
+ return 0;
+ }
+ bitmap = this_cpu_xchg(ida_bitmap, NULL);
+ if (!bitmap)
+ return -EAGAIN;
+ memset(bitmap, 0, sizeof(*bitmap));
+ bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+ rcu_assign_pointer(*slot, bitmap);
+ }
+
+ if (!bitmap) {
+ if (ebit < BITS_PER_LONG) {
+ bitmap = (void *)((1UL << ebit) |
+ RADIX_TREE_EXCEPTIONAL_ENTRY);
+ __radix_tree_replace(root, node, slot, bitmap, NULL);
+ return 0;
+ }
+ bitmap = this_cpu_xchg(ida_bitmap, NULL);
+ if (!bitmap)
+ return -EAGAIN;
+ memset(bitmap, 0, sizeof(*bitmap));
+ __radix_tree_replace(root, node, slot, bitmap, NULL);
+ }
+
+ __set_bit(bit, bitmap->bitmap);
+ return 0;
+}
+EXPORT_SYMBOL(xb_set_bit);
+
+/**
+ * xb_clear_bit - clear a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to clear
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+int xb_clear_bit(struct xb *xb, unsigned long bit)
+{
+ unsigned long index = bit / IDA_BITMAP_BITS;
+ struct radix_tree_root *root = &xb->xbrt;
+ struct radix_tree_node *node;
+ void **slot;
+ struct ida_bitmap *bitmap;
+ unsigned long ebit;
+
+ bit %= IDA_BITMAP_BITS;
+ ebit = bit + 2;
+
+ bitmap = __radix_tree_lookup(root, index, &node, &slot);
+ if (radix_tree_exception(bitmap)) {
+ unsigned long tmp = (unsigned long)bitmap;
+
+ if (ebit >= BITS_PER_LONG)
+ return 0;
+ tmp &= ~(1UL << ebit);
+ if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+ __radix_tree_delete(root, node, slot);
+ else
+ rcu_assign_pointer(*slot, (void *)tmp);
+ return 0;
+ }
+
+ if (!bitmap)
+ return 0;
+
+ __clear_bit(bit, bitmap->bitmap);
+ if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+ kfree(bitmap);
+ __radix_tree_delete(root, node, slot);
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL(xb_clear_bit);
+
+/**
+ * xb_test_bit - test a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to test
+ *
+ * This function is used to test a bit in the xbitmap.
+ *
+ * Returns: true if the bit is set, or false otherwise.
+ */
+bool xb_test_bit(const struct xb *xb, unsigned long bit)
+{
+ unsigned long index = bit / IDA_BITMAP_BITS;
+ const struct radix_tree_root *root = &xb->xbrt;
+ struct ida_bitmap *bitmap = radix_tree_lookup(root, index);
+
+ bit %= IDA_BITMAP_BITS;
+
+ if (!bitmap)
+ return false;
+ if (radix_tree_exception(bitmap)) {
+ bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
+ if (bit > BITS_PER_LONG)
+ return false;
+ return (unsigned long)bitmap & (1UL << bit);
+ }
+ return test_bit(bit, bitmap->bitmap);
+}
+EXPORT_SYMBOL(xb_test_bit);
+
+#ifndef __KERNEL__
+
+static DEFINE_XB(xb1);
+
+void xbitmap_check_bit(unsigned long bit)
+{
+ xb_preload(GFP_KERNEL);
+ assert(!xb_test_bit(&xb1, bit));
+ assert(xb_set_bit(&xb1, bit) == 0);
+ assert(xb_test_bit(&xb1, bit));
+ assert(xb_clear_bit(&xb1, bit) == 0);
+ assert(xb_empty(&xb1));
+ assert(xb_clear_bit(&xb1, bit) == 0);
+ assert(xb_empty(&xb1));
+ xb_preload_end();
+}
+
+void xbitmap_checks(void)
+{
+ xb_init(&xb1);
+ xbitmap_check_bit(0);
+ xbitmap_check_bit(30);
+ xbitmap_check_bit(31);
+ xbitmap_check_bit(1023);
+ xbitmap_check_bit(1024);
+ xbitmap_check_bit(1025);
+ xbitmap_check_bit((1UL << 63) | (1UL << 24));
+ xbitmap_check_bit((1UL << 63) | (1UL << 24) | 70);
+}
+
+int __weak main(void)
+{
+ radix_tree_init();
+ xbitmap_checks();
+}
+#endif
diff --git a/tools/testing/radix-tree/Makefile b/tools/testing/radix-tree/Makefile
index fa7ee36..34ece78 100644
--- a/tools/testing/radix-tree/Makefile
+++ b/tools/testing/radix-tree/Makefile
@@ -6,7 +6,8 @@ LDLIBS+= -lpthread -lurcu
TARGETS = main idr-test multiorder
CORE_OFILES := radix-tree.o idr.o linux.o test.o find_bit.o
OFILES = main.o $(CORE_OFILES) regression1.o regression2.o regression3.o \
- tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o
+ tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o \
+ xbitmap.o
ifndef SHIFT
SHIFT=3
@@ -25,8 +26,11 @@ idr-test: idr-test.o $(CORE_OFILES)
multiorder: multiorder.o $(CORE_OFILES)
+xbitmap: xbitmap.o $(CORE_OFILES)
+ $(CC) $(CFLAGS) $(LDFLAGS) $^ -o xbitmap
+
clean:
- $(RM) $(TARGETS) *.o radix-tree.c idr.c generated/map-shift.h
+ $(RM) $(TARGETS) *.o radix-tree.c idr.c xbitmap.c generated/map-shift.h
vpath %.c ../../lib
@@ -34,6 +38,7 @@ $(OFILES): Makefile *.h */*.h generated/map-shift.h \
../../include/linux/*.h \
../../include/asm/*.h \
../../../include/linux/radix-tree.h \
+ ../../../include/linux/xbitmap.h \
../../../include/linux/idr.h
radix-tree.c: ../../../lib/radix-tree.c
@@ -42,6 +47,9 @@ radix-tree.c: ../../../lib/radix-tree.c
idr.c: ../../../lib/idr.c
sed -e 's/^static //' -e 's/__always_inline //' -e 's/inline //' < $< > $@
+xbitmap.c: ../../../lib/xbitmap.c
+ sed -e 's/^static //' -e 's/__always_inline //' -e 's/inline //' < $< > $@
+
.PHONY: mapshift
mapshift:
diff --git a/tools/testing/radix-tree/linux/xbitmap.h b/tools/testing/radix-tree/linux/xbitmap.h
new file mode 100644
index 0000000..61de214
--- /dev/null
+++ b/tools/testing/radix-tree/linux/xbitmap.h
@@ -0,0 +1 @@
+#include "../../../../include/linux/xbitmap.h"
diff --git a/tools/testing/radix-tree/main.c b/tools/testing/radix-tree/main.c
index 257f3f8..d112363 100644
--- a/tools/testing/radix-tree/main.c
+++ b/tools/testing/radix-tree/main.c
@@ -326,6 +326,10 @@ static void single_thread_tests(bool long_run)
rcu_barrier();
printv(2, "after idr_checks: %d allocated, preempt %d\n",
nr_allocated, preempt_count);
+ xbitmap_checks();
+ rcu_barrier();
+ printv(2, "after xbitmap_checks: %d allocated, preempt %d\n",
+ nr_allocated, preempt_count);
big_gang_check(long_run);
rcu_barrier();
printv(2, "after big_gang_check: %d allocated, preempt %d\n",
diff --git a/tools/testing/radix-tree/test.h b/tools/testing/radix-tree/test.h
index d9c031d..8175d6b 100644
--- a/tools/testing/radix-tree/test.h
+++ b/tools/testing/radix-tree/test.h
@@ -38,6 +38,7 @@ void benchmark(void);
void idr_checks(void);
void ida_checks(void);
void ida_thread_tests(void);
+void xbitmap_checks(void);
struct item *
item_tag_set(struct radix_tree_root *root, unsigned long index, int tag);
--
2.7.4
This patch adds support to find next 1 or 0 bit in a xbmitmap range and
clear a range of bits.
More possible optimizations to add in the future:
1) xb_set_bit_range: set a range of bits.
2) when searching a bit, if the bit is not found in the slot, move on to
the next slot directly.
3) add tags to help searching.
Signed-off-by: Wei Wang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Suggested-by: Matthew Wilcox <[email protected]>
---
include/linux/xbitmap.h | 8 +-
lib/xbitmap.c | 229 +++++++++++++++++++++++++++++++++++++++++++
tools/include/linux/bitmap.h | 34 +++++++
tools/include/linux/kernel.h | 2 +
4 files changed, 272 insertions(+), 1 deletion(-)
diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
index b4d8375..eddf0d5e 100644
--- a/include/linux/xbitmap.h
+++ b/include/linux/xbitmap.h
@@ -33,8 +33,14 @@ static inline void xb_init(struct xb *xb)
}
int xb_set_bit(struct xb *xb, unsigned long bit);
+int xb_preload_and_set_bit(struct xb *xb, unsigned long bit, gfp_t gfp);
bool xb_test_bit(const struct xb *xb, unsigned long bit);
-int xb_clear_bit(struct xb *xb, unsigned long bit);
+void xb_clear_bit(struct xb *xb, unsigned long bit);
+unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
+ unsigned long end);
+unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start,
+ unsigned long end);
+void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end);
static inline bool xb_empty(const struct xb *xb)
{
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
index 182aa29..10df879 100644
--- a/lib/xbitmap.c
+++ b/lib/xbitmap.c
@@ -3,6 +3,13 @@
#include <linux/bitmap.h>
#include <linux/slab.h>
+/*
+ * Developer notes: locks are required to gurantee there is no concurrent
+ * calls of xb_set_bit, xb_clear_bit, xb_clear_bit_range, xb_test_bit,
+ * xb_find_next_set_bit, or xb_find_next_clear_bit to operate on the same
+ * ida bitamp.
+ */
+
/**
* xb_set_bit - set a bit in the xbitmap
* @xb: the xbitmap tree used to record the bit
@@ -70,6 +77,28 @@ int xb_set_bit(struct xb *xb, unsigned long bit)
EXPORT_SYMBOL(xb_set_bit);
/**
+ * xb_preload_and_set_bit - preload the memory and set a bit in the xbitmap
+ * @xb: the xbitmap tree used to record the bit
+ * @bit: index of the bit to set
+ *
+ * A wrapper of the xb_preload() and xb_set_bit().
+ * Returns: 0 on success; -EAGAIN or -ENOMEM on error.
+ */
+int xb_preload_and_set_bit(struct xb *xb, unsigned long bit, gfp_t gfp)
+{
+ int ret = 0;
+
+ if (!xb_preload(gfp))
+ return -ENOMEM;
+
+ ret = xb_set_bit(xb, bit);
+ xb_preload_end();
+
+ return ret;
+}
+EXPORT_SYMBOL(xb_preload_and_set_bit);
+
+/**
* xb_clear_bit - clear a bit in the xbitmap
* @xb: the xbitmap tree used to record the bit
* @bit: index of the bit to clear
@@ -115,6 +144,63 @@ void xb_clear_bit(struct xb *xb, unsigned long bit)
EXPORT_SYMBOL(xb_clear_bit);
/**
+ * xb_clear_bit_range - clear a range of bits in the xbitmap
+ * @start: the start of the bit range, inclusive
+ * @end: the end of the bit range, exclusive
+ *
+ * This function is used to clear a bit in the xbitmap. If all the bits of the
+ * bitmap are 0, the bitmap will be freed.
+ */
+void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end)
+{
+ struct radix_tree_root *root = &xb->xbrt;
+ struct radix_tree_node *node;
+ void **slot;
+ struct ida_bitmap *bitmap;
+ unsigned int nbits;
+
+ for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+ unsigned long index = start / IDA_BITMAP_BITS;
+ unsigned long bit = start % IDA_BITMAP_BITS;
+
+ bitmap = __radix_tree_lookup(root, index, &node, &slot);
+ if (radix_tree_exception(bitmap)) {
+ unsigned long ebit = bit + 2;
+ unsigned long tmp = (unsigned long)bitmap;
+
+ nbits = min(end - start + 1, BITS_PER_LONG - ebit);
+
+ if (ebit >= BITS_PER_LONG)
+ continue;
+ bitmap_clear(&tmp, ebit, nbits);
+ if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
+ __radix_tree_delete(root, node, slot);
+ else
+ rcu_assign_pointer(*slot, (void *)tmp);
+ } else if (bitmap) {
+ nbits = min(end - start + 1, IDA_BITMAP_BITS - bit);
+
+ if (nbits != IDA_BITMAP_BITS)
+ bitmap_clear(bitmap->bitmap, bit, nbits);
+
+ if (nbits == IDA_BITMAP_BITS ||
+ bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+ kfree(bitmap);
+ __radix_tree_delete(root, node, slot);
+ }
+ }
+
+ /*
+ * Already reached the last usable ida bitmap, so just return,
+ * otherwise overflow will happen.
+ */
+ if (index == ULONG_MAX / IDA_BITMAP_BITS)
+ break;
+ }
+}
+EXPORT_SYMBOL(xb_clear_bit_range);
+
+/**
* xb_test_bit - test a bit in the xbitmap
* @xb: the xbitmap tree used to record the bit
* @bit: index of the bit to test
@@ -143,6 +229,87 @@ bool xb_test_bit(const struct xb *xb, unsigned long bit)
}
EXPORT_SYMBOL(xb_test_bit);
+static unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
+ unsigned long end, bool set)
+{
+ struct radix_tree_root *root = &xb->xbrt;
+ struct radix_tree_node *node;
+ void **slot;
+ struct ida_bitmap *bmap;
+ unsigned long ret = end;
+
+ for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
+ unsigned long index = start / IDA_BITMAP_BITS;
+ unsigned long bit = start % IDA_BITMAP_BITS;
+
+ bmap = __radix_tree_lookup(root, index, &node, &slot);
+ if (radix_tree_exception(bmap)) {
+ unsigned long tmp = (unsigned long)bmap;
+ unsigned long ebit = bit + 2;
+
+ if (ebit >= BITS_PER_LONG)
+ continue;
+ if (set)
+ ret = find_next_bit(&tmp, BITS_PER_LONG, ebit);
+ else
+ ret = find_next_zero_bit(&tmp, BITS_PER_LONG,
+ ebit);
+ if (ret < BITS_PER_LONG)
+ return ret - 2 + IDA_BITMAP_BITS * index;
+ } else if (bmap) {
+ if (set)
+ ret = find_next_bit(bmap->bitmap,
+ IDA_BITMAP_BITS, bit);
+ else
+ ret = find_next_zero_bit(bmap->bitmap,
+ IDA_BITMAP_BITS, bit);
+ if (ret < IDA_BITMAP_BITS)
+ return ret + index * IDA_BITMAP_BITS;
+ } else if (!bmap && !set) {
+ return start;
+ }
+
+ /*
+ * Already reached the last searchable ida bitmap. Return
+ * ULONG_MAX, otherwise overflow will happen.
+ */
+ if (index == ULONG_MAX / IDA_BITMAP_BITS)
+ return ULONG_MAX;
+ }
+
+ return ret;
+}
+
+/**
+ * xb_find_next_set_bit - find the next set bit in a range
+ * @xb: the xbitmap to search
+ * @start: the start of the range, inclusive
+ * @end: the end of the range, exclusive
+ *
+ * Returns: the index of the found bit, or @end + 1 if no such bit is found.
+ */
+unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
+ unsigned long end)
+{
+ return xb_find_next_bit(xb, start, end, 1);
+}
+EXPORT_SYMBOL(xb_find_next_set_bit);
+
+/**
+ * xb_find_next_zero_bit - find the next zero bit in a range
+ * @xb: the xbitmap to search
+ * @start: the start of the range, inclusive
+ * @end: the end of the range, exclusive
+ *
+ * Returns: the index of the found bit, or @end + 1 if no such bit is found.
+ */
+unsigned long xb_find_next_zero_bit(struct xb *xb, unsigned long start,
+ unsigned long end)
+{
+ return xb_find_next_bit(xb, start, end, 0);
+}
+EXPORT_SYMBOL(xb_find_next_zero_bit);
+
#ifndef __KERNEL__
static DEFINE_XB(xb1);
@@ -160,6 +327,66 @@ void xbitmap_check_bit(unsigned long bit)
xb_preload_end();
}
+static void xbitmap_check_bit_range(void)
+{
+ /*
+ * Regular tests
+ * ebit tests: set 1030, 1031, 1034, 1035
+ * Next 1 in [0, 10000) --> 1030
+ * Next 1 in [1030, 1034) --> 1030
+ * Next 1 in [1032, 1034) --> none (1034)
+ * Next 0 in [1030, 1032) --> none (1032)
+ * Next 0 in [1030, 1033) --> 1032
+ *
+ * ida bitmap tests: set 8260, 8261, 8264, 8265
+ * Next 1 in [2000, 10000) --> 8260
+ * Next 1 in [8260, 8264) --> 8260
+ * Next 1 in [8262, 8264) --> none (8264)
+ * Next 0 in [8260, 8262) --> none (8262)
+ * Next 0 in [8260, 8263) --> 8262
+ */
+ assert(!xb_preload_and_set_bit(&xb1, 1030, GFP_KERNEL));
+ assert(!xb_preload_and_set_bit(&xb1, 1031, GFP_KERNEL));
+ assert(!xb_preload_and_set_bit(&xb1, 1034, GFP_KERNEL));
+ assert(!xb_preload_and_set_bit(&xb1, 1035, GFP_KERNEL));
+ assert(!xb_preload_and_set_bit(&xb1, 8260, GFP_KERNEL));
+ assert(!xb_preload_and_set_bit(&xb1, 8261, GFP_KERNEL));
+ assert(!xb_preload_and_set_bit(&xb1, 8264, GFP_KERNEL));
+ assert(!xb_preload_and_set_bit(&xb1, 8265, GFP_KERNEL));
+
+ assert(xb_find_next_set_bit(&xb1, 0, 10000) == 1030);
+ assert(xb_find_next_set_bit(&xb1, 1030, 1034) == 1030);
+ assert(xb_find_next_set_bit(&xb1, 1032, 1034) == 1034);
+ assert(xb_find_next_zero_bit(&xb1, 1030, 1032) == 1032);
+ assert(xb_find_next_zero_bit(&xb1, 1030, 1033) == 1032);
+
+ assert(xb_find_next_set_bit(&xb1, 2000, 10000) == 8260);
+ assert(xb_find_next_set_bit(&xb1, 8260, 8264) == 8260);
+ assert(xb_find_next_set_bit(&xb1, 8262, 8264) == 8264);
+ assert(xb_find_next_zero_bit(&xb1, 8260, 8262) == 8262);
+ assert(xb_find_next_zero_bit(&xb1, 8260, 8263) == 8262);
+
+ xb_clear_bit_range(&xb1, 0, 10000);
+ assert(xb_find_next_set_bit(&xb1, 0, 10000) == 10000);
+
+ /*
+ * Overflow tests:
+ * Next 1 in [0, ULONG_MAX) --> none (ULONG_MAX)
+ * Set ULONG_MAX - 4
+ * Next 1 in [0, ULONG_MAX) --> ULONG_MAX - 4
+ * Next 1 in [ULONG_MAX - 3, ULONG_MAX) --> none (ULONG_MAX)
+ * Next 0 in [ULONG_MAX - 4, ULONG_MAX) --> ULONG_MAX - 3
+ */
+ assert(!xb_preload_and_set_bit(&xb1, ULONG_MAX - 4, GFP_KERNEL));
+ assert(xb_find_next_set_bit(&xb1, ULONG_MAX - 10, ULONG_MAX) ==
+ ULONG_MAX - 4);
+ assert(xb_find_next_set_bit(&xb1, ULONG_MAX - 3, ULONG_MAX) ==
+ ULONG_MAX);
+ assert(xb_find_next_zero_bit(&xb1, ULONG_MAX - 4, ULONG_MAX) ==
+ ULONG_MAX - 3);
+ xb_clear_bit_range(&xb1, ULONG_MAX - 10, ULONG_MAX);
+}
+
void xbitmap_checks(void)
{
xb_init(&xb1);
@@ -171,6 +398,8 @@ void xbitmap_checks(void)
xbitmap_check_bit(1025);
xbitmap_check_bit((1UL << 63) | (1UL << 24));
xbitmap_check_bit((1UL << 63) | (1UL << 24) | 70);
+
+ xbitmap_check_bit_range();
}
int __weak main(void)
diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h
index ca16027..8d0bc1b 100644
--- a/tools/include/linux/bitmap.h
+++ b/tools/include/linux/bitmap.h
@@ -37,6 +37,40 @@ static inline void bitmap_zero(unsigned long *dst, int nbits)
}
}
+static inline void __bitmap_clear(unsigned long *map, unsigned int start,
+ int len)
+{
+ unsigned long *p = map + BIT_WORD(start);
+ const unsigned int size = start + len;
+ int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
+ unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
+
+ while (len - bits_to_clear >= 0) {
+ *p &= ~mask_to_clear;
+ len -= bits_to_clear;
+ bits_to_clear = BITS_PER_LONG;
+ mask_to_clear = ~0UL;
+ p++;
+ }
+ if (len) {
+ mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
+ *p &= ~mask_to_clear;
+ }
+}
+
+static inline __always_inline void bitmap_clear(unsigned long *map,
+ unsigned int start,
+ unsigned int nbits)
+{
+ if (__builtin_constant_p(nbits) && nbits == 1)
+ __clear_bit(start, map);
+ else if (__builtin_constant_p(start & 7) && IS_ALIGNED(start, 8) &&
+ __builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8))
+ memset((char *)map + start / 8, 0, nbits / 8);
+ else
+ __bitmap_clear(map, start, nbits);
+}
+
static inline void bitmap_fill(unsigned long *dst, unsigned int nbits)
{
unsigned int nlongs = BITS_TO_LONGS(nbits);
diff --git a/tools/include/linux/kernel.h b/tools/include/linux/kernel.h
index 0ad8844..3c992ae 100644
--- a/tools/include/linux/kernel.h
+++ b/tools/include/linux/kernel.h
@@ -13,6 +13,8 @@
#define UINT_MAX (~0U)
#endif
+#define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0)
+
#define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
#define PERF_ALIGN(x, a) __PERF_ALIGN_MASK(x, (typeof(x))(a)-1)
--
2.7.4
This patch made some changes to the original xbitmap implementation from
the linux-dax tree:
- remove xb_fill() and xb_zero() from xbitmap.h since they are not
implemented;
- xb_test_bit: changed "ebit > BITS_PER_LONG" to "ebit >= BITS_PER_LONG",
because bit 64 beyonds the "unsigned long" exceptional entry (0 to 63);
- xb_set_bit: delete the new inserted radix_tree_node when failing to
get the per cpu ida bitmap, this avoids the kind of memory leak of the
unused radix tree node left in the tree.
- xb_clear_bit: change it to be a void function, since the original
implementation reurns nothing than a 0.
- remove the comment above "#define XB_INDEX_BITS", because it causes
confusion based on the feedbacks from the previous discussion;
- xb_preload: with the original implementation, the CPU that successfully
do __radix_tree_preload() may get into sleep by kmalloc(), which has a
risk of getting the caller of xb_preload() scheduled to another CPU
after waken up, and the new CPU may not have radix_tree_node
pre-allocated there, this will be a problem when inserting a node to
the tree later. This patch moves __radix_tree_preload() after kmalloc()
and returns a boolean to indicate the success or failure. Also, add the
__must_check annotation to xb_preload for prudence purpose.
Signed-off-by: Wei Wang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Tetsuo Handa <[email protected]>
---
include/linux/xbitmap.h | 5 +----
lib/radix-tree.c | 27 +++++++++++++++++++++------
lib/xbitmap.c | 24 +++++++++++++-----------
3 files changed, 35 insertions(+), 21 deletions(-)
diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
index ed75d87..b4d8375 100644
--- a/include/linux/xbitmap.h
+++ b/include/linux/xbitmap.h
@@ -36,15 +36,12 @@ int xb_set_bit(struct xb *xb, unsigned long bit);
bool xb_test_bit(const struct xb *xb, unsigned long bit);
int xb_clear_bit(struct xb *xb, unsigned long bit);
-int xb_zero(struct xb *xb, unsigned long start, unsigned long nbits);
-int xb_fill(struct xb *xb, unsigned long start, unsigned long nbits);
-
static inline bool xb_empty(const struct xb *xb)
{
return radix_tree_empty(&xb->xbrt);
}
-void xb_preload(gfp_t);
+bool xb_preload(gfp_t);
static inline void xb_preload_end(void)
{
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 7000ad6..a039588 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -77,9 +77,6 @@ static struct kmem_cache *radix_tree_node_cachep;
RADIX_TREE_MAP_SHIFT))
#define IDA_PRELOAD_SIZE (IDA_MAX_PATH * 2 - 1)
-/*
- * The XB can go up to unsigned long, but also uses a bitmap.
- */
#define XB_INDEX_BITS (BITS_PER_LONG - ilog2(IDA_BITMAP_BITS))
#define XB_MAX_PATH (DIV_ROUND_UP(XB_INDEX_BITS, \
RADIX_TREE_MAP_SHIFT))
@@ -2145,17 +2142,35 @@ int ida_pre_get(struct ida *ida, gfp_t gfp)
}
EXPORT_SYMBOL(ida_pre_get);
-void xb_preload(gfp_t gfp)
+/**
+ * xb_preload - preload for xb_set_bit()
+ * @gfp_mask: allocation mask to use for preloading
+ *
+ * Preallocate memory to use for the next call to xb_set_bit(). On success,
+ * return true, with preemption disabled. On error, return false with
+ * preemption not disabled.
+ */
+__must_check bool xb_preload(gfp_t gfp)
{
- __radix_tree_preload(gfp, XB_PRELOAD_SIZE);
if (!this_cpu_read(ida_bitmap)) {
struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
if (!bitmap)
- return;
+ return false;
+ /*
+ * The per-CPU variable is updated with preemption enabled.
+ * If the calling task is unlucky to be scheduled to another
+ * CPU which has no ida_bitmap allocation, it will be detected
+ * when setting a bit (i.e. __xb_set_bit()).
+ */
bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
kfree(bitmap);
}
+
+ if (__radix_tree_preload(gfp, XB_PRELOAD_SIZE) < 0)
+ return false;
+
+ return true;
}
EXPORT_SYMBOL(xb_preload);
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
index 2b547a73..182aa29 100644
--- a/lib/xbitmap.c
+++ b/lib/xbitmap.c
@@ -39,8 +39,10 @@ int xb_set_bit(struct xb *xb, unsigned long bit)
return 0;
}
bitmap = this_cpu_xchg(ida_bitmap, NULL);
- if (!bitmap)
+ if (!bitmap) {
+ __radix_tree_delete(root, node, slot);
return -EAGAIN;
+ }
memset(bitmap, 0, sizeof(*bitmap));
bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
rcu_assign_pointer(*slot, bitmap);
@@ -54,8 +56,10 @@ int xb_set_bit(struct xb *xb, unsigned long bit)
return 0;
}
bitmap = this_cpu_xchg(ida_bitmap, NULL);
- if (!bitmap)
+ if (!bitmap) {
+ __radix_tree_delete(root, node, slot);
return -EAGAIN;
+ }
memset(bitmap, 0, sizeof(*bitmap));
__radix_tree_replace(root, node, slot, bitmap, NULL);
}
@@ -73,7 +77,7 @@ EXPORT_SYMBOL(xb_set_bit);
* This function is used to clear a bit in the xbitmap. If all the bits of the
* bitmap are 0, the bitmap will be freed.
*/
-int xb_clear_bit(struct xb *xb, unsigned long bit)
+void xb_clear_bit(struct xb *xb, unsigned long bit)
{
unsigned long index = bit / IDA_BITMAP_BITS;
struct radix_tree_root *root = &xb->xbrt;
@@ -90,25 +94,23 @@ int xb_clear_bit(struct xb *xb, unsigned long bit)
unsigned long tmp = (unsigned long)bitmap;
if (ebit >= BITS_PER_LONG)
- return 0;
+ return;
tmp &= ~(1UL << ebit);
if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
__radix_tree_delete(root, node, slot);
else
rcu_assign_pointer(*slot, (void *)tmp);
- return 0;
+ return;
}
if (!bitmap)
- return 0;
+ return;
__clear_bit(bit, bitmap->bitmap);
if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
kfree(bitmap);
__radix_tree_delete(root, node, slot);
}
-
- return 0;
}
EXPORT_SYMBOL(xb_clear_bit);
@@ -133,7 +135,7 @@ bool xb_test_bit(const struct xb *xb, unsigned long bit)
return false;
if (radix_tree_exception(bitmap)) {
bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
- if (bit > BITS_PER_LONG)
+ if (bit >= BITS_PER_LONG)
return false;
return (unsigned long)bitmap & (1UL << bit);
}
@@ -151,9 +153,9 @@ void xbitmap_check_bit(unsigned long bit)
assert(!xb_test_bit(&xb1, bit));
assert(xb_set_bit(&xb1, bit) == 0);
assert(xb_test_bit(&xb1, bit));
- assert(xb_clear_bit(&xb1, bit) == 0);
+ xb_clear_bit(&xb1, bit);
assert(xb_empty(&xb1));
- assert(xb_clear_bit(&xb1, bit) == 0);
+ xb_clear_bit(&xb1, bit);
assert(xb_empty(&xb1));
xb_preload_end();
}
--
2.7.4
Wei Wang wrote:
> +void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end)
> +{
> + struct radix_tree_root *root = &xb->xbrt;
> + struct radix_tree_node *node;
> + void **slot;
> + struct ida_bitmap *bitmap;
> + unsigned int nbits;
> +
> + for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
> + unsigned long index = start / IDA_BITMAP_BITS;
> + unsigned long bit = start % IDA_BITMAP_BITS;
> +
> + bitmap = __radix_tree_lookup(root, index, &node, &slot);
> + if (radix_tree_exception(bitmap)) {
> + unsigned long ebit = bit + 2;
> + unsigned long tmp = (unsigned long)bitmap;
> +
> + nbits = min(end - start + 1, BITS_PER_LONG - ebit);
> +
> + if (ebit >= BITS_PER_LONG)
What happens if we hit this "continue;" when "index == ULONG_MAX / IDA_BITMAP_BITS" ?
Can you eliminate exception path and fold all xbitmap patches into one, and
post only one xbitmap patch without virtio-baloon changes? If exception path
is valuable, you can add exception path after minimum version is merged.
This series is too difficult for me to close corner cases.
> + continue;
> + bitmap_clear(&tmp, ebit, nbits);
> + if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
> + __radix_tree_delete(root, node, slot);
> + else
> + rcu_assign_pointer(*slot, (void *)tmp);
> + } else if (bitmap) {
> + nbits = min(end - start + 1, IDA_BITMAP_BITS - bit);
> +
> + if (nbits != IDA_BITMAP_BITS)
> + bitmap_clear(bitmap->bitmap, bit, nbits);
> +
> + if (nbits == IDA_BITMAP_BITS ||
> + bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
> + kfree(bitmap);
> + __radix_tree_delete(root, node, slot);
> + }
> + }
> +
> + /*
> + * Already reached the last usable ida bitmap, so just return,
> + * otherwise overflow will happen.
> + */
> + if (index == ULONG_MAX / IDA_BITMAP_BITS)
> + break;
> + }
> +}
> +/**
> + * xb_find_next_set_bit - find the next set bit in a range
> + * @xb: the xbitmap to search
> + * @start: the start of the range, inclusive
> + * @end: the end of the range, exclusive
> + *
> + * Returns: the index of the found bit, or @end + 1 if no such bit is found.
> + */
> +unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
> + unsigned long end)
> +{
> + return xb_find_next_bit(xb, start, end, 1);
> +}
Won't "exclusive" loose ability to handle ULONG_MAX ? Since this is a
library module, missing ability to handle ULONG_MAX sounds like an omission.
Shouldn't we pass (or return) whether "found or not" flag (e.g. strtoul() in
C library function)?
bool xb_find_next_set_bit(struct xb *xb, unsigned long start, unsigned long end, unsigned long *result);
unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start, unsigned long end, bool *found);
Matthew, Wei,
On Tue, Dec 12, 2017 at 12:55 PM, Wei Wang <[email protected]> wrote:
> From: Matthew Wilcox <[email protected]>
>
> The eXtensible Bitmap is a sparse bitmap representation which is
> efficient for set bits which tend to cluster. It supports up to
> 'unsigned long' worth of bits, and this commit adds the bare bones --
> xb_set_bit(), xb_clear_bit() and xb_test_bit().
>
> Signed-off-by: Wei Wang <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Michael S. Tsirkin <[email protected]>
> Cc: Tetsuo Handa <[email protected]>
[...]
> --- /dev/null
> +++ b/include/linux/xbitmap.h
> @@ -0,0 +1,52 @@
> +/*
> + * eXtensible Bitmaps
> + * Copyright (c) 2017 Microsoft Corporation <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License as
> + * published by the Free Software Foundation; either version 2 of the
> + * License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * eXtensible Bitmaps provide an unlimited-size sparse bitmap facility.
> + * All bits are initially zero.
> + */
Have you considered using the new SPDX ids here instead? eg. this
would come out as a top line this way:
> +/* SPDX-License-Identifer: GPL-2.0+ */
Overall you get less boilerplate comment and more code, so this is a
win-win for everyone. This would nicely remove the legalese
boilerplate with the same effect, unless you are a legalese lover of
course. See Thomas doc patches for extra details.... and while you are
it if you could spread the words in your team and use it for all past,
present and future contributions, that would be much appreciated.
And as a side benefit to me, it will help me save on paper and ink
ribbons supplies for my dot matrix line printer each time I print the
source code of the whole kernel. ;)
Thanks for your consideration there.
--
Cordially
Philippe Ombredanne
On 12/12/2017 09:20 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
>> +void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end)
>> +{
>> + struct radix_tree_root *root = &xb->xbrt;
>> + struct radix_tree_node *node;
>> + void **slot;
>> + struct ida_bitmap *bitmap;
>> + unsigned int nbits;
>> +
>> + for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
>> + unsigned long index = start / IDA_BITMAP_BITS;
>> + unsigned long bit = start % IDA_BITMAP_BITS;
>> +
>> + bitmap = __radix_tree_lookup(root, index, &node, &slot);
>> + if (radix_tree_exception(bitmap)) {
>> + unsigned long ebit = bit + 2;
>> + unsigned long tmp = (unsigned long)bitmap;
>> +
>> + nbits = min(end - start + 1, BITS_PER_LONG - ebit);
>> +
>> + if (ebit >= BITS_PER_LONG)
> What happens if we hit this "continue;" when "index == ULONG_MAX / IDA_BITMAP_BITS" ?
Thanks. I also improved the test case for this. I plan to change the
implementation a little bit to avoid such overflow (has passed the test
case that I have, just post out for another set of eyes):
{
...
unsigned long idx = start / IDA_BITMAP_BITS;
unsigned long bit = start % IDA_BITMAP_BITS;
unsigned long idx_end = end / IDA_BITMAP_BITS;
unsigned long ret;
for (idx = start / IDA_BITMAP_BITS; idx <= idx_end; idx++) {
unsigned long ida_start = idx * IDA_BITMAP_BITS;
bitmap = __radix_tree_lookup(root, idx, &node, &slot);
if (radix_tree_exception(bitmap)) {
unsigned long tmp = (unsigned long)bitmap;
unsigned long ebit = bit + 2;
if (ebit >= BITS_PER_LONG)
continue;
if (set)
ret = find_next_bit(&tmp,
BITS_PER_LONG, ebit);
else
ret = find_next_zero_bit(&tmp,
BITS_PER_LONG,
ebit);
if (ret < BITS_PER_LONG)
return ret - 2 + ida_start;
} else if (bitmap) {
if (set)
ret = find_next_bit(bitmap->bitmap,
IDA_BITMAP_BITS, bit);
else
ret = find_next_zero_bit(bitmap->bitmap,
IDA_BITMAP_BITS, bit);
if (ret < IDA_BITMAP_BITS)
return ret + ida_start;
} else if (!bitmap && !set) {
return bit + IDA_BITMAP_BITS * idx;
}
bit = 0;
}
return end;
}
>
> Can you eliminate exception path and fold all xbitmap patches into one, and
> post only one xbitmap patch without virtio-baloon changes? If exception path
> is valuable, you can add exception path after minimum version is merged.
> This series is too difficult for me to close corner cases.
That exception path is claimed to save memory, and I don't have a strong
reason to remove that part.
Matthew, could we get your feedback on this?
>
>> +/**
>> + * xb_find_next_set_bit - find the next set bit in a range
>> + * @xb: the xbitmap to search
>> + * @start: the start of the range, inclusive
>> + * @end: the end of the range, exclusive
>> + *
>> + * Returns: the index of the found bit, or @end + 1 if no such bit is found.
>> + */
>> +unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
>> + unsigned long end)
>> +{
>> + return xb_find_next_bit(xb, start, end, 1);
>> +}
> Won't "exclusive" loose ability to handle ULONG_MAX ? Since this is a
> library module, missing ability to handle ULONG_MAX sounds like an omission.
> Shouldn't we pass (or return) whether "found or not" flag (e.g. strtoul() in
> C library function)?
>
> bool xb_find_next_set_bit(struct xb *xb, unsigned long start, unsigned long end, unsigned long *result);
> unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start, unsigned long end, bool *found);
Yes, ULONG_MAX needs to be tested by xb_test_bit(). Compared to checking
the return value, would it be the same to let the caller check for the
ULONG_MAX boundary?
Best,
Wei
Wei Wang wrote:
> On 12/12/2017 09:20 PM, Tetsuo Handa wrote:
> > Wei Wang wrote:
> >> +void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end)
> >> +{
> >> + struct radix_tree_root *root = &xb->xbrt;
> >> + struct radix_tree_node *node;
> >> + void **slot;
> >> + struct ida_bitmap *bitmap;
> >> + unsigned int nbits;
> >> +
> >> + for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
> >> + unsigned long index = start / IDA_BITMAP_BITS;
> >> + unsigned long bit = start % IDA_BITMAP_BITS;
> >> +
> >> + bitmap = __radix_tree_lookup(root, index, &node, &slot);
> >> + if (radix_tree_exception(bitmap)) {
> >> + unsigned long ebit = bit + 2;
> >> + unsigned long tmp = (unsigned long)bitmap;
> >> +
> >> + nbits = min(end - start + 1, BITS_PER_LONG - ebit);
> >> +
> >> + if (ebit >= BITS_PER_LONG)
> > What happens if we hit this "continue;" when "index == ULONG_MAX / IDA_BITMAP_BITS" ?
>
> Thanks. I also improved the test case for this. I plan to change the
> implementation a little bit to avoid such overflow (has passed the test
> case that I have, just post out for another set of eyes):
>
> {
> ...
> unsigned long idx = start / IDA_BITMAP_BITS;
> unsigned long bit = start % IDA_BITMAP_BITS;
> unsigned long idx_end = end / IDA_BITMAP_BITS;
> unsigned long ret;
>
> for (idx = start / IDA_BITMAP_BITS; idx <= idx_end; idx++) {
> unsigned long ida_start = idx * IDA_BITMAP_BITS;
>
> bitmap = __radix_tree_lookup(root, idx, &node, &slot);
> if (radix_tree_exception(bitmap)) {
> unsigned long tmp = (unsigned long)bitmap;
> unsigned long ebit = bit + 2;
>
> if (ebit >= BITS_PER_LONG)
> continue;
Will you please please do eliminate exception path?
I can't interpret what "ebit >= BITS_PER_LONG" means.
The reason you "continue;" is that all bits beyond are "0", isn't it?
Then, it would make sense to "continue;" when finding next "1" because
all bits beyond are "0". But how does it make sense to "continue;" when
finding next "0" despite all bits beyond are "0"?
> if (set)
> ret = find_next_bit(&tmp,
> BITS_PER_LONG, ebit);
> else
> ret = find_next_zero_bit(&tmp,
> BITS_PER_LONG,
> ebit);
> if (ret < BITS_PER_LONG)
> return ret - 2 + ida_start;
> } else if (bitmap) {
> if (set)
> ret = find_next_bit(bitmap->bitmap,
> IDA_BITMAP_BITS, bit);
> else
> ret = find_next_zero_bit(bitmap->bitmap,
> IDA_BITMAP_BITS, bit);
"bit" may not be 0 for the first round and "bit" is always 0 afterwords.
But where is the guaranteed that "end" is a multiple of IDA_BITMAP_BITS ?
Please explain why it is correct to use IDA_BITMAP_BITS unconditionally
for the last round.
> if (ret < IDA_BITMAP_BITS)
> return ret + ida_start;
> } else if (!bitmap && !set) {
At this point bitmap == NULL is guaranteed. Thus, "!bitmap && " is pointless.
> return bit + IDA_BITMAP_BITS * idx;
> }
> bit = 0;
> }
>
> return end;
> }
>
>
> >
> >> +/**
> >> + * xb_find_next_set_bit - find the next set bit in a range
> >> + * @xb: the xbitmap to search
> >> + * @start: the start of the range, inclusive
> >> + * @end: the end of the range, exclusive
> >> + *
> >> + * Returns: the index of the found bit, or @end + 1 if no such bit is found.
> >> + */
> >> +unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
> >> + unsigned long end)
> >> +{
> >> + return xb_find_next_bit(xb, start, end, 1);
> >> +}
> > Won't "exclusive" loose ability to handle ULONG_MAX ? Since this is a
> > library module, missing ability to handle ULONG_MAX sounds like an omission.
> > Shouldn't we pass (or return) whether "found or not" flag (e.g. strtoul() in
> > C library function)?
> >
> > bool xb_find_next_set_bit(struct xb *xb, unsigned long start, unsigned long end, unsigned long *result);
> > unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start, unsigned long end, bool *found);
>
> Yes, ULONG_MAX needs to be tested by xb_test_bit(). Compared to checking
> the return value, would it be the same to let the caller check for the
> ULONG_MAX boundary?
>
Why the caller needs to care about whether it is ULONG_MAX or not?
Also, one more thing you need to check. Have you checked how long does
xb_find_next_set_bit(xb, 0, ULONG_MAX) on an empty xbitmap takes?
If it causes soft lockup warning, should we add cond_resched() ?
If yes, you have to document that this API might sleep. If no, you
have to document that the caller of this API is responsible for
not to pass such a large value range.
On 12/13/2017 10:16 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
>> On 12/12/2017 09:20 PM, Tetsuo Handa wrote:
>>> Wei Wang wrote:
>>>> +void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end)
>>>> +{
>>>> + struct radix_tree_root *root = &xb->xbrt;
>>>> + struct radix_tree_node *node;
>>>> + void **slot;
>>>> + struct ida_bitmap *bitmap;
>>>> + unsigned int nbits;
>>>> +
>>>> + for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
>>>> + unsigned long index = start / IDA_BITMAP_BITS;
>>>> + unsigned long bit = start % IDA_BITMAP_BITS;
>>>> +
>>>> + bitmap = __radix_tree_lookup(root, index, &node, &slot);
>>>> + if (radix_tree_exception(bitmap)) {
>>>> + unsigned long ebit = bit + 2;
>>>> + unsigned long tmp = (unsigned long)bitmap;
>>>> +
>>>> + nbits = min(end - start + 1, BITS_PER_LONG - ebit);
>>>> +
>>>> + if (ebit >= BITS_PER_LONG)
>>> What happens if we hit this "continue;" when "index == ULONG_MAX / IDA_BITMAP_BITS" ?
>> Thanks. I also improved the test case for this. I plan to change the
>> implementation a little bit to avoid such overflow (has passed the test
>> case that I have, just post out for another set of eyes):
>>
>> {
>> ...
>> unsigned long idx = start / IDA_BITMAP_BITS;
>> unsigned long bit = start % IDA_BITMAP_BITS;
>> unsigned long idx_end = end / IDA_BITMAP_BITS;
>> unsigned long ret;
>>
>> for (idx = start / IDA_BITMAP_BITS; idx <= idx_end; idx++) {
>> unsigned long ida_start = idx * IDA_BITMAP_BITS;
>>
>> bitmap = __radix_tree_lookup(root, idx, &node, &slot);
>> if (radix_tree_exception(bitmap)) {
>> unsigned long tmp = (unsigned long)bitmap;
>> unsigned long ebit = bit + 2;
>>
>> if (ebit >= BITS_PER_LONG)
>> continue;
> Will you please please do eliminate exception path?
Please first see my explanations below, I'll try to help you understand
it thoroughly. If it is really too complex to understand it finally,
then I think we can start from the fundamental part by removing the
exceptional path if no objections from others.
> I can't interpret what "ebit >= BITS_PER_LONG" means.
> The reason you "continue;" is that all bits beyond are "0", isn't it?
> Then, it would make sense to "continue;" when finding next "1" because
> all bits beyond are "0". But how does it make sense to "continue;" when
> finding next "0" despite all bits beyond are "0"?
Not the case actually. Please see this example:
1) xb_set_bit(10); // bit 10 is set, so an exceptional entry (i.e.
[0:62]) is used
2) xb_clear_bit_range(66, 2048);
- One ida bitmap size is 1024 bits, so this clear will be performed
with 2 loops, first to clear [66, 1024), second to clear [1024, 2048)
- When the first loop clears [66, 1024), and finds that it is an
exception entry (because bit 10 is set, and the 62 bit entry is enough
to cover). Another point we have to remember is that an exceptional
entry implies that the rest of bits [63, 1024) are all 0s.
- The starting bit 66 already exceeds the the exceptional entry bit
range [0, 62], and with the fact that the rest of bits are all 0s, so it
is time to just "continue", which goes to the second range [1024, 2048)
I used the example of xb_clear_bit_range(), and xb_find_next_bit() is
the same fundamentally. Please let me know if anywhere still looks fuzzy.
>
>> if (set)
>> ret = find_next_bit(&tmp,
>> BITS_PER_LONG, ebit);
>> else
>> ret = find_next_zero_bit(&tmp,
>> BITS_PER_LONG,
>> ebit);
>> if (ret < BITS_PER_LONG)
>> return ret - 2 + ida_start;
>> } else if (bitmap) {
>> if (set)
>> ret = find_next_bit(bitmap->bitmap,
>> IDA_BITMAP_BITS, bit);
>> else
>> ret = find_next_zero_bit(bitmap->bitmap,
>> IDA_BITMAP_BITS, bit);
> "bit" may not be 0 for the first round and "bit" is always 0 afterwords.
> But where is the guaranteed that "end" is a multiple of IDA_BITMAP_BITS ?
> Please explain why it is correct to use IDA_BITMAP_BITS unconditionally
> for the last round.
There missed something here, it will be:
nbits = min(end - ida_start + 1, IDA_BITMAP_BITS - bit);
if (set)
ret = find_next_bit(bitmap->bitmap, nbits, bit);
else
ret = find_next_zero_bit(bitmap->bitmap,
nbits, bit);
if (ret < nbits)
return ret + ida_start;
>>>> +/**
>>>> + * xb_find_next_set_bit - find the next set bit in a range
>>>> + * @xb: the xbitmap to search
>>>> + * @start: the start of the range, inclusive
>>>> + * @end: the end of the range, exclusive
>>>> + *
>>>> + * Returns: the index of the found bit, or @end + 1 if no such bit is found.
>>>> + */
>>>> +unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start,
>>>> + unsigned long end)
>>>> +{
>>>> + return xb_find_next_bit(xb, start, end, 1);
>>>> +}
>>> Won't "exclusive" loose ability to handle ULONG_MAX ? Since this is a
>>> library module, missing ability to handle ULONG_MAX sounds like an omission.
>>> Shouldn't we pass (or return) whether "found or not" flag (e.g. strtoul() in
>>> C library function)?
>>>
>>> bool xb_find_next_set_bit(struct xb *xb, unsigned long start, unsigned long end, unsigned long *result);
>>> unsigned long xb_find_next_set_bit(struct xb *xb, unsigned long start, unsigned long end, bool *found);
>> Yes, ULONG_MAX needs to be tested by xb_test_bit(). Compared to checking
>> the return value, would it be the same to let the caller check for the
>> ULONG_MAX boundary?
>>
> Why the caller needs to care about whether it is ULONG_MAX or not?
I don't stick with this one, and will use the method that you suggested.
Thanks for the review.
>
> Also, one more thing you need to check. Have you checked how long does
> xb_find_next_set_bit(xb, 0, ULONG_MAX) on an empty xbitmap takes?
> If it causes soft lockup warning, should we add cond_resched() ?
> If yes, you have to document that this API might sleep. If no, you
> have to document that the caller of this API is responsible for
> not to pass such a large value range.
Yes, that will take too long time. Probably we can document some
comments as a reminder for the callers.
Best,
Wei
On 12/14/2017 11:47 AM, Wei Wang wrote:
> On 12/13/2017 10:16 PM, Tetsuo Handa wrote:
>
>
>>
>>> if (set)
>>> ret = find_next_bit(&tmp,
>>> BITS_PER_LONG, ebit);
>>> else
>>> ret = find_next_zero_bit(&tmp,
>>> BITS_PER_LONG,
>>> ebit);
>>> if (ret < BITS_PER_LONG)
>>> return ret - 2 + ida_start;
>>> } else if (bitmap) {
>>> if (set)
>>> ret = find_next_bit(bitmap->bitmap,
>>> IDA_BITMAP_BITS, bit);
>>> else
>>> ret =
>>> find_next_zero_bit(bitmap->bitmap,
>>> IDA_BITMAP_BITS, bit);
>> "bit" may not be 0 for the first round and "bit" is always 0 afterwords.
>> But where is the guaranteed that "end" is a multiple of
>> IDA_BITMAP_BITS ?
>> Please explain why it is correct to use IDA_BITMAP_BITS unconditionally
>> for the last round.
>
> There missed something here, it will be:
>
> nbits = min(end - ida_start + 1, IDA_BITMAP_BITS - bit);
captured a bug here, should be:
nbits = min(end - ida_start + 1, (unsigned long)IDA_BITMAP_BITS);
> if (set)
> ret = find_next_bit(bitmap->bitmap, nbits, bit);
> else
> ret = find_next_zero_bit(bitmap->bitmap,
> nbits, bit);
> if (ret < nbits)
> return ret + ida_start;
>
>
Best,
Wei
On Wed, Dec 13, 2017 at 08:26:06PM +0800, Wei Wang wrote:
> On 12/12/2017 09:20 PM, Tetsuo Handa wrote:
> > Can you eliminate exception path and fold all xbitmap patches into one, and
> > post only one xbitmap patch without virtio-baloon changes? If exception path
> > is valuable, you can add exception path after minimum version is merged.
> > This series is too difficult for me to close corner cases.
>
> That exception path is claimed to save memory, and I don't have a strong
> reason to remove that part.
> Matthew, could we get your feedback on this?
Sure. This code is derived from the IDA code in lib/idr.c. Eventually,
I intend to reunite them. For IDA, it clearly makes sense; the first 62
entries result in allocating no memory at all, which is going to be 99%
of users. After that, we allocate 128 bytes which will serve the first
1024 users.
The xbitmap, as used by Wei's patches here is going to be used somewhat
differently from that. I understand why Tetsuo wants the exceptional
path removed; I'm not sure the gains will be as important. But if we're
going to rebuild the IDA on top of the xbitmap, we need to keep them.
I really want to pay more attention to this, but I need to focus on
getting the XArray finished.
Wei Wang wrote:
> I used the example of xb_clear_bit_range(), and xb_find_next_bit() is
> the same fundamentally. Please let me know if anywhere still looks fuzzy.
I don't think it is the same for xb_find_next_bit() with set == 0.
+ if (radix_tree_exception(bmap)) {
+ unsigned long tmp = (unsigned long)bmap;
+ unsigned long ebit = bit + 2;
+
+ if (ebit >= BITS_PER_LONG)
+ continue;
+ if (set)
+ ret = find_next_bit(&tmp, BITS_PER_LONG, ebit);
+ else
+ ret = find_next_zero_bit(&tmp, BITS_PER_LONG,
+ ebit);
+ if (ret < BITS_PER_LONG)
+ return ret - 2 + IDA_BITMAP_BITS * index;
What I'm saying is that find_next_zero_bit() will not be called if you do
"if (ebit >= BITS_PER_LONG) continue;" before calling find_next_zero_bit().
When scanning "0000000000000000000000000000000000000000000000000000000000000001",
"bit < BITS_PER_LONG - 2" case finds "0" in this word but
"bit >= BITS_PER_LONG - 2" case finds "0" in next word or segment.
I can't understand why this is correct behavior. It is too much puzzling.
> > Also, one more thing you need to check. Have you checked how long does
> > xb_find_next_set_bit(xb, 0, ULONG_MAX) on an empty xbitmap takes?
> > If it causes soft lockup warning, should we add cond_resched() ?
> > If yes, you have to document that this API might sleep. If no, you
> > have to document that the caller of this API is responsible for
> > not to pass such a large value range.
>
> Yes, that will take too long time. Probably we can document some
> comments as a reminder for the callers.
>
Then, I feel that API is poorly implemented. There is no need to brute-force
when scanning [0, ULONG_MAX] range. If you eliminate exception path and
redesign the data structure, xbitmap will become as simple as a sample
implementation shown below. Not tested yet, but I think that this will be
sufficient for what virtio-baloon wants to do; i.e. find consecutive "1" bits
quickly. I didn't test whether finding "struct ulong_list_data" using radix
tree can improve performance.
----------------------------------------
#include <linux/module.h>
#include <linux/slab.h>
#define BITMAP_LEN 1024
struct ulong_list_data {
struct list_head list;
unsigned long segment; /* prev->segment < segment < next->segment */
unsigned long bits; /* Number of bits set in this offset bitmap. */
unsigned long *bitmap; /* Offset bitmap of BITMAP_LEN bits. */
};
static struct ulong_list_data null_ulong_list = {
{ NULL, NULL }, ULONG_MAX, 0, NULL
};
struct ulong_list_head {
struct list_head list;
struct ulong_list_data *last_used;
};
static void init_ulong(struct ulong_list_head *head)
{
INIT_LIST_HEAD(&head->list);
head->last_used = &null_ulong_list;
}
static bool set_ulong(struct ulong_list_head *head, const unsigned long value)
{
struct ulong_list_data *ptr = head->last_used;
struct list_head *list = &head->list;
const unsigned long segment = value / BITMAP_LEN;
const unsigned long offset = value % BITMAP_LEN;
bool found = false;
if (ptr->segment == segment)
goto shortcut;
list_for_each_entry(ptr, &head->list, list) {
if (ptr->segment < segment) {
list = &ptr->list;
continue;
}
found = ptr->segment == segment;
break;
}
if (!found) {
ptr = kzalloc(sizeof(*ptr), GFP_NOWAIT | __GFP_NOWARN);
if (!ptr)
return false;
ptr->bitmap = kzalloc(BITMAP_LEN / 8,
GFP_NOWAIT | __GFP_NOWARN);
if (!ptr->bitmap) {
kfree(ptr);
return false;
}
ptr->segment = segment;
list_add(&ptr->list, list);
}
head->last_used = ptr;
shortcut:
if (!test_bit(offset, ptr->bitmap)) {
__set_bit(offset, ptr->bitmap);
ptr->bits++;
}
return true;
}
static void clear_ulong(struct ulong_list_head *head, const unsigned long value)
{
struct ulong_list_data *ptr = head->last_used;
const unsigned long segment = value / BITMAP_LEN;
const unsigned long offset = value % BITMAP_LEN;
if (ptr->segment == segment)
goto shortcut;
list_for_each_entry(ptr, &head->list, list) {
if (ptr->segment < segment)
continue;
if (ptr->segment == segment) {
head->last_used = ptr;
shortcut:
if (test_bit(offset, ptr->bitmap)) {
__clear_bit(offset, ptr->bitmap);
if (--ptr->bits)
return;
if (head->last_used == ptr)
head->last_used = &null_ulong_list;
list_del(&ptr->list);
kfree(ptr);
}
}
return;
}
}
static void destroy_ulong(struct ulong_list_head *head)
{
struct ulong_list_data *ptr;
struct ulong_list_data *tmp;
list_for_each_entry_safe(ptr, tmp, &head->list, list) {
list_del(&ptr->list);
kfree(ptr);
}
init_ulong(head);
}
static unsigned long get_ulong_set_range(struct ulong_list_head *head,
unsigned long *start)
{
struct ulong_list_data *ptr;
unsigned long segment = *start / BITMAP_LEN;
unsigned int offset = *start % BITMAP_LEN;
unsigned long ret = BITMAP_LEN;
list_for_each_entry(ptr, &head->list, list) {
if (ptr->segment < segment)
continue;
if (ptr->segment > segment)
offset = 0;
ret = find_next_bit(ptr->bitmap, BITMAP_LEN, offset);
if (ret == BITMAP_LEN)
continue;
break;
}
if (ret == BITMAP_LEN)
return 0;
segment = ptr->segment;
*start = segment * BITMAP_LEN + ret;
ret = find_next_zero_bit(ptr->bitmap, BITMAP_LEN, ret);
if (ret < BITMAP_LEN || segment == ULONG_MAX / BITMAP_LEN)
return segment * BITMAP_LEN + ret - *start;
ret = 0;
list_for_each_entry_continue(ptr, &head->list, list) {
if (ptr->segment != ++segment)
break;
if (ptr->bits == BITMAP_LEN)
continue;
ret = find_next_zero_bit(ptr->bitmap, BITMAP_LEN, 0);
break;
}
return segment * BITMAP_LEN + ret - *start;
}
static int __init test_init(void)
{
struct ulong_list_data *ptr;
struct ulong_list_head head;
unsigned long start = 0;
unsigned long len;
init_ulong(&head);
set_ulong(&head, 0);
set_ulong(&head, 10);
set_ulong(&head, 11);
set_ulong(&head, ULONG_MAX);
set_ulong(&head, 1000000);
set_ulong(&head, 12);
clear_ulong(&head, 11);
for (len = 1048576; len < 1048576 + BITMAP_LEN * 3; len++)
set_ulong(&head, len);
clear_ulong(&head, 1048600);
set_ulong(&head, 2000000000);
pr_info("Debug dump start\n");
list_for_each_entry(ptr, &head.list, list)
pr_info("Segment %lu %lu\n", ptr->segment, ptr->bits);
pr_info("Debug dump end\n");
while ((len = get_ulong_set_range(&head, &start)) > 0) {
pr_info("Range %lu@%lu\n", len, start);
start += len;
if (!start)
break;
}
destroy_ulong(&head);
return -EINVAL;
}
module_init(test_init);
MODULE_LICENSE("GPL");
----------------------------------------
On Fri, Dec 15, 2017 at 01:29:45AM +0900, Tetsuo Handa wrote:
> > > Also, one more thing you need to check. Have you checked how long does
> > > xb_find_next_set_bit(xb, 0, ULONG_MAX) on an empty xbitmap takes?
> > > If it causes soft lockup warning, should we add cond_resched() ?
> > > If yes, you have to document that this API might sleep. If no, you
> > > have to document that the caller of this API is responsible for
> > > not to pass such a large value range.
> >
> > Yes, that will take too long time. Probably we can document some
> > comments as a reminder for the callers.
>
> Then, I feel that API is poorly implemented. There is no need to brute-force
> when scanning [0, ULONG_MAX] range. If you eliminate exception path and
> redesign the data structure, xbitmap will become as simple as a sample
> implementation shown below. Not tested yet, but I think that this will be
> sufficient for what virtio-baloon wants to do; i.e. find consecutive "1" bits
> quickly. I didn't test whether finding "struct ulong_list_data" using radix
> tree can improve performance.
find_next_set_bit() is just badly implemented. There is no need to
redesign the data structure. It should be a simple matter of:
- look at ->head, see it is NULL, return false.
If bit 100 is set and you call find_next_set_bit(101, ULONG_MAX), it
should look at block 0, see there is a pointer to it, scan the block,
see there are no bits set above 100, then realise we're at the end of
the tree and stop.
If bit 2000 is set, and you call find_next_set_bit(2001, ULONG_MAX)
tit should look at block 1, see there's no bit set after bit 2001, then
look at the other blocks in the node, see that all the pointers are NULL
and stop.
This isn't rocket science, we already do something like this in the radix
tree and it'll be even easier to do in the XArray. Which I'm going back
to working on now.
Hi Wei,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on linus/master]
[also build test ERROR on v4.15-rc3 next-20171214]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Wei-Wang/Virtio-balloon-Enhancement/20171215-100525
config: i386-tinyconfig (attached as .config)
compiler: gcc-7 (Debian 7.2.0-12) 7.2.1 20171025
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
Note: the linux-review/Wei-Wang/Virtio-balloon-Enhancement/20171215-100525 HEAD 607ddba072bf7f9c9cbacedaccad7c42c5c7149c builds fine.
It only hurts bisectibility.
All errors (new ones prefixed by >>):
>> lib/xbitmap.c:80:6: error: conflicting types for 'xb_clear_bit'
void xb_clear_bit(struct xb *xb, unsigned long bit)
^~~~~~~~~~~~
In file included from lib/xbitmap.c:2:0:
include/linux/xbitmap.h:37:5: note: previous declaration of 'xb_clear_bit' was here
int xb_clear_bit(struct xb *xb, unsigned long bit);
^~~~~~~~~~~~
vim +/xb_clear_bit +80 lib/xbitmap.c
71
72 /**
73 * xb_clear_bit - clear a bit in the xbitmap
74 * @xb: the xbitmap tree used to record the bit
75 * @bit: index of the bit to clear
76 *
77 * This function is used to clear a bit in the xbitmap. If all the bits of the
78 * bitmap are 0, the bitmap will be freed.
79 */
> 80 void xb_clear_bit(struct xb *xb, unsigned long bit)
81 {
82 unsigned long index = bit / IDA_BITMAP_BITS;
83 struct radix_tree_root *root = &xb->xbrt;
84 struct radix_tree_node *node;
85 void **slot;
86 struct ida_bitmap *bitmap;
87 unsigned long ebit;
88
89 bit %= IDA_BITMAP_BITS;
90 ebit = bit + 2;
91
92 bitmap = __radix_tree_lookup(root, index, &node, &slot);
93 if (radix_tree_exception(bitmap)) {
94 unsigned long tmp = (unsigned long)bitmap;
95
96 if (ebit >= BITS_PER_LONG)
97 return;
98 tmp &= ~(1UL << ebit);
99 if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
100 __radix_tree_delete(root, node, slot);
101 else
102 rcu_assign_pointer(*slot, (void *)tmp);
103 return;
104 }
105
106 if (!bitmap)
107 return;
108
109 __clear_bit(bit, bitmap->bitmap);
110 if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
111 kfree(bitmap);
112 __radix_tree_delete(root, node, slot);
113 }
114 }
115 EXPORT_SYMBOL(xb_clear_bit);
116
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hi Matthew,
I love your patch! Perhaps something to improve:
[auto build test WARNING on linus/master]
[also build test WARNING on v4.15-rc3]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Wei-Wang/Virtio-balloon-Enhancement/20171215-100525
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__
sparse warnings: (new ones prefixed by >>)
vim +29 lib/xbitmap.c
5
6 /**
7 * xb_set_bit - set a bit in the xbitmap
8 * @xb: the xbitmap tree used to record the bit
9 * @bit: index of the bit to set
10 *
11 * This function is used to set a bit in the xbitmap. If the bitmap that @bit
12 * resides in is not there, the per-cpu ida_bitmap will be taken.
13 *
14 * Returns: 0 on success. %-EAGAIN indicates that @bit was not set.
15 */
16 int xb_set_bit(struct xb *xb, unsigned long bit)
17 {
18 int err;
19 unsigned long index = bit / IDA_BITMAP_BITS;
20 struct radix_tree_root *root = &xb->xbrt;
21 struct radix_tree_node *node;
22 void **slot;
23 struct ida_bitmap *bitmap;
24 unsigned long ebit;
25
26 bit %= IDA_BITMAP_BITS;
27 ebit = bit + 2;
28
> 29 err = __radix_tree_create(root, index, 0, &node, &slot);
30 if (err)
31 return err;
32 bitmap = rcu_dereference_raw(*slot);
33 if (radix_tree_exception(bitmap)) {
34 unsigned long tmp = (unsigned long)bitmap;
35
36 if (ebit < BITS_PER_LONG) {
37 tmp |= 1UL << ebit;
38 rcu_assign_pointer(*slot, (void *)tmp);
39 return 0;
40 }
41 bitmap = this_cpu_xchg(ida_bitmap, NULL);
42 if (!bitmap)
43 return -EAGAIN;
44 memset(bitmap, 0, sizeof(*bitmap));
45 bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
46 rcu_assign_pointer(*slot, bitmap);
47 }
48
49 if (!bitmap) {
50 if (ebit < BITS_PER_LONG) {
51 bitmap = (void *)((1UL << ebit) |
52 RADIX_TREE_EXCEPTIONAL_ENTRY);
> 53 __radix_tree_replace(root, node, slot, bitmap, NULL);
54 return 0;
55 }
56 bitmap = this_cpu_xchg(ida_bitmap, NULL);
57 if (!bitmap)
58 return -EAGAIN;
59 memset(bitmap, 0, sizeof(*bitmap));
60 __radix_tree_replace(root, node, slot, bitmap, NULL);
61 }
62
63 __set_bit(bit, bitmap->bitmap);
64 return 0;
65 }
66 EXPORT_SYMBOL(xb_set_bit);
67
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
On Fri, Dec 15, 2017 at 07:05:07PM +0800, kbuild test robot wrote:
> 21 struct radix_tree_node *node;
> 22 void **slot;
^^^
missing __rcu annotation here.
Wei, could you fold that change into your next round? Thanks!
Matthew Wilcox wrote:
> On Fri, Dec 15, 2017 at 01:29:45AM +0900, Tetsuo Handa wrote:
> > > > Also, one more thing you need to check. Have you checked how long does
> > > > xb_find_next_set_bit(xb, 0, ULONG_MAX) on an empty xbitmap takes?
> > > > If it causes soft lockup warning, should we add cond_resched() ?
> > > > If yes, you have to document that this API might sleep. If no, you
> > > > have to document that the caller of this API is responsible for
> > > > not to pass such a large value range.
> > >
> > > Yes, that will take too long time. Probably we can document some
> > > comments as a reminder for the callers.
> >
> > Then, I feel that API is poorly implemented. There is no need to brute-force
> > when scanning [0, ULONG_MAX] range. If you eliminate exception path and
> > redesign the data structure, xbitmap will become as simple as a sample
> > implementation shown below. Not tested yet, but I think that this will be
> > sufficient for what virtio-baloon wants to do; i.e. find consecutive "1" bits
> > quickly. I didn't test whether finding "struct ulong_list_data" using radix
> > tree can improve performance.
>
> find_next_set_bit() is just badly implemented. There is no need to
> redesign the data structure. It should be a simple matter of:
>
> - look at ->head, see it is NULL, return false.
>
> If bit 100 is set and you call find_next_set_bit(101, ULONG_MAX), it
> should look at block 0, see there is a pointer to it, scan the block,
> see there are no bits set above 100, then realise we're at the end of
> the tree and stop.
>
> If bit 2000 is set, and you call find_next_set_bit(2001, ULONG_MAX)
> tit should look at block 1, see there's no bit set after bit 2001, then
> look at the other blocks in the node, see that all the pointers are NULL
> and stop.
>
> This isn't rocket science, we already do something like this in the radix
> tree and it'll be even easier to do in the XArray. Which I'm going back
> to working on now.
>
My understanding is that virtio-balloon wants to handle sparsely spreaded
unsigned long values (which is PATCH 4/7) and wants to find all chunks of
consecutive "1" bits efficiently. Therefore, I guess that holding the values
in ascending order at store time is faster than sorting the values at read
time. I don't know how to use radix tree API, but I think that B+ tree API
suits for holding the values in ascending order.
We wait for Wei to post radix tree version combined into one patch and then
compare performance between radix tree version and B+ tree version (shown
below)?
----------
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/btree.h>
#define BITMAP_LEN 1024
struct ulong_list_data {
/* Segment for this offset bitmap. */
unsigned long segment;
/* Number of bits set in this offset bitmap. */
unsigned long bits;
/* Offset bitmap of BITMAP_LEN bits. */
unsigned long *bitmap;
};
struct ulong_list_head {
struct btree_headl btree;
struct ulong_list_data *last_used;
};
static int init_ulong(struct ulong_list_head *head)
{
head->last_used = NULL;
return btree_initl(&head->btree);
}
static bool set_ulong(struct ulong_list_head *head, const unsigned long value)
{
struct ulong_list_data *ptr = head->last_used;
const unsigned long segment = value / BITMAP_LEN;
const unsigned long offset = value % BITMAP_LEN;
if (!ptr || ptr->segment != segment)
ptr = btree_lookupl(&head->btree, ~segment);
if (!ptr) {
ptr = kzalloc(sizeof(*ptr), GFP_NOWAIT | __GFP_NOWARN);
if (!ptr)
goto out1;
ptr->bitmap = kzalloc(BITMAP_LEN / 8,
GFP_NOWAIT | __GFP_NOWARN);
if (!ptr->bitmap)
goto out2;
if (btree_insertl(&head->btree, ~segment, ptr,
GFP_NOWAIT | __GFP_NOWARN))
goto out3;
ptr->segment = segment;
}
head->last_used = ptr;
if (!test_bit(offset, ptr->bitmap)) {
__set_bit(offset, ptr->bitmap);
ptr->bits++;
}
return true;
out3:
kfree(ptr->bitmap);
out2:
kfree(ptr);
out1:
return false;
}
static void clear_ulong(struct ulong_list_head *head, const unsigned long value)
{
struct ulong_list_data *ptr = head->last_used;
const unsigned long segment = value / BITMAP_LEN;
const unsigned long offset = value % BITMAP_LEN;
if (!ptr || ptr->segment != segment) {
ptr = btree_lookupl(&head->btree, ~segment);
if (!ptr)
return;
head->last_used = ptr;
}
if (!test_bit(offset, ptr->bitmap))
return;
__clear_bit(offset, ptr->bitmap);
if (--ptr->bits)
return;
btree_removel(&head->btree, ~segment);
if (head->last_used == ptr)
head->last_used = NULL;
kfree(ptr->bitmap);
kfree(ptr);
}
static void destroy_ulong(struct ulong_list_head *head)
{
struct ulong_list_data *ptr;
unsigned long segment;
btree_for_each_safel(&head->btree, segment, ptr) {
btree_removel(&head->btree, segment);
kfree(ptr->bitmap);
kfree(ptr);
}
btree_destroyl(&head->btree);
head->last_used = NULL;
}
/*
* get_ulong_set_range - Get range of "1" bits.
*
* @head: Pointer to "struct ulong_list_head".
* @start: Pointer to "unsigned long" which holds starting bit upon entry and
* holds found bit upon return.
*
* Returns length of consecutive "1" bits which start from @start or higher.
* @start is updated to hold actual location where "1" bit started. The caller
* can call this function again after adding return value of this function to
* @start, but be sure to check for overflow which will happen when the last
* bit (the ULONG_MAX'th bit) is "1".
*
* Returns 0 if no "1" bit was found in @start and afterwords bits. It does not
* make sense to call this function again in that case.
*/
static unsigned long get_ulong_set_range(struct ulong_list_head *head,
unsigned long *start)
{
struct ulong_list_data *ptr;
unsigned long segment = *start / BITMAP_LEN;
unsigned int offset = *start % BITMAP_LEN;
unsigned long ret = BITMAP_LEN;
unsigned long key = 0;
btree_for_each_safel(&head->btree, key, ptr) {
if (ptr->segment < segment)
continue;
if (ptr->segment > segment)
offset = 0;
ret = find_next_bit(ptr->bitmap, BITMAP_LEN, offset);
if (ret < BITMAP_LEN)
break;
}
if (ret == BITMAP_LEN)
return 0;
segment = ptr->segment;
*start = segment * BITMAP_LEN + ret;
ret = find_next_zero_bit(ptr->bitmap, BITMAP_LEN, ret);
if (ret == BITMAP_LEN)
while ((ptr = btree_get_prevl(&head->btree, &key)) != NULL) {
if (ptr->segment != segment + 1)
break;
segment++;
if (ptr->bits == BITMAP_LEN)
continue;
ret = find_next_zero_bit(ptr->bitmap, BITMAP_LEN, 0);
break;
}
return segment * BITMAP_LEN + ret - *start;
}
static int __init test_init(void)
{
struct ulong_list_data *ptr;
struct ulong_list_head head;
unsigned long key;
unsigned long start = 0;
unsigned long len;
if (init_ulong(&head))
return -ENOMEM;
set_ulong(&head, 0);
set_ulong(&head, 10);
set_ulong(&head, 11);
set_ulong(&head, 1000000);
set_ulong(&head, 12);
clear_ulong(&head, 11);
for (len = 1048576; len < 1048576 + BITMAP_LEN * 3; len++)
set_ulong(&head, len);
for (len = 1048576 + BITMAP_LEN; len < 1048576 + BITMAP_LEN * 2; len++)
set_ulong(&head, len);
clear_ulong(&head, 1048600);
set_ulong(&head, 2000000000);
for (len = ULONG_MAX - 1; len >= ULONG_MAX - BITMAP_LEN * 3; len--)
set_ulong(&head, len);
pr_info("Debug dump start\n");
btree_for_each_safel(&head.btree, key, ptr)
pr_info("Segment %lu %lu\n", ptr->segment, ptr->bits);
pr_info("Debug dump end\n");
while ((len = get_ulong_set_range(&head, &start)) > 0) {
pr_info("Range %lu@%lu\n", len, start);
start += len;
if (!start)
break;
}
destroy_ulong(&head);
return -EINVAL;
}
module_init(test_init);
MODULE_LICENSE("GPL");
----------
On Sat, Dec 16, 2017 at 01:21:52AM +0900, Tetsuo Handa wrote:
> My understanding is that virtio-balloon wants to handle sparsely spreaded
> unsigned long values (which is PATCH 4/7) and wants to find all chunks of
> consecutive "1" bits efficiently. Therefore, I guess that holding the values
> in ascending order at store time is faster than sorting the values at read
> time.
Are you asking why is a bitmap used here, as opposed to a tree? It's
not just store versus read. There's also the issue that memory can get
highly fragmented, if it is, the number of 1s is potentially very high.
A bitmap can use as little as 1 bit per value, it is hard to beat in
this respect.
--
MST
On Tue, Dec 12, 2017 at 07:55:55PM +0800, Wei Wang wrote:
> +int xb_preload_and_set_bit(struct xb *xb, unsigned long bit, gfp_t gfp);
I'm struggling to understand when one would use this. The xb_ API
requires you to handle your own locking. But specifying GFP flags
here implies you can sleep. So ... um ... there's no locking?
> +void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long end);
That's xb_zero() which you deleted with the previous patch ... remember,
keep things as close as possible to the bitmap API.
On Sat, Dec 16, 2017 at 01:21:52AM +0900, Tetsuo Handa wrote:
> My understanding is that virtio-balloon wants to handle sparsely spreaded
> unsigned long values (which is PATCH 4/7) and wants to find all chunks of
> consecutive "1" bits efficiently. Therefore, I guess that holding the values
> in ascending order at store time is faster than sorting the values at read
> time. I don't know how to use radix tree API, but I think that B+ tree API
> suits for holding the values in ascending order.
>
> We wait for Wei to post radix tree version combined into one patch and then
> compare performance between radix tree version and B+ tree version (shown
> below)?
Sure. We all benefit from some friendly competition. Even if a
competition between trees might remind one of the Entmoot ;-)
But let's not hold back -- let's figure out some good workloads to use
in our competition. And we should also decide on the API / locking
constraints. And of course we should compete based on not just speed,
but also memory consumption (both as a runtime overhead for a given set
of bits and as code size). If you can replace the IDR, you get to count
that savings against the cost of your implementation.
Here's the API I'm looking at right now. The user need take no lock;
the locking (spinlock) is handled internally to the implementation.
void xbit_init(struct xbitmap *xb);
int xbit_alloc(struct xbitmap *, unsigned long bit, gfp_t);
int xbit_alloc_range(struct xbitmap *, unsigned long start,
unsigned long nbits, gfp_t);
int xbit_set(struct xbitmap *, unsigned long bit, gfp_t);
bool xbit_test(struct xbitmap *, unsigned long bit);
int xbit_clear(struct xbitmap *, unsigned long bit);
int xbit_zero(struct xbitmap *, unsigned long start, unsigned long nbits);
int xbit_fill(struct xbitmap *, unsigned long start, unsigned long nbits,
gfp_t);
unsigned long xbit_find_clear(struct xbitmap *, unsigned long start,
unsigned long max);
unsigned long xbit_find_set(struct xbitmap *, unsigned long start,
unsigned long max);
> static bool set_ulong(struct ulong_list_head *head, const unsigned long value)
> {
> if (!ptr) {
> ptr = kzalloc(sizeof(*ptr), GFP_NOWAIT | __GFP_NOWARN);
> if (!ptr)
> goto out1;
> ptr->bitmap = kzalloc(BITMAP_LEN / 8,
> GFP_NOWAIT | __GFP_NOWARN);
> if (!ptr->bitmap)
> goto out2;
> if (btree_insertl(&head->btree, ~segment, ptr,
> GFP_NOWAIT | __GFP_NOWARN))
> goto out3;
> out3:
> kfree(ptr->bitmap);
> out2:
> kfree(ptr);
> out1:
> return false;
> }
And what is the user supposed to do if this returns false? How do they
make headway? The xb_ API is clear -- you call xb_prealloc and that
ensures forward progress.
On Fri, Dec 15, 2017 at 10:49:15AM -0800, Matthew Wilcox wrote:
> Here's the API I'm looking at right now. The user need take no lock;
> the locking (spinlock) is handled internally to the implementation.
I looked at the API some more and found some flaws:
- how does xbit_alloc communicate back which bit it allocated?
- What if xbit_find_set() is called on a completely empty array with
a range of 0, ULONG_MAX -- there's no invalid number to return.
- xbit_clear() can't return an error. Neither can xbit_zero().
- Need to add __must_check to various return values to discourage sloppy
programming
So I modify the proposed API we compete with thusly:
bool xbit_test(struct xbitmap *, unsigned long bit);
int __must_check xbit_set(struct xbitmap *, unsigned long bit, gfp_t);
void xbit_clear(struct xbitmap *, unsigned long bit);
int __must_check xbit_alloc(struct xbitmap *, unsigned long *bit, gfp_t);
int __must_check xbit_fill(struct xbitmap *, unsigned long start,
unsigned long nbits, gfp_t);
void xbit_zero(struct xbitmap *, unsigned long start, unsigned long nbits);
int __must_check xbit_alloc_range(struct xbitmap *, unsigned long *bit,
unsigned long nbits, gfp_t);
bool xbit_find_clear(struct xbitmap *, unsigned long *start, unsigned long max);
bool xbit_find_set(struct xbitmap *, unsigned long *start, unsigned long max);
(I'm a little sceptical about the API accepting 'max' for the find
functions and 'nbits' in the fill/zero/alloc_range functions, but I think
that matches how people want to use it, and it matches how bitmap.h works)
Michael S. Tsirkin wrote:
> On Sat, Dec 16, 2017 at 01:21:52AM +0900, Tetsuo Handa wrote:
> > My understanding is that virtio-balloon wants to handle sparsely spreaded
> > unsigned long values (which is PATCH 4/7) and wants to find all chunks of
> > consecutive "1" bits efficiently. Therefore, I guess that holding the values
> > in ascending order at store time is faster than sorting the values at read
> > time.
>
> Are you asking why is a bitmap used here, as opposed to a tree?
No. I'm OK with "segments using trees" + "offsets using bitmaps".
> It's
> not just store versus read. There's also the issue that memory can get
> highly fragmented, if it is, the number of 1s is potentially very high.
> A bitmap can use as little as 1 bit per value, it is hard to beat in
> this respect.
>
I'm asking whether we really need to invent a new library module (i.e.
PATCH 1/7 + PATCH 2/7 + PATCH 3/7) for virtio-balloon compared to mine.
What virtio-balloon needs is ability to
(1) record any integer value in [0, ULONG_MAX] range
(2) fetch all recorded values, with consecutive values combined in
min,max (or start,count) form for efficiently
and I wonder whether we need to invent complete API set which
Matthew Wilcox and Wei Wang are planning for generic purpose.
On Sat, Dec 16, 2017 at 01:31:24PM +0900, Tetsuo Handa wrote:
> Michael S. Tsirkin wrote:
> > On Sat, Dec 16, 2017 at 01:21:52AM +0900, Tetsuo Handa wrote:
> > > My understanding is that virtio-balloon wants to handle sparsely spreaded
> > > unsigned long values (which is PATCH 4/7) and wants to find all chunks of
> > > consecutive "1" bits efficiently. Therefore, I guess that holding the values
> > > in ascending order at store time is faster than sorting the values at read
> > > time.
What makes you think that the radix tree (also xbitmap, also idr) doesn't
sort the values at store time?
> I'm asking whether we really need to invent a new library module (i.e.
> PATCH 1/7 + PATCH 2/7 + PATCH 3/7) for virtio-balloon compared to mine.
>
> What virtio-balloon needs is ability to
>
> (1) record any integer value in [0, ULONG_MAX] range
>
> (2) fetch all recorded values, with consecutive values combined in
> min,max (or start,count) form for efficiently
>
> and I wonder whether we need to invent complete API set which
> Matthew Wilcox and Wei Wang are planning for generic purpose.
The xbitmap absolutely has that ability. And making it generic code
means more people see it, use it, debug it, optimise it. I originally
wrote the implementation for bcache, when Kent was complaining we didn't
have such a thing. His needs weren't as complex as Wei's, which is why
I hadn't implemented everything that Wei needed.
Matthew Wilcox wrote:
> On Sat, Dec 16, 2017 at 01:31:24PM +0900, Tetsuo Handa wrote:
> > Michael S. Tsirkin wrote:
> > > On Sat, Dec 16, 2017 at 01:21:52AM +0900, Tetsuo Handa wrote:
> > > > My understanding is that virtio-balloon wants to handle sparsely spreaded
> > > > unsigned long values (which is PATCH 4/7) and wants to find all chunks of
> > > > consecutive "1" bits efficiently. Therefore, I guess that holding the values
> > > > in ascending order at store time is faster than sorting the values at read
> > > > time.
>
> What makes you think that the radix tree (also xbitmap, also idr) doesn't
> sort the values at store time?
I don't care whether the radix tree sorts the values at store time.
What I care is how to read stored values in ascending order with less overhead.
Existing users are too much optimized and difficult to understand for new
comers. I appreciate if there are simple sample codes which explain how to
use library functions and which can be compiled/tested in userspace.
Your "- look at ->head, see it is NULL, return false." answer did not
give me any useful information.
>
> > I'm asking whether we really need to invent a new library module (i.e.
> > PATCH 1/7 + PATCH 2/7 + PATCH 3/7) for virtio-balloon compared to mine.
> >
> > What virtio-balloon needs is ability to
> >
> > (1) record any integer value in [0, ULONG_MAX] range
> >
> > (2) fetch all recorded values, with consecutive values combined in
> > min,max (or start,count) form for efficiently
> >
> > and I wonder whether we need to invent complete API set which
> > Matthew Wilcox and Wei Wang are planning for generic purpose.
>
> The xbitmap absolutely has that ability.
Current patches are too tricky to review.
When will all corner cases be closed?
> And making it generic code
> means more people see it, use it, debug it, optimise it.
I'm not objecting against generic code. But trying to optimize it can
enbug it, like using exception path keeps me difficult to review whether
the implementation is correct. I'm suggesting to start xbitmap without
exception path, and I haven't seen a version without exception path.
> I originally
> wrote the implementation for bcache, when Kent was complaining we didn't
> have such a thing. His needs weren't as complex as Wei's, which is why
> I hadn't implemented everything that Wei needed.
>
Unless current xbitmap patches become clear, virtio-balloon changes won't
be able to be get merged. We are repeating this series without closing
many bugs. We can start virtio-balloon changes with a stub code which
provides (1) and (2), and that's my version.
On 12/15/2017 09:24 PM, Matthew Wilcox wrote:
> On Fri, Dec 15, 2017 at 07:05:07PM +0800, kbuild test robot wrote:
>> 21 struct radix_tree_node *node;
>> 22 void **slot;
> ^^^
> missing __rcu annotation here.
>
> Wei, could you fold that change into your next round? Thanks!
>
Sure, I'll do. Thanks for your time on this patch series.
Best,
Wei
On 12/16/2017 02:42 AM, Matthew Wilcox wrote:
> On Tue, Dec 12, 2017 at 07:55:55PM +0800, Wei Wang wrote:
>> +int xb_preload_and_set_bit(struct xb *xb, unsigned long bit, gfp_t gfp);
> I'm struggling to understand when one would use this. The xb_ API
> requires you to handle your own locking. But specifying GFP flags
> here implies you can sleep. So ... um ... there's no locking?
In the regular use cases, people would do xb_preload() before taking the
lock, and the xb_set/clear within the lock.
In the virtio-balloon usage, we have a large number of bits to set with
the balloon_lock being held (we're not unlocking for each bit), so we
used the above wrapper to do preload and set within the balloon_lock,
and passed in GFP_NOWAIT to avoid sleeping. Probably we can change to
put this wrapper implementation to virtio-balloon, since it would not be
useful for the regular cases.
Best,
Wei
On 12/15/2017 12:29 AM, Tetsuo Handa wrote:
> Wei Wang wrote:
>> I used the example of xb_clear_bit_range(), and xb_find_next_bit() is
>> the same fundamentally. Please let me know if anywhere still looks fuzzy.
> I don't think it is the same for xb_find_next_bit() with set == 0.
>
> + if (radix_tree_exception(bmap)) {
> + unsigned long tmp = (unsigned long)bmap;
> + unsigned long ebit = bit + 2;
> +
> + if (ebit >= BITS_PER_LONG)
> + continue;
> + if (set)
> + ret = find_next_bit(&tmp, BITS_PER_LONG, ebit);
> + else
> + ret = find_next_zero_bit(&tmp, BITS_PER_LONG,
> + ebit);
> + if (ret < BITS_PER_LONG)
> + return ret - 2 + IDA_BITMAP_BITS * index;
>
> What I'm saying is that find_next_zero_bit() will not be called if you do
> "if (ebit >= BITS_PER_LONG) continue;" before calling find_next_zero_bit().
>
> When scanning "0000000000000000000000000000000000000000000000000000000000000001",
> "bit < BITS_PER_LONG - 2" case finds "0" in this word but
> "bit >= BITS_PER_LONG - 2" case finds "0" in next word or segment.
>
> I can't understand why this is correct behavior. It is too much puzzling.
>
OK, I'll post out a version without the exceptional path.
Best,
Wei
Wei Wang wrote:
> On 12/16/2017 02:42 AM, Matthew Wilcox wrote:
> > On Tue, Dec 12, 2017 at 07:55:55PM +0800, Wei Wang wrote:
> >> +int xb_preload_and_set_bit(struct xb *xb, unsigned long bit, gfp_t gfp);
> > I'm struggling to understand when one would use this. The xb_ API
> > requires you to handle your own locking. But specifying GFP flags
> > here implies you can sleep. So ... um ... there's no locking?
>
> In the regular use cases, people would do xb_preload() before taking the
> lock, and the xb_set/clear within the lock.
>
> In the virtio-balloon usage, we have a large number of bits to set with
> the balloon_lock being held (we're not unlocking for each bit), so we
> used the above wrapper to do preload and set within the balloon_lock,
> and passed in GFP_NOWAIT to avoid sleeping. Probably we can change to
> put this wrapper implementation to virtio-balloon, since it would not be
> useful for the regular cases.
GFP_NOWAIT is chosen in order not to try to OOM-kill something, isn't it?
But passing GFP_NOWAIT means that we can handle allocation failure. There is
no need to use preload approach when we can handle allocation failure.
On 12/16/2017 07:28 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
>> On 12/16/2017 02:42 AM, Matthew Wilcox wrote:
>>> On Tue, Dec 12, 2017 at 07:55:55PM +0800, Wei Wang wrote:
>>>> +int xb_preload_and_set_bit(struct xb *xb, unsigned long bit, gfp_t gfp);
>>> I'm struggling to understand when one would use this. The xb_ API
>>> requires you to handle your own locking. But specifying GFP flags
>>> here implies you can sleep. So ... um ... there's no locking?
>> In the regular use cases, people would do xb_preload() before taking the
>> lock, and the xb_set/clear within the lock.
>>
>> In the virtio-balloon usage, we have a large number of bits to set with
>> the balloon_lock being held (we're not unlocking for each bit), so we
>> used the above wrapper to do preload and set within the balloon_lock,
>> and passed in GFP_NOWAIT to avoid sleeping. Probably we can change to
>> put this wrapper implementation to virtio-balloon, since it would not be
>> useful for the regular cases.
> GFP_NOWAIT is chosen in order not to try to OOM-kill something, isn't it?
Yes, I think that's right the issue we are discussing here (also
discussed in the deadlock patch before): Suppose we use a sleep-able
flag GFP_KERNEL, which gets the caller (fill_balloon or leak_balloon)
into sleep with balloon_lock being held, and the memory reclaiming from
GFP_KERNEL would fall into the OOM code path which first invokes the
oom_notify-->leak_balloon to release some balloon memory, which needs to
take the balloon_lock that is being held by the task who is sleeping.
So, using GFP_NOWAIT avoids sleeping to get memory through directly
memory reclaiming, which could fall into that OOM code path that needs
to take the balloon_lock.
> But passing GFP_NOWAIT means that we can handle allocation failure. There is
> no need to use preload approach when we can handle allocation failure.
I think the reason we need xb_preload is because radix tree insertion
needs the memory being preallocated already (it couldn't suffer from
memory failure during the process of inserting, probably because
handling the failure there isn't easy, Matthew may know the backstory of
this)
So, I think we can handle the memory failure with xb_preload, which
stops going into the radix tree APIs, but shouldn't call radix tree APIs
without the related memory preallocated.
Best,
Wei
Wei Wang wrote:
> > But passing GFP_NOWAIT means that we can handle allocation failure. There is
> > no need to use preload approach when we can handle allocation failure.
>
> I think the reason we need xb_preload is because radix tree insertion
> needs the memory being preallocated already (it couldn't suffer from
> memory failure during the process of inserting, probably because
> handling the failure there isn't easy, Matthew may know the backstory of
> this)
According to https://lwn.net/Articles/175432/ , I think that preloading is needed
only when failure to insert an item into a radix tree is a significant problem.
That is, when failure to insert an item into a radix tree is not a problem,
I think that we don't need to use preloading.
>
> So, I think we can handle the memory failure with xb_preload, which
> stops going into the radix tree APIs, but shouldn't call radix tree APIs
> without the related memory preallocated.
It seems to me that virtio-ballon case has no problem without using preloading.
> -----Original Message-----
> From: Tetsuo Handa [mailto:[email protected]]
> Sent: Sunday, December 17, 2017 6:22 PM
> To: Wang, Wei W <[email protected]>; [email protected]
> Cc: [email protected]; [email protected]; qemu-
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH v19 3/7] xbitmap: add more operations
>
> Wei Wang wrote:
> > > But passing GFP_NOWAIT means that we can handle allocation failure.
> > > There is no need to use preload approach when we can handle allocation
> failure.
> >
> > I think the reason we need xb_preload is because radix tree insertion
> > needs the memory being preallocated already (it couldn't suffer from
> > memory failure during the process of inserting, probably because
> > handling the failure there isn't easy, Matthew may know the backstory
> > of
> > this)
>
> According to https://lwn.net/Articles/175432/ , I think that preloading is
> needed only when failure to insert an item into a radix tree is a significant
> problem.
> That is, when failure to insert an item into a radix tree is not a problem, I
> think that we don't need to use preloading.
It also mentions that the preload attempts to allocate sufficient memory to *guarantee* that the next radix tree insertion cannot fail.
If we check radix_tree_node_alloc(), the comments there says "this assumes that the caller has performed appropriate preallocation".
So, I think we would get a risk of triggering some issue without preload().
> >
> > So, I think we can handle the memory failure with xb_preload, which
> > stops going into the radix tree APIs, but shouldn't call radix tree
> > APIs without the related memory preallocated.
>
> It seems to me that virtio-ballon case has no problem without using
> preloading.
Why is that?
Best,
Wei
On Saturday, December 16, 2017 3:22 AM, Matthew Wilcox wrote:
> On Fri, Dec 15, 2017 at 10:49:15AM -0800, Matthew Wilcox wrote:
> > Here's the API I'm looking at right now. The user need take no lock;
> > the locking (spinlock) is handled internally to the implementation.
Another place I saw your comment " The xb_ API requires you to handle your own locking" which seems conflict with the above "the user need take no lock".
Doesn't the caller need a lock to avoid concurrent accesses to the ida bitmap?
> I looked at the API some more and found some flaws:
> - how does xbit_alloc communicate back which bit it allocated?
> - What if xbit_find_set() is called on a completely empty array with
> a range of 0, ULONG_MAX -- there's no invalid number to return.
We'll change it to "bool xb_find_set(.., unsigned long *result)", returning false indicates no "1" bit is found.
> - xbit_clear() can't return an error. Neither can xbit_zero().
I found the current xbit_clear implementation only returns 0, and there isn't an error to be returned from this function. In this case, is it better to make the function "void"?
> - Need to add __must_check to various return values to discourage sloppy
> programming
>
> So I modify the proposed API we compete with thusly:
>
> bool xbit_test(struct xbitmap *, unsigned long bit); int __must_check
> xbit_set(struct xbitmap *, unsigned long bit, gfp_t); void xbit_clear(struct
> xbitmap *, unsigned long bit); int __must_check xbit_alloc(struct xbitmap *,
> unsigned long *bit, gfp_t);
>
> int __must_check xbit_fill(struct xbitmap *, unsigned long start,
> unsigned long nbits, gfp_t); void xbit_zero(struct xbitmap *,
> unsigned long start, unsigned long nbits); int __must_check
> xbit_alloc_range(struct xbitmap *, unsigned long *bit,
> unsigned long nbits, gfp_t);
>
> bool xbit_find_clear(struct xbitmap *, unsigned long *start, unsigned long
> max); bool xbit_find_set(struct xbitmap *, unsigned long *start, unsigned
> long max);
>
> (I'm a little sceptical about the API accepting 'max' for the find functions and
> 'nbits' in the fill/zero/alloc_range functions, but I think that matches how
> people want to use it, and it matches how bitmap.h works)
Are you suggesting to rename the current xb_ APIs to the above xbit_ names (with parameter changes)?
Why would we need xbit_alloc, which looks like ida_get_new, I think set/clear should be adequate to the current usages.
Best,
Wei
Wang, Wei W wrote:
> > Wei Wang wrote:
> > > > But passing GFP_NOWAIT means that we can handle allocation failure.
> > > > There is no need to use preload approach when we can handle allocation failure.
> > >
> > > I think the reason we need xb_preload is because radix tree insertion
> > > needs the memory being preallocated already (it couldn't suffer from
> > > memory failure during the process of inserting, probably because
> > > handling the failure there isn't easy, Matthew may know the backstory
> > > of
> > > this)
> >
> > According to https://lwn.net/Articles/175432/ , I think that preloading is
> > needed only when failure to insert an item into a radix tree is a significant
> > problem.
> > That is, when failure to insert an item into a radix tree is not a problem, I
> > think that we don't need to use preloading.
>
> It also mentions that the preload attempts to allocate sufficient memory to *guarantee* that the next radix tree insertion cannot fail.
>
> If we check radix_tree_node_alloc(), the comments there says "this assumes that the caller has performed appropriate preallocation".
If you read what radix_tree_node_alloc() is doing, you will find that
radix_tree_node_alloc() returns NULL when memory allocation failed.
I think that "this assumes that the caller has performed appropriate preallocation"
means "The caller has to perform appropriate preallocation if the caller does not
want radix_tree_node_alloc() to return NULL".
>
> So, I think we would get a risk of triggering some issue without preload().
>
> > >
> > > So, I think we can handle the memory failure with xb_preload, which
> > > stops going into the radix tree APIs, but shouldn't call radix tree
> > > APIs without the related memory preallocated.
> >
> > It seems to me that virtio-ballon case has no problem without using
> > preloading.
>
> Why is that?
>
Because you are saying in PATCH 4/7 that it is OK to fail xb_set_page()
due to -ENOMEM (apart from lack of ability to fallback to !use_sg path
when all xb_set_page() calls failed (i.e. no page will be handled because
there is no "1" bit in the xbitmap)).
+static inline int xb_set_page(struct virtio_balloon *vb,
+ struct page *page,
+ unsigned long *pfn_min,
+ unsigned long *pfn_max)
+{
+ unsigned long pfn = page_to_pfn(page);
+ int ret;
+
+ *pfn_min = min(pfn, *pfn_min);
+ *pfn_max = max(pfn, *pfn_max);
+
+ do {
+ ret = xb_preload_and_set_bit(&vb->page_xb, pfn,
+ GFP_NOWAIT | __GFP_NOWARN);
+ } while (unlikely(ret == -EAGAIN));
+
+ return ret;
+}
@@ -173,8 +290,15 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
while ((page = balloon_page_pop(&pages))) {
balloon_page_enqueue(&vb->vb_dev_info, page);
+ if (use_sg) {
+ if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
+ __free_page(page);
+ continue;
+ }
+ } else {
+ set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+ }
@@ -223,7 +354,14 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
page = balloon_page_dequeue(vb_dev_info);
if (!page)
break;
- set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+ if (use_sg) {
+ if (xb_set_page(vb, page, &pfn_min, &pfn_max) < 0) {
+ balloon_page_enqueue(&vb->vb_dev_info, page);
+ break;
+ }
+ } else {
+ set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+ }
list_add(&page->lru, &pages);
vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
}
On Sun, Dec 17, 2017 at 01:47:21PM +0000, Wang, Wei W wrote:
> On Saturday, December 16, 2017 3:22 AM, Matthew Wilcox wrote:
> > On Fri, Dec 15, 2017 at 10:49:15AM -0800, Matthew Wilcox wrote:
> > > Here's the API I'm looking at right now. The user need take no lock;
> > > the locking (spinlock) is handled internally to the implementation.
>
> Another place I saw your comment " The xb_ API requires you to handle your own locking" which seems conflict with the above "the user need take no lock".
> Doesn't the caller need a lock to avoid concurrent accesses to the ida bitmap?
Yes, the xb_ implementation requires you to handle your own locking.
The xbit_ API that I'm proposing will take care of the locking for you.
There's also no preallocation in the API.
> We'll change it to "bool xb_find_set(.., unsigned long *result)", returning false indicates no "1" bit is found.
I put a replacement proposal in the next paragraph:
bool xbit_find_set(struct xbitmap *, unsigned long *start, unsigned long max);
Maybe 'start' is the wrong name for that parameter. Let's call it 'bit'.
It's both "where to start" and "first bit found".
> > - xbit_clear() can't return an error. Neither can xbit_zero().
>
> I found the current xbit_clear implementation only returns 0, and there isn't an error to be returned from this function. In this case, is it better to make the function "void"?
Yes, I think so.
My only qualm is that I've been considering optimising the memory
consumption when an entire 1024-bit chunk is full; instead of keeping a
pointer to a 128-byte entry full of ones, store a special value in the
radix tree which means "every bit is set".
The downside is that we then have to pass GFP flags to xbit_clear() and
xbit_zero(), and they can fail. It's not clear to me whether that's a
good tradeoff.
> Are you suggesting to rename the current xb_ APIs to the above xbit_ names (with parameter changes)?
>
> Why would we need xbit_alloc, which looks like ida_get_new, I think set/clear should be adequate to the current usages.
I'm intending on replacing the xb_ and ida_ implementations with this one.
It removes the preload API which makes it easier to use, and it handles
the locking for you.
But I need to get the XArray (which replaces the radix tree) finished first.
On 12/18/2017 06:18 AM, Matthew Wilcox wrote:
> On Sun, Dec 17, 2017 at 01:47:21PM +0000, Wang, Wei W wrote:
>> On Saturday, December 16, 2017 3:22 AM, Matthew Wilcox wrote:
>>> On Fri, Dec 15, 2017 at 10:49:15AM -0800, Matthew Wilcox wrote:
>>> - xbit_clear() can't return an error. Neither can xbit_zero().
>> I found the current xbit_clear implementation only returns 0, and there isn't an error to be returned from this function. In this case, is it better to make the function "void"?
> Yes, I think so.
>
> My only qualm is that I've been considering optimising the memory
> consumption when an entire 1024-bit chunk is full; instead of keeping a
> pointer to a 128-byte entry full of ones, store a special value in the
> radix tree which means "every bit is set".
>
> The downside is that we then have to pass GFP flags to xbit_clear() and
> xbit_zero(), and they can fail. It's not clear to me whether that's a
> good tradeoff.
Yes, this will sacrifice performance. In many usages, users may set bits
one by one, and each time when a bit is set, it needs to scan the whole
ida_bitmap to see if all other bits are set, if so, it can free the
ida_bitmap. I think this extra scanning of the ida_bitmap would add a
lot overhead.
>
>> Are you suggesting to rename the current xb_ APIs to the above xbit_ names (with parameter changes)?
>>
>> Why would we need xbit_alloc, which looks like ida_get_new, I think set/clear should be adequate to the current usages.
> I'm intending on replacing the xb_ and ida_ implementations with this one.
> It removes the preload API which makes it easier to use, and it handles
> the locking for you.
>
> But I need to get the XArray (which replaces the radix tree) finished first.
OK. It seems the new implementation wouldn't be done shortly.
Other parts of this patch series are close to the end of review, and we
hope to make some progress soon. Would it be acceptable that we continue
with the basic xb_ implementation (e.g. as xbitmap 1.0) for this patch
series? and xbit_ implementation can come as xbitmap 2.0 in the future?
Best,
Wei
On Mon, Dec 18, 2017 at 10:33:00AM +0800, Wei Wang wrote:
> > My only qualm is that I've been considering optimising the memory
> > consumption when an entire 1024-bit chunk is full; instead of keeping a
> > pointer to a 128-byte entry full of ones, store a special value in the
> > radix tree which means "every bit is set".
> >
> > The downside is that we then have to pass GFP flags to xbit_clear() and
> > xbit_zero(), and they can fail. It's not clear to me whether that's a
> > good tradeoff.
>
> Yes, this will sacrifice performance. In many usages, users may set bits one
> by one, and each time when a bit is set, it needs to scan the whole
> ida_bitmap to see if all other bits are set, if so, it can free the
> ida_bitmap. I think this extra scanning of the ida_bitmap would add a lot
> overhead.
Not a huge amount of overhead. An ida_bitmap is only two cachelines,
and the loop is simply 'check each word against ~0ul', so up to 16
load/test/loop instructions. Plus we have to do that anyway to maintain
the free tag for IDAs.
> > But I need to get the XArray (which replaces the radix tree) finished first.
>
> OK. It seems the new implementation wouldn't be done shortly.
> Other parts of this patch series are close to the end of review, and we hope
> to make some progress soon. Would it be acceptable that we continue with the
> basic xb_ implementation (e.g. as xbitmap 1.0) for this patch series? and
> xbit_ implementation can come as xbitmap 2.0 in the future?
Yes, absolutely, I don't want to hold you up behind the XArray.
On 12/17/2017 11:16 PM, Tetsuo Handa wrote:
> Wang, Wei W wrote:
>>> Wei Wang wrote:
>>>>> But passing GFP_NOWAIT means that we can handle allocation failure.
>>>>> There is no need to use preload approach when we can handle allocation failure.
>>>> I think the reason we need xb_preload is because radix tree insertion
>>>> needs the memory being preallocated already (it couldn't suffer from
>>>> memory failure during the process of inserting, probably because
>>>> handling the failure there isn't easy, Matthew may know the backstory
>>>> of
>>>> this)
>>> According to https://lwn.net/Articles/175432/ , I think that preloading is
>>> needed only when failure to insert an item into a radix tree is a significant
>>> problem.
>>> That is, when failure to insert an item into a radix tree is not a problem, I
>>> think that we don't need to use preloading.
>> It also mentions that the preload attempts to allocate sufficient memory to *guarantee* that the next radix tree insertion cannot fail.
>>
>> If we check radix_tree_node_alloc(), the comments there says "this assumes that the caller has performed appropriate preallocation".
> If you read what radix_tree_node_alloc() is doing, you will find that
> radix_tree_node_alloc() returns NULL when memory allocation failed.
>
> I think that "this assumes that the caller has performed appropriate preallocation"
> means "The caller has to perform appropriate preallocation if the caller does not
> want radix_tree_node_alloc() to return NULL".
For the radix tree, I agree that we may not need preload. But
ida_bitmap, which the xbitmap is based on, is allocated via preload, so
I think we cannot bypass preload, otherwise, we get no ida_bitmap to use.
Best,
Wei