Dynamic random choice datastructure

One of the basic usages of random numbers is chosing an item at random. Most of the time, we want a uniform distribution, but sometimes we want a weighted one: the probability of selecting an element is proportional to its weight. We would use Numpy’s random.choice(), or C++’s discrete_distribution.

The algorithm would pick a random number, and perform a binary search (libstdc++ implementation) to find the associated item. But some randomized algorithms need to update the weights during execution: with those algorithms, we need to precompute all the weights each time we do a change.

It’s actually possible to do it in O(log n) time, by maintaining a binary tree: each node contains the sum of the weights of its child. This allows us to do a similar binary search (log n), and to access only a few nodes when a weight is updated (log n). I read about an implementation based on red-black trees on HN.

I wrote a similar datastructure based on an implicit binary tree – getting read of the pointers and the memory allocations, using about 3 times less memory when using doubles. It is available on Github.

C++ reminder #3

A surprise with template type deduction and implicit conversion, when summing an array of float:

vector<float> to_sum = {0.5, 0.5, 0.5, 0.5};
cout << accumulate(to_sum.begin(), to_sum.end(), 0) << endl;
cout << accumulate(to_sum.begin(), to_sum.end(), 0.0) << endl;
cout << accumulate(to_sum.begin(), to_sum.end(), 0.0f) << endl;

This will print:

0
2
2

The template parameter is inferred from the scalar in the last argument: 0, 0.0 or 0.0f. (the first two arguments are iterators). The elements are converted implicitly to this type before summation.

  1. int type. Obviously not what was intended
  2. double. OK, but more precise and slower than float
  3. float. Probably the intent, but we need the 0.0f or an explicit cast

 

Note: this is just an example; if you care about precision or performance this is not the right way of summing floating point numbers.

C++ reminder

C++ is hard. We often see code written in the hope of avoiding a specific class of errors. For example, if we reset your pointers to NULL once they are deleted, they can never be deleted twice, right?

struct foo {
  char *p;
  ~foo();
};
foo::~foo() {
  delete p;
  p = nullptr;
}

Check that

Right, the reset gets optimized away. Since it’s forbidden to read a member from a deleted object, the compiler assumes that the member p can never be read again, and removes the write. Only with optimization disabled do we have the expected behaviour.

Morality: C++ is hard and workarounds rarely work as you would expect them to.