加權隨機選擇和不替換。 - Weighted random selection with and without replacement -开发者知识库

加權隨機選擇和不替換。 - Weighted random selection with and without replacement -开发者知识库,第1张

Recently I needed to do weighted random selection of elements from a list, both with and without replacement. While there are well known and good algorithms for unweighted selection, and some for weighted selection without replacement (such as modifications of the resevoir algorithm), I couldn't find any good algorithms for weighted selection with replacement. I also wanted to avoid the resevoir method, as I was selecting a significant fraction of the list, which is small enough to hold in memory.

最近,我需要對列表中的元素進行加權隨機選擇,既包括也不替換。雖然對於非加權選擇有很好的算法,但是對於沒有替換的加權選擇(比如resevoir算法的修改),我找不到合適的算法來替代選擇。我還想避免使用resevoir方法,因為我選擇了一個相當大的列表,它足夠小,可以保存在內存中。

Does anyone have any suggestions on the best approach in this situation? I have my own solutions, but I'm hoping to find something more efficient, simpler, or both.

在這種情況下,有人對最佳方法有什么建議嗎?我有自己的解決方案,但我希望能找到更高效、更簡單的方法,或者兩者兼而有之。

7 个解决方案

#1


31  

One of the fastest ways to make many with replacement samples from an unchanging list is the alias method. The core intuition is that we can create a set of equal-sized bins for the weighted list that can be indexed very efficiently through bit operations, to avoid a binary search. It will turn out that, done correctly, we will need to only store two items from the original list per bin, and thus can represent the split with a single percentage.

從一個不變的列表中獲取替換樣本的最快方法之一是別名方法。核心直覺是,我們可以創建一組大小相等的容器,用於通過位操作非常有效地索引的加權列表,以避免進行二分查找。它將證明,正確地完成,我們只需要將兩個條目從原始列表中存儲到每個bin中,這樣就可以用一個百分比來表示分割。

Let's us take the example of five equally weighted choices, (a:1, b:1, c:1, d:1, e:1)

讓我們以5個同樣權重的選項為例(a:1, b:1, c:1, d:1, e:1)

To create the alias lookup:

創建別名查找:

  1. Normalize the weights such that they sum to 1.0. (a:0.2 b:0.2 c:0.2 d:0.2 e:0.2) This is the probability of choosing each weight.

    使權重標准化,使其總和為1.0。(a:0.2 b:0.2 c:0.2 d:0.2 e:0.2)這是每個權重的選擇概率。

  2. Find the smallest power of 2 greater than or equal to the number of variables, and create this number of partitions, |p|. Each partition represents a probability mass of 1/|p|. In this case, we create 8 partitions, each able to contain 0.125.

    找出2大於或等於變量個數的最小值,並創建這幾個分區,|p|。每個分區表示1/|p|的概率質量。在本例中,我們創建了8個分區,每個分區能夠容納0.125個分區。

  3. Take the variable with the least remaining weight, and place as much of it's mass as possible in an empty partition. In this example, we see that a fills the first partition. (p1{a|null,1.0},p2,p3,p4,p5,p6,p7,p8) with (a:0.075, b:0.2 c:0.2 d:0.2 e:0.2)

    取最小剩余重量的變量,在一個空的分區中盡可能多地放置它的質量。在這個例子中,我們看到一個填充了第一個分區。(p1 { |空,1.0 },p2、p3,p4,p5,p6、p7,p8)和(0.075,b:0.075 c:0.2 d:0.2 e:0.2)

  4. If the partition is not filled, take the variable with the most weight, and fill the partition with that variable.

    如果沒有填充分區,則使用最重的變量,並用該變量填充分區。

Repeat steps 3 and 4, until none of the weight from the original partition need be assigned to the list.

重復步驟3和步驟4,直到原始分區的權重不需要分配到列表中。

For example, if we run another iteration of 3 and 4, we see

例如,如果我們運行另一個3和4的迭代,我們會看到。

(p1{a|null,1.0},p2{a|b,0.6},p3,p4,p5,p6,p7,p8) with (a:0, b:0.15 c:0.2 d:0.2 e:0.2) left to be assigned

(p1 { |空,1.0 },p2 { 0.6 } | b,p3,p4,p5,p6、p7,p8)與(0,b:0.15 c:0.2 d:0.2 e:0.2)被分配

At runtime:

在運行時:

  1. Get a U(0,1) random number, say binary 0.001100000

    得到一個U(0,1)隨機數,表示二進制0。001100000。

  2. bitshift it lg2(p), finding the index partition. Thus, we shift it by 3, yielding 001.1, or position 1, and thus partition 2.

    bitshift it lg2(p),找到索引分區。因此,我們將它轉換為3,產生001.1,或位置1,從而划分為2。

  3. If the partition is split, use the decimal portion of the shifted random number to decide the split. In this case, the value is 0.5, and 0.5 < 0.6, so return a.

    如果分區是分開的,使用移位的隨機數的小數部分來決定分割。在本例中,值為0.5,而0.5 < 0.6,因此返回a。

Here is some code and another explanation, but unfortunately it doesn't use the bitshifting technique, nor have I actually verified it.

這里有一些代碼和另一種解釋,但不幸的是,它沒有使用位移技術,我也沒有驗證過它。

最佳答案:

本文经用户投稿或网站收集转载,如有侵权请联系本站。

发表评论

0条回复