ค้นหาเวกเตอร์บิตที่ไม่ได้ซ้อนทับกัน


17

ผมให้คุณรายการn bitvectors ของความกว้างของkkเป้าหมายของคุณคือส่งคืน bitvector สองตัวจากรายการที่ไม่มี 1s ร่วมกันหรือเพื่อรายงานว่าไม่มีคู่ดังกล่าวอยู่

ตัวอย่างเช่นถ้าผมให้คุณ[00110,01100,11000]แล้วทางออกเดียวคือ{00110,11000} } อีกวิธีหนึ่งอินพุต[111,011,110,101]ไม่มีวิธีแก้ปัญหา และรายชื่อใด ๆ ที่มีทุกศูนย์ bitvector 000...0และองค์ประกอบอื่นeมีวิธีการแก้ปัญหาที่น่ารำคาญ{e,000...0} }

นี่เป็นตัวอย่างที่ยากขึ้นเล็กน้อยโดยไม่มีวิธีแก้ปัญหา (แต่ละแถวเป็นบิตเวกเตอร์สี่เหลี่ยมสีดำคือ 1 วินาทีและสี่เหลี่ยมสีขาวเป็น 0s):

■ ■ ■ ■ □ □ □ □ □ □ □ □ □
■ □ □ □ ■ ■ ■ □ □ □ □ □ □ 
■ □ □ □ □ □ □ ■ ■ ■ □ □ □
■ □ □ □ □ □ □ □ □ □ ■ ■ ■
□ ■ □ □ □ ■ □ □ □ ■ ■ □ □
□ ■ □ □ ■ □ □ □ ■ □ □ □ ■
□ ■ □ □ □ □ ■ ■ □ □ □ ■ □ <-- All row pairs share a black square
□ □ ■ □ □ □ ■ □ ■ □ ■ □ □
□ □ ■ □ □ ■ □ ■ □ □ □ □ ■
□ □ ■ □ ■ □ □ □ □ ■ □ ■ □
□ □ □ ■ ■ □ □ ■ □ □ ■ □ □
□ □ □ ■ □ □ ■ □ □ ■ □ □ ■
□ □ □ ■ □ ■ □ □ ■ □ □ ■ □

วิธีที่มีประสิทธิภาพสามารถพบ bitvector สองคนที่ไม่ทับซ้อนกันได้อย่างมีประสิทธิภาพหรือแสดงว่าไม่มีอยู่จริง?

อัลกอริธึมไร้เดียงสาที่คุณเปรียบเทียบทุกคู่ที่เป็นไปได้คือO(n2k) )เป็นไปได้ไหมที่จะทำได้ดีกว่า


การลดลงที่เป็นไปได้: คุณมีกราฟGมีจุดยอดหนึ่งจุดสำหรับแต่ละเวกเตอร์และขอบระหว่างจุดยอดสองจุดถ้าเวกเตอร์ที่สอดคล้องกันสองตัวนั้นมี 1 ร่วมกัน คุณต้องการทราบว่าเส้นผ่านศูนย์กลางกราฟเป็น2หรือไม่ แต่ดูเหมือนยากที่จะไปได้เร็วกว่าO(n2k) )
François

@ FrançoisGodiส่วนประกอบกราฟที่เชื่อมต่อใด ๆ ที่มีสามโหนดและขอบที่ขาดหายไปมีเส้นผ่านศูนย์กลางอย่างน้อยสองจุด ด้วยการเป็นตัวแทนรายการ adjacency มันต้องใช้เวลาO(V)เพื่อตรวจสอบว่า
Craig Gidney

@Strilanc แน่นอนว่าถ้าไม่มีวิธีการแก้ปัญหากราฟจะเสร็จสมบูรณ์ (ชัดเจนกว่าเส้นผ่านศูนย์กลาง = 1, คุณพูดถูก) แต่การคำนวณการแสดงรายการ adjacency อาจจะยาว
François

คือขนาดเล็กกว่าความกว้างของคำของเครื่องของคุณหรือไม่ k
กราฟิลส์

1
@TomvanderZanden ฟังดูเหมือนว่ามันจะละเมิดค่าคงที่โครงสร้างข้อมูลอาจอาศัยอยู่ โดยเฉพาะอย่างยิ่งความเท่าเทียมนั้นควรจะเป็นสกรรมกริยา ฉันได้คิดเกี่ยวกับการใช้ Trie แล้วและฉันไม่เห็นวิธีการหลีกเลี่ยงการระเบิดของปัจจัยทุกครั้งที่ bitmask แบบสอบถามมี 0
Craig Gidney

คำตอบ:


10

Warmup: bitvectors สุ่ม

ในฐานะที่เป็น warm-up เราสามารถเริ่มต้นด้วยกรณีที่ bitvector แต่ละตัวเลือก iid อย่างสม่ำเสมอ จากนั้นปรากฎว่าปัญหาสามารถแก้ไขได้ในเวลา (แม่นยำยิ่งขึ้น1.6สามารถถูกแทนที่ด้วยlg 3 )O(n1.6min(k,lgn))1.6lg3

เราจะพิจารณาปัญหาสองชุดต่อไปนี้:

กำหนดเซตของ bitvectors พิจารณาว่ามีคู่ที่ไม่ทับซ้อนกันอยู่หรือไม่s S , t TS,T{0,1}ksS,tT T

เทคนิคพื้นฐานในการแก้ปัญหานี้คือการแบ่งและพิชิต นี่คืออัลกอริทึมเวลาโดยใช้การหารและการพิชิต:O(n1.6k)

  1. แยกและTตามตำแหน่งบิตแรก กล่าวอีกนัยหนึ่งรูปแบบS 0 = { s S : s 0 = 0 } , S 1 = { s S : s 0 = 1 } , T 0 = { t T : t 0 = 0 } , T 1 = { t T : tSTS0={sS:s0=0}S1={sS:s0=1}T0={tT:t0=0} }T1={tT:t0=1}

  2. Now recursively look for a non-overlapping pair from S0,T0, from S0,T1, and from T1,S0. If any recursive call finds a non-overlapping pair, output it, otherwise output "No overlapping pair exists".

Since all bitvectors are chosen at random, we can expect |Sb||S|/2 and |Tb||T|/2. Thus, we have three recursive calls, and we've reduced the size of the problem by a factor of two (both sets are reduced in size by a factor of two). After lgmin(|S|,|T|) splits, one of the two sets is down to size 1, and the problem can be solved in linear time. We get a recurrence relation along the lines of T(n)=3T(n/2)+O(nk), whose solution is T(n)=O(n1.6k). Accounting for running time more precisely in the two-set case, we see the running time is O(min(|S|,|T|)0.6max(|S|,|T|)k).

This can be further improved, by noting that if k2.5lgn+100, then the probability that a non-overlapping pair exists is exponentially small. In particular, if x,y are two random vectors, the probability that they're non-overlapping is (3/4)k. If |S|=|T|=n, there are n2 such pairs, so by a union bound, the probability a non-overlapping pair exists is at most n2(3/4)k. When k2.5lgn+100, this is 1/2100. So, as a pre-processing step, if k2.5lgn+100, then we can immediately return "No non-overlapping pair exists" (the probability this is incorrect is negligibly small), otherwise we run the above algorithm.

Thus we achieve a running time of O(n1.6min(k,lgn)) (or O(min(|S|,|T|)0.6max(|S|,|T|)min(k,lgn)) for the two-set variant proposed above), for the special case where the bitvectors are chosen uniformly at random.

Of course, this is not a worst-case analysis. Random bitvectors are considerably easier than the worst case -- but let's treat it as a warmup, to get some ideas that perhaps we can apply to the general case.

Lessons from the warmup

We can learn a few lessons from the warmup above. First, divide-and-conquer (splitting on a bit position) seems helpful. Second, you want to split on a bit position with as many 1's in that position as possible; the more 0's there are, the less reduction in subproblem size you get.

Third, this suggests that the problem gets harder as the density of 1's gets smaller -- if there are very few 1's among the bitvectors (they are mostly 0's), the problem looks quite hard, as each split reduces the size of the subproblems a little bit. So, define the density Δ to be the fraction of bits that are 1 (i.e., out of all nk bits), and the density of bit position i to be the fraction of bitvectors that are 1 at position i.

Handling very low density

As a next step, we might wonder what happens if the density is extremely small. It turns out that if the density in every bit position is smaller than 1/k, we're guaranteed that a non-overlapping pair exists: there is a (non-constructive) existence argument showing that some non-overlapping pair must exist. This doesn't help us find it, but at least we know it exists.

Why is this the case? Let's say that a pair of bitvectors x,y is covered by bit position i if xi=yi=1. Note that every pair of overlapping bitvectors must be covered by some bit position. Now, if we fix a particular bit position i, the number of pairs that can be covered by that bit position is at most (nΔ(i))2<n2/k. Summing across all k of the bit positions, we find that the total number of pairs that are covered by some bit position is <n2. This means there must exist some pair that's not covered by any bit position, which implies that this pair is non-overlapping. So if the density is sufficiently low in every bit position, then a non-overlapping pair surely exists.

However, I'm at a loss to identify a fast algorithm to find such a non-overlapping pair, in these regime, even though one is guaranteed to exist. I don't immediately see any techniques that would yield a running time that has a sub-quadratic dependence on n. So, this is a nice special case to focus on, if you want to spend some time thinking about this problem.

Towards a general-case algorithm

In the general case, a natural heuristic seems to be: pick the bit position i with the most number of 1's (i.e., with the highest density), and split on it. In other words:

  1. Find a bit position i that maximizes Δ(i).

  2. Split S and T based upon bit position i. In other words, form S0={sS:si=0}, S1={sS:si=1}, T0={tT:ti=0}, T1={tT:ti=1}.

  3. Now recursively look for a non-overlapping pair from S0,T0, from S0,T1, and from T1,S0. If any recursive call finds a non-overlapping pair, output it, otherwise output "No overlapping pair exists".

The challenge is to analyze its performance in the worst case.

Let's assume that as a pre-processing step we first compute the density of every bit position. Also, if Δ(i)<1/k for every i, assume that the pre-processing step outputs "An overlapping pair exists" (I realize that this doesn't exhibit an example of an overlapping pair, but let's set that aside as a separate challenge). All this can be done in O(nk) time. The density information can be maintained efficiently as we do recursive calls; it won't be the dominant contributor to running time.

What will the running time of this procedure be? I'm not sure, but here are a few observations that might help. Each level of recursion reduces the problem size by about n/k bitvectors (e.g., from n bitvectors to nn/k bitvectors). Therefore, the recursion can only go about k levels deep. However, I'm not immediately sure how to count the number of leaves in the recursion tree (there are a lot less than 3k leaves), so I'm not sure what running time this should lead to.


ad low density: this seems to be some kind of pigeon-hole argument. Maybe if we use your general idea (split w.r.t. the column with the most ones), we get better bounds because the (S1,T1)-case (we don't recurse to) already gets rid of "most" ones?
Raphael

The total number of ones may be a useful parameter. You have already shown a lower bound we can use for cutting off the tree; can we show upper bounds, too? For example, if there are more than ck ones, we have at least c overlaps.
Raphael

By the way, how do you propose we do the first split; arbitrarily? Why not just split the whole input set w.r.t. some column i? We only need to recurse in the 0-case (there is no solution among those that share a one at i). In expectation, that gives via T(n)=T(n/2)+O(nk) a bound of O(nk) (if k fixed). For a general bound, you have shown that we can (assuming the lower-bound-cutoff you propose) that we get rid of at least n/k elements with every split, which seems to imply an O(nk) worst-case bound. Or am I missing something?
Raphael

Ah, that's wrong, of course, since it does not consider 0-1-mismatches. That's what I get for trying to think before breakfast, I guess.
Raphael

@Raphael, there are two issues: (a) the vectors might be mostly zeros, so you can't count on getting a 50-50 split; the recurrence would be something more like T(n)=T((nn/k)k)+O(nk), (b) more importantly, it's not enough to just recurse on the 0-subset; you also need to examine pairings between a vector from the 0-subset and a vector from the 1-subset, so there's an additional recursion or two to do. (I think? I hope I got that right.)
D.W.

8

Faster solution when nk, using matrix multiplication

Suppose that n=k. Our goal is to do better than an O(n2k)=O(n3) running time.

We can think of the bitvectors and bit positions as nodes in a graph. There is an edge between a bitvector node and a bit position node when the bitvector has a 1 in that position. The resulting graph is bipartite (with the bitvector-representing nodes on one side and the bitposition-representing nodes on the other), and has n+k=2n nodes.

Given the adjacency matrix M of a graph, we can tell if there is a two-hop path between two vertices by squaring M and checking if the resulting matrix has an "edge" between those two vertices (i.e. the edge's entry in the squared matrix is non-zero). For our purposes, a zero entry in the squared adjacency matrix corresponds to a non-overlapping pair of bitvectors (i.e. a solution). A lack of any zeroes means there's no solution.

Squaring an n x n matrix can be done in O(nω) time, where ω is known to be under 2.373 and conjectured to be 2.

So the algorithm is:

  • Convert the bitvectors and bit positions into a bipartite graph with n+k nodes and at most nk edges. This takes O(nk) time.
  • Compute the adjacency matrix of the graph. This takes O((n+k)2) time and space.
  • Square the adjacency matrix. This takes O((n+k)ω) time.
  • Search the bitvector section of the adjacency matrix for zero entries. This takes O(n2) time.

The most expensive step is squaring the adjacency matrix. If n=k then the overall algorithm takes O((n+k)ω)=O(nω) time, which is better than the naive O(n3) time.

This solution is also faster when k grows not-too-much-slower and not-too-much-faster than n. As long as kΩ(nω2) and kO(n2ω1), then (n+k)ω is better than n2k. For w2.373 that translates to n0.731kn1.373 (asymptotically). If w limits to 2, then the bounds widen towards nϵkn2ϵ.


1. This is also better than the naive solution if k=Ω(n) but k=o(n1.457). 2. If kn, a heuristic could be: pick a random subset of n bit positions, restrict to those bit positions and use matrix multiplication to enumerate all pairs that don't overlap in those n bit positions; for each such pair, check if it solves the original problem. If there aren't many pairs that don't overlap in those n bit positions, this provides a speedup over the naive algorithm. However I don't know a good upper bound on the number of such pairs.
D.W.

4

This is equivalent to finding a bit vector which is a subset of the complement of another vector; ie its 1's occur only where 0's occur in the other.

If k (or the number of 1's) is small, you can get O(n2k) time by simply generating all the subsets of the complement of each bitvector and putting them in a trie (using backtracking). If a bitvector is found in the trie (we can check each before complement-subset insertion) then we have a non-overlapping pair.

If the number of 1's or 0's is bounded to an even lower number than k, then the exponent can be replaced by that. The subset-indexing can be on either each vector or its complement, so long as probing uses the opposite.

There's also a scheme for superset-finding in a trie that only stores each vector only once, but does bit-skipping during probes for what I believe is similar aggregate complexity; ie it has o(k) insertion but o(2k) searches.


thanks. The complexity of your solution is n2(1p)k, where p is the probability of 1's in the bitvector. A couple of implementation details: though this is a slight improvement, there's no need to compute and store the complements in the trie. Just following the complementary branches when checking for a non-overlapping match is enough. And, taking the 0's directly as wildcards, no special wildcard is needed, either.
Mauro Lacy

2

Represent the bit vectors as an n×k matrix M. Take i and j between 1 and n.

(MMT)ij=lMilMjl.

(MMT)ij, the dot product of the ith and jth vector, is non-zero if, and only if, vectors i and j share a common 1. So, to find a solution, compute MMT and return the position of a zero entry, if such an entry exists.

Complexity

Using naive multiplication, this requires O(n2k) arithmetic operations. If n=k, it takes O(n2.37) operations using the utterly impractical Coppersmith-Winograd algorithm, or O(n2.8) using the Strassen algorithm. If k=O(n0.302), then the problem may be solved using n2+o(1) operations.


How is this different from Strilanc's answer?
D.W.

1
@D.W. Using an n-by-k matrix instead of an (n+k)-by-(n+k) matrix is an improvement. Also it mentions a way to cut off the factor of k when k << n, so that might be useful.
Craig Gidney
โดยการใช้ไซต์ของเรา หมายความว่าคุณได้อ่านและทำความเข้าใจนโยบายคุกกี้และนโยบายความเป็นส่วนตัวของเราแล้ว
Licensed under cc by-sa 3.0 with attribution required.