Warmup: bitvectors สุ่ม
ในฐานะที่เป็น warm-up เราสามารถเริ่มต้นด้วยกรณีที่ bitvector แต่ละตัวเลือก iid อย่างสม่ำเสมอ จากนั้นปรากฎว่าปัญหาสามารถแก้ไขได้ในเวลา (แม่นยำยิ่งขึ้น1.6สามารถถูกแทนที่ด้วยlg 3 )O(n1.6min(k,lgn))1.6lg3
เราจะพิจารณาปัญหาสองชุดต่อไปนี้:
กำหนดเซตของ bitvectors พิจารณาว่ามีคู่ที่ไม่ทับซ้อนกันอยู่หรือไม่s ∈ S , t ∈ TS,T⊆{0,1}ks∈S,t∈T T
เทคนิคพื้นฐานในการแก้ปัญหานี้คือการแบ่งและพิชิต นี่คืออัลกอริทึมเวลาโดยใช้การหารและการพิชิต:O(n1.6k)
แยกและTตามตำแหน่งบิตแรก กล่าวอีกนัยหนึ่งรูปแบบS 0 = { s ∈ S : s 0 = 0 } , S 1 = { s ∈ S : s 0 = 1 } , T 0 = { t ∈ T : t 0 = 0 } , T 1 = { t ∈ T : tSTS0={s∈S:s0=0}S1={s∈S:s0=1}T0={t∈T:t0=0} }T1={t∈T:t0=1}
Now recursively look for a non-overlapping pair from S0,T0, from S0,T1, and from T1,S0. If any recursive call finds a non-overlapping pair, output it, otherwise output "No overlapping pair exists".
Since all bitvectors are chosen at random, we can expect |Sb|≈|S|/2 and |Tb|≈|T|/2. Thus, we have three recursive calls, and we've reduced the size of the problem by a factor of two (both sets are reduced in size by a factor of two). After lgmin(|S|,|T|) splits, one of the two sets is down to size 1, and the problem can be solved in linear time. We get a recurrence relation along the lines of T(n)=3T(n/2)+O(nk), whose solution is T(n)=O(n1.6k). Accounting for running time more precisely in the two-set case, we see the running time is O(min(|S|,|T|)0.6max(|S|,|T|)k).
This can be further improved, by noting that if k≥2.5lgn+100, then the probability that a non-overlapping pair exists is exponentially small. In particular, if x,y are two random vectors, the probability that they're non-overlapping is (3/4)k. If |S|=|T|=n, there are n2 such pairs, so by a union bound, the probability a non-overlapping pair exists is at most n2(3/4)k. When k≥2.5lgn+100, this is ≤1/2100. So, as a pre-processing step, if k≥2.5lgn+100, then we can immediately return "No non-overlapping pair exists" (the probability this is incorrect is negligibly small), otherwise we run the above algorithm.
Thus we achieve a running time of O(n1.6min(k,lgn)) (or O(min(|S|,|T|)0.6max(|S|,|T|)min(k,lgn)) for the two-set variant proposed above), for the special case where the bitvectors are chosen uniformly at random.
Of course, this is not a worst-case analysis. Random bitvectors are considerably easier than the worst case -- but let's treat it as a warmup, to get some ideas that perhaps we can apply to the general case.
Lessons from the warmup
We can learn a few lessons from the warmup above. First, divide-and-conquer (splitting on a bit position) seems helpful. Second, you want to split on a bit position with as many 1's in that position as possible; the more 0's there are, the less reduction in subproblem size you get.
Third, this suggests that the problem gets harder as the density of 1's gets smaller -- if there are very few 1's among the bitvectors (they are mostly 0's), the problem looks quite hard, as each split reduces the size of the subproblems a little bit. So, define the density Δ to be the fraction of bits that are 1 (i.e., out of all nk bits), and the density of bit position i to be the fraction of bitvectors that are 1 at position i.
Handling very low density
As a next step, we might wonder what happens if the density is extremely small. It turns out that if the density in every bit position is smaller than 1/k−−√, we're guaranteed that a non-overlapping pair exists: there is a (non-constructive) existence argument showing that some non-overlapping pair must exist. This doesn't help us find it, but at least we know it exists.
Why is this the case? Let's say that a pair of bitvectors x,y is covered by bit position i if xi=yi=1. Note that every pair of overlapping bitvectors must be covered by some bit position. Now, if we fix a particular bit position i, the number of pairs that can be covered by that bit position is at most (nΔ(i))2<n2/k. Summing across all k of the bit positions, we find that the total number of pairs that are covered by some bit position is <n2. This means there must exist some pair that's not covered by any bit position, which implies that this pair is non-overlapping. So if the density is sufficiently low in every bit position, then a non-overlapping pair surely exists.
However, I'm at a loss to identify a fast algorithm to find such a non-overlapping pair, in these regime, even though one is guaranteed to exist. I don't immediately see any techniques that would yield a running time that has a sub-quadratic dependence on n. So, this is a nice special case to focus on, if you want to spend some time thinking about this problem.
Towards a general-case algorithm
In the general case, a natural heuristic seems to be: pick the bit position i with the most number of 1's (i.e., with the highest density), and split on it. In other words:
Find a bit position i that maximizes Δ(i).
Split S and T based upon bit position i. In other words, form S0={s∈S:si=0}, S1={s∈S:si=1}, T0={t∈T:ti=0}, T1={t∈T:ti=1}.
Now recursively look for a non-overlapping pair from S0,T0, from S0,T1, and from T1,S0. If any recursive call finds a non-overlapping pair, output it, otherwise output "No overlapping pair exists".
The challenge is to analyze its performance in the worst case.
Let's assume that as a pre-processing step we first compute the density of every bit position. Also, if Δ(i)<1/k−−√ for every i, assume that the pre-processing step outputs "An overlapping pair exists" (I realize that this doesn't exhibit an example of an overlapping pair, but let's set that aside as a separate challenge). All this can be done in O(nk) time. The density information can be maintained efficiently as we do recursive calls; it won't be the dominant contributor to running time.
What will the running time of this procedure be? I'm not sure, but here are a few observations that might help. Each level of recursion reduces the problem size by about n/k−−√ bitvectors (e.g., from n bitvectors to n−n/k−−√ bitvectors). Therefore, the recursion can only go about k−−√ levels deep. However, I'm not immediately sure how to count the number of leaves in the recursion tree (there are a lot less than 3k√ leaves), so I'm not sure what running time this should lead to.