17

ผมให้คุณรายการ $n$ bitvectors ของความกว้างของk $k$ เป้าหมายของคุณคือส่งคืน bitvector สองตัวจากรายการที่ไม่มี 1s ร่วมกันหรือเพื่อรายงานว่าไม่มีคู่ดังกล่าวอยู่

ตัวอย่างเช่นถ้าผมให้คุณ $[00110, 01100, 11000]$ แล้วทางออกเดียวคือ $\{00110, 11000\}$ }อีกวิธีหนึ่งอินพุต $[111, 011, 110, 101]$ ไม่มีวิธีแก้ปัญหา และรายชื่อใด ๆ ที่มีทุกศูนย์ bitvector $000...0$ และองค์ประกอบอื่น $e$ มีวิธีการแก้ปัญหาที่น่ารำคาญ $\{e, 000...0\}$ }

นี่เป็นตัวอย่างที่ยากขึ้นเล็กน้อยโดยไม่มีวิธีแก้ปัญหา (แต่ละแถวเป็นบิตเวกเตอร์สี่เหลี่ยมสีดำคือ 1 วินาทีและสี่เหลี่ยมสีขาวเป็น 0s):

■ ■ ■ ■ □ □ □ □ □ □ □ □ □
■ □ □ □ ■ ■ ■ □ □ □ □ □ □ 
■ □ □ □ □ □ □ ■ ■ ■ □ □ □
■ □ □ □ □ □ □ □ □ □ ■ ■ ■
□ ■ □ □ □ ■ □ □ □ ■ ■ □ □
□ ■ □ □ ■ □ □ □ ■ □ □ □ ■
□ ■ □ □ □ □ ■ ■ □ □ □ ■ □ <-- All row pairs share a black square
□ □ ■ □ □ □ ■ □ ■ □ ■ □ □
□ □ ■ □ □ ■ □ ■ □ □ □ □ ■
□ □ ■ □ ■ □ □ □ □ ■ □ ■ □
□ □ □ ■ ■ □ □ ■ □ □ ■ □ □
□ □ □ ■ □ □ ■ □ □ ■ □ □ ■
□ □ □ ■ □ ■ □ □ ■ □ □ ■ □

วิธีที่มีประสิทธิภาพสามารถพบ bitvector สองคนที่ไม่ทับซ้อนกันได้อย่างมีประสิทธิภาพหรือแสดงว่าไม่มีอยู่จริง?

อัลกอริธึมไร้เดียงสาที่คุณเปรียบเทียบทุกคู่ที่เป็นไปได้คือ $O(n^2 k)$ )เป็นไปได้ไหมที่จะทำได้ดีกว่า

algorithms search-algorithms

— Craig Gidney
แหล่งที่มา

การลดลงที่เป็นไปได้: คุณมีกราฟ

G

$G$ มีจุดยอดหนึ่งจุดสำหรับแต่ละเวกเตอร์และขอบระหว่างจุดยอดสองจุดถ้าเวกเตอร์ที่สอดคล้องกันสองตัวนั้นมี 1 ร่วมกัน คุณต้องการทราบว่าเส้นผ่านศูนย์กลางกราฟเป็น

\geq 2

$\geq 2$ หรือไม่ แต่ดูเหมือนยากที่จะไปได้เร็วกว่า

O (n^{2} k)

$O(n^2k)$ )

— François

@ FrançoisGodiส่วนประกอบกราฟที่เชื่อมต่อใด ๆ ที่มีสามโหนดและขอบที่ขาดหายไปมีเส้นผ่านศูนย์กลางอย่างน้อยสองจุด ด้วยการเป็นตัวแทนรายการ adjacency มันต้องใช้เวลา

O (V)

$O(V)$ เพื่อตรวจสอบว่า

— Craig Gidney

@Strilanc แน่นอนว่าถ้าไม่มีวิธีการแก้ปัญหากราฟจะเสร็จสมบูรณ์ (ชัดเจนกว่าเส้นผ่านศูนย์กลาง = 1, คุณพูดถูก) แต่การคำนวณการแสดงรายการ adjacency อาจจะยาว

— François

คือ

ขนาดเล็กกว่าความกว้างของคำของเครื่องของคุณหรือไม่

k

$k$

— กราฟิลส์

1

@TomvanderZanden ฟังดูเหมือนว่ามันจะละเมิดค่าคงที่โครงสร้างข้อมูลอาจอาศัยอยู่ โดยเฉพาะอย่างยิ่งความเท่าเทียมนั้นควรจะเป็นสกรรมกริยา ฉันได้คิดเกี่ยวกับการใช้ Trie แล้วและฉันไม่เห็นวิธีการหลีกเลี่ยงการระเบิดของปัจจัยทุกครั้งที่ bitmask แบบสอบถามมี 0

— Craig Gidney

10

Warmup: bitvectors สุ่ม

ในฐานะที่เป็น warm-up เราสามารถเริ่มต้นด้วยกรณีที่ bitvector แต่ละตัวเลือก iid อย่างสม่ำเสมอ จากนั้นปรากฎว่าปัญหาสามารถแก้ไขได้ในเวลา (แม่นยำยิ่งขึ้นสามารถถูกแทนที่ด้วย ) $O(n^{1.6} \min(k, \lg n))$ $1.6$ $\lg 3$

เราจะพิจารณาปัญหาสองชุดต่อไปนี้:

กำหนดเซตของ bitvectors พิจารณาว่ามีคู่ที่ไม่ทับซ้อนกันอยู่หรือไม่ $S,T \subseteq \{0,1\}^k$ $s \in S, t \in T$ T

เทคนิคพื้นฐานในการแก้ปัญหานี้คือการแบ่งและพิชิต นี่คืออัลกอริทึมเวลาโดยใช้การหารและการพิชิต: $O(n^{1.6} k)$

แยกและตามตำแหน่งบิตแรก กล่าวอีกนัยหนึ่งรูปแบบ , , , $S$ $T$ $S_0 = \{s \in S : s_0=0\}$ $S_1 = \{s \in S : s_0 = 1\}$ $T_0 = \{t \in T : t_0 = 0\}$ } $T_1 = \{t \in T : t_0 = 1\}$
Now recursively look for a non-overlapping pair from $S_0,T_0$ , from $S_0,T_1$ , and from $T_1,S_0$ . If any recursive call finds a non-overlapping pair, output it, otherwise output "No overlapping pair exists".

Since all bitvectors are chosen at random, we can expect $|S_b| \approx |S|/2$ and $|T_b| \approx |T|/2$ . Thus, we have three recursive calls, and we've reduced the size of the problem by a factor of two (both sets are reduced in size by a factor of two). After $\lg \min(|S|,|T|)$ splits, one of the two sets is down to size 1, and the problem can be solved in linear time. We get a recurrence relation along the lines of $T(n) = 3T(n/2) + O(nk)$ , whose solution is $T(n) = O(n^{1.6} k)$ . Accounting for running time more precisely in the two-set case, we see the running time is $O(\min(|S|,|T|)^{0.6} \max(|S|,|T|) k)$ .

This can be further improved, by noting that if $k \ge 2.5\lg n+100$ , then the probability that a non-overlapping pair exists is exponentially small. In particular, if $x,y$ are two random vectors, the probability that they're non-overlapping is $(3/4)^k$ . If $|S|=|T|=n$ , there are $n^2$ such pairs, so by a union bound, the probability a non-overlapping pair exists is at most $n^2 (3/4)^k$ . When $k \ge 2.5 \lg n+100$ , this is $\le 1/2^{100}$ . So, as a pre-processing step, if $k \ge 2.5 \lg n + 100$ , then we can immediately return "No non-overlapping pair exists" (the probability this is incorrect is negligibly small), otherwise we run the above algorithm.

Thus we achieve a running time of $O(n^{1.6} \min(k, \lg n))$ (or $O(\min(|S|,|T|)^{0.6} \max(|S|,|T|) \min(k, \lg n))$ for the two-set variant proposed above), for the special case where the bitvectors are chosen uniformly at random.

Of course, this is not a worst-case analysis. Random bitvectors are considerably easier than the worst case -- but let's treat it as a warmup, to get some ideas that perhaps we can apply to the general case.

Lessons from the warmup

We can learn a few lessons from the warmup above. First, divide-and-conquer (splitting on a bit position) seems helpful. Second, you want to split on a bit position with as many $1$ 's in that position as possible; the more $0$ 's there are, the less reduction in subproblem size you get.

Third, this suggests that the problem gets harder as the density of $1$ 's gets smaller -- if there are very few $1$ 's among the bitvectors (they are mostly $0$ 's), the problem looks quite hard, as each split reduces the size of the subproblems a little bit. So, define the density $\Delta$ to be the fraction of bits that are $1$ (i.e., out of all $nk$ bits), and the density of bit position $i$ to be the fraction of bitvectors that are $1$ at position $i$ .

Handling very low density

As a next step, we might wonder what happens if the density is extremely small. It turns out that if the density in every bit position is smaller than $1/\sqrt{k}$ , we're guaranteed that a non-overlapping pair exists: there is a (non-constructive) existence argument showing that some non-overlapping pair must exist. This doesn't help us find it, but at least we know it exists.

Why is this the case? Let's say that a pair of bitvectors $x,y$ is covered by bit position $i$ if $x_i=y_i=1$ . Note that every pair of overlapping bitvectors must be covered by some bit position. Now, if we fix a particular bit position $i$ , the number of pairs that can be covered by that bit position is at most $(n \Delta(i))^2 < n^2/k$ . Summing across all $k$ of the bit positions, we find that the total number of pairs that are covered by some bit position is $< n^2$ . This means there must exist some pair that's not covered by any bit position, which implies that this pair is non-overlapping. So if the density is sufficiently low in every bit position, then a non-overlapping pair surely exists.

However, I'm at a loss to identify a fast algorithm to find such a non-overlapping pair, in these regime, even though one is guaranteed to exist. I don't immediately see any techniques that would yield a running time that has a sub-quadratic dependence on $n$ . So, this is a nice special case to focus on, if you want to spend some time thinking about this problem.

Towards a general-case algorithm

In the general case, a natural heuristic seems to be: pick the bit position $i$ with the most number of $1$ 's (i.e., with the highest density), and split on it. In other words:

Find a bit position $i$ that maximizes $\Delta(i)$ .
Split $S$ and $T$ based upon bit position $i$ . In other words, form $S_0 = \{s \in S : s_i=0\}$ , $S_1 = \{s \in S : s_i = 1\}$ , $T_0 = \{t \in T : t_i = 0\}$ , $T_1 = \{t \in T : t_i = 1\}$ .
Now recursively look for a non-overlapping pair from $S_0,T_0$ , from $S_0,T_1$ , and from $T_1,S_0$ . If any recursive call finds a non-overlapping pair, output it, otherwise output "No overlapping pair exists".

The challenge is to analyze its performance in the worst case.

Let's assume that as a pre-processing step we first compute the density of every bit position. Also, if $\Delta(i) < 1/\sqrt{k}$ for every $i$ , assume that the pre-processing step outputs "An overlapping pair exists" (I realize that this doesn't exhibit an example of an overlapping pair, but let's set that aside as a separate challenge). All this can be done in $O(nk)$ time. The density information can be maintained efficiently as we do recursive calls; it won't be the dominant contributor to running time.

What will the running time of this procedure be? I'm not sure, but here are a few observations that might help. Each level of recursion reduces the problem size by about $n/\sqrt{k}$ bitvectors (e.g., from $n$ bitvectors to $n-n/\sqrt{k}$ bitvectors). Therefore, the recursion can only go about $\sqrt{k}$ levels deep. However, I'm not immediately sure how to count the number of leaves in the recursion tree (there are a lot less than $3^{\sqrt{k}}$ leaves), so I'm not sure what running time this should lead to.

— D.W.
แหล่งที่มา

ad low density: this seems to be some kind of pigeon-hole argument. Maybe if we use your general idea (split w.r.t. the column with the most ones), we get better bounds because the

(S_{1}, T_{1})

$(S_1, T_1)$ -case (we don't recurse to) already gets rid of "most" ones?

— Raphael

The total number of ones may be a useful parameter. You have already shown a lower bound we can use for cutting off the tree; can we show upper bounds, too? For example, if there are more than

c k

$ck$ ones, we have at least

c

$c$ overlaps.

— Raphael

By the way, how do you propose we do the first split; arbitrarily? Why not just split the whole input set w.r.t. some column

i

$i$ ? We only need to recurse in the

0

$0$ -case (there is no solution among those that share a one at

i

$i$ ). In expectation, that gives via

T (n) = T (n / 2) + O (n k)

$T(n) = T(n/2) + O(nk)$ a bound of

O (n k)

$O(nk)$ (if

k

$k$ fixed). For a general bound, you have shown that we can (assuming the lower-bound-cutoff you propose) that we get rid of at least

n / \sqrt{k}

$n/\sqrt{k}$ elements with every split, which seems to imply an

O (n k)

$O(nk)$ worst-case bound. Or am I missing something?

— Raphael

Ah, that's wrong, of course, since it does not consider 0-1-mismatches. That's what I get for trying to think before breakfast, I guess.

— Raphael

@Raphael, there are two issues: (a) the vectors might be mostly zeros, so you can't count on getting a 50-50 split; the recurrence would be something more like

T (n) = T ((n - n / \sqrt{k}) k) + O (n k)

$T(n) = T((n-n/\sqrt{k})k)+O(nk)$ , (b) more importantly, it's not enough to just recurse on the 0-subset; you also need to examine pairings between a vector from the 0-subset and a vector from the 1-subset, so there's an additional recursion or two to do. (I think? I hope I got that right.)

— D.W.

8

Faster solution when $n \approx k$ , using matrix multiplication

Suppose that $n = k$ . Our goal is to do better than an $O(n^2k) = O(n^3)$ running time.

We can think of the bitvectors and bit positions as nodes in a graph. There is an edge between a bitvector node and a bit position node when the bitvector has a 1 in that position. The resulting graph is bipartite (with the bitvector-representing nodes on one side and the bitposition-representing nodes on the other), and has $n + k = 2n$ nodes.

Given the adjacency matrix $M$ of a graph, we can tell if there is a two-hop path between two vertices by squaring $M$ and checking if the resulting matrix has an "edge" between those two vertices (i.e. the edge's entry in the squared matrix is non-zero). For our purposes, a zero entry in the squared adjacency matrix corresponds to a non-overlapping pair of bitvectors (i.e. a solution). A lack of any zeroes means there's no solution.

Squaring an n x n matrix can be done in $O(n^\omega)$ time, where $\omega$ is known to be under $2.373$ and conjectured to be $2$ .

So the algorithm is:

Convert the bitvectors and bit positions into a bipartite graph with $n+k$ nodes and at most $nk$ edges. This takes $O(nk)$ time.
Compute the adjacency matrix of the graph. This takes $O((n+k)^2)$ time and space.
Square the adjacency matrix. This takes $O((n+k)^\omega)$ time.
Search the bitvector section of the adjacency matrix for zero entries. This takes $O(n^2)$ time.

The most expensive step is squaring the adjacency matrix. If $n=k$ then the overall algorithm takes $O((n+k)^\omega) = O(n^\omega)$ time, which is better than the naive $O(n^3)$ time.

This solution is also faster when $k$ grows not-too-much-slower and not-too-much-faster than $n$ . As long as $k \in \Omega(n^{\omega-2})$ and $k \in O(n^\frac{2}{\omega-1})$ , then $(n+k)^\omega$ is better than $n^2 k$ . For $w \approx 2.373$ that translates to $n^{0.731} \leq k \leq n^{1.373}$ (asymptotically). If $w$ limits to 2, then the bounds widen towards $n^\epsilon \leq k \leq n^{2-\epsilon}$ .

— Craig Gidney
แหล่งที่มา

1. This is also better than the naive solution if

k = Ω (n)

$k=\Omega(n)$ but

k = o (n^{1.457})

$k=o(n^{1.457})$ . 2. If

k \geq n

$k \ge n$ , a heuristic could be: pick a random subset of

n

$n$ bit positions, restrict to those bit positions and use matrix multiplication to enumerate all pairs that don't overlap in those

n

$n$ bit positions; for each such pair, check if it solves the original problem. If there aren't many pairs that don't overlap in those

n

$n$ bit positions, this provides a speedup over the naive algorithm. However I don't know a good upper bound on the number of such pairs.

— D.W.

4

This is equivalent to finding a bit vector which is a subset of the complement of another vector; ie its 1's occur only where 0's occur in the other.

If k (or the number of 1's) is small, you can get $O(n2^k)$ time by simply generating all the subsets of the complement of each bitvector and putting them in a trie (using backtracking). If a bitvector is found in the trie (we can check each before complement-subset insertion) then we have a non-overlapping pair.

If the number of 1's or 0's is bounded to an even lower number than k, then the exponent can be replaced by that. The subset-indexing can be on either each vector or its complement, so long as probing uses the opposite.

There's also a scheme for superset-finding in a trie that only stores each vector only once, but does bit-skipping during probes for what I believe is similar aggregate complexity; ie it has $o(k)$ insertion but $o(2^k)$ searches.

— KWillets
แหล่งที่มา

thanks. The complexity of your solution is

\sim n 2^{(1 - p) k}

$\sim n 2^{(1-p)k}$ , where

p

$p$ is the probability of 1's in the bitvector. A couple of implementation details: though this is a slight improvement, there's no need to compute and store the complements in the trie. Just following the complementary branches when checking for a non-overlapping match is enough. And, taking the 0's directly as wildcards, no special wildcard is needed, either.

— Mauro Lacy

2

Represent the bit vectors as an $n\times k$ matrix $M$ . Take $i$ and $j$ between 1 and $n$ .

\begin{aligned} (M M^{T})_{i j} = \sum_{l} M_{i l} M_{j l} \end{aligned} .

$\begin{align} (MM^T)_{ij} = \sum_l M_{il}M_{jl} \end{align}.$

$(MM^T)_{ij}$ , the dot product of the $i$ th and $j$ th vector, is non-zero if, and only if, vectors $i$ and $j$ share a common 1. So, to find a solution, compute $MM^T$ and return the position of a zero entry, if such an entry exists.

Complexity

Using naive multiplication, this requires $O(n^2k)$ arithmetic operations. If $n=k$ , it takes $O(n^{2.37})$ operations using the utterly impractical Coppersmith-Winograd algorithm, or $O(n^{2.8})$ using the Strassen algorithm. If $k=O(n^{0.302})$ , then the problem may be solved using $n^{2 + o(1)}$ operations.

— Ben
แหล่งที่มา

How is this different from Strilanc's answer?

— D.W.

1

@D.W. Using an

n

$n$ -by-

k

$k$ matrix instead of an

(n + k)

$(n+k)$ -by-

(n + k)

$(n+k)$ matrix is an improvement. Also it mentions a way to cut off the factor of k when k << n, so that might be useful.

— Craig Gidney

ค้นหาเวกเตอร์บิตที่ไม่ได้ซ้อนทับกัน

Warmup: bitvectors สุ่ม

Lessons from the warmup

Handling very low density

Towards a general-case algorithm

Faster solution when n≈kn≈kn \approx k, using matrix multiplication

Faster solution when $n \approx k$ , using matrix multiplication