การประมาณ n ในปัญหาของตัวสะสมคูปอง

14

ในรูปแบบของปัญหาเกี่ยวกับตัวสะสมคูปองคุณไม่ทราบจำนวนคูปองและต้องพิจารณาจากข้อมูล ฉันจะอ้างถึงสิ่งนี้ว่าเป็นปัญหาคุกกี้โชคลาภ:

ป.ร. ให้ไว้ไม่ทราบจำนวนข้อความคุกกี้โชคลาภที่แตกต่างกัน $n$ ประมาณการ $n$ โดยการสุ่มตัวอย่างคุกกี้หนึ่งที่เวลาและการนับจำนวนครั้งในแต่ละโชคลาภจะปรากฏขึ้น กำหนดจำนวนตัวอย่างที่จำเป็นในการรับช่วงความมั่นใจที่ต้องการในการประมาณนี้

โดยทั่วไปฉันต้องการอัลกอริทึมที่สุ่มตัวอย่างข้อมูลเพียงพอที่จะเข้าถึงช่วงความเชื่อมั่นที่กำหนดให้พูด $n \pm 5$ ด้วยความมั่นใจ $95\%$ สำหรับความเรียบง่ายเราสามารถสรุปได้ว่าโชคชะตาทั้งหมดปรากฏขึ้นพร้อมกับความน่าจะเป็น / ความถี่เท่ากัน แต่นี่ไม่เป็นความจริงสำหรับปัญหาทั่วไปที่มากขึ้น

ดูเหมือนว่าจะคล้ายกับปัญหารถถังเยอรมันแต่ในกรณีนี้คุกกี้โชคลาภไม่ได้ติดป้ายกำกับตามลำดับและไม่มีการสั่งซื้อ

estimation coupon-collector-problem

— goweon
แหล่งที่มา

1

เรารู้หรือไม่ว่าข้อความนั้นบ่อยครั้งเท่า ๆ กัน?

— Glen_b -Reinstate Monica

คำถามที่แก้ไข: ใช่

— goweon

2

คุณสามารถเขียนฟังก์ชันความน่าจะเป็นได้หรือไม่?

— เซน

2

คนที่ศึกษาเกี่ยวกับสัตว์ป่าจะจับภาพติดป้ายและปล่อยสัตว์ หลังจากนั้นพวกเขาอนุมานขนาดของประชากรตามความถี่ที่พวกเขาเอากลับคืนติดแท็กสัตว์ ดูเหมือนว่าปัญหาของคุณจะเทียบเท่ากับคณิตศาสตร์ของพวกเขา

— Emil Friedman

6

สำหรับกรณีที่น่าจะเป็น / ความถี่เท่ากันวิธีนี้อาจใช้ได้ผลสำหรับคุณ

ให้เป็นขนาดตัวอย่างทั้งหมด, คือจำนวนรายการต่าง ๆ ที่สังเกต, คือจำนวนรายการที่เห็นอย่างแน่นอนครั้งเดียว, คือจำนวนรายการที่เห็นสองครั้ง, $K$ $N$ $N_1$ $N_2$ และ $A=N_1(1− {N_1 \over K} )+2N_2,$ $\hat Q = {N_1 \over K}.$

จากนั้นประมาณ 95% confidence interval กับขนาดของประชากรทั้งหมด จะได้รับจาก $n$

{\hat{n}}_{L o w e r} = \frac{1}{1 - \hat{Q} + \frac{1.96 \sqrt{A}}{K}}

$\hat n_{Lower}={1 \over {1-\hat Q+{1.96 \sqrt{A} \over K} }}$

{\hat{n}}_{U p p e r} = \frac{1}{1 - \hat{Q} - \frac{1.96 \sqrt{A}}{K}}

$\hat n_{Upper}={1 \over {1-\hat Q-{1.96 \sqrt{A} \over K} }}$

เมื่อใช้งานคุณอาจต้องปรับเปลี่ยนสิ่งเหล่านี้ขึ้นอยู่กับข้อมูลของคุณ

วิธีนี้เกิดจากการดีและทัวริง การอ้างอิงที่มีช่วงความเชื่อมั่นคือ Esty, Warren W. (1983), "กฎหมาย จำกัด ปกติสำหรับตัวประมาณค่าแบบไม่มีพารามิเตอร์ของความครอบคลุมของตัวอย่างแบบสุ่ม" , Ann. statist , เล่มที่ 11, หมายเลข 3, 905-912

สำหรับปัญหาทั่วไป Bunge ได้ผลิตซอฟต์แวร์ฟรีที่สร้างประมาณการหลายอย่าง ค้นหาด้วยชื่อและคำของเขาที่รับทั้งหมด

— soakley
แหล่งที่มา

1

ฉันรับเสรีภาพในการเพิ่มการอ้างอิงของ Esty โปรดตรวจสอบอีกครั้งว่าเป็นสิ่งที่คุณต้องการ

— Glen_b -Reinstate Monica

@soakley เป็นไปได้หรือไม่ที่จะได้รับขอบเขต (อาจเป็นขอบเขตที่แม่นยำน้อยกว่า) หากคุณรู้จัก

(ขนาดตัวอย่าง) และ

(จำนวนรายการที่ไม่ซ้ำที่เห็น) คือเราไม่ได้มีข้อมูลเกี่ยวกับ

และ

2

K

$K$

N

$N$

N_{1}

$N_1$

N_{2}

$N_2$

— Basj

ผมไม่ทราบวิธีที่จะทำมันมีเพียง

และ

K

$K$

N .

$N.$

— soakley

2

ฉันไม่ทราบว่าจะสามารถช่วยได้หรือไม่ แต่มันเป็นปัญหาของการเอาลูกที่แตกต่างกันระหว่างการทดลองในโกศด้วยลูกบอลมีป้ายกำกับแตกต่างกันด้วยการแทนที่ ตามหน้านี้(เป็นภาษาฝรั่งเศส) ถ้าหากตัวแปรสุ่มนับจำนวนลูกที่แตกต่างกันฟังก์ชั่นความน่าจะเป็นมอบให้โดย: $k$ $n$ $m$ $X_n$ $P(X_n = k) = {m \choose k} \sum_{i=0}^k {(-1)^{k-i}{k \choose i}}{(\frac{i}{m})^n}$

จากนั้นคุณสามารถใช้ตัวประมาณโอกาสสูงสุด

สูตรมีหลักฐานก็คือการให้ที่นี่ในการแก้ปัญหาการเข้าพัก

— Sylvain
แหล่งที่มา

2

ฟังก์ชันความน่าจะเป็นและความน่าจะเป็น

ในการตอบคำถามเกี่ยวกับปัญหาวันเกิดแบบย้อนกลับได้มีการให้คำตอบสำหรับฟังก์ชันความน่าจะเป็นในการทำงานโดย Cody Maughan

ฟังก์ชั่นความน่าจะเป็นสำหรับจำนวนของโชคลาภประเภท cooky $m$ เมื่อเราวาด $k$ คุกกี้โชคแตกต่างกันใน $n$ ดึง (ที่ทุกประเภทคุกกี้โชคลาภมีความน่าจะเป็นเท่ากันที่ปรากฏในการวาด) สามารถแสดงเป็น:

\begin{matrix} L (m | k, n) = m^{- n} \frac{m!}{(m - k)!} \propto P (k | m, n) & = & m^{- n} \frac{m!}{(m - k)!} \cdot \underset{\begin{array}{l} Stirling number \\ of the 2nd kind \end{array}}{\underset{⏟}{S (n, k)}} \\ = & m^{- n} \frac{m!}{(m - k)!} \cdot \frac{1}{k!} \sum_{i = 0}^{k} (- 1)^{i} (\binom{k}{i}) (k - i)^{n} \\ = & (\binom{m}{k}) \sum_{i = 0}^{k} (- 1)^{i} (\binom{k}{i}) {(\frac{k - i}{m})}^{n} \end{matrix}

$\begin{array}{} \mathcal{L}(m \, \vert \, k,n ) = m^{-n} \frac{m!}{(m-k)!} \propto P(k \, \vert \, m,n) &=& m^{-n}\frac{m!}{(m-k)!} \cdot \underbrace{S(n,k)}_{\begin{subarray}{l}\text{Stirling number }\\ \text{of the 2nd kind}\end{subarray}}\\ &=& m^{-n}\frac{m!}{(m-k)!} \cdot \frac{1}{k!} \sum_{i=0}^k {(-1)^{i}{k \choose i}}{(k-i)^n} \\ &=& {{m}\choose{k}} \sum_{i=0}^k {(-1)^{i}{k \choose i}}{\left(\frac{k-i}{m}\right)^n} \end{array}$

สำหรับความน่าจะเป็นที่ได้รับทางด้านขวามือจะเห็นปัญหาการเข้าพัก สิ่งนี้ได้รับการอธิบายก่อนหน้านี้โดยเบ็น การแสดงออกนั้นคล้ายกับคำตอบใน Sylvain

การประเมินความเป็นไปได้สูงสุด

เราสามารถคำนวณอันดับหนึ่งและอันดับสองโดยประมาณของฟังก์ชันความน่าจะเป็นสูงสุดที่

m_{1} \approx \frac{(\binom{n}{2})}{n - k}

$m_1 \approx \frac{ {{n}\choose{2}}}{n-k}$

m_{2} \approx \frac{(\binom{n}{2}) + \sqrt{{(\binom{n}{2})}^{2} - 4 (n - k) (\binom{n}{3})}}{2 (n - k)}

$m_2 \approx \frac{ {{n}\choose{2}} + \sqrt{{{n}\choose{2}}^2 - 4(n-k) {{n}\choose{3}}}}{2(n-k)}$

Likelihood interval

(note, this is not the same as a confidence interval see: The basic logic of constructing a confidence interval)

This remains an open problem for me. I am not sure yet how to deal with the expression $m^{-n} \frac{m!}{(m-k)!}$ (of course one can compute all values and select the boundaries based on that, but it would be more nice to have some explicit exact formula or estimate). I can not seem to relate it to any other distribution which would greatly help to evaluate it. But I feel like a nice (simple) expression could be possible from this likelihood interval approach.

Confidence interval

For the confidence interval we can use a normal approximation. In Ben's answer the following mean and variance are given:

E [K] = m (1 - {(1 - \frac{1}{m})}^{n})

$\mathbb{E}[K] = m \left(1-\left(1 - \frac{1}{m}\right)^n\right)$

V [K] = m ((m - 1) {(1 - \frac{2}{m})}^{n} + {(1 - \frac{1}{m})}^{n} - m {(1 - \frac{1}{m})}^{2 n})

$\mathbb{V}[K] = m \left(\left(m-1\right)\left(1-\frac{2}{m}\right)^n + \left(1 - \frac{1}{m}\right)^n - m \left(1 - \frac{1}{m}\right)^{2n} \right)$

Say for a given sample $n=200$ and observed unique cookies $k$ the 95% boundaries $\mathbb{E}[K] \pm 1.96 \sqrt{\mathbb{V}[K]}$ look like:

In the image above the curves for the interval have been drawn by expressing the lines as a function of the population size $m$ and sample size $n$ (so the x-axis is the dependent variable in drawing these curves).

The difficulty is to inverse this and obtain the interval values for a given observed value $k$ . It can be done computationally, but possibly there might be some more direct function.

In the image I have also added Clopper Pearson confidence intervals based on a direct computation of the cumulative distribution based on all the probabilities $P(k \, \vert \, m,n)$ (I did this in R where I needed to use the Strlng2 function from the CryptRndTest package which is an asymptotic approximation of the logarithm of the Stirling number of the second kind). You can see that the boundaries coincide reasonably well, so the normal approximation is performing well in this case.

# function to compute Probability
library("CryptRndTest")
P5 <- function(m,n,k) {
  exp(-n*log(m)+lfactorial(m)-lfactorial(m-k)+Strlng2(n,k))
}
P5 <- Vectorize(P5)

# function for expected value 
m4 <- function(m,n) {
  m*(1-(1-1/m)^n)
}

# function for variance
v4 <- function(m,n) {
  m*((m-1)*(1-2/m)^n+(1-1/m)^n-m*(1-1/m)^(2*n))
}


# compute 95% boundaries based on Pearson Clopper intervals
# first a distribution is computed
# then the 2.5% and 97.5% boundaries of the cumulative values are located
simDist <- function(m,n,p=0.05) {
  k <- 1:min(n,m)
  dist <- P5(m,n,k)
  dist[is.na(dist)] <- 0
  dist[dist == Inf] <- 0
  c(max(which(cumsum(dist)<p/2))+1,
       min(which(cumsum(dist)>1-p/2))-1)
}


# some values for the example
n <- 200
m <- 1:5000
k <- 1:n

# compute the Pearon Clopper intervals
res <- sapply(m, FUN = function(x) {simDist(x,n)})


# plot the maximum likelihood estimate
plot(m4(m,n),m,
     log="", ylab="estimated population size m", xlab = "observed uniques k",
     xlim =c(1,200),ylim =c(1,5000),
     pch=21,col=1,bg=1,cex=0.7, type = "l", yaxt = "n")
axis(2, at = c(0,2500,5000))

# add lines for confidence intervals based on normal approximation
lines(m4(m,n)+1.96*sqrt(v4(m,n)),m, lty=2)
lines(m4(m,n)-1.96*sqrt(v4(m,n)),m, lty=2)
# add lines for conficence intervals based on Clopper Pearson
lines(res[1,],m,col=3,lty=2)
lines(res[2,],m,col=3,lty=2)

# add legend
legend(0,5100,
       c("MLE","95% interval\n(Normal Approximation)\n","95% interval\n(Clopper-Pearson)\n")
       , lty=c(1,2,2), col=c(1,1,3),cex=0.7,
       box.col = rgb(0,0,0,0))

— Sextus Empiricus
แหล่งที่มา

For the case of unequal probabilities. You can approximate the number of cookies of a particular type as independent Binomial/Poisson distributed variables and describe whether they are filled or not as Bernouilli variables. Then add together the variance and means for those variables. I guess that this is also how Ben derived/approximated the expectation value and variance. ----- A problem is how you describe these different probabilities. You can not do this explicitly since you do not know the number of cookies.

— Sextus Empiricus