สูงสุดและปิดบ่อย - คำตอบรวมอยู่ด้วย


10

My  dataset:
1:A,B,C,E
2:A,C,D,E
3:     B,C,E
4:A,C,D,E
5:    C,D,E
6:    A,D,E

ฉันต้องการที่จะหาชุดรายการบ่อยสูงสุดและปิดชุดรายการบ่อย

  • ชุดรายการที่ใช้บ่อย XFเป็นสูงสุดถ้ามันไม่ได้มี supersets ใด ๆ บ่อย
  • ชุดรายการที่ใช้บ่อย X ∈ F ปิดหากไม่มีชุดซูเปอร์เซ็ตที่มีความถี่เท่ากัน

ดังนั้นฉันจึงนับการเกิดขึ้นของแต่ละรายการชุด

{A} = 4 ;  {B} = 2  ; {C} = 5  ; {D} = 4  ; {E} = 6

{A,B} = 1; {A,C} = 3; {A,D} = 3; {A,E} = 4; {B,C} = 2; 
{B,D} = 0; {B,E} = 2; {C,D} = 3; {C,E} = 5; {D,E} = 3

{A,B,C} = 1; {A,B,D} = 0; {A,B,E} = 1; {A,C,D} = 2; {A,C,E} = 3; 
{A,D,E} = 3; {B,C,D} = 0; {B,C,E} = 2; {C,D,E} = 3

{A,B,C,D} = 0; {A,B,C,E} = 1; {B,C,D,E} = 0

Min_Support ตั้งค่าเป็น 50// สำคัญมาก. ขอบคุณ steffen สำหรับการเตือนสิ่งนั้น

ทำสูงสุด ={A,B,C,E} ?

ไม่ปิด ={A,B,C,D} and {B,C,D,E}?

คำตอบ:


5

I found a slightly extended definition in this source (which includes a good explanation). Here is a more reliable (published) source: CHARM: An efficient algorithm for closed itemset mining by Mohammed J. Zaki and Ching-jui Hsiao.

According to this source:

  • An itemset 
is 
closed
 if 
none 
of
 its 
immediate 
supersets 
has 
 the
 same
 support 
as 
the 
itemset
  • An 
itemset 
is 
maximal 
frequent
 if 
none 
of 
its 
immediate 
supersets 
is 
frequent


Some remarks:

  • It is necessary to set a min_support (support = the number of item sets containing the subset of interest divided by the number of all itemsets) which defines which itemset is frequent. An itemset is frequent if its support >= min_support.
  • In regards to the algorithm, only itemsets with min_support are considered when one tries to find the maximal frequent and closed itemsets.
  • The important aspect in the definition of closed is, that it does not matter if an immediate superset exists with more support, only immediate supersets with exactly the same support do matter.
  • maximal frequent => closed => frequent, but not vice versa.

Application to the example of the OP

Note:

  • Did not check the support counts
  • Let's say min_support=0.5. This is fulfilled if min_support_count >= 3
{A} = 4  ; not closed due to {A,E}
{B} = 2  ; not frequent => ignore
{C} = 5  ; not closed due to {C,E}
{D} = 4  ; not closed due to {D,E}, but not maximal due to e.g. {A,D}
{E} = 6  ; closed, but not maximal due to e.g. {D,E}

{A,B} = 1; not frequent => ignore
{A,C} = 3; not closed due to {A,C,E}
{A,D} = 3; not closed due to {A,D,E}
{A,E} = 4; closed, but not maximal due to {A,D,E}
{B,C} = 2; not frequent => ignore
{B,D} = 0; not frequent => ignore
{B,E} = 2; not frequent => ignore
{C,D} = 3; not closed due to {C,D,E}
{C,E} = 5; closed, but not maximal due to {C,D,E}
{D,E} = 4; closed, but not maximal due to {A,D,E}

{A,B,C} = 1; not frequent => ignore
{A,B,D} = 0; not frequent => ignore
{A,B,E} = 1; not frequent => ignore
{A,C,D} = 2; not frequent => ignore
{A,C,E} = 3; maximal frequent
{A,D,E} = 3; maximal frequent
{B,C,D} = 0; not frequent => ignore
{B,C,E} = 2; not frequent => ignore
{C,D,E} = 3; maximal frequent

{A,B,C,D} = 0; not frequent => ignore
{A,B,C,E} = 1; not frequent => ignore
{B,C,D,E} = 0; not frequent => ignore

The source link is broken, just letting you know. And yes min_support is very important, I am using .50
Mike John

1
Sorry for that, fixed.
steffen

1
changed min_support=0.5 <=> min_support_count=3 and changed application to example accordingly.
steffen

Use APRIORI, and you can save a lot of counting and constructing itemsets...
จบแล้ว - Anony-Mousse

@Anony-Mousse I know APRIORI ... I stepped over the itemsets manually to explain the concept of closed and maximal frequent itemsets as detailed as possible, since this was the source of confusion of the OP (IMHO).
steffen

1

You may want to read up on the APRIORI algorithm. It avoids unneccessary itemsets by clever pruning.

{A} = 4 ;  {B} = 2  ; {C} = 5  ; {D} = 4  ; {E} = 6

B is not frequent, remove.

Construct and count two-itemsets (no magic yet, except that B is already out)

{A,C} = 3; {A,D} = 3; {A,E} = 4; 
{C,D} = 3; {C,E} = 5; {D,E} = 3

All of these are frequent (notice that all that had B cannot be frequent!)

Now use the prefix rule. ONLY combine itemsets starting with the same n-1 items. Remove all, where any subset is not frequent. Count the remaining itemsets.

{A,C,D} = 2; {A,C,E} = 3; {A,D,E} = 3; 
{C,D,E} = 3

Note that {A,C,D} is not frequent. As there is no shared prefix, there cannot be a larger frequent itemset!

Notice how much less work I did!

For maximal / closed itemsets, check subsets / supersets.

Note that e.g. {E}=6, and {A,E}=4. {E} is a subset, but has higher support, i.e. it is closed but not maximal. {A} is neither, as it does not have higher support than {A,E}, i.e. it is redundant.

โดยการใช้ไซต์ของเรา หมายความว่าคุณได้อ่านและทำความเข้าใจนโยบายคุกกี้และนโยบายความเป็นส่วนตัวของเราแล้ว
Licensed under cc by-sa 3.0 with attribution required.