ค้นหาคู่สมมาตรอย่างรวดเร็วในจำนวนที่กำหนด

15

from itertools import product
import pandas as pd

df = pd.DataFrame.from_records(product(range(10), range(10)))
df = df.sample(90)
df.columns = "c1 c2".split()
df = df.sort_values(df.columns.tolist()).reset_index(drop=True)
#     c1  c2
# 0    0   0
# 1    0   1
# 2    0   2
# 3    0   3
# 4    0   4
# ..  ..  ..
# 85   9   4
# 86   9   5
# 87   9   7
# 88   9   8
# 89   9   9
# 
# [90 rows x 2 columns]

ฉันจะค้นหาระบุและลบคู่ที่ซ้ำกันล่าสุดของคู่สมมาตรทั้งหมดในกรอบข้อมูลนี้ได้อย่างรวดเร็วได้อย่างไร

ตัวอย่างของคู่สมมาตรคือ '(0, 1)' เท่ากับ '(1, 0)' ควรลบหลัง

อัลกอริทึมต้องเร็วดังนั้นจึงแนะนำให้ใช้จำนวนมาก ไม่อนุญาตให้แปลงเป็นวัตถุหลาม

python pandas numpy

— แมว Unfun
แหล่งที่มา

1

คุณยกตัวอย่างสิ่งที่คุณเข้าใจได้symmetric pairsไหม

— yatu

(0, 1) == (1,0) เป็น True

— The Unfun Cat

1

(0, 1) == (0, 1) เป็น True หรือไม่

— wundermahn

@JerryM ใช่ แต่มันเป็นเรื่องไม่สำคัญที่จะลบด้วยdf.drop_duplicates()

— The Unfun Cat

2

@ molybdenum42 ฉันใช้ผลิตภัณฑ์ itertools เพื่อสร้างตัวอย่างข้อมูลเองไม่ได้ถูกสร้างขึ้นด้วยผลิตภัณฑ์ itertools

— Cat

13

คุณสามารถจัดเรียงค่าจากนั้นgroupby:

a= np.sort(df.to_numpy(), axis=1)
df.groupby([a[:,0], a[:,1]], as_index=False, sort=False).first()

ตัวเลือกที่ 2 : ถ้าคุณมีจำนวนมากของคู่c1, c2, groupbyได้ช้า ในกรณีนี้เราสามารถกำหนดค่าใหม่และตัวกรองโดยdrop_duplicates:

a= np.sort(df.to_numpy(), axis=1) 

(df.assign(one=a[:,0], two=a[:,1])   # one and two can be changed
   .drop_duplicates(['one','two'])   # taken from above
   .reindex(df.columns, axis=1)
)

— Quang Hoang
แหล่งที่มา

7

วิธีหนึ่งคือใช้np.uniqueกับreturn_index=Trueและใช้ผลลัพธ์เพื่อจัดทำดัชนีชื่อไฟล์:

a = np.sort(df.values)
_, ix = np.unique(a, return_index=True, axis=0)

print(df.iloc[ix, :])

    c1  c2
0    0   0
1    0   1
20   2   0
3    0   3
40   4   0
50   5   0
6    0   6
70   7   0
8    0   8
9    0   9
11   1   1
21   2   1
13   1   3
41   4   1
51   5   1
16   1   6
71   7   1
...

— yatu
แหล่งที่มา

1

ใช่ไม่ซ้ำกันเป็นอย่างอื่นไม่สามารถตรวจจับคู่สมมาตร @DanielMesejo

— yatu

ตกลงฉันเห็นว่าคุณกำลังเรียงลำดับคู่

— Dani Mesejo

ใช่ แต่ฉันหมายความว่าคุณเปลี่ยน [1, 0] เป็น [0, 1] ใช่ไหม?

— Dani Mesejo

6

`frozenset`

mask = pd.Series(map(frozenset, zip(df.c1, df.c2))).duplicated()

df[~mask]

— piRSquared
แหล่งที่มา

1

คุณไม่ซ้ำอย่างช้า ๆ กับสิ่งอันดับในแต่ละคอลัมน์ที่นี่หรือ ยัง upvote

— แมว

ใช่ฉันกำลังทำซ้ำ ไม่มันไม่ช้าอย่างที่คุณคิด

— piRSquared

5

ฉันจะทำ

df[~pd.DataFrame(np.sort(df.values,1)).duplicated().values]

มาจากนุ่นและไทร

s=pd.crosstab(df.c1,df.c2)
s=s.mask(np.triu(np.ones(s.shape)).astype(np.bool) & s==0).stack().reset_index()

— YOBEN_S
แหล่งที่มา

5

นี่คือหนึ่ง NumPy ตามหนึ่งสำหรับจำนวนเต็ม -

def remove_symm_pairs(df):
    a = df.to_numpy(copy=False)
    b = np.sort(a,axis=1)
    idx = np.ravel_multi_index(b.T,(b.max(0)+1))
    sidx = idx.argsort(kind='mergesort')
    p = idx[sidx]
    m = np.r_[True,p[:-1]!=p[1:]]
    a_out = a[np.sort(sidx[m])]
    df_out = pd.DataFrame(a_out)
    return df_out

return df.iloc[np.sort(sidx[m])]หากคุณต้องการที่จะเก็บข้อมูลดัชนีมันคือการใช้งาน

สำหรับหมายเลขทั่วไป (ints / float ฯลฯ ) เราจะใช้หมายเลขview-based-

# https://stackoverflow.com/a/44999009/ @Divakar
def view1D(a): # a is array
    a = np.ascontiguousarray(a)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel()

และก็เปลี่ยนขั้นตอนที่จะได้รับidxมีในidx = view1D(b)remove_symm_pairs

— Divakar
แหล่งที่มา

1

ถ้าสิ่งนี้ต้องการความรวดเร็วและหากตัวแปรของคุณเป็นจำนวนเต็มเคล็ดลับต่อไปนี้อาจช่วยได้: ให้v,wเป็นคอลัมน์ของเวกเตอร์ของคุณ สร้าง[v+w, np.abs(v-w)] =: [x, y]; แล้วเรียงเมทริกซ์นี้ lexicographically ลบที่ซ้ำกันและในที่สุดก็ map [v, w] = [(x+y), (x-y)]/2มันกลับไป

— Federico Poloni
แหล่งที่มา