เพิ่ม ID ที่พบในรายการในคอลัมน์ใหม่ใน pandas dataframe

11

สมมติว่าฉันมี dataframe ต่อไปนี้ (คอลัมน์จำนวนเต็มและคอลัมน์ที่มีรายการจำนวนเต็ม) ...

      ID                   Found_IDs
0  12345        [15443, 15533, 3433]
1  15533  [2234, 16608, 12002, 7654]
2   6789      [43322, 876544, 36789]

และยังมีรายการ ID แยกต่างหาก ...

bad_ids = [15533, 876544, 36789, 11111]

ระบุว่าและละเว้นdf['ID']คอลัมน์และดัชนีใด ๆ ฉันต้องการดูว่ามีรหัสใด ๆ ในbad_idsรายการที่กล่าวถึงในdf['Found_IDs']คอลัมน์ รหัสที่ฉันมีคือ:

df['bad_id'] = [c in l for c, l in zip(bad_ids, df['Found_IDs'])]

สิ่งนี้ใช้ได้ผลก็ต่อเมื่อbad_idsรายการนั้นยาวกว่า dataframe และสำหรับชุดข้อมูลจริงbad_idsรายการนั้นจะสั้นกว่า dataframe มาก ถ้าฉันตั้งค่าbad_idsรายการเป็นสององค์ประกอบเท่านั้น ...

bad_ids = [15533, 876544]

ฉันได้รับข้อผิดพลาดที่นิยมมาก (ฉันได้อ่านคำถามมากมายด้วยข้อผิดพลาดเดียวกัน) ...

ValueError: Length of values does not match length of index

ฉันได้ลองแปลงรายการเป็นซีรี่ส์ (ไม่มีการเปลี่ยนแปลงข้อผิดพลาด) ฉันได้ลองเพิ่มคอลัมน์ใหม่และตั้งค่าทั้งหมดเป็นFalseก่อนที่จะทำบรรทัดความเข้าใจ (อีกครั้งไม่มีการเปลี่ยนแปลงข้อผิดพลาด)

สองคำถาม:

ฉันจะทำให้โค้ดของฉัน (ด้านล่าง) ทำงานกับรายการที่สั้นกว่าดาต้าเฟรมได้อย่างไร
ฉันจะได้รับรหัสเพื่อเขียน ID จริงกลับไปที่df['bad_id']คอลัมน์ (มีประโยชน์มากกว่า True / False) ได้อย่างไร

ผลลัพธ์ที่คาดหวังสำหรับbad_ids = [15533, 876544]:

      ID                   Found_IDs  bad_id
0  12345        [15443, 15533, 3433]    True
1  15533  [2234, 16608, 12002, 7654]   False
2   6789      [43322, 876544, 36789]    True

เอาต์พุตที่เหมาะสำหรับbad_ids = [15533, 876544](ID) ถูกเขียนไปยังคอลัมน์หรือคอลัมน์ใหม่:

      ID                   Found_IDs  bad_id
0  12345        [15443, 15533, 3433]    15533
1  15533  [2234, 16608, 12002, 7654]   False
2   6789      [43322, 876544, 36789]    876544

รหัส:

import pandas as pd

result_list = [[12345,[15443,15533,3433]],
        [15533,[2234,16608,12002,7654]],
        [6789,[43322,876544,36789]]]

df = pd.DataFrame(result_list,columns=['ID','Found_IDs'])

# works if list has four elements
# bad_ids = [15533, 876544, 36789, 11111]

# fails if list has two elements (less elements than the dataframe)
# ValueError: Length of values does not match length of index
bad_ids = [15533, 876544]

# coverting to Series doesn't change things
# bad_ids = pd.Series(bad_ids)
# print(type(bad_ids))

# setting up a new column of false values doesn't change things
# df['bad_id'] = False

print(df)

df['bad_id'] = [c in l for c, l in zip(bad_ids, df['Found_IDs'])]

print(bad_ids)

print(df)

— MDR
แหล่งที่มา

7

ใช้np.intersect1dเพื่อรับจุดตัดของสองรายการ:

df['bad_id'] = df['Found_IDs'].apply(lambda x: np.intersect1d(x, bad_ids))

      ID                   Found_IDs    bad_id
0  12345        [15443, 15533, 3433]   [15533]
1  15533  [2234, 16608, 12002, 7654]        []
2   6789      [43322, 876544, 36789]  [876544]

หรือใช้เพียงวนิลางูหลามโดยใช้จุดตัดของsets:

bad_ids_set = set(bad_ids)
df['Found_IDs'].apply(lambda x: list(set(x) & bad_ids_set))

— erfan
แหล่งที่มา

3

หากต้องการทดสอบค่าทั้งหมดของรายการในFound_IDsคอลัมน์โดยbad_idsใช้ค่าทั้งหมด:

bad_ids = [15533, 876544]

df['bad_id'] = [any(c in l for c in bad_ids) for l  in df['Found_IDs']]
print (df)
      ID                   Found_IDs  bad_id
0  12345        [15443, 15533, 3433]    True
1  15533  [2234, 16608, 12002, 7654]   False
2   6789      [43322, 876544, 36789]    True

ถ้าต้องการการแข่งขันทั้งหมด:

df['bad_id'] = [[c for c in bad_ids if c in l] for l  in df['Found_IDs']]
print (df)
      ID                   Found_IDs    bad_id
0  12345        [15443, 15533, 3433]   [15533]
1  15533  [2234, 16608, 12002, 7654]        []
2   6789      [43322, 876544, 36789]  [876544]

และสำหรับการแข่งขันครั้งแรกหากมีการตั้งค่ารายการว่างFalseวิธีแก้ปัญหาที่เป็นไปได้ แต่ไม่แนะนำให้ผสมบูลีนและตัวเลข:

df['bad_id'] = [next(iter([c for c in bad_ids if c in l]), False) for l  in df['Found_IDs']]
print (df)
      ID                   Found_IDs  bad_id
0  12345        [15443, 15533, 3433]   15533
1  15533  [2234, 16608, 12002, 7654]   False
2   6789      [43322, 876544, 36789]  876544

โซลูชันพร้อมชุด:

df['bad_id'] = df['Found_IDs'].map(set(bad_ids).intersection)
print (df)

      ID                   Found_IDs    bad_id
0  12345        [15443, 15533, 3433]   {15533}
1  15533  [2234, 16608, 12002, 7654]        {}
2   6789      [43322, 876544, 36789]  {876544}

และคล้ายกับรายการความเข้าใจ:

df['bad_id'] = [list(set(bad_ids).intersection(l)) for l  in df['Found_IDs']]
print (df)
      ID                   Found_IDs    bad_id
0  12345        [15443, 15533, 3433]   [15533]
1  15533  [2234, 16608, 12002, 7654]        []
2   6789      [43322, 876544, 36789]  [876544]

— jezrael
แหล่งที่มา

1

คุณสามารถสมัครและใช้ np.any:

df['bad_id'] = df['Found_IDs'].apply(lambda x: np.any([c in x for c in bad_ids]))

สิ่งนี้จะคืนค่าบูลถ้ามี bad_id ใน Found_ID ถ้าคุณต้องการดึง bad_ids นี้:

df['bad_id'] = df['Found_IDs'].apply(lambda x: [*filter(lambda x: c in x, bad_ids)])

นี่จะส่งคืนรายการ bad_ids ที่ found_ids ถ้ามี 0 ก็จะส่งกลับ []

— บรูโนเมลโล
แหล่งที่มา

1

ใช้mergeและconcatจัดกลุ่มตามดัชนีของคุณเพื่อส่งคืนการแข่งขันทั้งหมด

bad_ids = [15533, 876544, 36789, 11111]

df2 = pd.concat(
    [
        df,
        pd.merge(
            df["Found_IDs"].explode().reset_index(),
            pd.Series(bad_ids, name="bad_ids"),
            left_on="Found_IDs",
            right_on="bad_ids",
            how="inner",
        )
        .groupby("index")
        .agg(bad_ids=("bad_ids", list)),
    ],
    axis=1,
).fillna(False)
print(df2)


      ID                   Found_IDs          bad_ids
0  12345        [15443, 15533, 3433]          [15533]
1  15533  [2234, 16608, 12002, 7654]            False
2   6789      [43322, 876544, 36789]  [876544, 36789]

— Datanovice
แหล่งที่มา

0

ใช้การระเบิดและการรวมกลุ่มโดย

s = df['Found_IDs'].explode()
df['bad_ids'] = s.isin(bad_ids).groupby(s.index).any()

สำหรับ bad_ids = [15533, 876544]

>>> df
      ID                   Found_IDs  bad_ids
0  12345        [15443, 15533, 3433]     True
1  15533  [2234, 16608, 12002, 7654]    False
2   6789      [43322, 876544, 36789]     True

หรือ

สำหรับการจับคู่ค่า

s = df['Found_IDs'].explode()
s.where(s.isin(bad_ids)).groupby(s.index).agg(lambda x: list(x.dropna()))

สำหรับ bad_ids = [15533, 876544]

      ID                   Found_IDs   bad_ids
0  12345        [15443, 15533, 3433]   [15533]
1  15533  [2234, 16608, 12002, 7654]        []
2   6789      [43322, 876544, 36789]  [876544]

— Vishnudev
แหล่งที่มา