FutureWarning: การเปรียบเทียบองค์ประกอบล้มเหลว ส่งคืนสเกลาร์ แต่ในอนาคตจะทำการเปรียบเทียบตามองค์ประกอบ

Question 1

ฉันใช้ Pandas 0.19.1บน Python 3 ฉันได้รับคำเตือนเกี่ยวกับบรรทัดของโค้ดเหล่านี้ ฉันพยายามที่จะได้รับรายชื่อที่มีทั้งหมดจำนวนแถวที่สตริงเป็นปัจจุบันที่คอลัมน์PeterUnnamed: 5

df = pd.read_excel(xls_path)
myRows = df[df['Unnamed: 5'] == 'Peter'].index.tolist()

มันสร้างคำเตือน:

"\Python36\lib\site-packages\pandas\core\ops.py:792: FutureWarning: elementwise 
comparison failed; returning scalar, but in the future will perform 
elementwise comparison 
result = getattr(x, name)(y)"

FutureWarning นี้คืออะไรและฉันควรเพิกเฉยเพราะดูเหมือนจะใช้งานได้

Question 2

คำเตือน FutureWarning นี้ไม่ได้มาจาก Pandas มันมาจากตัวเลขและข้อผิดพลาดยังส่งผลกระทบต่อ matplotlib และอื่น ๆ ต่อไปนี้เป็นวิธีสร้างคำเตือนที่ใกล้เคียงกับแหล่งที่มาของปัญหา:

import numpy as np
print(np.__version__)   # Numpy version '1.12.0'
'x' in np.arange(5)       #Future warning thrown here

FutureWarning: elementwise comparison failed; returning scalar instead, but in the 
future will perform elementwise comparison
False

อีกวิธีในการสร้างจุดบกพร่องนี้โดยใช้ตัวดำเนินการเท่ากับคู่:

import numpy as np
np.arange(5) == np.arange(5).astype(str)    #FutureWarning thrown here

ตัวอย่างของ Matplotlib ที่ได้รับผลกระทบจาก FutureWarning นี้ภายใต้การใช้พล็อตสั่นไหว: https://matplotlib.org/examples/pylab_examples/quiver_demo.html

เกิดอะไรขึ้นที่นี่?

มีความไม่ลงรอยกันระหว่าง Numpy และ native python เกี่ยวกับสิ่งที่ควรเกิดขึ้นเมื่อคุณเปรียบเทียบสตริงกับประเภทตัวเลขของ numpy สังเกตว่าตัวถูกดำเนินการด้านซ้ายคือสนามหญ้าของ python ซึ่งเป็นสตริงดั้งเดิมและการดำเนินการตรงกลางคือสนามหญ้าของ python แต่ตัวถูกดำเนินการด้านขวาคือสนามหญ้าของ numpy คุณควรส่งคืน Scalar สไตล์ Python หรือ Numpy style ndarray ของ Boolean หรือไม่? Numpy กล่าวว่า ndarray of bool ผู้พัฒนา Pythonic ไม่เห็นด้วย ความขัดแย้งแบบคลาสสิก

ควรเปรียบเทียบแบบองค์ประกอบหรือสเกลาร์หากมีรายการอยู่ในอาร์เรย์

หากรหัสหรือไลบรารีของคุณใช้inหรือ==ตัวดำเนินการเพื่อเปรียบเทียบสตริง python กับ numpy ndarrays จะไม่สามารถใช้งานร่วมกันได้ดังนั้นเมื่อคุณลองใช้โค้ดจะส่งคืนสเกลาร์ แต่ในตอนนี้เท่านั้น คำเตือนระบุว่าในอนาคตพฤติกรรมนี้อาจเปลี่ยนไปดังนั้นโค้ดของคุณจะกระจายไปทั่วพรมหาก python / numpy ตัดสินใจใช้สไตล์ Numpy

ส่งรายงานข้อบกพร่อง:

Numpy และ Python อยู่ในความขัดแย้งเนื่องจากตอนนี้การดำเนินการส่งคืนสเกลาร์ แต่ในอนาคตอาจมีการเปลี่ยนแปลง

https://github.com/numpy/numpy/issues/6784

https://github.com/pandas-dev/pandas/issues/7830

วิธีแก้ปัญหาสองวิธี:

ไม่ว่าจะเป็นการปิดล็อกเวอร์ชันของ python และ numpy ของคุณละเว้นคำเตือนและคาดว่าพฤติกรรมจะไม่เปลี่ยนแปลงหรือแปลงตัวถูกดำเนินการทั้งซ้ายและขวาของ==และinเป็นจากประเภทตัวเลขหรือประเภทตัวเลข python ดั้งเดิม

ระงับคำเตือนทั่วโลก:

import warnings
import numpy as np
warnings.simplefilter(action='ignore', category=FutureWarning)
print('x' in np.arange(5))   #returns False, without Warning

ระงับคำเตือนทีละบรรทัด

import warnings
import numpy as np

with warnings.catch_warnings():
    warnings.simplefilter(action='ignore', category=FutureWarning)
    print('x' in np.arange(2))   #returns False, warning is suppressed

print('x' in np.arange(10))   #returns False, Throws FutureWarning

เพียงแค่ระงับคำเตือนตามชื่อจากนั้นใส่ความคิดเห็นดัง ๆ ข้างๆโดยกล่าวถึงเวอร์ชันปัจจุบันของ python และ numpy โดยบอกว่ารหัสนี้เปราะและต้องใช้เวอร์ชันเหล่านี้และใส่ลิงก์ไปที่นี่ เตะกระป๋องลงข้างทาง

TLDR: pandasคือเจได; numpyเป็นกระท่อม; และpythonเป็นอาณาจักรกาแลกติก https://youtu.be/OZczsiCfQQk?t=3

Question 3

ฉันได้รับข้อผิดพลาดเดียวกันเมื่อพยายามตั้งค่าการindex_colอ่านไฟล์ลงในPandadata-frame:

df = pd.read_csv('my_file.tsv', sep='\t', header=0, index_col=['0'])  ## or same with the following
df = pd.read_csv('my_file.tsv', sep='\t', header=0, index_col=[0])

ฉันไม่เคยพบข้อผิดพลาดนี้มาก่อน ฉันยังคงพยายามหาเหตุผลที่อยู่เบื้องหลังสิ่งนี้ (โดยใช้คำอธิบายของ @Eric Leschinski และอื่น ๆ )

อย่างไรก็ตามวิธีการต่อไปนี้จะช่วยแก้ปัญหาได้ในตอนนี้จนกว่าฉันจะหาเหตุผลได้:

df = pd.read_csv('my_file.tsv', sep='\t', header=0)  ## not setting the index_col
df.set_index(['0'], inplace=True)

ฉันจะอัปเดตสิ่งนี้ทันทีที่ฉันทราบสาเหตุของพฤติกรรมดังกล่าว

Question 4

ประสบการณ์ของฉันกับข้อความเตือนเดียวกันเกิดจาก TypeError

TypeError: การเปรียบเทียบประเภทที่ไม่ถูกต้อง

ดังนั้นคุณอาจต้องการตรวจสอบประเภทข้อมูลของไฟล์ Unnamed: 5

for x in df['Unnamed: 5']:
  print(type(x))  # are they 'str' ?

นี่คือวิธีที่ฉันสามารถทำซ้ำข้อความเตือน:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 2), columns=['num1', 'num2'])
df['num3'] = 3
df.loc[df['num3'] == '3', 'num3'] = 4  # TypeError and the Warning
df.loc[df['num3'] == 3, 'num3'] = 4  # No Error

หวังว่าจะช่วยได้

Question 5

ไม่สามารถเอาชนะคำตอบที่มีรายละเอียดได้อย่างยอดเยี่ยมของ Eric Leschinski แต่นี่เป็นวิธีแก้ปัญหาสั้น ๆ สำหรับคำถามเดิมที่ฉันคิดว่ายังไม่ได้กล่าวถึง - ใส่สตริงในรายการและใช้.isinแทน==

ตัวอย่างเช่น:

import pandas as pd
import numpy as np

df = pd.DataFrame({"Name": ["Peter", "Joe"], "Number": [1, 2]})

# Raises warning using == to compare different types:
df.loc[df["Number"] == "2", "Number"]

# No warning using .isin:
df.loc[df["Number"].isin(["2"]), "Number"]

Question 6

numpy.core.defchararrayการแก้ปัญหาอย่างรวดเร็วสำหรับการนี้คือการใช้ ฉันยังต้องเผชิญกับข้อความเตือนเดียวกันและสามารถแก้ไขได้โดยใช้โมดูลด้านบน

import numpy.core.defchararray as npd
resultdataset = npd.equal(dataset1, dataset2)

Question 7

คำตอบของ Eric อธิบายอย่างเป็นประโยชน์ว่าปัญหามาจากการเปรียบเทียบ Pandas Series (ที่มีอาร์เรย์ NumPy) กับสตริง Python น่าเสียดายที่วิธีแก้ปัญหาสองอย่างของเขาทั้งสองเพียงแค่ระงับคำเตือน

หากต้องการเขียนโค้ดที่ไม่ก่อให้เกิดคำเตือนตั้งแต่แรกให้เปรียบเทียบสตริงของคุณกับแต่ละองค์ประกอบของซีรีส์อย่างชัดเจนและรับบูลแยกกันสำหรับแต่ละส่วน ตัวอย่างเช่นคุณสามารถใช้mapและฟังก์ชันนิรนาม

myRows = df[df['Unnamed: 5'].map( lambda x: x == 'Peter' )].index.tolist()

Question 8

หากอาร์เรย์ของคุณไม่ใหญ่เกินไปหรือคุณมีไม่มากเกินไปคุณอาจหลีกเลี่ยงการบังคับให้ด้านซ้ายมือ==เป็นสตริงได้:

myRows = df[str(df['Unnamed: 5']) == 'Peter'].index.tolist()

แต่จะช้ากว่าประมาณ 1.5 เท่าถ้าdf['Unnamed: 5']เป็นสตริงช้ากว่า 25-30 เท่าถ้าdf['Unnamed: 5']อาร์เรย์จำนวนน้อย (ความยาว = 10) และช้ากว่า 150-160 เท่าหากเป็นอาร์เรย์ตัวเลขที่มีความยาว 100 (โดยเฉลี่ยมากกว่าการทดลอง 500 ครั้ง) .

a = linspace(0, 5, 10)
b = linspace(0, 50, 100)
n = 500
string1 = 'Peter'
string2 = 'blargh'
times_a = zeros(n)
times_str_a = zeros(n)
times_s = zeros(n)
times_str_s = zeros(n)
times_b = zeros(n)
times_str_b = zeros(n)
for i in range(n):
    t0 = time.time()
    tmp1 = a == string1
    t1 = time.time()
    tmp2 = str(a) == string1
    t2 = time.time()
    tmp3 = string2 == string1
    t3 = time.time()
    tmp4 = str(string2) == string1
    t4 = time.time()
    tmp5 = b == string1
    t5 = time.time()
    tmp6 = str(b) == string1
    t6 = time.time()
    times_a[i] = t1 - t0
    times_str_a[i] = t2 - t1
    times_s[i] = t3 - t2
    times_str_s[i] = t4 - t3
    times_b[i] = t5 - t4
    times_str_b[i] = t6 - t5
print('Small array:')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_a), mean(times_str_a)))
print('Ratio of time with/without string conversion: {}'.format(mean(times_str_a)/mean(times_a)))

print('\nBig array')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_b), mean(times_str_b)))
print(mean(times_str_b)/mean(times_b))

print('\nString')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_s), mean(times_str_s)))
print('Ratio of time with/without string conversion: {}'.format(mean(times_str_s)/mean(times_s)))

ผลลัพธ์:

Small array:
Time to compare without str conversion: 6.58464431763e-06 s. With str conversion: 0.000173756599426 s
Ratio of time with/without string conversion: 26.3881526541

Big array
Time to compare without str conversion: 5.44309616089e-06 s. With str conversion: 0.000870866775513 s
159.99474375821288

String
Time to compare without str conversion: 5.89370727539e-07 s. With str conversion: 8.30173492432e-07 s
Ratio of time with/without string conversion: 1.40857605178

Question 9

ฉันได้รับคำเตือนนี้เพราะฉันคิดว่าคอลัมน์ของฉันมีสตริงว่าง แต่ในการตรวจสอบมันมี np.nan!

if df['column'] == '':

การเปลี่ยนคอลัมน์ของฉันเป็นสตริงว่างช่วยได้ :)

Question 10

ฉันได้เปรียบเทียบวิธีการสองสามวิธีที่เป็นไปได้ในการทำเช่นนี้รวมถึงแพนด้าวิธีการจำนวนมากและวิธีการทำความเข้าใจรายการ

ก่อนอื่นเริ่มต้นด้วยพื้นฐาน:

>>> import numpy as np
>>> import operator
>>> import pandas as pd

>>> x = [1, 2, 1, 2]
>>> %time count = np.sum(np.equal(1, x))
>>> print("Count {} using numpy equal with ints".format(count))
CPU times: user 52 µs, sys: 0 ns, total: 52 µs
Wall time: 56 µs
Count 2 using numpy equal with ints

ดังนั้นพื้นฐานของเราคือการนับควรถูกต้อง2และเราควรคำนึงถึง50 usและเราควรจะใช้เวลาประมาณ

ตอนนี้เราลองใช้วิธีไร้เดียงสา:

>>> x = ['s', 'b', 's', 'b']
>>> %time count = np.sum(np.equal('s', x))
>>> print("Count {} using numpy equal".format(count))
CPU times: user 145 µs, sys: 24 µs, total: 169 µs
Wall time: 158 µs
Count NotImplemented using numpy equal
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  """Entry point for launching an IPython kernel.

และที่นี่เราได้รับคำตอบที่ผิด ( NotImplemented != 2) เราใช้เวลานานและมันจะส่งคำเตือน

เราจะลองวิธีไร้เดียงสาอีกวิธี:

>>> %time count = np.sum(x == 's')
>>> print("Count {} using ==".format(count))
CPU times: user 46 µs, sys: 1 µs, total: 47 µs
Wall time: 50.1 µs
Count 0 using ==

อีกครั้งคำตอบที่ผิด ( 0 != 2) นี่เป็นเรื่องที่ร้ายกาจยิ่งกว่าเพราะไม่มีคำเตือนตามมา ( 0สามารถส่งต่อไปรอบ ๆ ได้2)

ตอนนี้เรามาลองทำความเข้าใจกับรายการ:

>>> %time count = np.sum([operator.eq(_x, 's') for _x in x])
>>> print("Count {} using list comprehension".format(count))
CPU times: user 55 µs, sys: 1 µs, total: 56 µs
Wall time: 60.3 µs
Count 2 using list comprehension

เราได้คำตอบที่ถูกต้องที่นี่และค่อนข้างเร็ว!

ความเป็นไปได้อื่นpandas:

>>> y = pd.Series(x)
>>> %time count = np.sum(y == 's')
>>> print("Count {} using pandas ==".format(count))
CPU times: user 453 µs, sys: 31 µs, total: 484 µs
Wall time: 463 µs
Count 2 using pandas ==

ช้า แต่ถูกต้อง!

และในที่สุดตัวเลือกที่ฉันจะใช้: การส่งnumpyอาร์เรย์ไปยังobjectประเภท:

>>> x = np.array(['s', 'b', 's', 'b']).astype(object)
>>> %time count = np.sum(np.equal('s', x))
>>> print("Count {} using numpy equal".format(count))
CPU times: user 50 µs, sys: 1 µs, total: 51 µs
Wall time: 55.1 µs
Count 2 using numpy equal

รวดเร็วและถูกต้อง!

Question 11

ฉันมีรหัสนี้ซึ่งทำให้เกิดข้อผิดพลาด:

for t in dfObj['time']:
  if type(t) == str:
    the_date = dateutil.parser.parse(t)
    loc_dt_int = int(the_date.timestamp())
    dfObj.loc[t == dfObj.time, 'time'] = loc_dt_int

ฉันเปลี่ยนเป็นสิ่งนี้:

for t in dfObj['time']:
  try:
    the_date = dateutil.parser.parse(t)
    loc_dt_int = int(the_date.timestamp())
    dfObj.loc[t == dfObj.time, 'time'] = loc_dt_int
  except Exception as e:
    print(e)
    continue

เพื่อหลีกเลี่ยงการเปรียบเทียบซึ่งเป็นการส่งคำเตือน - ตามที่ระบุไว้ข้างต้น ฉันต้องหลีกเลี่ยงข้อยกเว้นเนื่องจากdfObj.locใน for loop อาจมีวิธีบอกไม่ให้ตรวจสอบแถวที่เปลี่ยนไปแล้ว

Question 12

ในกรณีของฉันคำเตือนเกิดขึ้นเนื่องจากการสร้างดัชนีบูลีนประเภทปกติ - เนื่องจากซีรีส์มีเพียง np.nan การสาธิต (แพนด้า 1.0.3):

>>> import pandas as pd
>>> import numpy as np
>>> pd.Series([np.nan, 'Hi']) == 'Hi'
0    False
1     True
>>> pd.Series([np.nan, np.nan]) == 'Hi'
~/anaconda3/envs/ms3/lib/python3.7/site-packages/pandas/core/ops/array_ops.py:255: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  res_values = method(rvalues)
0    False
1    False

ฉันคิดว่ากับ pandas 1.0 พวกเขาต้องการให้คุณใช้'string'ประเภทข้อมูลใหม่ซึ่งอนุญาตให้ใช้pd.NAค่า:

>>> pd.Series([pd.NA, pd.NA]) == 'Hi'
0    False
1    False
>>> pd.Series([np.nan, np.nan], dtype='string') == 'Hi'
0    <NA>
1    <NA>
>>> (pd.Series([np.nan, np.nan], dtype='string') == 'Hi').fillna(False)
0    False
1    False

อย่าชอบจุดที่พวกเขาปรับแต่งด้วยฟังก์ชันการทำงานทุกวันเช่นการทำดัชนีบูลีน