แก้ไข: ฉันได้แก้ไขการพิมพ์ผิด-ข้อผิดพลาดใน regex .. ที่มันจำเป็นต้องมี '\ x80` ไม่\ 80
regex เพื่อกรองฟอร์ม UTF-8 ที่ไม่ถูกต้องสำหรับการยึดติดอย่างเคร่งครัดกับ UTF-8 มีดังนี้
perl -l -ne '/
^( ([\x00-\x7F]) # 1-byte pattern
|([\xC2-\xDF][\x80-\xBF]) # 2-byte pattern
|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})) # 4-byte pattern
)*$ /x or print'
เอาท์พุท (ของสายสำคัญจากการทดสอบ 1 ):
Codepoint
=========
00001000 Test=1 mode=strict valid,invalid,fail=(1000,0,0)
0000E000 Test=1 mode=strict valid,invalid,fail=(D800,800,0)
0010FFFF mode=strict test-return=(0,0) valid,invalid,fail=(10F800,800,0)
ถามหนึ่งจะสร้างข้อมูลทดสอบเพื่อทดสอบ regex ซึ่งกรอง Unicode ที่ไม่ถูกต้องได้อย่างไร
A. สร้างอัลกอริทึมการทดสอบ UTF-8 ของคุณเองและทำลายกฎ ...
Catch-22 .. แต่แล้วคุณจะทดสอบอัลกอริทึมทดสอบอย่างไร
regex ข้างต้นได้รับการทดสอบ (ใช้iconv
เป็นข้อมูลอ้างอิง) สำหรับทุกค่าจำนวนเต็มตั้งแต่0x00000
ถึง0x10FFFF
.. ค่านี้เป็นค่าจำนวนเต็มสูงสุดของ Unicode Codepoint
ตามหน้าวิกิพีเดีย UTF-8หน้านี้
- UTF-8 เข้ารหัสรหัสจุด 1,112,064 แต่ละชุดในชุดอักขระ Unicode โดยใช้หนึ่งถึงสี่ไบต์ 8 บิต
numeber นี้ (1,112,064) เท่ากับช่วง0x000000
การ0x10F7FF
ซึ่งเป็น 0x0800 อายของจริงสูงสุดจำนวนเต็มค่าสำหรับ Unicode codepoint สูงสุด:0x10FFFF
นี้บล็อกของจำนวนเต็มหายไปจากสเปกตรัม Unicode Codepoints เพราะความจำเป็นในการเข้ารหัส UTF-16 ที่จะก้าวเกินความตั้งใจออกแบบเดิมผ่านระบบที่เรียกว่าคู่ตัวแทน บล็อกของ0x0800
จำนวนเต็มได้รับการสงวนไว้จะใช้ UTF-16 .. บล็อกนี้ครอบคลุมช่วงไป0x00D800
0x00DFFF
ไม่มีนักการตลาดเหล่านี้ที่มีค่า Unicode ที่ถูกกฎหมายดังนั้นจึงเป็นค่า UTF-8 ที่ไม่ถูกต้อง
ในการทดสอบ 1นั้นregex
ได้ถูกทดสอบกับทุกหมายเลขในช่วงของ Unicode Codepoints และตรงกับผลลัพธ์ที่ได้จากiconv
.. เช่น ค่าที่ถูกต้อง0x010F7FFและค่าที่ไม่ถูกต้อง 0x000800
อย่างไรก็ตามปัญหานี้เกิดขึ้น * regex จัดการค่า UTF-8 นอกช่วงได้อย่างไร ดังกล่าวข้างต้น0x010FFFF
(UTF-8 สามารถขยายไปยัง 6 ไบต์มีค่าจำนวนเต็มสูงสุดของ0x7FFFFFFF ?
เพื่อสร้างสิ่งที่จำเป็น * The ไม่ใช่ Unicode UTF-8 ค่าไบต์ , ผมเคยใช้คำสั่งต่อไปนี้:
perl -C -e 'print chr 0x'$hexUTF32BE
เพื่อทดสอบความถูกต้อง (ในบางแบบ) ฉันใช้Gilles'
UTF-8 regex ...
perl -l -ne '/
^( [\000-\177] # 1-byte pattern
|[\300-\337][\200-\277] # 2-byte pattern
|[\340-\357][\200-\277]{2} # 3-byte pattern
|[\360-\367][\200-\277]{3} # 4-byte pattern
|[\370-\373][\200-\277]{4} # 5-byte pattern
|[\374-\375][\200-\277]{5} # 6-byte pattern
)*$ /x or print'
ผลลัพธ์ของ 'perl's print chr' ตรงกับการกรองของ Gilles 'หนึ่งจะตอกย้ำความถูกต้องของอีก .. ฉันไม่สามารถใช้งานได้iconv
เพราะมันจัดการเฉพาะเซตย่อย Unicode มาตรฐานที่ถูกต้องของ UTF-8 ที่กว้างกว่า มาตรฐาน...
แม่ชีที่เกี่ยวข้องมีขนาดค่อนข้างใหญ่ดังนั้นฉันได้ทดสอบช่วงบนสุดช่วงล่างสุดและการสแกนหลายครั้งที่เพิ่มขึ้นเช่น 11111, 13579, 33333, 53441 ... ผลลัพธ์ทั้งหมดตรงกันดังนั้นตอนนี้ สิ่งที่เหลืออยู่ก็คือการทดสอบ regex กับค่า UTF-8 สไตล์นอกขอบเขตเหล่านี้ (ไม่ถูกต้องสำหรับ Unicode และยังไม่ถูกต้องสำหรับ UTF-8 ที่เข้มงวด)
นี่คือโมดูลทดสอบ:
[[ "$(locale charmap)" != "UTF-8" ]] && { echo "ERROR: locale must be UTF-8, but it is $(locale charmap)"; exit 1; }
# Testing the UTF-8 regex
#
# Tests to check that the observed byte-ranges (above) have
# been accurately observed and included in the test code and final regex.
# =========================================================================
: 2 bytes; B2=0 # run-test=1 do-not-test=0
: 3 bytes; B3=0 # run-test=1 do-not-test=0
: 4 bytes; B4=0 # run-test=1 do-not-test=0
: regex; Rx=1 # run-test=1 do-not-test=0
((strict=16)); mode[$strict]=strict # iconv -f UTF-16BE then iconv -f UTF-32BE beyond 0xFFFF)
(( lax=32)); mode[$lax]=lax # iconv -f UTF-32BE only)
# modebits=$strict
# UTF-8, in relation to UTF-16 has invalid values
# modebits=$strict automatically shifts to modebits=$lax
# when the tested integer exceeds 0xFFFF
# modebits=$lax
# UTF-8, in relation to UTF-32, has no restrictione
# Test 1 Sequentially tests a range of Big-Endian integers
# * Unicode Codepoints are a subset ofBig-Endian integers
# ( based on 'iconv' -f UTF-32BE -f UTF-8 )
# Note: strict UTF-8 has a few quirks because of UTF-16
# Set modebits=16 to "strictly" test the low range
Test=1; modebits=$strict
# Test=2; modebits=$lax
# Test=3
mode3wlo=$(( 1*4)) # minimum chars * 4 ( '4' is for UTF-32BE )
mode3whi=$((10*4)) # minimum chars * 4 ( '4' is for UTF-32BE )
#########################################################################
# 1 byte UTF-8 values: Nothing to do; no complexities.
#########################################################################
# 2 Byte UTF-8 values: Verifying that I've got the right range values.
if ((B2==1)) ; then
echo "# Test 2 bytes for Valid UTF-8 values: ie. values which are in range"
# =========================================================================
time \
for d1 in {194..223} ;do
# bin oct hex dec
# lo 11000010 302 C2 194
# hi 11011111 337 DF 223
B2b1=$(printf "%0.2X" $d1)
#
for d2 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B2b2=$(printf "%0.2X" $d2)
#
echo -n "${B2b1}${B2b2}" |
xxd -p -u -r |
iconv -f UTF-8 >/dev/null || {
echo "ERROR: Invalid UTF-8 found: ${B2b1}${B2b2}"; exit 20; }
#
done
done
echo
# Now do a negated test.. This takes longer, because there are more values.
echo "# Test 2 bytes for Invalid values: ie. values which are out of range"
# =========================================================================
# Note: 'iconv' will treat a leading \x00-\x7F as a valid leading single,
# so this negated test primes the first UTF-8 byte with values starting at \x80
time \
for d1 in {128..193} {224..255} ;do
#for d1 in {128..194} {224..255} ;do # force a valid UTF-8 (needs $B2b2)
B2b1=$(printf "%0.2X" $d1)
#
for d2 in {0..127} {192..255} ;do
#for d2 in {0..128} {192..255} ;do # force a valid UTF-8 (needs $B2b1)
B2b2=$(printf "%0.2X" $d2)
#
echo -n "${B2b1}${B2b2}" |
xxd -p -u -r |
iconv -f UTF-8 2>/dev/null && {
echo "ERROR: VALID UTF-8 found: ${B2b1}${B2b2}"; exit 21; }
#
done
done
echo
fi
#########################################################################
# 3 Byte UTF-8 values: Verifying that I've got the right range values.
if ((B3==1)) ; then
echo "# Test 3 bytes for Valid UTF-8 values: ie. values which are in range"
# ========================================================================
time \
for d1 in {224..239} ;do
# bin oct hex dec
# lo 11100000 340 E0 224
# hi 11101111 357 EF 239
B3b1=$(printf "%0.2X" $d1)
#
if [[ $B3b1 == "E0" ]] ; then
B3b2range="$(echo {160..191})"
# bin oct hex dec
# lo 10100000 240 A0 160
# hi 10111111 277 BF 191
elif [[ $B3b1 == "ED" ]] ; then
B3b2range="$(echo {128..159})"
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10011111 237 9F 159
else
B3b2range="$(echo {128..191})"
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
fi
#
for d2 in $B3b2range ;do
B3b2=$(printf "%0.2X" $d2)
echo "${B3b1} ${B3b2} xx"
#
for d3 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B3b3=$(printf "%0.2X" $d3)
#
echo -n "${B3b1}${B3b2}${B3b3}" |
xxd -p -u -r |
iconv -f UTF-8 >/dev/null || {
echo "ERROR: Invalid UTF-8 found: ${B3b1}${B3b2}${B3b3}"; exit 30; }
#
done
done
done
echo
# Now do a negated test.. This takes longer, because there are more values.
echo "# Test 3 bytes for Invalid values: ie. values which are out of range"
# =========================================================================
# Note: 'iconv' will treat a leading \x00-\x7F as a valid leading single,
# so this negated test primes the first UTF-8 byte with values starting at \x80
#
# real 26m28.462s \
# user 27m12.526s | stepping by 2
# sys 13m11.193s /
#
# real 239m00.836s \
# user 225m11.108s | stepping by 1
# sys 120m00.538s /
#
time \
for d1 in {128..223..1} {240..255..1} ;do
#for d1 in {128..224..1} {239..255..1} ;do # force a valid UTF-8 (needs $B2b2,$B3b3)
B3b1=$(printf "%0.2X" $d1)
#
if [[ $B3b1 == "E0" ]] ; then
B3b2range="$(echo {0..159..1} {192..255..1})"
#B3b2range="$(> {192..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
elif [[ $B3b1 == "ED" ]] ; then
B3b2range="$(echo {0..127..1} {160..255..1})"
#B3b2range="$(echo {0..128..1} {160..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
else
B3b2range="$(echo {0..127..1} {192..255..1})"
#B3b2range="$(echo {0..128..1} {192..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
fi
for d2 in $B3b2range ;do
B3b2=$(printf "%0.2X" $d2)
echo "${B3b1} ${B3b2} xx"
#
for d3 in {0..127..1} {192..255..1} ;do
#for d3 in {0..128..1} {192..255..1} ;do # force a valid UTF-8 (needs $B2b1)
B3b3=$(printf "%0.2X" $d3)
#
echo -n "${B3b1}${B3b2}${B3b3}" |
xxd -p -u -r |
iconv -f UTF-8 2>/dev/null && {
echo "ERROR: VALID UTF-8 found: ${B3b1}${B3b2}${B3b3}"; exit 31; }
#
done
done
done
echo
fi
#########################################################################
# Brute force testing in the Astral Plane will take a VERY LONG time..
# Perhaps selective testing is more appropriate, now that the previous tests
# have panned out okay...
#
# 4 Byte UTF-8 values:
if ((B4==1)) ; then
echo "# Test 4 bytes for Valid UTF-8 values: ie. values which are in range"
# ==================================================================
# real 58m18.531s \
# user 56m44.317s |
# sys 27m29.867s /
time \
for d1 in {240..244} ;do
# bin oct hex dec
# lo 11110000 360 F0 240
# hi 11110100 364 F4 244 -- F4 encodes some values greater than 0x10FFFF;
# such a sequence is invalid.
B4b1=$(printf "%0.2X" $d1)
#
if [[ $B4b1 == "F0" ]] ; then
B4b2range="$(echo {144..191})" ## f0 90 80 80 to f0 bf bf bf
# bin oct hex dec 010000 -- 03FFFF
# lo 10010000 220 90 144
# hi 10111111 277 BF 191
#
elif [[ $B4b1 == "F4" ]] ; then
B4b2range="$(echo {128..143})" ## f4 80 80 80 to f4 8f bf bf
# bin oct hex dec 100000 -- 10FFFF
# lo 10000000 200 80 128
# hi 10001111 217 8F 143 -- F4 encodes some values greater than 0x10FFFF;
# such a sequence is invalid.
else
B4b2range="$(echo {128..191})" ## fx 80 80 80 to f3 bf bf bf
# bin oct hex dec 0C0000 -- 0FFFFF
# lo 10000000 200 80 128 0A0000
# hi 10111111 277 BF 191
fi
#
for d2 in $B4b2range ;do
B4b2=$(printf "%0.2X" $d2)
#
for d3 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B4b3=$(printf "%0.2X" $d3)
echo "${B4b1} ${B4b2} ${B4b3} xx"
#
for d4 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B4b4=$(printf "%0.2X" $d4)
#
echo -n "${B4b1}${B4b2}${B4b3}${B4b4}" |
xxd -p -u -r |
iconv -f UTF-8 >/dev/null || {
echo "ERROR: Invalid UTF-8 found: ${B4b1}${B4b2}${B4b3}${B4b4}"; exit 40; }
#
done
done
done
done
echo "# Test 4 bytes for Valid UTF-8 values: END"
echo
fi
########################################################################
# There is no test (yet) for negated range values in the astral plane. #
# (all negated range values must be invalid) #
# I won't bother; This was mainly for me to ge the general feel of #
# the tests, and the final test below should flush anything out.. #
# Traversing the intire UTF-8 range takes quite a while... #
# so no need to do it twice (albeit in a slightly different manner) #
########################################################################
################################
### The construction of: ####
### The Regular Expression ####
### (de-construction?) ####
################################
# BYTE 1 BYTE 2 BYTE 3 BYTE 4
# 1: [\x00-\x7F]
# ===========
# ([\x00-\x7F])
#
# 2: [\xC2-\xDF] [\x80-\xBF]
# =================================
# ([\xC2-\xDF][\x80-\xBF])
#
# 3: [\xE0] [\xA0-\xBF] [\x80-\xBF]
# [\xED] [\x80-\x9F] [\x80-\xBF]
# [\xE1-\xEC\xEE-\xEF] [\x80-\xBF] [\x80-\xBF]
# ==============================================
# ((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF]))
#
# 4 [\xF0] [\x90-\xBF] [\x80-\xBF] [\x80-\xBF]
# [\xF1-\xF3] [\x80-\xBF] [\x80-\xBF] [\x80-\xBF]
# [\xF4] [\x80-\x8F] [\x80-\xBF] [\x80-\xBF]
# ===========================================================
# ((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2}))
#
# The final regex
# ===============
# 1-4: (([\x00-\x7F])|([\xC2-\xDF][\x80-\xBF])|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF]))|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})))
# 4-1: (((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2}))|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF]))|([\xC2-\xDF][\x80-\xBF])|([\x00-\x7F]))
#######################################################################
# The final Test; for a single character (multi chars to follow) #
# Compare the return code of 'iconv' against the 'regex' #
# for the full range of 0x000000 to 0x10FFFF #
# #
# Note; this script has 3 modes: #
# Run this test TWICE, set each mode Manually! #
# #
# 1. Sequentially test every value from 0x000000 to 0x10FFFF #
# 2. Throw a spanner into the works! Force random byte patterns #
# 2. Throw a spanner into the works! Force random longer strings #
# ============================== #
# #
# Note: The purpose of this routine is to determine if there is any #
# difference how 'iconv' and 'regex' handle the same data #
# #
#######################################################################
if ((Rx==1)) ; then
# real 191m34.826s
# user 158m24.114s
# sys 83m10.676s
time {
invalCt=0
validCt=0
failCt=0
decBeg=$((0x00110000)) # incement by decimal integer
decMax=$((0x7FFFFFFF)) # incement by decimal integer
#
for ((CPDec=decBeg;CPDec<=decMax;CPDec+=13247)) ;do
((D==1)) && echo "=========================================================="
#
# Convert decimal integer '$CPDec' to Hex-digits; 6-long (dec2hex)
hexUTF32BE=$(printf '%0.8X\n' $CPDec) # hexUTF32BE
# progress count
if (((CPDec%$((0x1000)))==0)) ;then
((Test>2)) && echo
echo "$hexUTF32BE Test=$Test mode=${mode[$modebits]} "
fi
if ((Test==1 || Test==2 ))
then # Test 1. Sequentially test every value from 0x000000 to 0x10FFFF
#
if ((Test==2)) ; then
bits=32
UTF8="$( perl -C -e 'print chr 0x'$hexUTF32BE |
perl -l -ne '/^( [\000-\177]
| [\300-\337][\200-\277]
| [\340-\357][\200-\277]{2}
| [\360-\367][\200-\277]{3}
| [\370-\373][\200-\277]{4}
| [\374-\375][\200-\277]{5}
)*$/x and print' |xxd -p )"
UTF8="${UTF8%0a}"
[[ -n "$UTF8" ]] \
&& rcIco32=0 || rcIco32=1
rcIco16=
elif ((modebits==strict && CPDec<=$((0xFFFF)))) ;then
bits=16
UTF8="$( echo -n "${hexUTF32BE:4}" |
xxd -p -u -r |
iconv -f UTF-16BE -t UTF-8 2>/dev/null)" \
&& rcIco16=0 || rcIco16=1
rcIco32=
else
bits=32
UTF8="$( echo -n "$hexUTF32BE" |
xxd -p -u -r |
iconv -f UTF-32BE -t UTF-8 2>/dev/null)" \
&& rcIco32=0 || rcIco32=1
rcIco16=
fi
# echo "1 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE "
#
#
#
if ((${rcIco16}${rcIco32}!=0)) ;then
# 'iconv -f UTF-16BE' failed produce a reliable UTF-8
if ((bits==16)) ;then
((D==1)) && echo "bits-$bits rcIconv: error $hexUTF32BE .. 'strict' failed, now trying 'lax'"
# iconv failed to create a 'srict' UTF-8 so
# try UTF-32BE to get a 'lax' UTF-8 pattern
UTF8="$( echo -n "$hexUTF32BE" |
xxd -p -u -r |
iconv -f UTF-32BE -t UTF-8 2>/dev/null)" \
&& rcIco32=0 || rcIco32=1
#echo "2 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE "
if ((rcIco32!=0)) ;then
((D==1)) && echo -n "bits-$bits rcIconv: Cannot gen UTF-8 for: $hexUTF32BE"
rcIco32=1
fi
fi
fi
# echo "3 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE "
#
#
#
if ((rcIco16==0 || rcIco32==0)) ;then
# 'strict(16)' OR 'lax(32)'... 'iconv' managed to generate a UTF-8 pattern
((D==1)) && echo -n "bits-$bits rcIconv: pattern* $hexUTF32BE"
((D==1)) && if [[ $bits == "16" && $rcIco32 == "0" ]] ;then
echo " .. 'lax' UTF-8 produced a pattern"
else
echo
fi
# regex test
if ((modebits==strict)) ;then
#rxOut="$(echo -n "$UTF8" |perl -l -ne '/^(([\x00-\x7F])|([\xC2-\xDF][\x80-\xBF])|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF]))|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})))*$/ or print' )"
rxOut="$(echo -n "$UTF8" |
perl -l -ne '/^( ([\x00-\x7F]) # 1-byte pattern
|([\xC2-\xDF][\x80-\xBF]) # 2-byte pattern
|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})) # 4-byte pattern
)*$ /x or print' )"
else
if ((Test==2)) ;then
rx="$(echo -n "$UTF8" |perl -l -ne '/^([\000-\177]|[\300-\337][\200-\277]|[\340-\357][\200-\277]{2}|[\360-\367][\200-\277]{3}|[\370-\373][\200-\277]{4}|[\374-\375][\200-\277]{5})*$/ and print')"
[[ "$UTF8" != "$rx" ]] && rxOut="$UTF8" || rxOut=
rx="$(echo -n "$rx" |sed -e "s/\(..\)/\1 /g")"
else
rxOut="$(echo -n "$UTF8" |perl -l -ne '/^([\000-\177]|[\300-\337][\200-\277]|[\340-\357][\200-\277]{2}|[\360-\367][\200-\277]{3}|[\370-\373][\200-\277]{4}|[\374-\375][\200-\277]{5})*$/ or print' )"
fi
fi
if [[ "$rxOut" == "" ]] ;then
((D==1)) && echo " rcRegex: ok"
rcRegex=0
else
((D==1)) && echo -n "bits-$bits rcRegex: error $hexUTF32BE .. 'strict' failed,"
((D==1)) && if [[ "12" == *$Test* ]] ;then
echo # " (codepoint) Test $Test"
else
echo
fi
rcRegex=1
fi
fi
#
elif [[ $Test == 2 ]]
then # Test 2. Throw a randomizing spanner into the works!
# Then test the arbitary bytes ASIS
#
hexLineRand="$(echo -n "$hexUTF32BE" |
sed -re "s/(.)(.)(.)(.)(.)(.)(.)(.)/\1\n\2\n\3\n\4\n\5\n\6\n\7\n\8/" |
sort -R |
tr -d '\n')"
#
elif [[ $Test == 3 ]]
then # Test 3. Test single UTF-16BE bytes in the range 0x00000000 to 0x7FFFFFFF
#
echo "Test 3 is not properly implemented yet.. Exiting"
exit 99
else
echo "ERROR: Invalid mode"
exit
fi
#
#
if ((Test==1 || Test=2)) ;then
if ((modebits==strict && CPDec<=$((0xFFFF)))) ;then
((rcIconv=rcIco16))
else
((rcIconv=rcIco32))
fi
if ((rcRegex!=rcIconv)) ;then
[[ $Test != 1 ]] && echo
if ((rcRegex==1)) ;then
echo "ERROR: 'regex' ok, but NOT 'iconv': ${hexUTF32BE} "
else
echo "ERROR: 'iconv' ok, but NOT 'regex': ${hexUTF32BE} "
fi
((failCt++));
elif ((rcRegex!=0)) ;then
# ((invalCt++)); echo -ne "$hexUTF32BE exit-codes $${rcIco16}${rcIco32}=,$rcRegex\t: $(printf "%0.8X\n" $invalCt)\t$hexLine$(printf "%$(((mode3whi*2)-${#hexLine}))s")\r"
((invalCt++))
else
((validCt++))
fi
if ((Test==1)) ;then
echo -ne "$hexUTF32BE " "mode=${mode[$modebits]} test-return=($rcIconv,$rcRegex) valid,invalid,fail=($(printf "%X" $validCt),$(printf "%X" $invalCt),$(printf "%X" $failCt)) \r"
else
echo -ne "$hexUTF32BE $rx mode=${mode[$modebits]} test-return=($rcIconv,$rcRegex) val,inval,fail=($(printf "%X" $validCt),$(printf "%X" $invalCt),$(printf "%X" $failCt))\r"
fi
fi
done
} # End time
fi
exit