VectorSource และ VCorpus คืออะไรในแพ็คเกจ 'tm' (การทำเหมืองข้อความ) ใน R

9

ฉันไม่แน่ใจว่า VectorSource และ VCorpus อยู่ในแพ็คเกจ 'tm' อย่างแน่นอน

เอกสารไม่ชัดเจนเกี่ยวกับสิ่งเหล่านี้ทุกคนสามารถทำให้ฉันเข้าใจในแง่ง่ายหรือไม่?

r text-mining

12

"Corpus" คือชุดของเอกสารข้อความ

VCorpus ใน tm หมายถึงคลังข้อมูล "ระเหย" ซึ่งหมายความว่าคลังเก็บไว้ในหน่วยความจำและจะถูกทำลายเมื่อวัตถุ R ที่มีมันถูกทำลาย

เปรียบเทียบสิ่งนี้กับ PCorpus หรือ Permanent Corpus ซึ่งเก็บไว้นอกหน่วยความจำโดยระบุเป็น db

ในการสร้าง VCorpus โดยใช้ tm เราต้องผ่านวัตถุ "Source" เป็นพารามิเตอร์ในการใช้กับวิธี VCorpus คุณสามารถค้นหาแหล่งข้อมูลที่พร้อมใช้งานโดยใช้วิธีนี้ -
getSources ()

[1] "DataframeSource" "DirSource" "URISource" "VectorSource"
[5] "XMLSource" "ZipSource"

แหล่งข้อมูลบทคัดย่อที่เป็นอินพุตเช่นไดเรกทอรีหรือ URI เป็นต้น VectorSource ใช้สำหรับเวกเตอร์อักขระเท่านั้น

ตัวอย่างง่ายๆ:

สมมติว่าคุณมีพาหะถ่าน -

อินพุต <- c ('นี่คือหนึ่งบรรทัด', 'และนี่คืออันที่สอง')

สร้างแหล่งที่มา - vecSource <- VectorSource (อินพุต)

จากนั้นสร้างคลังข้อมูล - VCorpus (vecSource)

หวังว่านี่จะช่วยได้ คุณสามารถอ่านเพิ่มเติมได้ที่นี่ - https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

— Indi
แหล่งที่มา

5

ในแง่การปฏิบัติมีความแตกต่างใหญ่ระหว่างและCorpusVCorpus

Corpusใช้SimpleCorpusเป็นค่าเริ่มต้นซึ่งหมายถึงคุณสมบัติบางอย่างของVCorpusจะไม่สามารถใช้ได้ หนึ่งที่เห็นได้ชัดทันทีคือที่SimpleCorpusจะไม่อนุญาตให้คุณขีดกลางขีดเส้นใต้หรือเครื่องหมายอื่น ๆ ของเครื่องหมายวรรคตอน; SimpleCorpusหรือCorpusลบโดยอัตโนมัติพวกเขาVCorpusไม่ได้ มีข้อ จำกัด อื่น ๆ ของการเป็นที่คุณจะพบในความช่วยเหลือด้วยCorpus?SimpleCorpus

นี่คือตัวอย่าง:

# Read a text file from internet
filePath <- "http://www.sthda.com/sthda/RDoc/example-files/martin-luther-king-i-have-a-dream-speech.txt"
text <- readLines(filePath)

# load the data as a corpus
C.mlk <- Corpus(VectorSource(text))
C.mlk
V.mlk <- VCorpus(VectorSource(text))
V.mlk

ผลลัพธ์จะเป็น:

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 46
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 46

หากคุณทำการตรวจสอบวัตถุ:

# inspect the content of the document
inspect(C.mlk[1:2])
inspect(V.mlk[1:2])

คุณจะสังเกตเห็นว่าCorpusคลายข้อความ:

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 2
[1]                                                                                                                                            
[2] And so even though we face the difficulties of today and tomorrow, I still have a dream. It is a dream deeply rooted in the American dream.


<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 2
[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 0
[[2]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 139

ในขณะที่VCorpusเก็บมันไว้ด้วยกันภายในวัตถุ

สมมติว่าตอนนี้คุณทำการแปลงเมทริกซ์สำหรับทั้งสอง:

dtm.C.mlk <- DocumentTermMatrix(C.mlk)
length(dtm.C.mlk$dimnames$Terms)
# 168

dtm.V.mlk <- DocumentTermMatrix(V.mlk)
length(dtm.V.mlk$dimnames$Terms)
# 187

ในที่สุดเรามาดูเนื้อหา นี่คือจากCorpus:

grep("[[:punct:]]", dtm.C.mlk$dimnames$Terms, value = TRUE)
# character(0)

และจากVCorpus:

grep("[[:punct:]]", dtm.V.mlk$dimnames$Terms, value = TRUE)

[1] "alabama,"       "almighty,"      "brotherhood."   "brothers."     
 [5] "california."    "catholics,"     "character."     "children,"     
 [9] "city,"          "colorado."      "creed:"         "day,"          
[13] "day."           "died,"          "dream."         "equal."        
[17] "exalted,"       "faith,"         "gentiles,"      "georgia,"      
[21] "georgia."       "hamlet,"        "hampshire."     "happens,"      
[25] "hope,"          "hope."          "injustice,"     "justice."      
[29] "last!"          "liberty,"       "low,"           "meaning:"      
[33] "men,"           "mississippi,"   "mississippi."   "mountainside," 
[37] "nation,"        "nullification," "oppression,"    "pennsylvania." 
[41] "plain,"         "pride,"         "racists,"       "ring!"         
[45] "ring,"          "ring."          "self-evident,"  "sing."         
[49] "snow-capped"    "spiritual:"     "straight;"      "tennessee."    
[53] "thee,"          "today!"         "together,"      "together."     
[57] "tomorrow,"      "true."          "york."

ลองดูคำที่มีเครื่องหมายวรรคตอน นั่นคือความแตกต่างอย่างมาก ไม่ใช่เหรอ

— f0nzie
แหล่งที่มา