แยกตัวเลขจากเวกเตอร์ของสตริง

108

ฉันมีสตริงแบบนี้:

years<-c("20 years old", "1 years old")

ฉันต้องการ grep เฉพาะตัวเลขจากเวกเตอร์นี้ ผลลัพธ์ที่คาดหวังคือเวกเตอร์:

c(20, 1)

ฉันจะทำสิ่งนี้ได้อย่างไร?

regex r

— ผู้ใช้ 1471980
แหล่งที่มา

89

เกี่ยวกับ

# pattern is by finding a set of numbers in the start and capturing them
as.numeric(gsub("([0-9]+).*$", "\\1", years))

หรือ

# pattern is to just remove _years_old
as.numeric(gsub(" years old", "", years))

หรือ

# split by space, get the element in first index
as.numeric(sapply(strsplit(years, " "), "[[", 1))

— อรุณ
แหล่งที่มา

1

เหตุใดจึง.*จำเป็น? หากคุณต้องการตั้งแต่เริ่มต้นทำไมไม่ใช้^[[:digit:]]+?

— sebastian-c

2

.*เป็นสิ่งที่จำเป็นเนื่องจากคุณต้องจับคู่สตริงทั้งหมด โดยที่ไม่มีสิ่งใดถูกลบออก นอกจากนี้ทราบว่าสามารถนำมาใช้ที่นี่แทนsub gsub

— Matthew Lundberg

15

หากตัวเลขไม่จำเป็นต้องอยู่ในจุดเริ่มต้นของสตริงให้ใช้สิ่งนี้:gsub(".*?([0-9]+).*", "\\1", years)

— TMS

ฉันอยากได้ 27 ฉันไม่เข้าใจว่าทำไมการเพิ่มเงื่อนไข (เช่นการเพิ่มค่า Escape "-" ผลลัพธ์จะนานขึ้น ... gsub(".*?([0-9]+).*?", "\\1", "Jun. 27–30")ผลลัพธ์: [1] "2730" gsub(".*?([0-9]+)\\-.*?", "\\1", "Jun. 27–30")ผลลัพธ์: [1] "27 มิ.ย. –30 "

— Lionel Trebuchon

67

อัปเดต เนื่องจากextract_numericเลิกใช้แล้วเราสามารถใช้parse_numberจากreadrแพ็คเกจ

library(readr)
parse_number(years)

นี่คืออีกทางเลือกหนึ่งของ extract_numeric

library(tidyr)
extract_numeric(years)
#[1] 20  1

— Akrun
แหล่งที่มา

2

ใช้ได้ดีสำหรับแอปพลิเคชันนี้ แต่parse_numberอย่าลืมว่าอย่าเล่นกับตัวเลขติดลบ ลอง parse_number("–27,633")

— Nettle

@Nettle ใช่ถูกต้องและจะใช้ไม่ได้หากมีหลายอินสแตนซ์เช่นกัน

— akrun

3

ข้อบกพร่องในการแยกวิเคราะห์จำนวนลบได้รับการแก้ไขแล้ว: github.com/tidyverse/readr/issues/308 readr::parse_number("-12,345") # [1] -12345

— Russ Hyde

66

ฉันคิดว่าการทดแทนเป็นวิธีทางอ้อมในการหาทางออก หากคุณต้องการดึงข้อมูลทั้งหมดขอแนะนำgregexpr:

matches <- regmatches(years, gregexpr("[[:digit:]]+", years))
as.numeric(unlist(matches))

หากคุณมีการแข่งขันหลายรายการในสตริงสิ่งนี้จะได้รับทั้งหมด หากคุณสนใจเฉพาะคู่แรกให้ใช้regexprแทนgregexprและคุณสามารถข้ามunlist.

— เซบาสเตียน - ค
แหล่งที่มา

1

ฉันไม่ได้คาดหวัง แต่วิธีนี้จะช้ากว่าวิธีอื่น ๆ ตามลำดับความสำคัญ

— Matthew Lundberg

@MatthewLundberg gregexpr, regexprหรือทั้งสอง?

— sebastian-c

1

gregexpr. ผมไม่ได้พยายามregexprจนถึงเพียงแค่ตอนนี้ แตกต่างกันมาก การใช้regexprทำให้ระหว่างโซลูชันของ Andrew และ Arun (เร็วที่สุดเป็นอันดับสอง) ในชุด 1e6 บางทีก็น่าสนใจเช่นกันการใช้subในโซลูชันของแอนดรูไม่ช่วยเพิ่มความเร็ว

— Matthew Lundberg

การแบ่งตามจุดทศนิยม ตัวอย่างเช่น 2.5 กลายเป็น c ('2', '5')

— MBorg

35

นี่เป็นทางเลือกสำหรับโซลูชันแรกของ Arun โดยมีนิพจน์ทั่วไปเหมือน Perl ที่ง่ายกว่า:

as.numeric(gsub("[^\\d]+", "", years, perl=TRUE))

— แอนดรู
แหล่งที่มา

as.numeric(sub("\\D+","",years)). ถ้ามีตัวอักษรก่อนและ | หรือหลังgsub

— Onyambu

21

หรือเพียงแค่:

as.numeric(gsub("\\D", "", years))
# [1] 20  1

— 989
แหล่งที่มา

19

stringrแก้ปัญหาไปป์ไลน์:

library(stringr)
years %>% str_match_all("[0-9]+") %>% unlist %>% as.numeric

— โจ
แหล่งที่มา

ขอบคุณโจ แต่คำตอบนี้ไม่ได้ดึงสัญญาณเชิงลบก่อนตัวเลขในสตริง

— Miao Cai

18

คุณสามารถกำจัดตัวอักษรทั้งหมดได้เช่นกัน:

as.numeric(gsub("[[:alpha:]]", "", years))

ดูเหมือนว่าสิ่งนี้จะเข้าใจได้น้อยกว่า

— ไทเลอร์ริงเกอร์
แหล่งที่มา

3

น่าแปลกที่โซลูชันของ Andrew เอาชนะสิ่งนี้ด้วยปัจจัย 5 ในเครื่องของฉัน

— Matthew Lundberg

6

เรายังสามารถใช้str_extractจากstringr

years<-c("20 years old", "1 years old")
as.integer(stringr::str_extract(years, "\\d+"))
#[1] 20  1

หากมีตัวเลขหลายตัวในสตริงและเราต้องการแยกตัวเลขทั้งหมดออกเราอาจใช้str_extract_allซึ่งไม่เหมือนกับการ str_extractคืนค่าแมคทีฟทั้งหมด

years<-c("20 years old and 21", "1 years old")
stringr::str_extract(years, "\\d+")
#[1] "20"  "1"

stringr::str_extract_all(years, "\\d+")

#[[1]]
#[1] "20" "21"

#[[2]]
#[1] "1"

— Ronak Shah
แหล่งที่มา

5

แยกตัวเลขออกจากสตริงใด ๆ ที่ตำแหน่งเริ่มต้น

x <- gregexpr("^[0-9]+", years)  # Numbers with any number of digits
x2 <- as.numeric(unlist(regmatches(years, x)))

ดึงตัวเลขออกจากสตริง INDEPENDENT ของตำแหน่ง

x <- gregexpr("[0-9]+", years)  # Numbers with any number of digits
x2 <- as.numeric(unlist(regmatches(years, x)))

— สบานิวาล
แหล่งที่มา

2

หลังจากโพสต์จากGabor Grothendieck โพสต์ที่ r-help mailing list

years<-c("20 years old", "1 years old")

library(gsubfn)
pat <- "[-+.e0-9]*\\d"
sapply(years, function(x) strapply(x, pat, as.numeric)[[1]])

— Juanbretti
แหล่งที่มา

2

การใช้แพ็คเกจunglueเราสามารถทำได้:

# install.packages("unglue")
library(unglue)

years<-c("20 years old", "1 years old")
unglue_vec(years, "{x} years old", convert = TRUE)
#> [1] 20  1

^{สร้างเมื่อ 2019-11-06 โดยแพ็คเกจ reprex (v0.3.0)}

ข้อมูลเพิ่มเติม: https://github.com/moodymudskipper/unglue/blob/master/README.md

— คุณ Moody_Mudskipper
แหล่งที่มา