
Instrument: 201301240005447Recorded: 01/24/2013ConsIDeration: 0,125.00document Type: MORTGAGES Pages: 17Grantor: BYRES,CONNIE R / BYRES,SCottGrantee: MORTGAGE ELECTRONIC REGISTRATION SYstemS INC / QUICKEN LOANS INCLegal Description: * St:5495 MCNAMara LN City:FliNT PrpID:1135532002 CC:11 T:8 R:7 S:35 ext:PT OF NE4 * ---------------------------------/---------------------------------Instrument: 201301240005408Recorded: 01/24/2013ConsIDeration: ,124.00document Type: MORTGAGES Pages: 17Grantor: SANNE,BETTY LOU / SANNE,KENNETH DGrantee: JPMORGAN CHASE BANK NALegal Description: Sub:WOODCROFT NO 1 Lt:188 St:2213 RADCliFFE AVE City:FliNT PrpID:4024106003 CC:54 * ---------------------------------/---------------------------------
有一些常用的字符向量,如“Instrument”,“Grantor”和“PrpID”.我究竟如何将其导入R?这会涉及解析或刮取某种类型吗?
不用说,我试图将此文件导入Excel但无法正常工作.我认为R会更好地工作,只需要弄清楚如何.谢谢
解决方法 我编写了一个非常通用的解析函数,可以处理分隔线和字段值分隔符的任何模式,指定为参数化正则表达式.它还可以选择从字段值中删除尾随空格,并将可变参数传递给构建结果data.frame的单个data.frame()调用.sectionedFIEldlinesToFrame <- function(lines,divRE,sepRE,select,rtw=T,...) { divlineIndexes <- grep(perl=T,lines); ## remove possible leading and trailing divs,for robustness if (length(divlineIndexes)>0L && divlineIndexes[1L]==1L) { leaddivCount <- match(T,c(diff(divlineIndexes)!=1L,T)); lines <- lines[-seq_len(leaddivCount)]; divlineIndexes <- divlineIndexes[-seq_len(leaddivCount)]-leaddivCount; }; ## end if if (length(divlineIndexes)>0L && divlineIndexes[length(divlineIndexes)]==length(lines)) { traildivCount <- match(T,c(rev(diff(divlineIndexes)!=1L),T)); lines <- lines[-seq(to=length(lines),len=traildivCount)]; divlineIndexes <- divlineIndexes[-seq(to=length(divlineIndexes),len=traildivCount)]; }; ## end if ## get fIElds to extract if (missing(select)) { allFIEldlineIndexes <- grep(perl=T,lines); fIElds <- unique(sub(perl=T,paste0(sepRE,'.*'),'',lines[allFIEldlineIndexes])); } else { fIElds <- select; }; ## end if ## extract each fIEld vector and build the data.frame do.call(data.frame,c(setnames(lapply(fIElds,function(fIEld) { fIEldlineIndexes <- grep(perl=T,paste0('^\Q',fIEld,'\E',sepRE),lines); sectionIndexes <- findInterval(fIEldlineIndexes,divlineIndexes); ## 0-based values <- sub(perl=T,paste0('^.*?',lines[fIEldlineIndexes]); if (rtw) values <- sub(perl=T,'\s+$',values); values[match(seq(0L,length(divlineIndexes)),sectionIndexes)]; }),fIElds),...));}; ## end sectionedFIEldlinesToFrame() 以下是如何使用它:
filename <- 'data.txt';divRE <- '^-+/-+$';sepRE <- ':\s*';df <- sectionedFIEldlinesToFrame(readlines(filename),stringsAsFactors=F);str(df);## 'data.frame': 2 obs. of 8 variables:## $Instrument : chr "201301240005447" "201301240005408"## $Recorded : chr "01/24/2013" "01/24/2013"## $ConsIDeration : chr "0,125.00" ",124.00"## $document.Type : chr "MORTGAGES" "MORTGAGES"## $Pages : chr "17" "17"## $Grantor : chr "BYRES,SCott" "SANNE,KENNETH D"## $Grantee : chr "MORTGAGE ELECTRONIC REGISTRATION SYstemS INC / QUICKEN LOANS INC" "JPMORGAN CHASE BANK NA"## $Legal.Description: chr "* St:5495 MCNAMara LN City:FliNT PrpID:1135532002 CC:11 T:8 R:7 S:35 ext:PT OF NE4" "Sub:WOODCROFT NO 1 Lt:188 St:2213 RADCliFFE AVE City:FliNT PrpID:4024106003 CC:54"
您还可以指定select参数以准确选择要提取的字段:
select <- c('Instrument','Pages','Grantor');df <- sectionedFIEldlinesToFrame(readlines(filename),stringsAsFactors=F);df;## Instrument Pages Grantor## 1 201301240005447 17 BYRES,SCott## 2 201301240005408 17 SANNE,KENNETH D 我已经尽力使其尽可能健壮.它仔细处理可能的冗余前导和尾随分隔线,并正确处理节之间不一致字段的情况.
值得强调的是最后一点.所提供的所有其他解决方案对输入数据做出了非常脆弱的假设,要么每个部分恰好有8个字段始终以相同的顺序,要么每个部分都出现每个(可能是硬编码的)字段名称.如果违反了这个假设,那些解决方案就变得毫无用处.我的函数不对字段编号,名称或一致性做出任何假设.它动态检索任何部分中存在的所有字段名称,并构建每个字段的正确向量,生成NA元素,其中字段不存在于给定部分中.
这里有些例子:
sectionedFIEldlinesToFrame(character(),'^-$',':');## data frame with 0 columns and 0 rowssectionedFIEldlinesToFrame(rep('-',2L),':');## data frame with 0 columns and 0 rowssectionedFIEldlinesToFrame(c('A:a','-'),':');## A## 1 asectionedFIEldlinesToFrame(c('A:a','-','B:b',':');## A B## 1 a <NA>## 2 <NA> bsectionedFIEldlinesToFrame(c('A:a','B:c',':');## A B## 1 a b## 2 <NA> csectionedFIEldlinesToFrame(c('A:a','A:d'),':');## A B## 1 a b## 2 <NA> c## 3 d <NA>sectionedFIEldlinesToFrame(c('-','A:a','A:d','C:e',':');## A B C## 1 a b <NA>## 2 <NA> c <NA>## 3 d <NA> esectionedFIEldlinesToFrame(c('-',':');## A B C## 1 a b <NA>## 2 <NA> <NA> <NA>## 3 <NA> c <NA>## 4 d <NA> e 总结 以上是内存溢出为你收集整理的将文本文件中隐藏格式的数据导入R全部内容,希望文章能够帮你解决将文本文件中隐藏格式的数据导入R所遇到的程序开发问题。
如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。
欢迎分享,转载请注明来源:内存溢出
微信扫一扫
支付宝扫一扫
评论列表(0条)