For one of my projects I am using Python scripts to scrape web pages. Unfortunately, the dates in the pages are not in a consistent format. Some are like Jun 19 2014, whereas others are 28-Mar-14, yet others are 2010-Sep-20. The trickiest ones are the ones like Jun-10. Thanks to plannapus in stack overflow I have a nifty R solution to this problem.
The solution is posted here.
as.Date function returns an NA when a format does not match data. You run as.Date with multiple date formats and combine the various outputs to arrive at a definitive list. The function he wrote for this purpose is below
multidate <- function(data, formats){
a<-list()
for(i in 1:length(formats)){
a[[i]]<- as.Date(data,format=formats[i])
a[[1]][!is.na(a[[i]])]<-a[[i]][!is.na(a[[i]])]
}
a[[1]]
}
The only problem I faced was with %B-%y format. It was ambiguous and was not automatically matched. What I did was to run the function once, get the ones marked as NA and prefix 01- to the dates. Then the %B-%y format turned into %d-%B-%y which was successfully imputed by the function.
Leave a Reply