R and Messy Date Formats in Data

For one of my projects I am using Python scripts to scrape web pages. Unfortunately, the dates in the pages are not in a consistent format. Some are like Jun 19 2014, whereas others are 28-Mar-14, yet others are 2010-Sep-20. The trickiest ones are the ones like Jun-10. Thanks to plannapus in stack overflow I have a nifty R solution to this problem.

The solution is posted here.

as.Date function returns an NA when a format does not match data. You run as.Date with multiple date formats and combine the various outputs to arrive at a definitive list. The function he wrote for this purpose is below


multidate <- function(data, formats){
    a<-list()
    for(i in 1:length(formats)){
        a[[i]]<- as.Date(data,format=formats[i])
        a[[1]][!is.na(a[[i]])]<-a[[i]][!is.na(a[[i]])]
        }
    a[[1]]
    }

The only problem I faced was with %B-%y format. It was ambiguous and was not automatically matched. What I did was to run the function once, get the ones marked as NA and prefix 01- to the dates. Then the %B-%y format turned into %d-%B-%y which was successfully imputed by the function.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Create a website or blog at WordPress.com

Up ↑

%d bloggers like this: