Advanced Data Cleaning and Processing Algorithms Using R Language

Introduction

With the massive explosion of data generated and published in the past few decades comes new challenges. Data analysts and scientists now have to discover patterns in this unprecedented amount of raw and often redundant data. In order to do that, data must go through a very critical and sensitive process of cleaning to produce high quality data that can later be used to estimate and draw generalizable conclusions.

In this report datasets provided by Credit Union Cherry Blossom Ten Mile Run were used. Cherry Blossom started in 1973 with the primary objective of training elite runners who were aiming to participate in the Boston Marathon. Since then, it has gained more popularity. In 2012 nearly 17,000 runners ranging in age from 9 to 89 participated. The organizers publish the results after each year’s race at http://www.cherryblossom.org/. The datasets include the runners’ names, ages, and hometowns which act as a valuable resource for forming a better understanding of the relationship between age and performance.

Loading the data

13 text files for each gender containing data from 1999 to 2012 was rearranged in one text files with 150000 observations. This process would be very tedious and time consuming if done manually considering how both the format and the manner the information was reported is different from year to year. These changes can be seen as differences in the format of the table header and the use of footnotes. The tables had many errors including values that begin in the wrong column, missing headers, as well as other mistakes

After Fix

A customized programming approach was used to transform the text files to tables, the following codes shows the two main functions for finding the columns locations and seperating the rows to each columns

findLocs = function(spaceRow) {
  
  rowLength = nchar(spaceRow)
  searchLocs <- c(0,gregexpr(" ",spaceRow)[[1]])
  if (substring(spaceRow, rowLength, rowLength) != " ")
   return(c(searchLocs,rowLength +1))
  else return(searchLocs)
    
}

MakeColumns <- function(rowData,year){
  
  indexEq <-grep('^==',rowData)
  headerRow <- rowData[indexEq-1]
  body <- rowData[indexEq+1:length(rowData)]
  spaceRow <- rowData[indexEq]
  headerRow <- tolower(headerRow)
  searchLocs <- findLocs(spaceRow)
  df <- mapply(substr,list(body),start=searchLocs[-length(searchLocs)]+1,
        stop = searchLocs[-1]-1)
  colnames(df)<-mapply(substr,list(headerRow),start=searchLocs[-length(searchLocs)]+1,
                       stop = searchLocs[-1]-1)
  
  df <- select(df,c('place','ag','name','hometown','time'))
  return(df)
}

Inspecting/Cleaning The Data

After loading the data we need to do initial inspecting to find any out-of-the-ordinary patterns. by the first inspection, we see the age values for 2003 and 2006 look off, a box plot can be a great tool to detect errors like this, going back to the original text files, and manipulate align the columns better we were able to fix it.

Table 1: An example table caption.
place	ag	name	hometown	time	year	gender
1	28	Worku Bikila	Ethiopia	46:59	1999	Male
2	24	Lazarus Nyakeraka	Kenya	47:01	1999	Male
3	27	James Kariuki	Kenya	47:03	1999	Male
4	28	William Kiptum	Kenya	47:07	1999	Male
5	26	Joseph Kimani	Kenya	47:31	1999	Male
6	25	Josphat Machuka	Kenya	47:33	1999	Male

Before Fix

After Fix

Formating Columns

The next step is to formate the columns correctly. In order to be able to manipulate the time, a new formate should be made. from h:m:s to number of minutes

This can easily be made with a bit of math.

toMinutes <- function(x){
  
  
  if(length(x)==2){
    return(x[1]+x[2]/60)}
  else{return(x[1]*60 + x[2] +x[3]/60)
  }
}

Further Inspection

We once again used boxplot to summarize the number of minutes across both genders across the years, no outliers are found, although 505 missing data exist out of 15000. scatter plot is a great tool to quickly visualize the data and find any obvious trends which can tell us if we have cleaned data.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   45.25   84.72   96.47   97.36  108.92  177.52     505

plotting scatter plot scatterPlot

Initial Analysis

This sequence of boxplots shows the quartiles of time for men grouped into 10-year age intervals. As age increases, all the quartiles increase. However, the box becomes asymmetrical with age, which indicates that the upper quartile increases faster than the median and lower quartile.

Conclusion

In this project, we demonstrated how to load data from multiple unsynchronized text files into one data frame. In a corporate scenario where such a task is needed, a multidisciplinary team usually handles the cleaning and organizing of the text files and loads them into a database to be furthered analyzed inside a GUI application. Programming language can speed up the cleaning process while simultaneously allow us to discover and analyze the data.