# Introduction to R for Data Science :: Session 1

**The Exactness of Mind**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Welcome to Introduction to R for Data Science Session 1! The course is co-organized by Data Science Serbia and Startit. You will find all course material (R scripts, data sets, SlideShare presentations, readings) on these pages.

**Lecturers**

- dipl. ing Branko Kovač, Data Analyst at CUBE, Data Science Mentor at Springboard, Institut savremenih nauka, Data Science Serbia
- Goran S. Milovanović, Phd, [email protected], Data Science Serbia

**Summary of Session 1, 28. april 2016 :: Introduction to R**

*Elementary data structures, data.frames + an illustrative example of a simple linear regression model. An introduction to basic R data types and objects (vectors, lists, data.frame objects). Examples: subsetting and coercion. Getting to know RStudio. What can R do and how to make it perform the most elementary tricks needed in Data Science? What is CRAN and how to install R packages? R graphics: simple linear regression with plot(), abline(), and fancy with ggplot(). *

**Intro to R for Data Science SlideShare :: Session 1**

**R script + Data Set :: Session 1**

######################################################## # Introduction to R for Data Science # SESSION 1 :: 28 April, 2016 # Data Science Community Serbia + Startit # :: Branko Kovač and Goran S. Milovanović :: ######################################################## # This is an R comment: it begins with "#" and ends with nothing ? # data source: http://www.stat.ufl.edu/~winner/datasets.html (modified, from .dat to .csv) # from the website of Mr. Larry Winner, Department of Statistics, University of Florida # Data set: RKO Films Costs and Revenues 1930-1941 # More on RKO Films: https://en.wikipedia.org/wiki/RKO_Pictures # First question: where are we? getwd(); # this will tell you the path to the R working directory # Where are my files? # NOTE: Here you need to change filesDir to match your local path filesDir <- "/home/goran/Desktop/__IntroR_Session1/"; class(filesDir); # now filesDir is a of a character type; there are classes and types in R typeof(filesDir); # By the way, you do not need to use the semicolon to separate lines of code: class(filesDir) typeof(filesDir) # point R to where your files are stored setwd(filesDir); # set working directory getwd(); # check # Read some data in csv (comma separated values # - it might turn out that you will be using these very often) fileName <- "rko_film_1930-1941.csv"; dataSet <- read.csv(fileName, header=T, check.names=F, stringsAsFactors=F, row.names=NULL); # read.csv is for reading comma separated values # type ? in front of any R function for help ?read.csv # to find our that read.csv is a member of a wider read* family of functions # of which read.table is the most generic one # now, dataSet is of type... typeof(dataSet); # in type semantics, dataSet is a list. In R we use lists a lot. class(dataSet); # in object semantics, dataSet is a data.frame! # what is the first member of the dataSet list? dataSet[[1]]; # what are the first two members? dataSet[1:2]; # mind the difference between subsetting a list with [[]] and [] # does a single member of dataSet have a name? names(dataSet[[1]]); # of what type is it? typeof(dataSet[[1]]); class(dataSet[[1]]); # do first two elements have names? names(dataSet[1:2]); # wow typeof(dataSet[1:2]); # the first element of dataSet, understood as a character vector, does not have a name # however, elements OF A list do have names # can we subset a data.frame object by names? dataSet$movie; dataSet$movie[1:10]; dataSet$movie[[1]]; class(dataSet$movie[[1]]); typeof(dataSet$movie[[1]]); # thus, a character vector is the first member = the first column of the dataSet data.frame testWord <- testWord testWord[[1]]; testWord[[1:2]]; # error testWord[1:2]; # similar dataSet[1:2]; # first two columns of a dataSet # back to characters tW <- testWord[1]; tW[1] tW[2] # NA # from a viewpoint of a statistical spreadsheet user, NA is used for missing data in R # what is the second letter in tW == 'Ana' substring(tW,2,2); # there are functions in R to deal with characters as strings! # finding elements of vectors w <- testWord[w]; # how many elements in testWord? length(testWord); # subsetting testWord, again testWord[2:length(testWord)]; # length is another important function, like which() or substring() tail(testWord,2); # vectors have tails, yay! head(testWord,3); # and heads as well # a data.frame has a head too, and that knowledge often comes handy... head(dataSet,5); # ... especially when dealing with large data sets # of course... tail(dataSet,10); # another two functions: tail() and head() # further subsetting of a data.frame object dataSet$reRelease # columns can have names; reRelease is the name of the 2nd column of dataSet typeof(dataSet$reRelease); class(dataSet$reRelease); # automatic type conversion in R: from numeric to logical is.numeric(dataSet$reRelease); reRelease is.logical(reRelease); # vectors, sequences... # automatic type conversion (coercing) in R: from real to integer x <- 2:10; # is the same as... x <- seq(2,10,by=1); # multiples of 3.1415927... multipliPi <- x*pi; multipliPi # NOTE multiplication * in R operates element-wise # This is one of the reasons we call it a vector programming language... is.double(multipliPi); # type conversion in R: from double to integer as.integer(multipliPi) is.integer(multipliPi) is.integer(as.integer(multipliPi)) # rounding round(multipliPi,1) round(multipliPi,2) # carefully! as.integer(multipliPi) == round(multipliPi,0) # check documentation ?as.integer # enjoy... # more coercion... num <- as.numeric("123"); is.numeric(num) ch <- as.character(num) is.character(ch) # What do we all love in Data Science and Statistics? Random numbers..! runif(100,0,1) # one hundred uniformly distributed random numbers on a range 0 .. 1 rnorm(100, mean=0, sd=1) # one hundred random deviates from the standard Gaussian # all probability density and mass functions in R have similar r* functions to generate random deviates # Enough! Let's do something for real... # Q: Is it possible to predict the total revenue from movie production cost? # Are these two related at all? # What is the size of the data set? n # any missing data? sum(!(is.na(dataSet$productionCost))); sum(!(is.na(dataSet$totalRevenue))); # plot dataSet$productionCost on x-axis and dataSet$totalRevenue on y-axis plot(dataSet$productionCost, dataSet$totalRevenue); # are these two correlated? cPearson <- cor(dataSet$productionCost, dataSet$totalRevenue,method="pearson"); cPearson

# are these two correlated? cPearson <- cor(dataSet$productionCost, dataSet$totalRevenue,method="pearson"); cPearson # hm, maybe I should use non-parametric correlation instead cSpearman <- cor(dataSet$productionCost, dataSet$totalRevenue,method="spearman"); cSpearman # log-transform will not help much in this case... hist(log(dataSet$productionCost),20); # the default base of log in R is e (natural) hist(log(dataSet$totalRevenue),20);

# However, who in the World tests the assumptions of the linear model... Kick it! reg <- lm(dataSet$totalRevenue ~ dataSet$productionCost); summary(reg); # get residuals reg$residuals # get coefficients reg$coefficients # some functions to inspect the simple linear model coefficients(reg) # model coefficients confint(reg, level=0.95) # CIs for model parameters fitted(reg) # predicted values residuals(reg) # residuals anova(reg) # anova table vcov(reg) # covariance matrix for model parameters # plot model intercept <- reg$coefficients[1]; slope <- reg$coefficients[2]; plot(dataSet$productionCost, dataSet$totalRevenue); abline(reg$coefficients); # as simple as that; abline() is a generic function, check it out ?abline

# and now for a nice plot library(ggplot2); # first do: install.packages("ggplot2");not now - it can take a while # library() is a call to use any R package # of which the powerful ggplot2 is among the most popular g <- ggplot(data=dataSet, aes(x = productionCost, y = totalRevenue)) + geom_point() + geom_smooth(method=lm, se=TRUE) + xlab("nProduction Cost") + ylab("Total Revenuen") + ggtitle("Linear Regressionn"); print(g);

# Q1: Is this model any good? # Q2: Are there any truly dangerous outliers present in the data set? # print is also a generic function in R: for example, print("Doviđenja i uživajte u praznicima uz gomilu materijala za čitanje i vežbu!") # P.S. Play with: reg <- lm(dataSet$totalRevenue ~ dataSet$productionCost + dataSet$domesticRevenue); summary(reg) # etc.

**Readings :: Session 2 [5. May, 2016, @Startit.rs, 19h CET]**

Chapters 1 - 5, **The Art of R Programming, Norman Matloff**

- Intro to R
- Vectors and Matrics
- Lists

**Session 1 Photos**

**leave a comment**for the author, please follow the link and comment on their blog:

**The Exactness of Mind**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.