Our guest author today is Joachim Schork from Statistics Globe. I asked Joachim to give us an intro to a data language that we haven’t covered so far here on Data36. **That’s R…** Or using its official name: the **R Programming Language.** In this article, Joachim will guide us through the fundamentals — and **you’ll learn what you can achieve with some basic R.** You feel free to copy-paste the code from this article and follow along the process on your computer. And don’t worry if you don’t get everything 100% first. This article is to demonstrate R — but if you want to actually learn it from scratch, I encourage you to check out Joachim’s blog and his YouTube channel that are linked at the end of this article.

The R programming language is a powerful tool that is getting more and more common in the fields of statistics and social sciences.

In this article, I’ll introduce some basic features of the R programming language, and I’ll explain how to use R to manipulate, analyze, and visualize your data.

Let’s first dive into some background information about the R programming language!

## Background Information about the R Programming Language

R was developed by Ross Ihaka and Robert Gentleman in 1995 and is based on the S programming language. The R software is open source, i.e. freely available for use, development, and distribution.

Even though the basic installation of the R programming language provides many powerful functions, its main strength comes from the many add-on packages that have been developed by the R programming community. As of June 2021, more than 17,000 packages are available on the Comprehensive R Archive Network (CRAN).

The popularity of R is constantly increasing, and especially in the fields of statistics and social sciences R is becoming the major software tool.

In the following sections of this article, I’ll show some useful ways to use R in practice, including example data and example R code that you can run yourself on your own machine. So keep on reading!

## Data Manipulation in R

A major strength of the R programming language is the easy-to-use and comprehensive features when handling and manipulating data sets.

For illustration, we will create our own data set. We will also use this data set in the later sections of this article, so make sure to understand the structure of our data!

If you don’t have R and RStudio installed yet, just follow Tomi’s installation guide here.

*Note: for now, just follow my lead, generate the data — if you don’t get everything 100% at first, don’t worry, it’ll all fall into place later!*

### Let’s generate some data!

As the first step, we have to specify the sample size of our data. The following R code creates a new data object called N, which contains the value 1000 (i.e. our sample size).

N <- 1000 # Specify sample size

Since we want to create some randomly distributed data later on, we also should set a random seed:

*(Note: a random seed initializes a pseudorandom number generator and hence ensures the reproducibility of our code)*

set.seed(1357531) # Set random seed for reproducibility

Next, we generate some randomly distributed variables. In the following R code, we use the `rnorm`

function to create normally distributed variables, the `runif`

function to create a uniformly distributed variable, the `rpois`

function to create a Poisson distribution, and the `rbinom`

function to create a binomial group indicator:

x1 <- rnorm(N) # Create random variables x2 <- runif(N) + 0.25 * x1 x3 <- rpois(N, 3) - 0.3 * x1 + 0.5 * x2 y <- rnorm(N, 2, 3) + 5 * x1 + 2 * x2 + 0.5 * x3 group <- LETTERS[rbinom(N, 1, prob = rank(rank(y) * rank(- x1)) / N) + 1]

### Put the data in a data frame!

In the next step, we combine all these variables in a data frame object by using the `data.frame`

function:

data <- data.frame(x1, x2, x3, y, group) # Store all variables in data frame

Then, we can use the `head`

function to print the first six rows of our example data:

head(data) # Print head of data frame

The previous output of the RStudio console shows the structure of our example data. As you can see, our data contains three numerical predictor variables called `x1`

, `x2`

, and `x3`

, as well as a target variable `y`

and a group indicator.

### Clean up the RStudio workspace!

Finally, we can clear our RStudio workspace of everything but our example data frame by using the `rm`

, `setdiff`

, and `ls`

functions:

rm(list = setdiff(ls(), "data")) # Clear unnecessary objects from workspace

After running the previous R code, only our example data frame is kept in the workspace.

### Ready for data analysis with R!

As you have seen in the previous section, the R programming language provides many useful functions to create and manipulate data objects. Of course, the functions shown in this section could only give a brief overview. However, at this point you should have a basic idea on how to deal with data sets in R.

We are ready to analyze our data!

## Descriptive Statistics in R

### The summary function

The following R syntax demonstrates how to calculate some basic summary statistics using the R programming language.

The easiest way to do this is by applying the `summary`

function to a data set:

summary(data) # Summary statistics of data frame

The previous output shows some of the most important metrics when analyzing a data set, i.e. the minimum, the 1st quantile, the median, the mean, the 3rd quantile, and the maximum of each column in our data set. This should already give a good idea how the structure of the data looks.

### The cor function

Next, we can use the `cor`

function to return the correlation matrix of our data, i.e. a table showing the correlation coefficients between all numeric variables of a data set. Note that we have to subset the first four columns of our data, since our grouping indication (i.e. the fifth column) is not numeric.

cor(data[ , 1:4]) # Correlation matrix

As you can see based on the previous output, the variables contained in our data set are heavily correlated.

### The aggregate function

We can also analyze our data by group using the `aggregate`

function. The following R code computes the mean by group for the target variable y:

aggregate(y ~ group, data, mean) # Descriptive statistics by group

The mean of the variable `y`

in group A is `4.531951`

and the mean in group B is `5.419287`

.

As you have seen so far, we can use the R programming language to create basic descriptive statistics of a data set. However, we can also use R to perform more complex analyses.

### The lm function

The following R code shows how to use the `lm`

function and the `summary`

function (note that we have already used the `summary`

function before) to estimate a linear regression model:

summary(lm(y ~ ., data)) # Estimate linear regression model

The previous output shows that all predictor variables are significantly related to our outcome variable `y`

. Furthermore, you can see additional statistical metrics such as the residual standard error, the degrees of freedom, multiple and adjusted R-squared, the F-statistic, as well as the p-value of our model.

I don’t want to go into too much detail here, since this is supposed to be an introductory guide. However, in case you want to read more about statistical methods using the R programming language, you may have a look here.

In the next section, I’ll show you another strength of the R programming language – data visualization!

## Data Visualization in R

The following section shows an introduction on how to draw graphics using the R programming language.

The basic installation of R is already quite powerful when it comes to data visualization. However, the most popular and important framework for graphics in R is provided by the ggplot2 package.

If we want to create graphics with the `ggplot2`

package, it is preferable to have our data frame in long format, i.e. we first have to do another data manipulation step.

Note:* *Our previously created wide data set consists of one row for each observation. In contrast, a long data set contains an ID variable (or variables) that groups the values corresponding to an observation.

To convert our wide data frame to long format, we can use the functions of the tidyverse – another powerful add-on package for the R programming language.

### Install tidyverse and ggplot2!

In order to use the functions of the `tidyverse`

package, we first have to install and load `tidyverse`

to R:

install.packages("tidyverse") # Install & load tidyverse package library("tidyverse")

In the next step, we can apply the `pivot_longer`

function provided by the `tidyverse`

package to transform our data to long format:

data_long <- as.data.frame( # Convert data frame to long format pivot_longer(data = data, cols = c("x1", "x2", "x3", "y"))) head(data_long) # Print head of long data frame

The previous output illustrates the structure of our new data set.

Now, we can continue with the visualization of our data. For this, we have to install and load the `ggplot2`

package:

install.packages("ggplot2") # Install & load ggplot2 package library("ggplot2")

### Visualize with ggplot2!

The `ggplot2`

package provides a different function for each type of plot. For instance, the `geom_density`

function is used to create density plots, the `geom_boxplot`

function is used to create boxplots, and the `geom_point`

function is used to create scatterplots. However, the basic structure of the `ggplot2`

code is always the same.

The following R code creates a graphic containing a transparent density for each of our numerical variables:

ggp_density <- ggplot(data_long, # Create density plot aes(x = value, fill = name)) + geom_density(alpha = 0.5) ggp_density

The following syntax draws a grouped boxplot of our four numerical variables:

ggp_boxplot <- ggplot(data_long, # Create boxplot aes(x = name, y = value, color = group)) + geom_boxplot() ggp_boxplot

And the following code visualizes the predictor variable `x1`

and the target variable `y`

in a scatterplot, including regression lines by group:

ggp_scatterplot <- ggplot(data, # Create scatterplot aes(x = x1, y = y, color = group)) + geom_point() + geom_smooth(method = "lm", formula = y ~ x) ggp_scatterplot

Again, the previous visualizations are only a tiny overview of the possible graphs that you can create by using the R programming language. You can believe me – R is just awesome when it comes to data visualization!

**Summary**

In this article, I have demonstrated some basic features of the R programming language. As you have seen, the R programming language provides easy-to-apply functions for the handling of data sets, for the analysis of these data sets, and for the graphical visualization of the data sets.

However, there is much more to explore! In case you are interested in more R programming content you may check out my website and my YouTube channel. On these platforms, I provide R programming tutorials for a large range of topics.

Last but not least, I want to thank Tomi Mester for giving me this opportunity to share my joy for R on his website. I hope I managed to convince you that R is a very useful programming language that is definitely worth adding to a data scientist’s and statistician’s skill repertoire!

Cheers,

Joachim