• how to simulate simple data sets
  • setup a template for plots
  • create a line plot
  • add a jiiter plot to the base plot
  • increase the dataset dimension for creating a scatter plot
#  loading required libraries for this notebook

#loading libraries


Which plot to choose?

The answer depend on your data. Depending on the kind of relation you would like to highlight there are different plots that can be useful. In my workflow generally the first think I need to check is

  1. presence of a changing in time of one (or multiple) variables
  2. check if the data follow a distribution
  3. check the presence of a linear correlation between the variables

For this purpose we can use

  1. time series plots
  2. distribution plots
  3. correlation plots

First of all we generate some data

#	creating a very simple dataframe

# Parameters

x_min   <- 0
x_max   <- 10   
x_step  <- 0.01

y_mean  <- 0.5
y_sd    <- 0.25
y_min   <- -1
y_max   <- 1     

x       <- seq(x_min,x_max,x_step)

# Variables
var_random  <- runif(x,y_min,y_max)
var_norm    <- rnorm(x,y_mean,y_sd) 
var_sin     <- sin(x)

# Data.frame 

df  <- data.frame (x,var_random,var_norm,var_sin)
dt  <- data.table(df)
# Melt 
dtm <- melt(dt, id.vars="x")
A data.table: 6 × 3
x variable value
<dbl> <fct> <dbl>
0.00 var_random -0.541030637
0.01 var_random -0.485715914
0.02 var_random 0.588714651
0.03 var_random 0.002162422
0.04 var_random -0.721160703
0.05 var_random 0.154144957

notes about the code A few comments on the code. First of all we setup the min of x and y and also a few parameters that will e used to generate the data. as a second step we use the functions runif,rnorm and sin to create a random variable, a uniformly distributed variable and a sinusoid. We use the function data.frame to put togheter the x and vars, we then transform everything in a data.table since we need to use the function melt from the data.table library. In this case the id of each variable corrensponds to the x and so in the melt function we used as a parameter the id.vars=x

The previous code create a personalised theme replacing the settings that can be found in the theme_minimal from ggplot. For doing that we use the command %+replace%. We just changed some text options plot, but here you can insert all the customization you want (see the ggplot reference here. Now that we have set the theme for our plot we will plot the three variables. Since we only have 3 variables we can create a line plot for each of the variable and using the library patchwork we put all of the plots together

#	Line plot

options(repr.plot.width=8.9, repr.plot.height=8.9,units="cm")

p <- ggplot(dtm[variable=="var_sin"], aes(x = x, y = value, group=variable)) +
     geom_line(aes(linetype=variable,color=variable),size=3) + theme_light(base_size=20) + theme(legend.position = "none")

p1 <- ggplot(dtm[variable=="var_norm"], aes(x = x, y = value, group=variable)) +
     geom_line(aes(linetype=variable,color=variable)) + theme_light(base_size=20) + theme(legend.position = "none")

p2 <- ggplot(dtm[variable=="var_random"], aes(x = x, y = value, group=variable)) +
     geom_line(aes(linetype=variable,color=variable)) + theme_light(base_size=20) + theme(legend.position = "none")


So what does the previous lines of code works. First of all we create an object p. For ggplot every plot is just an object that we can recall later. This is very important since we can put plots in a list, we can write functions that can generate plots and in a few lines and we can take advantage of how R deals with objects also (I'm using the term object with large acception here and not in a stricly language meaaning). We invock a ggplot and we tell that he should consider the data dtm as source for the plot. Since we do not want to plot all the variables we select only the variable_sin. Then we need to specify the x and y and also if we want any grouping variable. Everything included in the parenthesis after the aes() takes care of it. Now the important part: adding a line plot we use the geom_line (if you stop here and try to get a plot you will only get an empty canvas + the x and y axis and labels). This will create the line plot and finally we use the theme for the plot we just created. What does our plots tell us? We can spot without problem the sinusoid. While the other data looks noisy and random. Is there any kind of distribution in the values of our variables? Let's find it our creating histograms of the values of the variables in exam.

# Histogram plot

p3 <- ggplot(dtm[variable=="var_sin"], aes(y = value, group=variable)) +
     geom_histogram(bins=20) + theme_light(base_size=20)    
p4 <- ggplot(dtm[variable=="var_norm"], aes(y = value, group=variable)) +
     geom_histogram(bins=20) + theme_light(base_size=20)      

p5 <- ggplot(dtm[variable=="var_random"], aes(y = value, group=variable)) +
      geom_histogram(bins=20) + theme_light(base_size=20)

p3 + p4 + p5

notes on the code. Since dtm is a data.table we can use the following synthax dtm[variable=="var_sin"] to select only the variable we would like to plot. We add an histogram and with the options bins=20,R will take care of splitting the distributions in 20 bins. What do the plots tell us? It is easy to spot at a glance that we have one of the variable with a normal distribution while the other are not. The sin(x) looks as expected with higher frequencies of values at -1 and 1 and the noise variable has does not show any kind of distribution.

Now we will use another kind of plot to see how the data are distributed. What is called a jiiter plot

pj1 <- ggplot(dtm, aes(x=variable,y = value, group=variable))            +
       geom_jitter(position = position_jitter(0.1),alpha=0.1,, size = 3) +

notes on the code: in this case we just used all the dataframe with the variable as x and the y as the value. since we have lots of points we used an alpha value of 0.1 in order to have a nice effect on the plot. About the results. the concentration of points (absent in the first case, concentrated on a mean value, at the border for the sinusoidal values) gives us a perfect glance of the distribution of the values. Finally in order to explore Let's add a second se of "measurements" for each variable to the dataset previously created and let's plot them

# new variables

var_random2  <- runif(x,y_min,y_max)
var_norm2    <- rnorm(x,y_mean,y_sd) 
var_sin2     <- sin(x) + rnorm(x,0,0.1)

At first we will plot them and add them to the previous plot

p7 <- p  + geom_line(aes(y=var_sin2, color="blue"),size=1) 
p8 <- p1 + geom_line(aes(y=var_norm2, color="blue")) 
p9 <- p2 + geom_line(aes(y=var_random2, color="blue")) 

p7 + theme(legend.position = "none")
p8 + theme(legend.position = "none")
p9 + theme(legend.position = "none")

we could have changed the dataframe and add the new columns but the versatility of ggplot let us add a new layer of plot and also specify the new color we would like to use for it. Are these "second measurements"s correlated in comparison with the previous one?
We can check it using a scatter plot. ggplot can help us with the command geom points but this time for sake of clarity we will first merge the new data with the dataframe

df2<- data.frame(df,var_sin2,var_norm2,var_random2)
dt2 <- data.table(df2)
p10 <-  ggplot(dt2) + geom_point(aes(x=var_sin,y=var_sin2),size=5,alpha=0.5)      + theme_light(base_size=20)
p11 <-  ggplot(dt2) + geom_point(aes(x=var_norm,y=var_norm2),size=5,alpha=0.5)    + theme_light(base_size=20)
p12 <-  ggplot(dt2) + geom_point(aes(x=var_random,y=var_random2),size=5,alpha=0.5)+ theme_light(base_size=20)


A few notes. We did not use melt since we just needed to select the cols from our newly created dataframe. (If needed a melt data.table can be reshape using the command dcast) We plotted them in pairs because we wanted to see if the "first measurement" was in some way correlated to the "second one" In the first plot we've seen the pair of sinusoidal variables. We creaed them as correlated and in fact if we plot one vs the other we can see that the points lie on the bisect of the I and IV quadrant. They are positively lineary correlated. Then we have the norm variables. both of them are created at taking random numbers from a normal distribution. Finally the random vars. totally random and no correlation between them as expected. The aspect of the plot was changed in order to give more space to the plot