****************
Data structure
****************
.. role:: bblue
.. role:: bred
.. role:: bmagenta
.. role:: bgreen
.. role:: green
.. role:: red
.. role:: blue
.. role:: tt

.. highlight:: r

.. _obj:

R is an object-oriented language: an **object** in R is anything (constants, data structures, functions, graphs) that can be assigned to a variable:

* Data Objects: used to store real or complex numerical values, logical values or characters. These objects are always vectors: there are no scalars in R.

* Language Objects: functions, expressions

.. _datatypes:

Data structure types 
=====================

* :bblue:`Vectors`: one-dimensional arrays used to store collection data of the same mode

  * Numeric Vectors (mode: *numeric*)
  * Complex Vectors (mode: *complex*)
  * Logical Vectors (model: *logical*)
  * Character Vector or text strings (mode: character)

* :bblue:`Matrices`: two-dimensional arrays to store collections of data of the same mode. They are accessed by two integer indices.

* :bblue:`Arrays`: similar to matrices but they can be multi-dimensional (more than two dimensions)

* :bred:`Factors`: vectors of categorical variables designed to group the components of another vector with the same size
* **Lists**: ordered collection of objects, where the elements can be of different types
* :bmagenta:`Data Frames`: generalization of matrices where different columns can store different mode data. 
* :bgreen:`Functions`: objects created by the user and reused to make specific operations.

.. image:: images/dataStructuresNew.png
    :scale: 50%
    :align: center

.. _numVectors:

:bblue:`Vectors`
-----------------

Numeric Vectors
****************

There are several ways to assign values to a variable:

::

  > a <- 1.7               # assign a value to a vector with only one element (~ scalar)
  > 1.7 -> a		   # assign a value to a vector with only one element (~ scalar)
  > a = 1.7                # assign a value to a vector with only one element (~ scalar)
  > assign("a", 1.7)	   # assign a value to a vector with only one element (~ scalar)
  
To show the values:

::

  > a			   # show the value in the screen (not valid in scripts)
  [1] 1.7
  > print(a)		   # show the value in the screen (valid in scripts)
  [1] 1.7

To generate a vector with several numeric values:

::

  > a <- c(10, 11, 15, 19) # assign four values to a vector using the concatenate command c()
  > a 			   # show the value in the screen 
  [1] 10 11 15 19
  
The operations are always done over all the elements of the numeric array:

::

  > a*a			   # evaluate the square value of every element in the vector
  [1] 100 121 225 361
  > 1/a			   # evaluate the inverse value of every element in the vector
  [1] 0.10000000 0.09090909 0.06666667 0.05263158
  > b <- a-1		   # subtract 1 from every element and assign the result to b
  > b
  [1]  9 10 14 18
  
To generate a *sequence*:

::

  > 2:10			        # generate a sequence from n1=2 to n2=10 using n1:n2
  [1]  2  3  4  5  6  7  8  9 10
  > 5:1				        # generate an inverse sequence if n2 < n1
  [1] 5 4 3 2 1
  
  > seq(from=n1, to=n2, by=n3) 		# generate sequence from n1 to n2 using n3 step
                                        #  (parameters names can be avoided if order is kept)
  > seq(from=1, to=10, by=3)
  [1]  1  4  7 10
  > seq(1, 10, 3)
  [1]  1  4  7 10
  
  > seq(length=10, from=1, by=3) 	# generate a fixed length sequence
  [1]  1  4  7 10 13 16 19 22 25 28
  
  > help(seq)				# for help about this command
  ...


To generate *repetitions*:

::

  > a <- 1:3; b <- rep(a, times=3); c <- rep(a, each=3)		# command rep()
  
In the previous example we have run three commands in the same line. They have been separated by a ';'. 

The content of the three variables is now:

::

  > a
  [1] 1 2 3
  > b
  [1] 1 2 3 1 2 3 1 2 3
  > c
  [1] 1 1 1 2 2 2 3 3 3
  
**The recycling rule:** vectors of different sizes can be combined, as far as
the length of the longer vector is a multiple of the shorter vector’s length
(otherwise a warning is issued, although the operation is carried out):

::

  > a+c                                         # proper dimensions
  [1] 2 3 4 3 4 5 4 5 6                         # (operation equivalent to b+c)

  > d <- c(10,100) 
  > b+d                                         # incorrect dimensions
  [1]  11 102  13 101  12 103  11 102  13
  Warning message:
  In b + d : longer object length is not a multiple of shorter object length

If we need to know which are the objects that are currently defined, we can *list* them:

::

  > ls()
  [1] "a" "b" "c" "d"
  
Undesired objects can be deleted using ``rm()`` function:

::

  > rm(a,c)				        # remove objects 'a' and 'b'
  > ls()				        # list current objects
  [1] "b" "d"
  

In order to remove everything in the working environment:

::

  > rm(list=ls())                               # Use this with caution
  > ls()                                        # (you'll receive no warning!)
  character(0)


.. _logicalVectors:
  
Logical Vectors
****************

::

  > a <- seq(1:10)				# generate a sequence
  > a
  [1]  1  2  3  4  5  6  7  8  9 10             # show values in screen
  > b <- (a>5)					# assign values from an inequality
  > b						# show values in screen
  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
  > a[b]					# show values that fulfil the condition
  [1]  6  7  8  9 10
  > a[a>5]					# the same, but avoiding intermediate variable
  [1]  6  7  8  9 10

.. _charVectors:  
  
Character Vectors
******************

::
  
  > a <- "This is an example"			# generate a character vector
  > a						# show vector content
  [1] "This is an example"
  
We can concatenate vectors after converting them into character vectors:

::

  > x <- 1.5
  > y <- -2.7
  > paste("Point is (",x,",",y,")", sep="")	# concatenate x, y and a string using 'paste' 
  [1] "Point is (1.5,-2.7)"

.. _matrices:

:bblue:`Matrices`
------------------
A matrix is a **bi-dimensional** collection of data:

::

  > a <- matrix(1:12, nrow=3, ncol=4)		# define a matrix with 3 rows and 4 columns
  > a                                                                                               
     [,1] [,2] [,3] [,4]                                                                          
  [1,]    1    4    7   10                                                                          
  [2,]    2    5    8   11                                                                          
  [3,]    3    6    9   12  
  
  > dim(a)					# return matrix dimensions (rows,columns)
  [1] 3 4    
  
 
The elements of vectors and matrices are **recycled** when it is required by the involved dimensions:

::

  > a <- matrix(1:8, nrow=4, ncol=4)		# create a matrix with 4 rows and 4 columns
  > a
       [,1] [,2] [,3] [,4]
  [1,]    1    5    1    5
  [2,]    2    6    2    6
  [3,]    3    7    3    7
  [4,]    4    8    4    8
  
.. _arrays:

:bblue:`Arrays`
---------------

They are similar to the matrices although they can have 2 o more dimensions.

::

  > z <- array(1:24, dim=c(2,3,4))
  > z
  , , 1

        [,1] [,2] [,3]
  [1,]    1    3    5
  [2,]    2    4    6

  , , 2

        [,1] [,2] [,3]
  [1,]    7    9   11
  [2,]    8   10   12

  , , 3

        [,1] [,2] [,3]
  [1,]   13   15   17
  [2,]   14   16   18

  , , 4

        [,1] [,2] [,3]
  [1,]   19   21   23
  [2,]   20   22   24
  
  
.. _factors:

:bred:`Factors`
---------------

Factors are vectors that contain categorical information useful to group the values of other vectors of the same size.
Let's see an example:

::

  > bv <- c(0.92,0.97,0.87,0.91,0.92,1.04,0.91,0.94,0.96,
  +         0.90,0.96,0.86,0.85)   			# (B-V) colours from 13 galaxies
  
If additional information is available (for instance, the morphological type of the galaxies) we 
can create a **factor** containing the galaxy types:

::

  > morfo <- c("Sab","E","Sab","S0","E","E","S0","S0","E",
  +            "Sab","E","Sab","S0") 			# morphological info (same size)
  > length(morfo)					# ensure vector is the same size
  [1] 13
  > fmorfo <- factor(morfo)				# create factor with 'factor()'
  > fmorfo
  [1] Sab E   Sab S0  E   E   S0  S0  E   Sab E   Sab S0 	# show factor content
  Levels: E S0 Sab					# factor different values (levels)
  > levels(fmorfo)					# show factor levels
  [1] "E"   "S0"  "Sab"						

We could use this additional information to perform an statistical analysis segregating the data according 
to these types. This will be covered lately in the  :ref:`functions` section.

  
.. Now, we can calculate the mean values for each morphological type of the galaxies in the sample. 
.. For this purpose, we use the special function ``tapply()`` (more on this function in the :ref:`functions` section) 
.. which, according to *R Documentation*, "*Apply a function to each non-empty group 
.. of values given by a unique combination of the levels of certain factors*". The ``tapply()`` function requires 
.. the vector from which we want to calculate the colors in ``bv``, the associated factor ``fmorfo`` and the 
.. function that we want to evaluate (the mean, ``mean()``):

.. ::

..  > meanbv <- tapply(bv, fmorfo, mean)
..  > meanbv
..      E     S0    Sab 
..  0.9700 0.9025 0.8875 
  
.. Similarly it is possible to evaluate any other function (intrinsic from R or user-defined) segregating the 
.. data using the factor information. For example, the standard deviation can be calculated:

.. ::

..  > stbv <- tapply(bv,fmorfo,sd)
..  > stbv
.. 	  E         S0        Sab 
..  0.04358899 0.03774917 0.02753785 

  
.. _lists:

Lists
------
Lists are ordered collections of objects, where the elements can be of a different type (a list can 
be a combination of matrices, vectors, other lists, etc.) They are created using the ``list()`` function:

::

  > gal <- list(name="NGC3379", morf="E", T.RC3=-5, colours=c(0.53,0.96))
  > gal
  $name                                                                                
  [1] "NGC3379"                                                                        
                                                                                     
  $morf                                                                                
  [1] "E"                                                                              
                                                                                     
  $T.RC3                                                                               
  [1] -5                                                                               
                                                                                     
  $colours                                                                             
  [1] 0.53 0.96                                                                        

  > gal$<Tab>			# pressing Tab key after '$', the elements of 'gal' are shown
  gal$name     gal$morf     gal$T.RC3    gal$colours
  
  > length(gal)			# check how many elements 'gal' has
  [1] 4
  
  > names(gal)          	# return element names
  [1] "name"    "morf"    "T.RC3"   "colours"

New elements can be added in a simple way, just defining them:

::

  > gal$radio <- TRUE				# add a boolean element
  > gal$redshift <- 0.002922			# add a numeric element
  
  > names(gal)					# return element names
  [1] "name"     "morf"     "T.RC3"    "colours"  "radio"    "redshift"

Lists can be concatenated to generate bigger lists. If we have ``list1``, ``list2``, ``list3``, we can create
a unique list which is the result of the union of these three lists:

::

  > list123 <- c(list1, list2, list3)
  
As the elements in a list can be R objects of a different type:

* :green:`Lists are extremely versatile since they can store every type of information (good)`
* :red:`Lists can be converted in objects with a rather complex structure (bad).` A list can contain several elements which are vectors of different length, which is similar to having a table where the columns have a different number of rows.
  
:blue:`The ideal situation is to take advantage of the list versatility but preventing them from growing with a very
complex structure. This is why R has defined a new type of data which fulfils both requirements: a` :bblue:`Data Frame.`
  
.. _dataframes:

:bmagenta:`Data Frames (Tables)`
--------------------------------

A *Data Frame* is an special type of list very useful for the statistical work. There are some restrictions to guarantee 
that they can be used for this statistical purpose.

Among other restrictions, a *Data Frame* must verify that:

* List components must be vectors (numeric, character or logical vectors), factors, numeric matrices or other data frames.
* Vectors, which are the variables in the data frame, must be of the same length.

.. warning:: 
  In a data frame, character vectors are automatically converted into factors, and the number of levels can be 
  determined as the number of different values in such a vector. This default behaviour can be modified with the 
  ``options(stringsAsFactors = FALSE)`` command.

Basically, in a *Data Frame* all the information is displayed as a **table** where the columns have the 
same number of rows and can contain different type objects (numbers, characters, ...).
  
*Data Frames* can be created using the ``data.frame()`` function. Let's see how to define a *data frame* with 
two elements, a numeric vector and a character vector (note that both must be same length vectors):

::
 
  > options(stringsAsFactors = FALSE)
  > df <- data.frame(numbers=c(10,20,30,40),text=c("a","b","c","a"))
  > df
    numbers text
  1      10    a
  2      20    b
  3      30    c
  4      40    a
  > df$text					# character vector not converted to a factor
  [1] "a" "b" "c" "a"


  > options(stringsAsFactors = TRUE)		# default
  > df <- data.frame(numbers=c(10,20,30,40),text=c("a","b","c","a"))
  > df$text						
  [1] a b c a					# character vector of length = 4
  Levels: a b c					#  converted to a three levels factor!!  
  > df$numbers
  [1] 10 20 30 40				# numeric vector of length = 4
  
  > mode(df)					# storage mode of the object
  [1] "list"
  > typeof(df)					# (internal) storage mode of the object
  [1] "list"
  > class(df)					# object class
  [1] "data.frame"

However the most common way of defining a *data frame* is reading the data stored in a file. We will
see later how to do it using ``read.table()`` function.
  
  
.. _factorsTables:

Factors and Tables
*******************
It is frequently useful (for instance, for table creation) to be able to generate factors from a numeric 
continuum variable. To do so, we can use the ``cut`` command. Its parameter ``breaks`` defines how the data 
are divided. **If** ``breaks`` **is a number**, this is used as the number of (same length) intervals:

::

  > bv <- c(0.92,0.97,0.87,0.91,0.92,1.04,0.91,0.94,0.96,
  +         0.90,0.96,0.86,0.85)  		# (B-V) colors from 13 galaxies
  > fbv <- cut(bv,breaks=3)			# divide 'bv' in 3 equal-length intervals
  > fbv 					# show in which interval every galaxy is
  [1] (0.913,0.977] (0.913,0.977] (0.85,0.913]  (0.85,0.913]  (0.913,0.977]
  [6] (0.977,1.04]  (0.85,0.913]  (0.913,0.977] (0.913,0.977] (0.85,0.913] 
  [11] (0.913,0.977] (0.85,0.913]  (0.85,0.913] 
  Levels: (0.85,0.913] (0.913,0.977] (0.977,1.04]       # the 3 intervals
  > table(fbv)					# generate a table with the 3 intervals
  fbv
    (0.85,0.913] (0.913,0.977]  (0.977,1.04] 
            6             6             1 

**If** ``breaks`` **is a vector**, its values are used as the limits of the intervals:

::

  > ffbv <- cut(bv,breaks=c(0.80,0.90,1.00,1.10))
  > table(ffbv)
  ffbv
    (0.8,0.9]   (0.9,1]   (1,1.1] 
        4         8         1 

If we want just an approximate number of intervals, but with equally spaced *round* values, we can use the 
``pretty()`` function (that not always returns the specified number of intervals!):

::

  > fffbv <- cut(bv,pretty(bv,3))		# ask for 3 'pretty' intervals
  > table(fffbv)				# return 4 intervals
  fffbv
    (0.85,0.9] (0.9,0.95]   (0.95,1]   (1,1.05] 
         3          5          3          1 

We can also use a quantile division:

::

  > ffffbv <- cut(bv,quantile(bv,(0:4)/4))	# ask for the 4 quantiles
  > table(ffffbv)
  ffffbv
    (0.85,0.9]  (0.9,0.92] (0.92,0.96] (0.96,1.04] 
          3           4           3           2 

.. Warning:: The last two groupings exclude the value 0.85 which is one of our data values.


Factors can be used to build multi-dimensional tables. Let's see how.
First of all, we will define the data (that in a real case would be read from a data file):

::

  > heights <- c(1.64,1.76,1.79,1.65,1.68,1.65,1.86,1.82,1.73,
  +              1.75,1.59,1.87,1.73,1.57,1.63,1.71,1.68,1.73,1.53,1.82)
  > weights <- c(64,77,82,62,71,72,85,68,72,75,81,88,72,
  +              71,74,69,81,67,65,73)
  > ages <- c(12,34,23,53,23,12,53,38,83,28,28,58,38,
  +           63,72,44,33,27,32,38)
  
For each one of these variables we can generate factors:

::

  > fheights <- cut(heights,c(1.50,1.60,1.70,1.80,1.90))	# factor for 'heights'
  > fweights <- cut(weights,c(60,70,80,90))			# factor for 'weights'
  > fages    <- cut(ages,seq(10,90,10))				# factor for 'ages'
  
Table generation is now straightforward using these factors. We can, for instance, generate  bi-dimensional tables:

::

  > ta <- table(fheights, fweights)			# table for 'heights' vs. 'weights'
  > ta
             fweights
  fheights    (60,70] (70,80] (80,90]
    (1.5,1.6]       1       1       1
    (1.6,1.7]       2       3       1
    (1.7,1.8]       2       4       1
    (1.8,1.9]       1       1       2

Marginal frequencies can also be included:

::

  > addmargins(ta)
             fweights
  fheights    (60,70] (70,80] (80,90] Sum
    (1.5,1.6]       1       1       1   3
    (1.6,1.7]       2       3       1   6
    (1.7,1.8]       2       4       1   7
    (1.8,1.9]       1       1       2   4
    Sum             6       9       5  20

Or we can work with the relative frequencies;

::

  > tta <- prop.table(ta)
  > addmargins(tta)
           fweights
  fheights    (60,70] (70,80] (80,90]  Sum
    (1.5,1.6]    0.05    0.05    0.05 0.15
    (1.6,1.7]    0.10    0.15    0.05 0.30
    (1.7,1.8]    0.10    0.20    0.05 0.35
    (1.8,1.9]    0.05    0.05    0.10 0.20
    Sum          0.30    0.45    0.25 1.00

We can also generate tridimensional tables. Following the previous example, we can examine the same 
bi-dimensional table for each age interval:

::

  > table(fheights, fweights, fages)
  , , fages = (10,20]					# first age interval

           fweights
  fheights    (60,70] (70,80] (80,90]
    (1.5,1.6]       0       0       0
    (1.6,1.7]       1       1       0
    (1.7,1.8]       0       0       0
    (1.8,1.9]       0       0       0

  , , fages = (20,30]					# second age interval

           fweights
  fheights    (60,70] (70,80] (80,90]
    (1.5,1.6]       0       0       1
    (1.6,1.7]       0       1       0
    (1.7,1.8]       1       1       1
    (1.8,1.9]       0       0       0
    
  ........
  
  , , fages = (70,80]					# next-to-the-last age interval

           fweights
  fheights    (60,70] (70,80] (80,90]
    (1.5,1.6]       0       0       0
    (1.6,1.7]       0       1       0
    (1.7,1.8]       0       0       0
    (1.8,1.9]       0       0       0

  , , fages = (80,90]					# last age interval
  
           fweights
  fheights    (60,70] (70,80] (80,90]
    (1.5,1.6]       0       0       0
    (1.6,1.7]       0       0       0
    (1.7,1.8]       0       1       0
    (1.8,1.9]       0       0       0

  > sum(table(fheights, fweights, fages))		# check total number of entries
  [1] 20                
  

.. _matricesTables: 

Matrices and Tables
*******************
We can easily generate 2D tables from matrices:

::

  > mtab <- matrix(c(30,12,47,58,25,32), ncol=2, byrow=TRUE)  	# create a matrix filled by rows
  > colnames(mtab) <- c("ellipticals","spirals")		# set matrix column names 
  > rownames(mtab) <- c("sample1","sample2","new sample")	# set matrix row names
  > mtab
             ellipticals spirals
  sample1             30      12
  sample2             47      58
  new sample          25      32

However, ``mtab`` is not a true R table. To transform it into a true table we can use:

::
  
  > rtab <- as.table(mtab)
  
  > mode(mtab);mode(rtab)				# indistinguishable in 'mode'
  [1] "numeric"
  [1] "numeric"
  
  > typeof(mtab);typeof(rtab)				# indistinguishable in 'typeof'
  [1] "double"
  [1] "double"
  
  > class(mtab);class(rtab)				# but different in 'class' !
  [1] "matrix"		
  [1] "table"

In addition to the functions to calculate *marginal distributions* (``margin.table``), 
*frequencies* (``prop.table``), etc., the command ``summary`` returns the 
:math:`\chi^2` test for the independence of the factors:

::

  > summary(rtab)
  Number of cases in table: 204 
  Number of factors: 2 
  Test for independence of all factors:
	  Chisq = 9.726, df = 2, p-value = 0.007726
  
The same command returns a different result when it is applied to a matrix type object:

::

  > summary(mtab)
          V1             V2    
    Min.   :25.0   Min.   :12  
    1st Qu.:27.5   1st Qu.:22  
    Median :30.0   Median :32  
    Mean   :34.0   Mean   :34  
    3rd Qu.:38.5   3rd Qu.:45  
    Max.   :47.0   Max.   :58
  

.. _functions:

:bgreen:`Functions`
-------------------

These are objects that can be created by the user and then re-used to make specific operations.

For example, we can define a **function** to calculate the standard deviation:

::

  > stddev <- function(x) {		        # user-defined function 'stddev'
  +   res = sqrt(sum((x-mean(x))^2) / (length(x)-1))
  +   return(res)
  + }

Functions can be defined inside other functions (nested) and can also be passed as arguments to other functions.
The value returned by a function is the result of the last expression evaluated in the body of the function or the 
value grabbed by the ``return`` command.

R functions arguments can have *default values* or can be *missing*. Arguments can be matched by name or position:

::

  > mynumbers <- c(1, 2, 3, 4, 5)
  > stddev(mynumbers)                           # equivalent calls to 'stddev'
  [1] 1.581139
  > stddev(x = mynumbers)
  [1] 1.581139
  
  > sd(x=mynumbers)				# R function using 'missing argument' with
  [1] 1.581139					#       default value (FALSE)
  > sd(x=mynumbers, na.rm=TRUE)			# Specify all arguments by name
  [1] 1.581139
  > sd(mynumbers, na.rm=TRUE)			# Mixing positional and by name matching
  [1] 1.581139  
  > sd(na.rm=TRUE, x=mynumbers)			# legal but not recommended (keep order)
  [1] 1.581139  

Looping Functions
******************
There are special R functions that can be used to repeat instructions in the command line and facilitate 
the programming process:
  
  * **lapply**: evaluate a function for each element of a list
  * **sapply**: evaluate a function for each element of a list *simplifying* the result
  * **apply**: Apply a function over the margins of an array (usually to apply a function to the rows/columns in a matrix)
  * **tapply**: Apply a function over subsets of a vector (for example defined with a factor)
  * **mapply**: Multivariate version of lapply
  
  
Let's see how to apply these functions to the previous example with the galaxy colours:

::

  > bv.vec <- c(0.92,0.97,0.87, 0.91,0.92,1.04,0.91,0.94,0.96,
  +             0.90,0.96,0.86,0.85)   				# (B-V) colours from 13 galaxies
  > morfo <- c("Sab","E","Sab","S0","E",  "E","S0","S0","E",	# ordered morph. information
  +            "Sab","E","Sab","S0")				#    for the galaxies
  
**lapply**  

::

  > bv.list <- list(colsSab=c(0.92,0.87,0.90,0.86), 
  +                 colsE=c(0.97,0.92,1.04,0.96,0.96), 
  +                 colsSO=c(0.91,0.91,0.94,0.85))
  
  > lapply(bv.list, mean)			# calculate mean for each galaxy type
  $colsSab                                      #    (returns a list)
  [1] 0.8875

  $colsE
  [1] 0.97

  $colsSO
  [1] 0.9025

**sapply**

::

  > sapply(bv.list, mean)			# simplified version of 'lapply'
  colsSab   colsE  colsSO                       #    (returns a vector)
  0.8875  0.9700  0.9025 

  
**tapply**  

::
  
  > fmorfo <- factor(morfo)			# create factor
  > tapply(bv,fmorfo,mean)			# apply mean function to the galaxy colours
       E     S0    Sab 				#    segregating by morphological type
  0.9700 0.9025 0.8875 

**apply**

::

  > a <- matrix(1:12, nrow=3, ncol=4)		# define a matrix with 3 rows and 4 columns
  > a                                                                                               
       [,1] [,2] [,3] [,4]
  [1,]    1    4    7   10
  [2,]    2    5    8   11
  [3,]    3    6    9   12
  
  > apply(a,1,mean)				# calculate rows ("1") mean == rowMeans
  [1] 5.5 6.5 7.5
  > rowMeans(a)
  [1] 5.5 6.5 7.5
  
  > apply(a,1,sum)				# calculate rows ("1") sum == rowSums
  [1] 22 26 30
  > rowSums(a)
  [1] 22 26 30

  > apply(a,2,mean)				# calculate columns ("2") mean == colMeans
  [1]  2  5  8 11
  > apply(a,2,sum)				# calculate columns ("2") sum == colSums
  [1]  6 15 24 33
  

.. _specialValues:

Special Values
==============
It is useful to define some values as * Not Available* (*NA*):

::

  > a <- c(0:2, NA, NA, 5:7)			# define vector with NA values
  > a						# show values in screen
  [1]  0  1  2 NA NA  5  6  7

We can carry out mathematical operations:

::

  > a*a						# calculate the square of 'a'
  [1]  0  1  4 NA NA 25 36 49
  
We can check whether there is any undefined value:

::

  > unavail <- is.na(a)				# use of is.na() function
  > unavail
  [1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
  
  
Sometimes calculations end up in values with no mathematical sense:

::

  > a <- log(-1)
  > a 
  [1] NaN					# Result is Not-a-Number (NaN)
  > a <- 1/0; b <- 0/0; c <- log(0); d <- c(a,b,c)
  > d
  [1]  Inf  NaN -Inf				# Infinities and Not-a-Number
  > 1/Inf 					# Possible to operate with Infinite
  [1] 0                                         #   (if it makes sense!)

To check whether we have *Infinite* values or *Not-a-Number* values:

::
  
  > is.infinite(d)				# is there any Infinite value?
  [1]  TRUE FALSE  TRUE
  > is.nan(d)					# is there any Not-a-Number value?
  [1] FALSE  TRUE FALSE

Main R functions (``mean``, ``var``, ``sum``, ``min``, ``max``,...)  accept an argument called ``na.rm`` 
that can be set as ``TRUE`` or ``FALSE`` to remove (or not) the unavailable data.

::

  > a <- c(0:2, NA, NA, 5:7)			# define vector 'a' with Not-Available data
  > a 
  [1]  0  1  2 NA NA  5  6  7
  > mean(a)					# since there are Not-Available data
  [1] NA
  
  > mean(a, na.rm=TRUE)				# calculate mean, ignoring Not-Available values
  [1] 3.5


.. subsetting:

Subsetting
===========

Several R operators can be used to extract subsets (slices) from R objects:

* **[** can be used to extract **one or more elements** of an R object. It always returns an object of the same class
* **[[** can be used to extract a **single** element from a data frame or a list. The class of the extracted element can be different from the original object.
* **$** can be used to extract **named** elements from a data frame or a list.


For *Numeric Vectors*:

::

  > a <- 1:15				# generate a sequence
  > a <- a*a				# calculate the square of 'a'
  > a					# show in screen
  [1]   1   4   9  16  25  36  49  64  81 100 121 144 169 196 225
  > a[3]				# access to the third value in the vector
  [1] 9					#  (numeric index)	
  > a[3:5]				# access to a continuum slice of values
  [1]  9 16 25				#  (numeric index)
  > a[c(1,3,10)]			# access to a given sequence of values
  [1]   1   9 100			#  (numeric index)
  > a[-1]                               # negative index remove values from vector
  [1]   4   9  16  25  36  49  64  81 100 121 144 169 196 225
  > a[c(-1,-3,-5,-7)]                   # remove several values (it is not possible
  [1]   4  16  36  64  81 100 121 144 169 196 225   to mix positive and negative indices!)
  > a[a>100]				# access to a sequence based on a condition
  [1] 121 144 169 196 225		#  (logical index)


For *Character Vectors*:
 
::

  > a <- c("A", "B", "C", "C", "D", "E")
  > a[1]				# first element of "a" (also a character vector)
  [1] "A"				#  (numeric index)
  > a[1:4]				# sequence of the first 4 elements
  [1] "A" "B" "C" "C"
  > a[a>"C"]				# select elements "greater" than letter "C"
 [1] "D" "E"				#  (logical index)
  > gtC <- a > "C"			# the same but using an intermediate logical vector
  > gtC
  [1] FALSE FALSE FALSE FALSE  TRUE  TRUE
  > a[gtC]
  [1] "D" "E"
  
For *Matrices*, elements are accessed through two integer indices:

.. note:: The agreement to establish the indices order ``a[i,j]`` is the same than the one used in Math for the matrix coefficients a :sub:`ij`

::

  > a <- matrix(1:12, nrow=3, ncol=4)	# define a matrix with 3 rows and 4 columns
  > a                                  		                                        
       [,1] [,2] [,3] [,4]                                     
  [1,]    1    4    7   10
  [2,]    2    5    8   11
  [3,]    3    6    9   12
   
  
  > a[2,3]				# return the value in the 2nd row and 3th column
  [1] 8
  > a[[2,3]]				# return the value in the 2nd row and 3th column
  [1] 8
  > a[2,]				# return the values for the second row
  [1]  2  5  8 11
  > a[,3]				# return the values for the third column
  [1] 7 8 9 

.. note:: By default, subsetting a single element or a single row or a single column returns a vector, not a matrix (this can be changed using ``drop=FALSE``)

::
    
  > a[2,3, drop=FALSE]			# so as not to 'drop' the dimension
        [,1]				#     (returns a 1x1 matrix)
  [1,]    8
  > a[2, , drop=FALSE]			# return a 1x4 matrix
        [,1] [,2] [,3] [,4]
  [1,]    2    5    8   11
  
  
The access to the matrix elements can be done with the indices stored in other auxiliary matrices:

::

  > ind <- matrix(c(1:3,3:1), nrow=3, ncol=2)	# auxiliary matrix for the indices i,j
  > ind
       [,1] [,2]                                                                                    
  [1,]    1    3
  [2,]    2    2
  [3,]    3    1
  
  > a[ind] <- 0				# set to 0 the matrix values in the indices
  > a					#  specified in 'ind' (1,3), (2,2), (3,1)
       [,1] [,2] [,3] [,4]
  [1,]    1    4    0   10
  [2,]    2    0    8   11
  [3,]    0    6    9   12
  
  
For *lists*:  
  

The list components can be accessed using the three operators mentioned above (*[*, *[[* and *$*):

::

  > gal <- list(name="NGC3379", morf="E", colours=c(0.53,0.96))
  
  > gal[3]				# access to the third element of the list
  $colours                              #  (get back a list with one element called 'colours' 
  [1] 0.53 0.96                         #   with the sequence '0.53,0.96')
  > gal["colours"]			# single bracket + name (same as above)
  $colours				
  [1] 0.53 0.96

  > gal[[3]]				# access to the third element of the list
  [1] 0.53 0.96				#  (get back just the sequence)
  > gal[["colours"]]			# double bracket + name (same as above)
  [1] 0.53 0.96				
  
  
  > gal$colours				# element associated with the name 'colours'
  [1] 0.53 0.96				#  (same as double bracket)
  > gal$colours[1]			# first element of the sequence in the third element
  [1] 0.53
  > gal$colours[2]			# second element of the sequence in the third element
  [1] 0.96

To extract **multiple elements** of a list, single bracket is mandatory:

::

  > gal <- list(name="NGC3379", morf="E", colours=c(0.53,0.96))
  
  > gal[c(1,2)]				# return a list with the elements 'name' and 'morf'
  $name
  [1] "NGC3379"

  $morf
  [1] "E"

For **computed** indices the *[[* and *[* operators can be used. The *$* operator can only be used with *literal* names:

::

  > gal <- list(name="NGC3379", morf="E", colours=c(0.53,0.96))
  
  > info <- "morf"			# variable containing the name of one of the list elements
  
  > gal[["morf"]
  [1] "E"  
  > gal[[info]]				# computed index for 'morf' with double bracket
  [1] "E"
  
  > gal["morf"]
  $morf
  [1] "E"
  > gal[info]				# computed index for 'morf' with single bracket
  $morf
  [1] "E"
  
  > gal$morf
  [1] "E"
  > gal$info				# element 'info' unknown
  NULL

To **recursively** extract an element:

::

  > gal <- list(name="NGC3379", morf="E", colours=c(0.53,0.96))
  
  > gal[[c(3,1)]]			# extract the 1st element of the 3rd element ('0.53')
  [1] 0.53
  > gal[[3]][[1]]			# equivalent double subsetting
  [1] 0.53
  
  > gal[c(3,1)]				# not recursive!
  $colours
  [1] 0.53 0.96

  $name
  [1] "NGC3379"

Elements can be extracted using **partial matching** with the *[[* and *$* operators:

::

    > gal <- list(name="NGC3379", morf="E", colours=c(0.53,0.96))
     
    > gal$na				# get element by partial matching the name
    [1] "NGC3379"
    > gal[["na"]]			# expect exact element name
    NULL
    > gal[["na", exact=FALSE]]		# partial matching as with '$'
    [1] "NGC3379"


For *Data Frames (Tables)*, the operators used for slicing are the same than those used for *lists*:

::

  > airquality 					# data frame in R library
  > airquality[1:7, ]				# display first 7 rows of data frame
    Ozone Solar.R Wind Temp Month Day		# there are missing values in rows 5 and 6
  1    41     190  7.4   67     5   1
  2    36     118  8.0   72     5   2
  3    12     149 12.6   74     5   3
  4    18     313 11.5   62     5   4
  5    NA      NA 14.3   56     5   5
  6    28      NA 14.9   66     5   6
  7    23     299  8.6   65     5   7    
  > class(airquality[1:7, ])
  [1] "data.frame"
  
  > airquality[1,1]				# get element in row=1, col=1
  [1] 41
  > airquality[[1,1]]				# get element in row=1, col=1
  [1] 41
  
  > airquality[1,]				# get row=1 (all columns)
    Ozone Solar.R Wind Temp Month Day
  1    41     190  7.4   67     5   1
  > class(airquality[1,])
  [1] "data.frame"
  > as.numeric(airquality[1,])			# get row=1 into a numeric vector
  [1]  41.0 190.0   7.4  67.0   5.0   1.0
  

  > airquality$Ozone				# get "Ozone" column into a vector
    [1]  41  36  12  18  NA  28  23  19   8  NA   7  16  11  14  18  14  34   6
   [19]  30  11   1  11   4  32  NA  NA  NA  23  45 115  37  NA  NA  NA  NA  NA
   [37]  NA  29  NA  71  39  NA  NA  23  NA  NA  21  37  20  12  13  NA  NA  NA
   [55]  NA  NA  NA  NA  NA  NA  NA 135  49  32  NA  64  40  77  97  97  85  NA
   [73]  10  27  NA   7  48  35  61  79  63  16  NA  NA  80 108  20  52  82  50
   [91]  64  59  39   9  16  78  35  66 122  89 110  NA  NA  44  28  65  NA  22
  [109]  59  23  31  44  21   9  NA  45 168  73  NA  76 118  84  85  96  78  73
  [127]  91  47  32  20  23  21  24  44  21  28   9  13  46  18  13  24  16  13
  [145]  23  36   7  14  30  NA  14  18  20

  > class(airquality$Ozone)
  [1] "integer"

  
For *Character Strings* the access to their elements is done in a different way:

::

  > a <- "This is an example of a text string"	# define a character string
  > substr(a,5,10)				# show a string subset
  [1] " is an"

  
.. rmNA:

Removing NA values
------------------
  
We can remove *Not Available* values in a simple way using subsetting:

::

  > a <- c(0:2, NA, NA, 5:7)			# define vector with NA values

  > aa <- a[!is.na(a)]				# the condition uses the negation
  > aa 						#    of is.na() function
  [1] 0 1 2 5 6 7                               # new vector with no NA values

To take the subset of multiple vectors avoiding the missing values:

::

  > a <- c( 1,  2, 3, NA, 5, NA, 7)
  > b <- c("A","B",NA,"D",NA,"E","F")
  > valsok <- complete.cases(a,b)		# return positions in which both vectors have 
  > valsok					#    no-missing values
  [1]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE
  > a[valsok]					# subsetting 'a' gets good elements in 'a'
  [1] 1 2 7
  > b[valsok]					# subsetting 'b' gets good elements in 'b'
  [1] "A" "B" "F"

We can also use the function ``complete.cases`` to remove missing values from data frames:

::

  > airquality 					# data frame in R library
  > airquality[1:7, ]				# display first 7 rows of data frame
    Ozone Solar.R Wind Temp Month Day		# there are missing values in rows 5 and 6
  1    41     190  7.4   67     5   1
  2    36     118  8.0   72     5   2
  3    12     149 12.6   74     5   3
  4    18     313 11.5   62     5   4
  5    NA      NA 14.3   56     5   5
  6    28      NA 14.9   66     5   6
  7    23     299  8.6   65     5   7
  
  > valsok <- complete.cases(airquality)	# rows in which all the values are ok
  > airquality[valsok, ][1:7,]			# subset original dataframe and show first 7 rows
    Ozone Solar.R Wind Temp Month Day
  1    41     190  7.4   67     5   1
  2    36     118  8.0   72     5   2
  3    12     149 12.6   74     5   3
  4    18     313 11.5   62     5   4
  7    23     299  8.6   65     5   7
  8    19      99 13.8   59     5   8
  9     8      19 20.1   61     5   9