vignettes/not-lost-in-translation-importing-and-exporting-graphs.Rmd
not-lost-in-translation-importing-and-exporting-graphs.Rmd
diffnet
objects (the core of
netdiffuseR).diffnet
objects is not
the only way to use netdiffuseR. Most of the functions
can also be used with matrices and arrays.##
## Attaching package: 'netdiffuseR'
## The following object is masked from 'package:base':
##
## %*%
We call raw network data to datasets that have a somewhat raw form, for example, edgelists, adjacency matrices, survey nomination data, etc. and need to be read into R.
Usually this datasets are acompained with vertex attribute data.
The issue is how to read it into R and handle it altogether.
Before start, we recommend the user to take a look at the
Data input functions included in the utils
package
(see ?read.table
), to the functions included in the
foreign
package (useful to read from Stata, SPSS, etc.),
and to the read_excel
function in the readxl
package1
for reading excel files into R.
edgelist_to_adjmat
edgelist_to_adjmat
supports both weights
and spells.as_diffnet
For this example we will use the fakesurvey
and
fakeEdgelist
datasets. The later was been generated using
the fakesurvey
dataset, which holds survey information
retrieved from 10 different individuals in two different groups. Ties in
the fakeEdgelist
dataset are valued, and its value
coincides with the number of nominatios that each individual in the
survey did to each other.
Taking a look at fakesurvey
’s group
and
id
column and fakeEdgelist
’s ego
and alter
columns the user can tell that the laters have
been generated by adding up group*100
with
id
.
# id group
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1
# 6 1 2
head(fakeEdgelist)
# ego alter value
# 1 102 101 1
# 2 103 102 1
# 3 102 103 1
# 4 105 103 1
# 5 105 104 2
# 6 104 105 1
We will use this information later on to verify the way the data is
sorted in the resulting diffnet
objects.
To use the as_diffnet
function we need at least two
objects: a dynamic graph represented as either an array
or a list of adjacency matrices, each of size
,
which in our case will be
,
and an integer vector of size
which holds each vertex’s time of adoption. Lets start
by generating the dynamic graph using th edgelist_to_adjmat
function:
# Coercing the edgelist to an adjacency matrix
adjmat <- edgelist_to_adjmat(
edgelist = fakeEdgelist[,1:2], # Should be a two column matrix/data.frame
w = fakeEdgelist$value, # An optional vector with weights
undirected = FALSE, # In this case, the edgelist is directed
t = 5) # We use this option to make 5 replicas of it
# Warning in edgelist_to_adjmat.matrix(as.matrix(edgelist), w, t0, t1, t, : Some edges a had NA/NULL value on either -times- or -w-:
# 11
# These won't be included in the adjacency matrix. The complete list will be stored as an attribute of the resulting adjacency matrix, namely, -incomplete-.
As the function warns, there is an edge that had incomplete information, and further was not used to create the adjacency matrix, the edge 11. If we take a look at that edge, we will see that indeed it had incomplete information on the weight attribute:
fakeEdgelist[11,,drop=FALSE]
# ego alter value
# 11 202 <NA> NA
In order to address this, if we want to keep the vertex 202, an isolated vertex in the data, we need to fill that value up so that when creating the diffnet object we won’t have any problem having more attributes or times of adoption that vertices in the graph.
# Filling the empty data, and checking the outcome
fakeEdgelist[11,"value"] <- 1
fakeEdgelist[11,,drop=FALSE]
# ego alter value
# 11 202 <NA> 1
# Coercing the edgelist to an adjacency matrix (again)
adjmat <- edgelist_to_adjmat(
edgelist = fakeEdgelist[,1:2], # Should be a two column matrix/data.frame
w = fakeEdgelist$value, # An optional vector with weights
undirected = FALSE, # In this case, the edgelist is directed
keep.isolates = TRUE, # NOTICE THIS NEW ARGUMENT!
t = 5) # We use this option to make 5 replicas of it
As expected, there is no warning. Furthermore, we have told the function that in case of having isolated vertices to keep them, as is in the case of the edge #11 which has the vertex 202. Since we asked the function to create 5 copies of the adjacency matrix, we have a list of length 5 with adjacency matrices. Lets take a look at the first element of this list:
adjmat[[1]]
# 9 x 9 sparse Matrix of class "dgCMatrix"
# 101 102 103 104 105 201 202 205 210
# 101 . . . . . . . . .
# 102 1 . 1 . . . . . .
# 103 . 1 . . . . . . .
# 104 . . . . 1 . . . .
# 105 . . 1 2 . . . . .
# 201 . . . . . . . . .
# 202 . . . . . . . . .
# 205 . . . . . 1 . . 1
# 210 . . . . . 1 . 1 .
As you can see, the edgelist_to_adjmat
function kept the
vertices labels and included them as dimnames in the matrix.2 Now that
our adjacency matrix has the number of elements that we expected, which
actually coincides with the number of rows in the
fakesurvey
dataset, we can create a diffnet
object:
# Coercing the adjacency matrix and edgelist into a diffnet object
diffnet <- as_diffnet(
graph = adjmat, # Passing a dynamic graph
toa = fakesurvey$toa, # This is required
vertex.static.attrs = fakesurvey # Is is optional
)
# Taking a look at the diffnet object
diffnet
# Dynamic network of class -diffnet-
# Name : Diffusion Network
# Behavior : Unspecified
# # of nodes : 9 (101, 102, 103, 104, 105, 201, 202, 205, ...)
# # of time periods : 5 (1 - 5)
# Type : directed
# Final prevalence : 0.89
# Static attributes : id, toa, group, net1, net2, net3, age, gender, not... (9)
# Dynamic attributes : -
edgelist_to_diffnet
Following the previous example, instead of “manually” generating the
adjacency matrix and calling the as_diffnet
function, we
will use the edgelist_to_diffnet
function. The most
important issue when calling this routine is to have matching ids
between the edgelist and the attributes dataset. So before calling the
edgelist_to_diffnet
function we need to fix the
id
column in the fakesurvey
dataset:3
# Before
fakesurvey$id
# [1] 1 2 3 4 5 1 2 5 10
# Changing the id
fakesurvey$id <- with(fakesurvey, group*100 + id)
# After
fakesurvey$id
# [1] 101 102 103 104 105 201 202 205 210
Now that it is fixed, we can call the
edgelist_to_diffnet
function
diffnet2 <- edgelist_to_diffnet(
edgelist = fakeEdgelist[,1:2], # Passed to edgelist_to_adjmat
w = fakeEdgelist$value, # Passed to edgelist_to_adjmat
dat = fakesurvey, # Data frame with -idvar- and -toavar-
idvar = "id", # Name of the -idvar- in -dat-
toavar = "toa", # Name of the -toavar- in -dat-
keep.isolates = TRUE # Passed to edgelist_to_adjmat
)
# Warning in check_var_class_and_coerce(x, edgelist, c("factor", "integer", :
# Coercing -ego- into character.
# Warning in check_var_class_and_coerce(x, edgelist, c("factor", "integer", :
# Coercing -alter- into character.
diffnet2
# Dynamic network of class -diffnet-
# Name : Diffusion Network
# Behavior : Unspecified
# # of nodes : 9 (101, 102, 103, 104, 105, 201, 202, 205, ...)
# # of time periods : 5 (1 - 5)
# Type : directed
# Final prevalence : 0.89
# Static attributes : group, net1, net2, net3, age, gender, note (7)
# Dynamic attributes : -
As a difference with the previous example, here the algorithm makes
sure that the ordering of the dataset and the vertices in the adjacency
matrix coincide. The previous example did gave us a correctly sorted
diffnet
object, but that may not always be the case.
Nevertheless, the option id.and.per.vars
allows the user
providing with the names of the variables in the vertex attribute
datasets that hold the ids and time period ids of each observation, so
that the function sorts the data before coercing it into diffnet
objects. More on this in the following examples.
survey_to_diffnet
fakesurvey
, which
holds cross section data, and fakesurveyDyn
, which holds
longitudinal data.We start by taking a look at the data
# Loading the data
data("fakesurvey")
fakesurvey
# id toa group net1 net2 net3 age gender note
# 1 1 1 1 NA NA NA 30 M No nominations
# 2 2 5 1 3 1 NA 35 F Nothing weird
# 3 3 5 1 NA 2 NA 31 F Only nominates in net2
# 4 4 3 1 6 5 NA 30 M Nominates someone who wasn't interview
# 5 5 2 1 4 4 3 40 F Nominates 4 two times
# 6 1 4 2 3 4 8 29 F Only nominates outsiders
# 7 2 3 2 3 NA NA 35 M Isolated
# 8 5 3 2 10 1 NA 50 M Nothing weird
# 9 10 NA 2 5 1 NA 19 F Non-adopter
A couple of important remarks for this dataset. First, each
individual in this dataset belongs to a different group, while this is
not always the case, survey_to_diffnet
allows accounting
for this through the groupvar
argument. Also, besides of
having an isolated vertex, two individuals in the survey nominate people
that neither weren’t survey nor show in their groups:
fakesurvey[c(4,6),]
# id toa group net1 net2 net3 age gender note
# 4 4 3 1 6 5 NA 30 M Nominates someone who wasn't interview
# 6 1 4 2 3 4 8 29 F Only nominates outsiders
So in group one 4 nominates id 6, who does not show in the data, and in group two 6 nominates 3, 4, and 8, also individuals who don’t show up in the survey.
While for some researchers nominations of unsurveyed individuals may
not be of importance, for some others might be. For such cases, the
function has the option of either keeping unsurveyed individuals (so you
would get a bigger adjacency matrix), or ignore them and keep only those
who were surveyed. For example, if we wanted to keep unsurveyed
individuals in the network we would need to set
no.unsurveyed = FALSE
:
# Coercing the survey data into a diffnet object
diffnet_w_unsurveyed <- survey_to_diffnet(
dat = fakesurvey, # The dataset
idvar = "id", # Name of the idvar (must be integer)
netvars = c("net1", "net2", "net3"), # Vector of names of nomination vars
toavar = "toa", # Name of the time of adoption var
groupvar = "group", # Name of the group var (OPTIONAL)
no.unsurveyed = FALSE # KEEP OR NOT UNSURVEYED
)
diffnet_w_unsurveyed
# Dynamic network of class -diffnet-
# Name : Diffusion Network
# Behavior : Unspecified
# # of nodes : 13 (101, 102, 103, 104, 105, 106, 201, 202, ...)
# # of time periods : 5 (1 - 5)
# Type : directed
# Final prevalence : 0.62
# Static attributes : group, net1, net2, net3, age, gender, note (7)
# Dynamic attributes : -
# Retrieving nodes ids
nodes(diffnet_w_unsurveyed)
# [1] "101" "102" "103" "104" "105" "106" "201" "202" "203" "204" "205" "208"
# [13] "210"
A network spanning 5 time periods with 13 vertices (9 surveyed
individuals + 4 unsurveyed individuals). This produces a different
result when compared to the case in which me use the default behavior of
the function, no.unsurveyed = TRUE
:
# Coercing the survey data into a diffnet object
diffnet_wo_unsurveyed <- survey_to_diffnet(
dat = fakesurvey, # The dataset
idvar = "id", # Name of the idvar (must be integer)
netvars = c("net1", "net2", "net3"), # Vector of names of nomination vars
toavar = "toa", # Name of the time of adoption var
groupvar = "group" # Name of the group var (OPTIONAL)
)
diffnet_wo_unsurveyed
# Dynamic network of class -diffnet-
# Name : Diffusion Network
# Behavior : Unspecified
# # of nodes : 9 (101, 102, 103, 104, 105, 201, 202, 205, ...)
# # of time periods : 5 (1 - 5)
# Type : directed
# Final prevalence : 0.89
# Static attributes : group, net1, net2, net3, age, gender, note (7)
# Dynamic attributes : -
# Retrieving nodes ids
nodes(diffnet_wo_unsurveyed)
# [1] "101" "102" "103" "104" "105" "201" "202" "205" "210"
Furthermore, we can compare the two diffusion networks by sustracting one from another:
difference <- diffnet_w_unsurveyed - diffnet_wo_unsurveyed
difference
# Dynamic network of class -diffnet-
# Name : Diffusion Network
# Behavior : Unspecified
# # of nodes : 4 (106, 203, 204, 208)
# # of time periods : 5 (1 - 5)
# Type : directed
# Final prevalence : 0.00
# Static attributes : group, net1, net2, net3, age, gender, note (7)
# Dynamic attributes : -
In this example we will use dynamic network data, this is, an edgelist with spells and dynamic attributes
# ego alter value time
# 1 102 101 1 1990
# 2 103 102 1 1990
# 3 102 103 1 1990
# 4 105 103 1 1990
# 5 105 104 2 1990
# 6 104 105 1 1990
# id toa group net1 net2 net3 age gender
# 1 1 1991 1 NA NA NA 30 M
# 2 2 1990 1 3 1 NA 35 F
# 3 3 1991 1 NA 2 NA 31 F
# 4 4 1990 1 6 5 NA 30 M
# 5 5 1991 1 4 4 3 40 F
# 6 1 1991 2 3 4 8 29 F
# note time
# 1 First wave: No nominations 1990
# 2 First wave: Nothing weird 1990
# 3 First wave: Only nominates in net2 1990
# 4 First wave: Nominates someone who wasn't interview 1990
# 5 First wave: Nominates 4 two times 1990
# 6 First wave: Only nominates outsiders 1990
Same as before, we have to make sure the ids are right
# Fixing ids
fakesurveyDyn$id <- with(fakesurveyDyn, group*100 + id)
# An individual who is alone
fakeDynEdgelist[11,"value"] <- 1
diffnet <- edgelist_to_diffnet(
edgelist = fakeDynEdgelist[,1:2], # As usual, a two column dataset
w = fakeDynEdgelist$value, # Here we are using weights
t0 = fakeDynEdgelist$time, # An integer vector with starting point of spell
t1 = fakeDynEdgelist$time, # An integer vector with the endpoint of spell
dat = fakesurveyDyn, # Attributes dataset
idvar = "id",
toavar = "toa",
timevar = "time",
keep.isolates = TRUE # Keeping isolates (if there's any)
)
# Warning in check_var_class_and_coerce(x, dat, c("numeric", "integer"),
# "integer", : Coercing -time- into integer.
# Warning in check_var_class_and_coerce(x, edgelist, c("factor", "integer", :
# Coercing -ego- into character.
# Warning in check_var_class_and_coerce(x, edgelist, c("factor", "integer", :
# Coercing -alter- into character.
diffnet
# Dynamic network of class -diffnet-
# Name : Diffusion Network
# Behavior : Unspecified
# # of nodes : 9 (101, 102, 103, 104, 105, 201, 202, 205, ...)
# # of time periods : 2 (1990 - 1991)
# Type : directed
# Final prevalence : 1.00
# Static attributes : -
# Dynamic attributes : group, net1, net2, net3, age, gender, note (7)
While there are other candidates as the
openxlsx
package, the readxl
package has the
nice feature of correctly processing the encoding of the excel files.
This is specially important if you are dealing with non ASCII or UTF-8
datasets.↩︎
Another thing to tell, the matrices stored in
adjmat
are of class dgCMatrix
from the
Matrix
package, these are Column Compressed Stored sparse
matrices and allows saving memory in matrices with many zeros.
netdiffuseR routines are based in this class of
matrices. Furthermore, to have an idea of how much memory sparse
matrices save, while a square matrix of size
would need close to 18GB of memory using a regular R
matrix
, a dgCMatrix
of the same size takes
around 6MB.↩︎
The with
function allows simplifying data
management in R by allowing to reference columns in a data.frame without
having to call the data.frame itself (see ?with
).↩︎