In this example we are going to implement a custom
matrix-like type that stores data in form of characters as a string in
the _data
attribute of an object. This is not going to be
particularly useful, but serves as an easy to understand starting point
that can be adapted for other storage mechanisms, e.g., databases,
shared memory, and so on.
First we need to clarify what we mean by matrix-like type: a
matrix-like type is a class (i.e., implemented using any of R’s
object-oriented mechanisms) that implements at least
the dim()
, length()
, dimnames()
,
and the [
extraction method.
Let’s call our type StringMatrix
and implement it as an
S3 class. If you need a refresher on S3 classes, please read the OO field guide
chapter in the Advanced R book by Hadley Wickham first.
Let’s start with implementing the dim()
,
length()
, and dimnames()
methods:
dim.StringMatrix <- function(x) {
attr(x, "_dim") # store dimensions in `_dim` attribute
}
length.StringMatrix <- function(x) {
prod(dim(x)) # rely on `dim()` method above
}
dimnames.StringMatrix <- function(x) {
attr(x, "_dimnames") # store dim names in `_dimnames` attribute
}
Subsetting in R is very powerful and can therefore be difficult to implement depending on how many indexing mechanisms you want to support. For example, given a 5x5 matrix, all of the following cases (non-exhaustive) are equivalent:
X[seq(1, 25, by = 5)] # subsetting by positive integers
X[1, ] # simplifying subsetting by positive integers
X[-(2:5), ] # simplifying subsetting by negative integers
X[c(TRUE, FALSE, FALSE, FALSE, FALSE), ] # simplifying subsetting by booleans
X["row_1", ] # simplifying subsetting by row names (only if dimnames exist)
This complexity motivated the development of the crochet package. The
extract()
function of the package takes care of converting
all those indexing mechanisms to positive integers, which is typically
the easiest mechanism to implement. extract()
returns a
function that can be used as a method for [
for a custom
type. Matrices can be subsetted using one-dimensional (i
only) and two-dimensional indices (i
and j
)
and both have very different behaviors. Therefore, two functions need to
be provided to extract()
as extract_vector
and
extract_matrix
. extract_vector
has to be a
function of the form function(x, i, ...)
and
extract_matrix
a function of the form
function(x, i, j, ...)
. Both functions return a subset of
x
.
The following snippets gives a simple way to extract characters from
a string one by one. In R, we can extract the nth character
from a string using the substr()
function:
substr(x, n, n)
. Conversely, the nth character can
be replaced as follows: substr(x, n, n) <- value
. Note
that the [
character is not allowed in a variable name, so
it needs to be escaped with backticks when establishing the return value
of extract()
as a method of StringMatrix
.
`[.StringMatrix` <- crochet::extract(
extract_vector = function(x, i, ...) { # i are positive integers
# Reserve output vector
subset <- vector(mode = "character", length = length(i))
# Populate output vector
for (singleIdx in 1:length(i)) {
subset[singleIdx] <- substr(attr(x, "_data"), i[singleIdx], i[singleIdx])
}
# Return output vector
return(subset)
},
extract_matrix = function(x, i, j, ...) { # i and j are positive integers
# Reserve output matrix
subset <- matrix(
data = vector(mode = "character", length = length(i) * length(j)),
nrow = length(i),
ncol = length(j)
)
# Populate output matrix
for (colIdx in 1:length(j)) {
for (rowIdx in 1:length(i)) {
# two-dimensional index needs to be converted to one-dimensional index
singleIdx <- crochet:::ijtok(x, i[rowIdx], j[colIdx])
subset[rowIdx, colIdx] <- substr(attr(x, "_data"), singleIdx, singleIdx)
}
}
# Return output matrix
return(subset)
}
)
We can now create an object of the StringMatrix
class
and provide it with some data:
# Generate data
n <- 5
p <- 5
alphabet <- c(0:9, letters)
data <- sample(alphabet, replace = TRUE, size = n * p)
# Create object
obj <- list()
class(obj) <- "StringMatrix"
attr(obj, "_dim") <- c(n, p)
attr(obj, "_dimnames") <- list(paste0("row_", 1:n), paste0("col_", 1:p))
attr(obj, "_data") <- paste(data, collapse = "")
# Call some methods
dim(obj)
## [1] 5 5
## [1] 5
## [1] 5
## [1] 25
## [[1]]
## [1] "row_1" "row_2" "row_3" "row_4" "row_5"
##
## [[2]]
## [1] "col_1" "col_2" "col_3" "col_4" "col_5"
## [1] "row_1" "row_2" "row_3" "row_4" "row_5"
## [1] "col_1" "col_2" "col_3" "col_4" "col_5"
## [1] "n" "q" "y" "6" "e"
## [1] "n" "q" "y" "6" "e"
## [1] "n" "q" "y" "6" "e"
## [1] "n" "q" "y" "6" "e"
## [1] "n" "q" "y" "6" "e"
Different from [
for atomic vectors (where both named
and unnamed arguments are interpreted as indices), optional named
arguments can be passed to extract_vector
and
extact_matrix
as ...
. This can be useful for
some optimization strategies (e.g., fadvise or madvise). Let’s add an
option to capitalize subsets as a demonstration:
`[.StringMatrix` <- crochet::extract(
extract_vector = function(x, i, ...) { # i are positive integers
dotdotdot <- list(...)
# Reserve output vector
subset <- vector(mode = "character", length = length(i))
# Populate output vector
for (singleIdx in 1:length(i)) {
subset[singleIdx] <- substr(attr(x, "_data"), i[singleIdx], i[singleIdx])
}
# Capitalize output
if (!is.null(dotdotdot$capitalize) && dotdotdot$capitalize) {
subset <- toupper(subset)
}
# Return output vector
return(subset)
},
extract_matrix = function(x, i, j, ...) { # i and j are positive integers
dotdotdot <- list(...)
# Reserve output matrix
subset <- matrix(
data = vector(mode = "character", length = length(i) * length(j)),
nrow = length(i),
ncol = length(j)
)
# Populate output matrix
for (colIdx in 1:length(j)) {
for (rowIdx in 1:length(i)) {
# two-dimensional index needs to be converted to one-dimensional index
singleIdx <- crochet:::ijtok(x, i[rowIdx], j[colIdx])
subset[rowIdx, colIdx] <- substr(attr(x, "_data"), singleIdx, singleIdx)
}
}
# Capitalize output
if (!is.null(dotdotdot$capitalize) && dotdotdot$capitalize) {
subset <- toupper(subset)
}
# Return output matrix
return(subset)
}
)
Now we can capitalize the output as follows:
## [1] "n" "q" "y" "6" "e"
## [1] "N" "Q" "Y" "6" "E"
## [1] "n" "q" "y" "6" "e"
To support replacement, replace()
returns a function
that can be used as a method for [<-
for a custom type.
Analogous to the extract()
method, two parameters are
required by replace()
: replace_vector
has to
be a function of the form function(x, i, ..., value)
and
replace_matrix
a function of the form
function(x, i, j, ..., value)
. Both functions return a
likely modified version of x
.
`[<-.StringMatrix` <- crochet::replace(
replace_vector = function(x, i, ..., value) { # i are positive integers
# Perform replacement
for (singleIdx in 1:length(i)) {
substr(attr(x, "_data"), i[singleIdx], i[singleIdx]) <- value[singleIdx]
}
# Do not forget to return x
return(x)
},
replace_matrix = function(x, i, j, ..., value) { # i and j are positive integers
# Convert value to matrix for easier indexing
dim(value) <- c(length(i), length(j))
# Perform replacement
for (colIdx in 1:length(j)) {
for (rowIdx in 1:length(i)) { # two-dimensional index needs to be converted to one-dimensional index
singleIdx <- crochet:::ijtok(x, i[rowIdx], j[colIdx])
substr(attr(x, "_data"), singleIdx, singleIdx) <- value[rowIdx, colIdx]
}
}
# Do not forget to return x
return(x)
}
)
Now we can replace some data:
## [,1] [,2] [,3] [,4] [,5]
## [1,] "z" "z" "y" "6" "e"
## [2,] "z" "z" "i" "h" "l"
## [3,] "z" "0" "w" "m" "f"
## [4,] "z" "3" "a" "j" "5"
## [5,] "z" "q" "a" "x" "5"
As you can see the simple extraction and replacement functions above
cover a lot of scenarios. There are some edge cases not mentioned here
that can’t be handled by crochet automatically (e.g., x[FALSE],
combinations with NA
s, and so on), so if you want full
coverage, you should run the crochet test suite on your custom type.
Examples of this can be found in
inst/tinytest/test-stringmatrix.R
for the
StringMatrix
type, and in the BEDMatrix or LinkedMatrix
packages.