Skip to content

request: could stringdistmatrix return a sparse matrix of logicals #74

@dan-reznik

Description

@dan-reznik

Let s_vec be some vector of N distinct strings. When N is too large, stringdistmatrix grows unwieldy (NxN), as does the "dist" struct returned by stringdistmatrix when called w a single arg.

Would like to request a new function, similar to stringdistmatrix, but which would return the information in a sparse way, for those cases one is only interested in (i,j) lower-triangular pairs satisfying a condition:

s_vec <- c("string1","string2", ...)
N <- length(s_vec)
df <- stringdist_thresh(s_vec, method="lv", thresh=2, op="<=") 

The above will return a tibble w/ columns "i", "j", and "dist". Each row will indicate that for some (i,j,dist) the predicate (in this case: dist <= 2) was satisfied. i in 1 to N-1, j in i+1 to N

"op" can be one of: "<", "<=", ">", ">=", "==", "!=".

As an example, consider:

s_vec <- c("aaa","aab", "aac","abc")
df <- stringdist_thresh(s_vec, method="lv", thresh=2L, op="<") 
#> tibble:
#> i, j, dist
#> 1 2 1
#> 1 3 1
#> 2 3 1
#> 3 4 1

Notice how the data frame indicates only those pairs for which the distance satisifies lv < 2.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions