Fuzzy matching to names in the WCVP using phonetic matching and edit distance. The WCVP can be loaded for matching from rWCVPdata::wcvp_names.
Usage
wcvp_match_fuzzy(names_df, wcvp_names, name_col, progress_bar = TRUE)
phonetic_match(names_df, wcvp_names, name_col)
edit_match(names_df, wcvp_names, name_col)
Arguments
- names_df
Data frame of names for matching.
- wcvp_names
Data frame of taxonomic names from WCVP version 7 or later. If
NULL
(the default), names will be loaded fromrWCVPdata::wcvp_names
.- name_col
Character. The column in
names_df
that has the taxon name for matching.- progress_bar
Logical. Show progress bar when matching? Defaults to
TRUE
; should be changed toFALSE
if used in a markdown report.
Details
The wcvp_match_fuzzy
function uses phonetic matching first and then finds
the closest match based on edit distance for any remaining names.
Phonetic matching uses phonics::metaphone encoding with a maximum code length of 20.
Edit distance matching finds the closest match based on Levenshtein similarity, calculated by RecordLinkage::levenshteinSim.
See also
Other name matching functions:
wcvp_match_exact()
,
wcvp_match_names()
Examples
# this example requires 'rWCVPdata'
if(requireNamespace("rWCVPdata")){
wcvp_names <- rWCVPdata::wcvp_names
wcvp_match_fuzzy(redlist_example, wcvp_names, "scientificName")
}
#> Matching ■■■■■■■ 20% ETA 9s
#> Matching ■■■■■■■■■■■■■■■■■■■ 60% ETA 7s
#> Matching ■■■■■■■■■■■■■■■■■■■■■■■■■ 80% ETA 3s
#> # A tibble: 23 × 16
#> assessmentId scientificName redlistCategory authority match_type
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 19395021 Avena hybrida Data Deficient Peterm. Fuzzy (ph…
#> 2 64135503 Citrus garrawayi Least Concern F.M.Bailey Fuzzy (ph…
#> 3 189601563 Croton campanulatus Endangered Caruzo & Cor… Fuzzy (ph…
#> 4 115968141 Cynanchum freemani Endangered (N.E.Br.) Wo… Fuzzy (ph…
#> 5 11047751 Echinacanthus longipes Vulnerable H.S.Lo & D. … Fuzzy (ph…
#> 6 126598076 Juglans pyriformis Endangered Liebm. Fuzzy (ph…
#> 7 126598076 Juglans pyriformis Endangered Liebm. Fuzzy (ph…
#> 8 198678856 Leichhardtia variifolia Vulnerable (Guillaumin)… Fuzzy (ph…
#> 9 135836392 Mouriri myrtilloides Least Concern (Sw.) Poir. Fuzzy (ph…
#> 10 146459149 Neocussonia umbellifera Least Concern (Sond.) Hutc… Fuzzy (ph…
#> # ℹ 13 more rows
#> # ℹ 11 more variables: multiple_matches <lgl>, match_similarity <dbl>,
#> # match_edit_distance <dbl>, wcvp_id <dbl>, wcvp_name <chr>,
#> # wcvp_authors <chr>, wcvp_rank <chr>, wcvp_status <chr>,
#> # wcvp_homotypic <lgl>, wcvp_ipni_id <chr>, wcvp_accepted_id <dbl>
# this example requires 'rWCVPdata'
if(requireNamespace("rWCVPdata")){
wcvp_names <- rWCVPdata::wcvp_names
phonetic_match(redlist_example, wcvp_names, "scientificName")
}
#> # A tibble: 24 × 16
#> assessmentId scientificName redlistCategory authority match_type
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 11081542 Antimima quartzitica Least Concern (Dinter) H.… NA
#> 2 19395021 Avena hybrida Data Deficient Peterm. Fuzzy (ph…
#> 3 64135503 Citrus garrawayi Least Concern F.M.Bailey Fuzzy (ph…
#> 4 189601563 Croton campanulatus Endangered Caruzo & Co… Fuzzy (ph…
#> 5 115968141 Cynanchum freemani Endangered (N.E.Br.) W… Fuzzy (ph…
#> 6 11047751 Echinacanthus longipes Vulnerable H.S.Lo & D.… Fuzzy (ph…
#> 7 11001316 Geissanthus pinchinchana Endangered (Lundell) P… NA
#> 8 126598076 Juglans pyriformis Endangered Liebm. Fuzzy (ph…
#> 9 126598076 Juglans pyriformis Endangered Liebm. Fuzzy (ph…
#> 10 198678856 Leichhardtia variifolia Vulnerable (Guillaumin… Fuzzy (ph…
#> # ℹ 14 more rows
#> # ℹ 11 more variables: multiple_matches <lgl>, match_similarity <dbl>,
#> # match_edit_distance <dbl>, wcvp_id <dbl>, wcvp_name <chr>,
#> # wcvp_authors <chr>, wcvp_rank <chr>, wcvp_status <chr>,
#> # wcvp_homotypic <lgl>, wcvp_ipni_id <chr>, wcvp_accepted_id <dbl>
# this example requires 'rWCVPdata'
if(requireNamespace("rWCVPdata")){
wcvp_names <- rWCVPdata::wcvp_names
edit_match(redlist_example, wcvp_names, "scientificName")
}
#> Matching ■■■ 5% ETA 48s
#> Matching ■■■■■ 15% ETA 39s
#> Matching ■■■■■■■ 20% ETA 36s
#> Matching ■■■■■■■■■■ 30% ETA 32s
#> Matching ■■■■■■■■■■■ 35% ETA 31s
#> Matching ■■■■■■■■■■■■■ 40% ETA 27s
#> Matching ■■■■■■■■■■■■■■■ 45% ETA 25s
#> Matching ■■■■■■■■■■■■■■■■■ 55% ETA 20s
#> Matching ■■■■■■■■■■■■■■■■■■■■■ 65% ETA 16s
#> Matching ■■■■■■■■■■■■■■■■■■■■■■ 70% ETA 14s
#> Matching ■■■■■■■■■■■■■■■■■■■■■■■■■ 80% ETA 9s
#> Matching ■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 90% ETA 4s
#> Matching ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 95% ETA 2s
#> # A tibble: 23 × 16
#> assessmentId scientificName redlistCategory authority match_type
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 11081542 Antimima quartzitica Least Concern (Dinter) H.… Fuzzy (ed…
#> 2 19395021 Avena hybrida Data Deficient Peterm. Fuzzy (ed…
#> 3 64135503 Citrus garrawayi Least Concern F.M.Bailey Fuzzy (ed…
#> 4 189601563 Croton campanulatus Endangered Caruzo & Co… Fuzzy (ed…
#> 5 115968141 Cynanchum freemani Endangered (N.E.Br.) W… Fuzzy (ed…
#> 6 11047751 Echinacanthus longipes Vulnerable H.S.Lo & D.… Fuzzy (ed…
#> 7 11001316 Geissanthus pinchinchana Endangered (Lundell) P… Fuzzy (ed…
#> 8 126598076 Juglans pyriformis Endangered Liebm. Fuzzy (ed…
#> 9 126598076 Juglans pyriformis Endangered Liebm. Fuzzy (ed…
#> 10 198678856 Leichhardtia variifolia Vulnerable (Guillaumin… Fuzzy (ed…
#> # ℹ 13 more rows
#> # ℹ 11 more variables: multiple_matches <lgl>, match_similarity <dbl>,
#> # match_edit_distance <dbl>, wcvp_id <dbl>, wcvp_name <chr>,
#> # wcvp_authors <chr>, wcvp_rank <chr>, wcvp_status <chr>,
#> # wcvp_homotypic <lgl>, wcvp_ipni_id <chr>, wcvp_accepted_id <dbl>