Skip to contents

Fuzzy matching to names in the WCVP using phonetic matching and edit distance. The WCVP can be loaded for matching from rWCVPdata::wcvp_names.

Usage

wcvp_match_fuzzy(names_df, wcvp_names, name_col, progress_bar = TRUE)

phonetic_match(names_df, wcvp_names, name_col)

edit_match(names_df, wcvp_names, name_col)

Arguments

names_df

Data frame of names for matching.

wcvp_names

Data frame of taxonomic names from WCVP version 7 or later. If NULL (the default), names will be loaded from rWCVPdata::wcvp_names.

name_col

Character. The column in names_df that has the taxon name for matching.

progress_bar

Logical. Show progress bar when matching? Defaults to TRUE; should be changed to FALSE if used in a markdown report.

Value

Match results from WCVP bound to the original data from names_df.

Details

The wcvp_match_fuzzy function uses phonetic matching first and then finds the closest match based on edit distance for any remaining names.

Phonetic matching uses phonics::metaphone encoding with a maximum code length of 20.

Edit distance matching finds the closest match based on Levenshtein similarity, calculated by RecordLinkage::levenshteinSim.

See also

Other name matching functions: wcvp_match_exact(), wcvp_match_names()

Examples

 # this example requires 'rWCVPdata'
if(requireNamespace("rWCVPdata")){
wcvp_names <- rWCVPdata::wcvp_names
wcvp_match_fuzzy(redlist_example, wcvp_names, "scientificName")
}
#> Matching ■■■■■■■                           20% ETA  9s
#> Matching ■■■■■■■■■■■■■■■■■■■               60% ETA  7s
#> Matching ■■■■■■■■■■■■■■■■■■■■■■■■■         80% ETA  3s
#> # A tibble: 23 × 16
#>    assessmentId scientificName          redlistCategory authority     match_type
#>           <dbl> <chr>                   <chr>           <chr>         <chr>     
#>  1     19395021 Avena hybrida           Data Deficient  Peterm.       Fuzzy (ph…
#>  2     64135503 Citrus garrawayi        Least Concern   F.M.Bailey    Fuzzy (ph…
#>  3    189601563 Croton campanulatus     Endangered      Caruzo & Cor… Fuzzy (ph…
#>  4    115968141 Cynanchum freemani      Endangered      (N.E.Br.) Wo… Fuzzy (ph…
#>  5     11047751 Echinacanthus longipes  Vulnerable      H.S.Lo & D. … Fuzzy (ph…
#>  6    126598076 Juglans pyriformis      Endangered      Liebm.        Fuzzy (ph…
#>  7    126598076 Juglans pyriformis      Endangered      Liebm.        Fuzzy (ph…
#>  8    198678856 Leichhardtia variifolia Vulnerable      (Guillaumin)… Fuzzy (ph…
#>  9    135836392 Mouriri myrtilloides    Least Concern   (Sw.) Poir.   Fuzzy (ph…
#> 10    146459149 Neocussonia umbellifera Least Concern   (Sond.) Hutc… Fuzzy (ph…
#> # ℹ 13 more rows
#> # ℹ 11 more variables: multiple_matches <lgl>, match_similarity <dbl>,
#> #   match_edit_distance <dbl>, wcvp_id <dbl>, wcvp_name <chr>,
#> #   wcvp_authors <chr>, wcvp_rank <chr>, wcvp_status <chr>,
#> #   wcvp_homotypic <lgl>, wcvp_ipni_id <chr>, wcvp_accepted_id <dbl>


 # this example requires 'rWCVPdata'
if(requireNamespace("rWCVPdata")){
wcvp_names <- rWCVPdata::wcvp_names
phonetic_match(redlist_example, wcvp_names, "scientificName")
}
#> # A tibble: 24 × 16
#>    assessmentId scientificName           redlistCategory authority    match_type
#>           <dbl> <chr>                    <chr>           <chr>        <chr>     
#>  1     11081542 Antimima quartzitica     Least Concern   (Dinter) H.… NA        
#>  2     19395021 Avena hybrida            Data Deficient  Peterm.      Fuzzy (ph…
#>  3     64135503 Citrus garrawayi         Least Concern   F.M.Bailey   Fuzzy (ph…
#>  4    189601563 Croton campanulatus      Endangered      Caruzo & Co… Fuzzy (ph…
#>  5    115968141 Cynanchum freemani       Endangered      (N.E.Br.) W… Fuzzy (ph…
#>  6     11047751 Echinacanthus longipes   Vulnerable      H.S.Lo & D.… Fuzzy (ph…
#>  7     11001316 Geissanthus pinchinchana Endangered      (Lundell) P… NA        
#>  8    126598076 Juglans pyriformis       Endangered      Liebm.       Fuzzy (ph…
#>  9    126598076 Juglans pyriformis       Endangered      Liebm.       Fuzzy (ph…
#> 10    198678856 Leichhardtia variifolia  Vulnerable      (Guillaumin… Fuzzy (ph…
#> # ℹ 14 more rows
#> # ℹ 11 more variables: multiple_matches <lgl>, match_similarity <dbl>,
#> #   match_edit_distance <dbl>, wcvp_id <dbl>, wcvp_name <chr>,
#> #   wcvp_authors <chr>, wcvp_rank <chr>, wcvp_status <chr>,
#> #   wcvp_homotypic <lgl>, wcvp_ipni_id <chr>, wcvp_accepted_id <dbl>


 # this example requires 'rWCVPdata'
if(requireNamespace("rWCVPdata")){
wcvp_names <- rWCVPdata::wcvp_names
edit_match(redlist_example, wcvp_names, "scientificName")
}
#> Matching ■■■                                5% ETA 48s
#> Matching ■■■■■                             15% ETA 39s
#> Matching ■■■■■■■                           20% ETA 36s
#> Matching ■■■■■■■■■■                        30% ETA 32s
#> Matching ■■■■■■■■■■■                       35% ETA 31s
#> Matching ■■■■■■■■■■■■■                     40% ETA 27s
#> Matching ■■■■■■■■■■■■■■■                   45% ETA 25s
#> Matching ■■■■■■■■■■■■■■■■■                 55% ETA 20s
#> Matching ■■■■■■■■■■■■■■■■■■■■■             65% ETA 16s
#> Matching ■■■■■■■■■■■■■■■■■■■■■■            70% ETA 14s
#> Matching ■■■■■■■■■■■■■■■■■■■■■■■■■         80% ETA  9s
#> Matching ■■■■■■■■■■■■■■■■■■■■■■■■■■■■      90% ETA  4s
#> Matching ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■     95% ETA  2s
#> # A tibble: 23 × 16
#>    assessmentId scientificName           redlistCategory authority    match_type
#>           <dbl> <chr>                    <chr>           <chr>        <chr>     
#>  1     11081542 Antimima quartzitica     Least Concern   (Dinter) H.… Fuzzy (ed…
#>  2     19395021 Avena hybrida            Data Deficient  Peterm.      Fuzzy (ed…
#>  3     64135503 Citrus garrawayi         Least Concern   F.M.Bailey   Fuzzy (ed…
#>  4    189601563 Croton campanulatus      Endangered      Caruzo & Co… Fuzzy (ed…
#>  5    115968141 Cynanchum freemani       Endangered      (N.E.Br.) W… Fuzzy (ed…
#>  6     11047751 Echinacanthus longipes   Vulnerable      H.S.Lo & D.… Fuzzy (ed…
#>  7     11001316 Geissanthus pinchinchana Endangered      (Lundell) P… Fuzzy (ed…
#>  8    126598076 Juglans pyriformis       Endangered      Liebm.       Fuzzy (ed…
#>  9    126598076 Juglans pyriformis       Endangered      Liebm.       Fuzzy (ed…
#> 10    198678856 Leichhardtia variifolia  Vulnerable      (Guillaumin… Fuzzy (ed…
#> # ℹ 13 more rows
#> # ℹ 11 more variables: multiple_matches <lgl>, match_similarity <dbl>,
#> #   match_edit_distance <dbl>, wcvp_id <dbl>, wcvp_name <chr>,
#> #   wcvp_authors <chr>, wcvp_rank <chr>, wcvp_status <chr>,
#> #   wcvp_homotypic <lgl>, wcvp_ipni_id <chr>, wcvp_accepted_id <dbl>