Title: | Turn Clean Data into Messy Data |
---|---|
Description: | Take real or simulated data and salt it with errors commonly found in the wild, such as pseudo-OCR errors, Unicode problems, numeric fields with nonsensical punctuation, bad dates, etc. |
Authors: | Matthew Lincoln [aut, cre] |
Maintainer: | Matthew Lincoln <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.1 |
Built: | 2024-11-21 06:06:19 UTC |
Source: | https://github.com/mdlincoln/salty |
Access the original source vector for a given shaker function
inspect_shaker(f)
inspect_shaker(f)
f |
A shaker function |
A character vector
inspect_shaker(shaker$punctuation)
inspect_shaker(shaker$punctuation)
Sample a proportion of indices of a vector
p_indices(x, p)
p_indices(x, p)
x |
A vector |
p |
A numeric probability between 0 and 1 |
An integer vector of indices.
These are easy-to-use wrapper functions that call either salt_insert (for including new characters) or salt_replace (for salting that requires replacement of specific characters) with sane defaults.
salt_punctuation(x, p = 0.2, n = 1) salt_letters(x, p = 0.2, n = 1) salt_whitespace(x, p = 0.2, n = 1) salt_digits(x, p = 0.2, n = 1) salt_ocr(x, p = 0.2, rep_p = 0.1) salt_capitalization(x, p = 0.1, rep_p = 0.1) salt_decimal_commas(x, p = 0.1, rep_p = 0.1)
salt_punctuation(x, p = 0.2, n = 1) salt_letters(x, p = 0.2, n = 1) salt_whitespace(x, p = 0.2, n = 1) salt_digits(x, p = 0.2, n = 1) salt_ocr(x, p = 0.2, rep_p = 0.1) salt_capitalization(x, p = 0.1, rep_p = 0.1) salt_decimal_commas(x, p = 0.1, rep_p = 0.1)
x |
A vector. This will always be coerced to character during salting. |
p |
A number between 0 and 1. Percent of values in |
n |
A positive integer. Number of times to add new values from
|
rep_p |
A number between 0 and 1. Probability that a given match should be replaced in one of the selected values. |
For a more fine-grained control over how characters are added and whether , see the documentation for salt_insert, salt_substitute, salt_replace, and salt_delete.
salt_punctuation()
: Punctuation characters
salt_letters()
: Upper- and lower-case letters
salt_whitespace()
: Spaces
salt_digits()
: 0-9
salt_ocr()
: Replace some substrings with common OCR problems
salt_capitalization()
: Flip capitalization of letters
salt_decimal_commas()
: Flip decimals to commas and vice versa
Delete some characters from some values
salt_delete(x, p = 0.2, n = 1)
salt_delete(x, p = 0.2, n = 1)
x |
A vector. This will always be coerced to character during salting. |
p |
A number between 0 and 1. Percent of values in |
n |
A positive integer. Number of times to add new values from
|
A character vector the same length as x
x <- c("Lorem ipsum dolor sit amet, consectetur adipiscing elit.", "Nunc finibus tortor a elit eleifend interdum.", "Maecenas aliquam augue sit amet ultricies placerat.") salt_delete(x, p = 0.5, n = 5) salt_empty(x, p = 0.5) salt_na(x, p = 0.5)
x <- c("Lorem ipsum dolor sit amet, consectetur adipiscing elit.", "Nunc finibus tortor a elit eleifend interdum.", "Maecenas aliquam augue sit amet ultricies placerat.") salt_delete(x, p = 0.5, n = 5) salt_empty(x, p = 0.5) salt_na(x, p = 0.5)
Inserts a selection of characters into a percentage of values in the supplied vector.
salt_insert(x, insertions, p = 0.2, n = 1)
salt_insert(x, insertions, p = 0.2, n = 1)
x |
A vector. This will always be coerced to character during salting. |
insertions |
A shaker function, or a character vector. |
p |
A number between 0 and 1. Percent of values in |
n |
A positive integer. Number of times to add new values from
|
A character vector the same length as x
Remove entire values from a vector
salt_na(x, p = 0.2) salt_empty(x, p = 0.2)
salt_na(x, p = 0.2) salt_empty(x, p = 0.2)
x |
A vector |
p |
A number between 0 and 1. Proportion of values to edit. |
A vector the same length as x
Inserts a selection of characters into some values of x. Pair salt_replace with the named vectors in replacement_shaker, or supply your own named vector of replacements. The convenience functions salt_ocr and salt_capitalization are light wrappers around salt_replace.
salt_replace(x, replacements, p = 0.1, rep_p = 0.5)
salt_replace(x, replacements, p = 0.1, rep_p = 0.5)
x |
A vector. This will always be coerced to character during salting. |
replacements |
A replacement_shaker function, or a named character vector of patterns and replacements. |
p |
A number between 0 and 1. Percent of values in |
rep_p |
A number between 0 and 1. Probability that a given match should be replaced in one of the selected values. |
A character vector the same length as x
x <- c("Lorem ipsum dolor sit amet, consectetur adipiscing elit.", "Nunc finibus tortor a elit eleifend interdum.", "Maecenas aliquam augue sit amet ultricies placerat.") salt_replace(x, replacement_shaker$capitalization, p = 0.5, rep_p = 0.2) salt_ocr(x, p = 1, rep_p = 0.5)
x <- c("Lorem ipsum dolor sit amet, consectetur adipiscing elit.", "Nunc finibus tortor a elit eleifend interdum.", "Maecenas aliquam augue sit amet ultricies placerat.") salt_replace(x, replacement_shaker$capitalization, p = 0.5, rep_p = 0.2) salt_ocr(x, p = 1, rep_p = 0.5)
Substitute certain characters in a vector
salt_substitute(x, substitutions, p = 0.2, n = 1)
salt_substitute(x, substitutions, p = 0.2, n = 1)
x |
A vector. This will always be coerced to character during salting. |
substitutions |
Values to be substituted in |
p |
A number between 0 and 1. Percent of values in |
n |
A positive integer. Number of times to add new values from
|
A character vector the same length as x
x <- c("Lorem ipsum dolor sit amet, consectetur adipiscing elit.", "Nunc finibus tortor a elit eleifend interdum.", "Maecenas aliquam augue sit amet ultricies placerat.") salt_substitute(x, shaker$digits, p = 0.5, n = 5)
x <- c("Lorem ipsum dolor sit amet, consectetur adipiscing elit.", "Nunc finibus tortor a elit eleifend interdum.", "Maecenas aliquam augue sit amet ultricies placerat.") salt_substitute(x, shaker$digits, p = 0.5, n = 5)
Because swaps
can be provided by either a character vector or a function
that returns a character vector, salt_swap
can be fruitfully used in
conjunction with the charlatan::charlatan package to intersperse real data with
simulated data.
salt_swap(x, swaps, p = 0.2)
salt_swap(x, swaps, p = 0.2)
x |
A vector. This will always be coerced to character during salting. |
swaps |
Values to be swapped out |
p |
A number between 0 and 1. Percent of values in |
A character vector the same length as x
x <- c("Lorem ipsum dolor sit amet, consectetur adipiscing elit.", "Nunc finibus tortor a elit eleifend interdum.", "Maecenas aliquam augue sit amet ultricies placerat.") new_values <- c("foo", "bar", "baz") salt_swap(x, swaps = new_values, p = 0.5)
x <- c("Lorem ipsum dolor sit amet, consectetur adipiscing elit.", "Nunc finibus tortor a elit eleifend interdum.", "Maecenas aliquam augue sit amet ultricies placerat.") new_values <- c("foo", "bar", "baz") salt_swap(x, swaps = new_values, p = 0.5)
Insert, delete, replace, and substitute bits of your data with messy values.
Convenient wrappers such as salt_punctuation are provided for quick access
to this package's functionality with simple defaults. For more fine-grained
control, use one of the underlying salt_
functions:
salt_insert will insert new characters into some of the values of x
. All
the original characters of the original values will be maintained.
salt_substitute will substitute some characters in some of the values of
x
in place of some of the original characters.
salt_replace will replace some characters in some of the values of x
.
Unlike salt_substitute, salt_replace does conditional replacement dependent
on the original values of x
, such as changing capitalization or simulating
OCR errors based on certain character combinations.
salt_delete will remove some characters in the values of x
salt_na and salt_empty will replace some values of x
with NA
or with
empty strings.
salt_swap replaces entire values of x
with new strings
salt_
functionsshaker contains various character sets to be added to your data using salt_insert and salt_substitute. replacement_shaker is for salt_replace, and contains pairlists that replace matched patterns in your data.
shaker replacement_shaker available_shakers()
shaker replacement_shaker available_shakers()
An object of class list
of length 6.
An object of class list
of length 3.
A sampling function that will be called by salt_insert, salt_substitute, or salt_replace.
salt_insert(letters, shaker$punctuation) available_shakers()
salt_insert(letters, shaker$punctuation) available_shakers()