Title: | Data Augmentation for Private Posterior Estimation |
---|---|
Description: | A data augmentation based sampler for conducting privacy-aware Bayesian inference. The dapper_sample() function takes an existing sampler as input and automatically constructs a privacy-aware sampler. The process of constructing a sampler is simplified through the specification of four independent modules, allowing for easy comparison between different privacy mechanisms by only swapping out the relevant modules. Probability mass functions for the discrete Gaussian and discrete Laplacian are provided to facilitate analyses dealing with privatized count data. The output of dapper_sample() can be analyzed using many of the same tools from the 'rstan' ecosystem. For methodological details on the sampler see Ju et al. (2022) <doi:10.48550/arXiv.2206.00710>, and for details on the discrete Gaussian and discrete Laplacian distributions see Canonne et al. (2020) <doi:10.48550/arXiv.2004.00010>. |
Authors: | Kevin Eng [aut, cre, cph] |
Maintainer: | Kevin Eng <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.1 |
Built: | 2024-10-29 04:17:40 UTC |
Source: | https://github.com/mango-empire/dapper |
Generates samples from the private posterior using a data augmentation framework.
dapper_sample( data_model = NULL, sdp = NULL, init_par = NULL, seed = NULL, niter = 2000, warmup = floor(niter/2), chains = 1 )
dapper_sample( data_model = NULL, sdp = NULL, init_par = NULL, seed = NULL, niter = 2000, warmup = floor(niter/2), chains = 1 )
data_model |
a data model represented by a |
sdp |
the observed privatized data. Must be a vector or matrix. |
init_par |
initial starting point of the chain. |
seed |
set random seed. |
niter |
number of draws. |
warmup |
number of iterations to discard as warmup. Default is half of niter. |
chains |
number of MCMC chains to run. Can be done in parallel or sequentially. |
Generates samples from the private posterior implied by data_model
. The
data_model
input must by an object of class privacy
which is created
using the new_privacy() constructor. MCMC chains can be run in parallel
using furrr::future_map(). See the furrr package documentation for specifics.
Long computations can be monitored with the progressr package.
A dpout object which contains:
*chain
: a draw_matrix
object containing niter - warmpup
draws from the private posterior.
*accept_prob
: a (niter - warmup)
row matrix containing acceptance probabilities.
Each column corresponds to a parameter.
Ju, N., Awan, J. A., Gong, R., & Rao, V. A. (2022). Data Augmentation MCMC for Bayesian Inference from Privatized Data. arXiv. doi:10.48550/ARXIV.2206.00710
#simulate confidential data #privacy mechanism adds gaussian noise to each observation. set.seed(1) n <- 100 eps <- 3 y <- rnorm(n, mean = -2, sd = 1) sdp <- mean(y) + rnorm(1, 0, 1/eps) post_f <- function(dmat, theta) { x <- c(dmat) xbar <- mean(x) n <- length(x) pr_m <- 0 pr_s2 <- 4 ps_s2 <- 1/(1/pr_s2 + n) ps_m <- ps_s2 * ((1/pr_s2)*pr_m + n * xbar) rnorm(1, mean = ps_m, sd = sqrt(ps_s2)) } latent_f <- function(theta) { matrix(rnorm(100, mean = theta, sd = 1), ncol = 1) } st_f <- function(xi, sdp, i) { xi } priv_f <- function(sdp, sx) { sum(dnorm(sdp - sx/n, 0, 1/eps, TRUE)) } dmod <- new_privacy(post_f = post_f, latent_f = latent_f, priv_f = priv_f, st_f = st_f, npar = 1) out <- dapper_sample(dmod, sdp = sdp, init_par = -2, niter = 500) summary(out) # for parallel computing we 'plan' a session # the code below uses 2 CPU cores for parallel computing library(furrr) plan(multisession, workers = 2) out <- dapper_sample(dmod, sdp = sdp, init_par = -2, niter = 500, chains = 2) # to go back to sequential computing we use plan(sequential)
#simulate confidential data #privacy mechanism adds gaussian noise to each observation. set.seed(1) n <- 100 eps <- 3 y <- rnorm(n, mean = -2, sd = 1) sdp <- mean(y) + rnorm(1, 0, 1/eps) post_f <- function(dmat, theta) { x <- c(dmat) xbar <- mean(x) n <- length(x) pr_m <- 0 pr_s2 <- 4 ps_s2 <- 1/(1/pr_s2 + n) ps_m <- ps_s2 * ((1/pr_s2)*pr_m + n * xbar) rnorm(1, mean = ps_m, sd = sqrt(ps_s2)) } latent_f <- function(theta) { matrix(rnorm(100, mean = theta, sd = 1), ncol = 1) } st_f <- function(xi, sdp, i) { xi } priv_f <- function(sdp, sx) { sum(dnorm(sdp - sx/n, 0, 1/eps, TRUE)) } dmod <- new_privacy(post_f = post_f, latent_f = latent_f, priv_f = priv_f, st_f = st_f, npar = 1) out <- dapper_sample(dmod, sdp = sdp, init_par = -2, niter = 500) summary(out) # for parallel computing we 'plan' a session # the code below uses 2 CPU cores for parallel computing library(furrr) plan(multisession, workers = 2) out <- dapper_sample(dmod, sdp = sdp, init_par = -2, niter = 500, chains = 2) # to go back to sequential computing we use plan(sequential)
The probability mass function and random number generator for the discrete Laplacian distribution.
ddlaplace(x, scale = 1, log = FALSE) rdlaplace(n, scale = 1)
ddlaplace(x, scale = 1, log = FALSE) rdlaplace(n, scale = 1)
x |
a vector of quantiles. |
scale |
the scale parameter. |
log |
logical; if |
n |
number of random deviates. |
Probability mass function
ddlaplace() returns a numeric vector representing the probability mass function of the discrete Laplace distribution.
rdlaplace() returns a numeric vector of random samples from the discrete Laplace distribution.
Canonne, C. L., Kamath, G., & Steinke, T. (2020). The Discrete Gaussian for Differential Privacy. arXiv. doi:10.48550/ARXIV.2004.00010
# mass function ddlaplace(0) # mass function is vectorized ddlaplace(0:10, scale = 5) # generate random samples rdlaplace(10)
# mass function ddlaplace(0) # mass function is vectorized ddlaplace(0:10, scale = 5) # generate random samples rdlaplace(10)
The probability mass function and random number generator for the
discrete Gaussian distribution with mean mu
and scale parameter sigma
.
ddnorm(x, mu = 0, sigma = 1, log = FALSE) rdnorm(n, mu = 0, sigma = 1)
ddnorm(x, mu = 0, sigma = 1, log = FALSE) rdnorm(n, mu = 0, sigma = 1)
x |
vector of quantiles. |
mu |
location parameter. |
sigma |
scale parameter. |
log |
logical; if |
n |
number of random deviates. |
Probability mass function
ddnorm() returns a numeric vector representing the probability mass function of the discrete Gaussian distribution.
rdnorm() returns a numeric vector of random samples from the discrete Gaussian distribution.
Canonne, C. L., Kamath, G., & Steinke, T. (2020). The Discrete Gaussian for Differential Privacy. arXiv. doi:10.48550/ARXIV.2004.00010
# mass function ddnorm(0) # mass function is also vectorized ddnorm(0:10, mu = 0, sigma = 5) # generate random samples rdnorm(10)
# mass function ddnorm(0) # mass function is also vectorized ddnorm(0:10, mu = 0, sigma = 5) # generate random samples rdnorm(10)
privacy
Object Constructor.Creates a privacy
object to be used as input into dapper_sample().
new_privacy( post_f = NULL, latent_f = NULL, priv_f = NULL, st_f = NULL, npar = NULL, varnames = NULL )
new_privacy( post_f = NULL, latent_f = NULL, priv_f = NULL, st_f = NULL, npar = NULL, varnames = NULL )
post_f |
a function that draws posterior samples given the confidential data. |
latent_f |
a function that represents the latent data sampling model. |
priv_f |
a function that represents the log likelihood of the privacy mechanism. |
st_f |
a function that calculates the statistic to be released. |
npar |
dimension of the parameter being estimated. |
varnames |
an optional character vector of parameter names. Used to label summary outputs. |
post_f() is a function which makes draws from the posterior sampler. It has
the syntax post_f(dmat, theta). Here dmat
is a numeric matrix representing the confidential database
and theta
is a numeric vector which serves as the initialization point for a one sample draw.
The easiest, bug-free way to construct post_f() is to use a conjugate prior. However,
this function can also be constructed by wrapping a MCMC sampler generated from other R packages
(e.g. rstan, fmcmc, adaptMCMC).
priv_f() is a function that represents the log of the privacy mechanism density.
This function has the form priv_f(sdp, sx) where sdp
and sx
are both either
a numeric vector or matrix. The arguments must appear in the exact stated order with the same variables names as mentioned.
Finally, the return value of priv_f() must be a numeric vector of length one.
st_f() is a function which calculates a summary statistic. It
has the syntax st_f(xi, sdp, i) where the three arguments must appear in the stated order.
The role of this function is to represent terms in the definition of record additivity.
Here i
is an integer,
while xi
is an numeric vector and sdp
is a numeric vector or matrix.
npar
is an integer equal to the dimension of theta
.
A S3 object of class privacy
.
Plot dpout object.
## S3 method for class 'dpout' plot(x, ...)
## S3 method for class 'dpout' plot(x, ...)
x |
dp_out object. |
... |
optional arguments to |
trace plots.
Summarise dpout object.
## S3 method for class 'dpout' summary(object, ...)
## S3 method for class 'dpout' summary(object, ...)
object |
dp_out object |
... |
optional arguments to |
a summary table of MCMC statistics.