| Title: | Distributed Online Goodness-of-Fit Tests for Distributed Datasets |
|---|---|
| Description: | Distributed Online Goodness-of-Fit Test can process the distributed datasets. The philosophy of the package is described in Guo G.(2024) <doi:10.1016/j.apm.2024.115709>. |
| Authors: | Guangbao Guo [aut, cre] (ORCID: <https://orcid.org/0000-0002-4115-6218>), Di Chang [aut] |
| Maintainer: | Guangbao Guo <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.3 |
| Built: | 2026-05-25 08:40:14 UTC |
| Source: | https://github.com/cran/Dogoftest |
Performs a two-sample Anderson-Darling (AD) goodness-of-fit test using bootstrap resampling to compare whether two samples come from the same distribution. This test is sensitive to differences in both location and shape between the two distributions.
AD2gof( x, y, alternative = c("two.sided", "less", "greater"), nboots = 2000, keep.boots = FALSE )AD2gof( x, y, alternative = c("two.sided", "less", "greater"), nboots = 2000, keep.boots = FALSE )
x |
A numeric vector of data values from the first sample. |
y |
A numeric vector of data values from the second sample. |
alternative |
Character string specifying the alternative hypothesis. One of '"two.sided"' (default), '"less"', or '"greater"'. |
nboots |
Integer. Number of bootstrap replicates to compute the null distribution (default: 2000). |
keep.boots |
Logical. If 'TRUE', returns the full vector of bootstrap statistics (default: 'FALSE'). |
The test computes the Anderson-Darling statistic using the pooled empirical distribution functions (ECDFs) of the two samples. A bootstrap procedure resamples the group labels to approximate the null distribution and compute a p-value. If 'p.value = 0', it is adjusted to '1 / (2 * nboots)' for stability.
A list of class '"htest"' containing:
The observed Anderson-Darling test statistic.
The estimated bootstrap p-value.
The alternative hypothesis used.
A character string describing the test.
(Optional) A numeric vector of bootstrap statistics if 'keep.boots = TRUE'.
set.seed(123) x <- rnorm(100, mean = 0, sd = 4) y <- rnorm(100, mean = 2, sd = 4) AD2gof(x, y)set.seed(123) x <- rnorm(100, mean = 0, sd = 4) y <- rnorm(100, mean = 2, sd = 4) AD2gof(x, y)
Performs the Anderson-Darling (AD) goodness-of-fit test for a given univariate distribution. The function computes the AD statistic and returns an approximate p-value based on adjusted formulas.
ADgof( x, dist = c("norm", "exp", "unif", "lnorm", "weibull", "gamma", "t", "chisq"), ..., eps = 1e-15 )ADgof( x, dist = c("norm", "exp", "unif", "lnorm", "weibull", "gamma", "t", "chisq"), ..., eps = 1e-15 )
x |
A numeric vector of sample observations. |
dist |
A character string specifying the null distribution. Options are
|
... |
Additional named parameters passed to the corresponding distribution functions
(e.g., |
eps |
A small positive constant to avoid log(0) during computation (default: |
This implementation supports several common distributions. Parameters of the null distribution
must be supplied via .... The p-value is calculated using the approximations suggested
by Stephens (1986) and other refinements. For small samples or custom distributions, a bootstrap
version may be preferred.
A list of class "htest" with components:
The value of the Anderson-Darling test statistic.
The approximate p-value computed using adjustment formulas.
A description of the test performed.
The name of the input data.
set.seed(123) x1 <- rnorm(500, mean = 5, sd = 2) ADgof(x1, dist = "norm", mean = 5, sd = 2) x2 <- rexp(400, rate = 1.5) ADgof(x2, dist = "exp") ADgof(x2, dist = "exp", rate = 1.5) x3 <- runif(300, min = -2, max = 4) ADgof(x3, dist = "unif", min = -2, max = 4)set.seed(123) x1 <- rnorm(500, mean = 5, sd = 2) ADgof(x1, dist = "norm", mean = 5, sd = 2) x2 <- rexp(400, rate = 1.5) ADgof(x2, dist = "exp") ADgof(x2, dist = "exp", rate = 1.5) x3 <- runif(300, min = -2, max = 4) ADgof(x3, dist = "unif", min = -2, max = 4)
psi21k, psi26k, and psi31k are from Birnbaum and Saunders (1969).
The fatigue lifetimes of aluminum specimens exposed to a maximum stress of 21,000 psi, 26,000 psi, 31,000 psi,
respectively.
bearings is from McCool (1974). The fatigue lifetimes (in hours) of ten bearings.
fatigue is from Brown and Miller (1978). The fatigue lifetimes of cylindrical specimens
subjected to combined torsional and axial loads over constant-amplitude cycles until failure.
repair is from Hsieh (1990). This is a maintenance data set
on active repair times (in hours) for an airborne communications transceiver.
data(BSdata)data(BSdata)
Birnbaum, Z. W. and Saunders, S. C. (1969). A new family of life distributions. J. Appl. Probab. 6(2): 637-652.
McCool, J. I. (1974). Inferential techniques for Weibull populations. Aerospace Research Laboratories Report ARL T
R74-0180, Wright-Patterson Air Force Base, Dayton, OH.
Rieck, J. R. and Nedelman, J. (1991).
A Log-Linear Model for the Birnbaum-Saunders Distribution. Technometrics. 33, 51-60.
Brown, M. W. and Miller, K. J. (1978).
Biaxial Fatigue Data. Report CEMR1/78. University of Sheffield, Dept. of Mechanical Engineering.
Hsieh, H. K. (1990). Estimating the Critical Time of Inverse Gaussian Hazard Rate. IEEE Transactions on Reliability, 39(10): 342-345.
# Attach data sets data(BSdata)# Attach data sets data(BSdata)
Performs a nonparametric two-sample Cramér–von Mises test using a permutation-based bootstrap method to assess whether two samples come from the same distribution.
CVM2gof( x, y, alternative = c("two.sided", "less", "greater"), nboots = 2000, keep.boots = FALSE )CVM2gof( x, y, alternative = c("two.sided", "less", "greater"), nboots = 2000, keep.boots = FALSE )
x |
Numeric vector of observations from the first sample. |
y |
Numeric vector of observations from the second sample. |
alternative |
Character string specifying the alternative hypothesis.
Must be one of |
nboots |
Number of bootstrap replicates to approximate the null distribution (default: 2000). |
keep.boots |
Logical. If |
The test compares two empirical cumulative distribution functions (ECDFs). The bootstrap procedure permutes group labels to generate the null distribution. Tailored one-sided tests use one-sided squared differences of ECDFs.
An object of class "htest" with elements:
Observed Cramér–von Mises test statistic.
Bootstrap-based p-value.
The alternative hypothesis used.
A description of the test.
(Optional) Vector of bootstrap test statistics if keep.boots = TRUE.
set.seed(123) x <- rnorm(100, mean = 0, sd = 4) y <- rnorm(100, mean = 2, sd = 4) CVM2gof(x, y) # One-sided test CVM2gof(x, y, alternative = "greater") # Store bootstrap replicates res <- CVM2gof(x, y, keep.boots = TRUE) hist(res$bootstraps, main = "Bootstrap Distribution", xlab = "Test Statistic")set.seed(123) x <- rnorm(100, mean = 0, sd = 4) y <- rnorm(100, mean = 2, sd = 4) CVM2gof(x, y) # One-sided test CVM2gof(x, y, alternative = "greater") # Store bootstrap replicates res <- CVM2gof(x, y, keep.boots = TRUE) hist(res$bootstraps, main = "Bootstrap Distribution", xlab = "Test Statistic")
Perform the Cramer-von Mises Goodness-of-Fit Test for Normality
cvmgof(x)cvmgof(x)
x |
A numeric vector containing the sample data. |
statistic |
The value of the Cramer-von Mises test statistic. |
p.value |
The p-value for the test. |
method |
A character string describing the test. |
# Example usage: set.seed(123) x <- rnorm(100) # Generate a sample from a normal distribution result <- cvmgof(x) print(result) # Example with non-normal data: y <- rexp(100) # Generate a sample from an exponential distribution result <- cvmgof(y) print(result)# Example usage: set.seed(123) x <- rnorm(100) # Generate a sample from a normal distribution result <- cvmgof(x) print(result) # Example with non-normal data: y <- rexp(100) # Generate a sample from an exponential distribution result <- cvmgof(y) print(result)
Performs the one-sample Cramér–von Mises goodness-of-fit (GoF) test to assess whether a sample comes from a specified distribution using asymptotic p-value approximations.
CVMgof2( x, dist = c("norm", "exp", "unif", "lnorm", "weibull", "gamma", "t", "chisq"), ..., eps = 1e-15 )CVMgof2( x, dist = c("norm", "exp", "unif", "lnorm", "weibull", "gamma", "t", "chisq"), ..., eps = 1e-15 )
x |
A numeric vector of observations. |
dist |
A character string specifying the theoretical distribution. Must be one of
|
... |
Distribution parameters passed to the corresponding |
eps |
A small value to truncate extreme p-values (default is |
The test uses the Cramér–von Mises statistic to assess how well the empirical distribution function (EDF) of the sample agrees with the cumulative distribution function (CDF) of the specified theoretical distribution. The p-value is computed using approximation formulas derived from the asymptotic distribution of the test statistic.
An object of class "htest" with the following components:
The computed Cramér–von Mises test statistic.
The asymptotic p-value.
A description of the test and distribution.
The name of the data vector.
set.seed(123) x1 <- rnorm(500, mean = 0, sd = 1) CVMgof2(x1, dist = "norm", mean = 0, sd = 1) x2 <- rexp(500, rate = 2) CVMgof2(x2, dist = "exp", rate = 2) x3 <- runif(200, min = -1, max = 3) CVMgof2(x3, dist = "unif", min = -1, max = 3)set.seed(123) x1 <- rnorm(500, mean = 0, sd = 1) CVMgof2(x1, dist = "norm", mean = 0, sd = 1) x2 <- rexp(500, rate = 2) CVMgof2(x2, dist = "exp", rate = 2) x3 <- runif(200, min = -1, max = 3) CVMgof2(x3, dist = "unif", min = -1, max = 3)
Zoometric measurements of 27 week old creole goats collected by Dorantes-Coronado (2013).
data(goats)data(goats)
A data frame with 52 rows and 7 columns containing measurements (in kilograms and centimeters) on the following variables.
body.weightbody.lengthtrunk.lengthwithers.heightthoracic.perimeterhip.lengthear.lengthDorantes-Coronado (2013).
Dorantes-Coronado, E.J. (2013). Estudio preliminar para el establecimiento de un programa de mejoramiento genetico de cabras en el Estado de Mexico. Ph.D. Thesis. Colegio de Postgraduados, Mexico.
data(goats) plot(goats)data(goats) plot(goats)
Performs a two-sample Kolmogorov–Smirnov (KS) test using a bootstrap method to assess whether two independent samples come from the same distribution.
KS2gof( x, y, alternative = c("two.sided", "less", "greater"), nboots = 5000, keep.boots = FALSE )KS2gof( x, y, alternative = c("two.sided", "less", "greater"), nboots = 5000, keep.boots = FALSE )
x, y
|
Numeric vectors of data values for the two independent samples. |
alternative |
Character string specifying the alternative hypothesis,
must be one of |
nboots |
Number of bootstrap resamples used to approximate the null distribution (default: 5000). |
keep.boots |
Logical; if |
This implementation performs a nonparametric KS test for equality of distributions by resampling under the null hypothesis. It supports one-sided and two-sided alternatives.
If keep.boots = TRUE, the function returns all bootstrap statistics,
which can be used for further analysis (e.g., plotting).
If the p-value is zero due to no bootstrap statistic exceeding the observed value,
it is adjusted to 1 / (2 * nboots) to avoid a zero p-value.
An object of class "htest" with the following components:
The observed KS statistic.
The p-value based on the bootstrap distribution.
The alternative hypothesis.
Description of the test used.
set.seed(123) x <- rnorm(100, mean = 0, sd = 4) y <- rnorm(100, mean = 2, sd = 4) KS2gof(x, y)set.seed(123) x <- rnorm(100, mean = 0, sd = 4) y <- rnorm(100, mean = 2, sd = 4) KS2gof(x, y)
Perform the Lilliefors (Kolmogorov-Smirnov) Goodness-of-Fit Test for Normality
ksgof(x)ksgof(x)
x |
A numeric vector containing the sample data. |
statistic |
The value of the Lilliefors (Kolmogorov-Smirnov) test statistic. |
p.value |
The p-value for the test. |
method |
A character string describing the test. |
# Example usage: set.seed(123) x <- rnorm(100) # Generate a sample from a normal distribution result <- ksgof(x) print(result) # Example with non-normal data: y <- rexp(100) # Generate a sample from an exponential distribution result <- ksgof(y) print(result)# Example usage: set.seed(123) x <- rnorm(100) # Generate a sample from a normal distribution result <- ksgof(x) print(result) # Example with non-normal data: y <- rexp(100) # Generate a sample from an exponential distribution result <- ksgof(y) print(result)
Performs the one-sample Kolmogorov-Smirnov test for a specified theoretical distribution.
KSgof2( x, dist = c("norm", "exp", "unif", "lnorm", "weibull", "gamma", "t", "chisq"), ..., eps = 1e-15 )KSgof2( x, dist = c("norm", "exp", "unif", "lnorm", "weibull", "gamma", "t", "chisq"), ..., eps = 1e-15 )
x |
Numeric vector of observations. |
dist |
Character string specifying the distribution to test against.
One of |
... |
Additional parameters passed to the distribution’s cumulative distribution function (CDF).
For example, |
eps |
Numeric lower and upper bound for tail probabilities to avoid numerical issues (default: |
The test compares the empirical distribution function of x with the cumulative distribution function
of a specified theoretical distribution using the Kolmogorov-Smirnov statistic.
For large sample sizes, a p-value approximation based on the asymptotic distribution is used.
A correction is applied when sample size exceeds 100, adjusting the test statistic to approximate a fixed sample size. For very small or very large statistics, piecewise polynomial approximations are used to compute the p-value.
An object of class "htest" containing the test statistic, p-value, method description, and data name.
set.seed(123) x <- rnorm(1000, mean = 5, sd = 2) KSgof2(x, dist = "norm", mean = 5, sd = 2) y <- rexp(500, rate = 0.5) KSgof2(y, dist = "exp", rate = 0.5) u <- runif(300, min = 0, max = 10) KSgof2(u, dist = "unif", min = 0, max = 10)set.seed(123) x <- rnorm(1000, mean = 5, sd = 2) KSgof2(x, dist = "norm", mean = 5, sd = 2) y <- rexp(500, rate = 0.5) KSgof2(y, dist = "exp", rate = 0.5) u <- runif(300, min = 0, max = 10) KSgof2(u, dist = "unif", min = 0, max = 10)
Performs a two-sample Kuiper test using bootstrap resampling to test whether two independent samples come from the same distribution.
Kuiper2gof( x, y, alternative = c("two.sided", "less", "greater"), nboots = 2000, keep.boots = FALSE )Kuiper2gof( x, y, alternative = c("two.sided", "less", "greater"), nboots = 2000, keep.boots = FALSE )
x, y
|
Numeric vectors of data values for the two samples. |
alternative |
Character string indicating the alternative hypothesis. Must be one of |
nboots |
Integer. Number of bootstrap resamples to compute the empirical null distribution (default: 2000). |
keep.boots |
Logical. If |
The Kuiper test is a nonparametric test similar to the Kolmogorov–Smirnov test, but sensitive to discrepancies in both location and shape between two distributions. This implementation uses bootstrap resampling to estimate the p-value.
The two.sided test uses the sum of maximum positive and negative ECDF differences.
The greater and less options use one-sided variations.
If the observed test statistic exceeds all bootstrap values, the p-value is set to 1 / (2 * nboots) to avoid zero.
An object of class "htest" containing:
The observed Kuiper statistic.
The p-value computed from the bootstrap distribution.
The specified alternative hypothesis.
A character string describing the test.
(If requested) A numeric vector of bootstrap statistics.
set.seed(123) x <- rnorm(100, 0, 4) y <- rnorm(100, 2, 4) Kuiper2gof(x, y)set.seed(123) x <- rnorm(100, 0, 4) y <- rnorm(100, 2, 4) Kuiper2gof(x, y)
This function calculates the quantile of the Cramer-von Mises goodness-of-fit statistic using the 'uniroot' function to find the root of the given function.
qCvMgof(X, p)qCvMgof(X, p)
X |
A numeric vector containing the sample data. |
p |
A numeric value representing the desired quantile probability. |
root |
The quantile value corresponding to the given probability. |
# Example usage: set.seed(123) X <- rnorm(100) # Generate a sample from a normal distribution p <- 0.95 # Desired quantile probability result <- qCvMgof(X, p) print(result)# Example usage: set.seed(123) X <- rnorm(100) # Generate a sample from a normal distribution p <- 0.95 # Desired quantile probability result <- qCvMgof(X, p) print(result)
This function performs a simple Cramer-von Mises goodness-of-fit test to assess whether a given sample comes from a uniform distribution. The test statistic and p-value are calculated based on the sorted sample data.
simpleCvMgof(X)simpleCvMgof(X)
X |
A numeric vector containing the sample data. |
statistic |
The value of the Cramer-von Mises test statistic. |
pvalue |
The p-value for the test. |
statname |
The name of the test statistic. |
# Example usage: set.seed(123) X <- runif(100) # Generate a sample from a uniform distribution result <- simpleCvMgof(X) print(result) # Example with non-uniform data: Y <- rnorm(100) # Generate a sample from a normal distribution result <- simpleCvMgof(Y) print(result)# Example usage: set.seed(123) X <- runif(100) # Generate a sample from a uniform distribution result <- simpleCvMgof(X) print(result) # Example with non-uniform data: Y <- rnorm(100) # Generate a sample from a normal distribution result <- simpleCvMgof(Y) print(result)
Snowfall dataset
vector of values
This file contains observations of the annual snowfall amounts in Buffalo, New York. 63 as observed from 1910/11 to 1972/73 as listed in The autoregressive method: a method of approximating and estimating positive functions. Carmichael, Jean-Pierre. DTIC Document. 1976
Compressive strength and strain of maize seeds.
data("strength")data("strength")
A data frame with 90 observations on the following 2 variables.
straina numeric vector giving the relative change in length under compression stress in millimeters.
cstrengtha numeric vector giving the compressive strength in Newtons.
These data correspond to maize seeds with floury endosperm and 8% of moisture.
Mancera-Rico, A. (2014).
Mancera-Rico, A. (2014). Contenido de humedad y tipo de endospermo en la resistencia a compresion en semillas de maiz. Ph.D. Thesis. Colegio de Postgraduados, Mexico.
data(strength) plot(strength) # plot of "strain" versus "cstrength"data(strength) plot(strength) # plot of "strain" versus "cstrength"
Watson goodness-of-fit test Performs the Watson test for goodness-of-fit to a specified distribution.
Wgof(x, dist = c("norm", "exp", "unif", "lnorm", "gamma"), ..., eps = 1e-15)Wgof(x, dist = c("norm", "exp", "unif", "lnorm", "gamma"), ..., eps = 1e-15)
x |
Numeric vector of observations. |
dist |
Character string specifying the distribution to test against.
One of |
... |
Additional parameters passed to the distribution's cumulative distribution function (CDF).
For example, |
eps |
Numeric tolerance for probability bounds to avoid extremes (default: 1e-15). |
The Watson test is a modification of the Cramér–von Mises test, adjusting for mean deviations. It measures the squared distance between the empirical distribution function of the data and the specified theoretical cumulative distribution function, with a correction for location.
An object of class "htest" containing the test statistic, p-value, method description, data name,
and any distribution parameters used.
set.seed(123) x_norm <- rnorm(1000, mean = 5, sd = 2) Wgof(x_norm, dist = "norm", mean = 5, sd = 2) x_exp <- rexp(500, rate = 0.5) Wgof(x_exp, dist = "exp", rate = 0.5) x_unif <- runif(300, min = 0, max = 10) Wgof(x_unif, dist = "unif", min = 0, max = 10) x_lnorm <- rlnorm(200, meanlog = 0, sdlog = 1) Wgof(x_lnorm, dist = "lnorm", meanlog = 0, sdlog = 1) x_gamma <- rgamma(400, shape = 1, rate = 1) Wgof(x_gamma, dist = "gamma", shape = 1, rate = 1)set.seed(123) x_norm <- rnorm(1000, mean = 5, sd = 2) Wgof(x_norm, dist = "norm", mean = 5, sd = 2) x_exp <- rexp(500, rate = 0.5) Wgof(x_exp, dist = "exp", rate = 0.5) x_unif <- runif(300, min = 0, max = 10) Wgof(x_unif, dist = "unif", min = 0, max = 10) x_lnorm <- rlnorm(200, meanlog = 0, sdlog = 1) Wgof(x_lnorm, dist = "lnorm", meanlog = 0, sdlog = 1) x_gamma <- rgamma(400, shape = 1, rate = 1) Wgof(x_gamma, dist = "gamma", shape = 1, rate = 1)
A white wine tasting preference data used in the study of Cortez, Cerdeira, Almeida, Matos, and Reis 2009. This white wine contains 4898 white vinho verde wine samples and 12 variables including the tasting preference score of white wine and its physicochemical characteristics.
data(WhiteWine)data(WhiteWine)
A data frame with 4898 rows, quality score, and 11 variables of physicochemical properties of wines.
quality Tasting preference is a rating score provided by a minimum of three sensory with ordinal values from
0 (very bad) to 10 (excellent). The final sensory score is the median of these evaluations.
fixed.acidity The fixed acidity is the physicochemical property in unit (g(tartaric acid)/dm^3).
volatile.acidity The volatile acidity is in unit g(acetic acid)/dm^3.
citric.acid The citric acidity is in unit g/dm^3.
residual.sugar The residual sugar is in unit g/dm^3.
chlorides The chlorides is in unit g(sodium chloride)/dm^3.
free.sulfur.dioxide The free sulfur dioxide is in unit mg/dm^3.
total.sulfur.dioxide The total sulfur dioxide is in unit mg/dm^3.
density The density is in unit g/cm^3.
pH The wine's pH value.
sulphates The sulphates is in unit g(potassium sulphates)/dm^3.
alcohol The alcohol is in unit \
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., and Reis, J. (2009), “Modeling wine preferences by data mining from physicochemical properties,” Decision Support Systems, 47, 547–553. doi:10.1016/j.dss.2009.05.016
head(WhiteWine)head(WhiteWine)