| Title: | Distributed Skew Factor Model Estimation Methods |
|---|---|
| Description: | Provides a distributed framework for simulating and estimating skew factor models under various skewed and heavy-tailed distributions. The methods support distributed data generation, aggregation of local estimators, and evaluation of estimation performance via mean squared error, relative error, and sparsity measures. The distributed principal component (PC) estimators implemented in the package include 'IPC' (Independent Principal Component),'PPC' (Project Principal Component), 'SPC' (Sparse Principal Component), and other related distributed PC methods. The methodological background follows Guo G. (2023) <doi:10.1007/s00180-022-01270-z>. |
| Authors: | Guangbao Guo [aut, cre], Yu Jin [aut] |
| Maintainer: | Guangbao Guo <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.0.1 |
| Built: | 2026-06-01 10:21:32 UTC |
| Source: | https://github.com/cran/DSFM |
Measurements of air quality variables in an Italian city collected over several months in 2004–2005. The data includes hourly averaged responses from chemical sensors embedded in an air quality chemical multi-sensor device.
data(AirQuality)data(AirQuality)
A data frame with 9358 observations on the following 15 variables. Some variable names use parentheses, which may need to be quoted with backticks in R.
Date: Date (in DD/MM/YYYY format)
Time: Time (in HH.MM.SS format)
CO.GT.: Carbon Monoxide concentration (mg/m³)
PT08.S1.CO.: Sensor 1 response
NMHC.GT.: Non-methane hydrocarbons (µg/m³)
C6H6.GT.: Benzene concentration (µg/m³)
PT08.S2.NMHC.: Sensor 2 response
NOx.GT.: Nitric oxide concentration (ppb)
PT08.S3.NOx.: Sensor 3 response
NO2.GT.: Nitrogen dioxide concentration (µg/m³)
PT08.S4.NO2.: Sensor 4 response
PT08.S5.O3.: Sensor 5 response
T: Temperature (°C)
RH: Relative Humidity (%)
AH: Absolute Humidity
Some variables contain missing values coded as -200.
The dataset contains air quality data recorded in a densely populated area of an Italian city between March 2004 and February 2005. The data were collected using an array of chemical sensors and meteorological instruments.
This dataset is frequently used for tasks such as missing value imputation, time series analysis, regression, and machine learning model evaluation.
De Vito, S., Massera, E., Piga, M., Martinotto, L., & Di Francia, G. (2008).\ UCI Machine Learning Repository: Air Quality Data Set.\ Available at: https://archive.ics.uci.edu/ml/datasets/Air+Quality
De Vito, S., Massera, E., Piga, M., Martinotto, L., & Di Francia, G. (2008).\ Semi-Supervised Learning Techniques in Artificial Olfaction: A Novel Approach to Classification Problems and Drift Counteraction.\ IEEE Sensors Journal, 8(12), 2030–2038.
data(AirQuality) # Replace missing values (-200) with NA AirQuality[AirQuality == -200] <- NA # Check if there are non-NA values before plotting if (sum(!is.na(AirQuality$CO.GT.)) > 0) { plot(AirQuality$CO.GT., type = "l", ylab = "CO (mg/m³)", main = "Hourly CO Concentration") } else { message("No non-NA values in CO.GT. column to plot") }data(AirQuality) # Replace missing values (-200) with NA AirQuality[AirQuality == -200] <- NA # Check if there are non-NA values before plotting if (sum(!is.na(AirQuality$CO.GT.)) > 0) { plot(AirQuality$CO.GT., type = "l", ylab = "CO (mg/m³)", main = "Hourly CO Concentration") } else { message("No non-NA values in CO.GT. column to plot") }
This function calculates the Mean Squared Error (MSE) and relative error for factor loadings and uniqueness estimates obtained from factor analysis.
calculate_errors(data, A, D)calculate_errors(data, A, D)
data |
Matrix of SFM data. |
A |
Matrix of true factor loadings. |
D |
Matrix of true uniquenesses. |
A named vector containing:
MSEA |
Mean Squared Error for factor loadings. |
MSED |
Mean Squared Error for uniqueness estimates. |
LSA |
Relative error for factor loadings. |
LSD |
Relative error for uniqueness estimates. |
set.seed(123) # For reproducibility # Define dimensions n <- 10 # Number of samples p <- 5 # Number of factors # Generate matrices with compatible dimensions A <- matrix(runif(p * p, -1, 1), nrow = p) # Factor loadings matrix (p x p) D <- diag(runif(p, 1, 2)) # Uniquenesses matrix (p x p) data <- matrix(runif(n * p), nrow = n) # Data matrix (n x p) # Calculate errors (only if SOPC is installed) if (requireNamespace("SOPC", quietly = TRUE)) { errors <- calculate_errors(data, A, D) print(errors) }set.seed(123) # For reproducibility # Define dimensions n <- 10 # Number of samples p <- 5 # Number of factors # Generate matrices with compatible dimensions A <- matrix(runif(p * p, -1, 1), nrow = p) # Factor loadings matrix (p x p) D <- diag(runif(p, 1, 2)) # Uniquenesses matrix (p x p) data <- matrix(runif(n * p), nrow = n) # Data matrix (n x p) # Calculate errors (only if SOPC is installed) if (requireNamespace("SOPC", quietly = TRUE)) { errors <- calculate_errors(data, A, D) print(errors) }
This function performs distributed Fan-type principal component analysis on a numeric dataset split across multiple nodes.
DFanPC(data, m, n1, K)DFanPC(data, m, n1, K)
data |
A numeric matrix containing the total dataset. |
m |
An integer specifying the number of principal components. |
n1 |
An integer specifying the length of each data subset. |
K |
An integer specifying the number of nodes. |
A list with the following components:
List of estimated loading matrices for each node.
List of diagonal residual variance matrices for each node.
List of covariance matrices for each node.
set.seed(123) data <- matrix(rnorm(500), nrow = 100, ncol = 5) DFanPC(data = data, m = 3, n1 = 20, K = 5)set.seed(123) data <- matrix(rnorm(500), nrow = 100, ncol = 5) DFanPC(data = data, m = 3, n1 = 20, K = 5)
Performs distributed Gao-type principal component analysis on a numeric dataset split across multiple nodes.
DGaoPC(data, m, n1, K)DGaoPC(data, m, n1, K)
data |
A numeric matrix containing the total dataset. |
m |
An integer specifying the number of principal components for the first stage. |
n1 |
An integer specifying the length of each data subset. |
K |
An integer specifying the number of nodes. |
A list with the following components:
List of estimated loading matrices for the first-stage components for each node.
List of estimated loading matrices for the second-stage components for each node.
List of diagonal residual variance matrices for each node.
List of covariance matrices of reconstructed data for each node.
set.seed(123) data <- matrix(rnorm(500), nrow = 100, ncol = 5) DGaoPC(data = data, m = 3, n1 = 20, K = 5)set.seed(123) data <- matrix(rnorm(500), nrow = 100, ncol = 5) DGaoPC(data = data, m = 3, n1 = 20, K = 5)
Performs distributed Gul-type principal component analysis on a numeric dataset split across multiple nodes.
DGulPC(data, m, n1, K)DGulPC(data, m, n1, K)
data |
A numeric matrix containing the total dataset. |
m |
An integer specifying the number of principal components for the first stage. |
n1 |
An integer specifying the length of each data subset. |
K |
An integer specifying the number of nodes. |
A list with the following components:
List of estimated first-stage loading matrices for each node.
List of estimated second-stage loading matrices for each node.
List of diagonal residual variance matrices for each node.
List of covariance matrices of reconstructed data for each node.
set.seed(123) data <- matrix(rnorm(500), nrow = 100, ncol = 5) DGulPC(data = data, m = 3, n1 = 20, K = 5)set.seed(123) data <- matrix(rnorm(500), nrow = 100, ncol = 5) DGulPC(data = data, m = 3, n1 = 20, K = 5)
Performs distributed principal component analysis on a numeric dataset split across multiple nodes. Estimates loading matrices, residual variances, and covariance matrices for each node.
DPC(data, m, n1, K)DPC(data, m, n1, K)
data |
A numeric matrix containing the total dataset. |
m |
An integer specifying the number of principal components. |
n1 |
An integer specifying the length of each data subset. |
K |
An integer specifying the number of nodes. |
A list with the following components:
List of estimated loading matrices for each node.
List of diagonal residual variance matrices for each node.
List of covariance matrices for each node.
set.seed(123) data <- matrix(rnorm(500), nrow = 100, ncol = 5) DPC(data = data, m = 3, n1 = 20, K = 5)set.seed(123) data <- matrix(rnorm(500), nrow = 100, ncol = 5) DPC(data = data, m = 3, n1 = 20, K = 5)
Performs distributed probabilistic principal component analysis (PPC) on a numeric dataset split across multiple nodes. Estimates loading matrices, residual variances, and covariance matrices for each node using a probabilistic approach.
DPPC(data, m, n1, K)DPPC(data, m, n1, K)
data |
A numeric matrix containing the total dataset. |
m |
An integer specifying the number of principal components. |
n1 |
An integer specifying the length of each data subset. |
K |
An integer specifying the number of nodes. |
A list with the following components:
List of estimated loading matrices for each node.
List of diagonal residual variance matrices for each node.
List of covariance matrices for each node.
set.seed(123) data <- matrix(rnorm(500), nrow = 100, ncol = 5) DPPC(data = data, m = 3, n1 = 20, K = 5)set.seed(123) data <- matrix(rnorm(500), nrow = 100, ncol = 5) DPPC(data = data, m = 3, n1 = 20, K = 5)
Performs distributed sparse principal component analysis (DSPC) on a numeric dataset split across multiple nodes. Estimates sparse loading matrices, residual variances, and covariance matrices for each node.
DSPC(data, m, gamma, n1, K)DSPC(data, m, gamma, n1, K)
data |
A numeric matrix containing the total dataset. |
m |
An integer specifying the number of principal components. |
gamma |
A numeric value specifying the sparsity parameter for SPC. |
n1 |
An integer specifying the length of each data subset. |
K |
An integer specifying the number of nodes. |
A list with the following components:
List of sparse loading matrices for each node.
List of diagonal residual variance matrices for each node.
List of covariance matrices for each node.
set.seed(123) data <- matrix(rnorm(500), nrow = 100, ncol = 5) DSPC(data = data, m = 3, gamma = 0.03, n1 = 20, K = 5)set.seed(123) data <- matrix(rnorm(500), nrow = 100, ncol = 5) DSPC(data = data, m = 3, gamma = 0.03, n1 = 20, K = 5)
Performs comprehensive factor model testing including joint tests (Wald, GRS, PY), individual asset t-tests, and False Discovery Rate control.
factor.tests(ret, fac, q.fdr = 0.05)factor.tests(ret, fac, q.fdr = 0.05)
ret |
A T × N matrix representing the excess returns of N assets at T time points. |
fac |
A T × K matrix representing the returns of K factors at T time points. |
q.fdr |
The significance level for FDR (False Discovery Rate) testing, defaulting to 5%. |
A list containing the following components:
alpha |
N-vector of estimated alphas for each asset |
tstat |
N-vector of t-statistics for testing individual alphas |
pval |
N-vector of p-values for individual alpha tests |
Wald |
Wald test statistic for joint alpha significance |
p_Wald |
p-value for Wald test |
GRS |
GRS test statistic (finite-sample F-test) |
p_GRS |
p-value for GRS test |
PY |
Pesaran and Yamagata test statistic |
p_PY |
p-value for PY test |
reject_fdr |
Logical vector indicating which assets have significant alphas after FDR correction |
fdr_p |
Adjusted p-values using Benjamini-Hochberg procedure |
power_proxy |
Number of significant assets after FDR correction |
set.seed(42) T <- 120 N <- 25 K <- 3 fac <- matrix(rnorm(T * K), T, K) beta <- matrix(rnorm(N * K), N, K) alpha <- rep(0, N) alpha[1:3] <- 0.4 / 100 # 3 non-zero alphas eps <- matrix(rnorm(T * N, sd = 0.02), T, N) ret <- alpha + fac %*% t(beta) + eps results <- factor.tests(ret, fac, q.fdr = 0.05) # View results cat("Wald test p-value:", results$p_Wald, "\n") cat("GRS test p-value:", results$p_GRS, "\n") cat("PY test p-value:", results$p_PY, "\n") cat("Significant assets after FDR:", results$power_proxy, "\n")set.seed(42) T <- 120 N <- 25 K <- 3 fac <- matrix(rnorm(T * K), T, K) beta <- matrix(rnorm(N * K), N, K) alpha <- rep(0, N) alpha[1:3] <- 0.4 / 100 # 3 non-zero alphas eps <- matrix(rnorm(T * N, sd = 0.02), T, N) ret <- alpha + fac %*% t(beta) + eps results <- factor.tests(ret, fac, q.fdr = 0.05) # View results cat("Wald test p-value:", results$p_Wald, "\n") cat("GRS test p-value:", results$p_GRS, "\n") cat("PY test p-value:", results$p_PY, "\n") cat("Significant assets after FDR:", results$power_proxy, "\n")
A data frame containing gene expression, lipid measurements, and grouping variables (diet and genotype) for 40 mice from a nutrigenomics study.
data(Nutrimouse)data(Nutrimouse)
A data frame with 40 observations on 143 variables:
120 numeric variables for gene expression
21 numeric variables for lipid measurements
2 categorical variables: diet and genotype
This dataset was created for integrative analysis of transcriptomic and lipidomic responses of mice to different diets and genotypes.
All numeric variables (genes and lipids) are centered and scaled. The categorical variables indicate the experimental design: five diet types and two genotypes.
This format is convenient for regression, classification, and dimension reduction techniques requiring a single data frame.
Extracted from the mixOmics package, based on: \ Martin, P. G. P., et al. (2007). A systems biology approach to the study of gene expression and lipid metabolism in mice fed high-fat diets. Journal of Lipid Research, 48(2), 360–377.
González, I., Déjean, S., Martin, P. G. P., and Baccini, A. (2009). CCA: An R package to extend canonical correlation analysis. Journal of Statistical Software, 23(12), 1–14.
data(Nutrimouse) # View structure str(Nutrimouse) # Boxplot of a gene across diets boxplot(Nutrimouse[,1] ~ Nutrimouse$diet, main = "Gene 1 Expression by Diet") # PCA on all numeric variables (excluding factors) nutri_numeric <- Nutrimouse[, sapply(Nutrimouse, is.numeric)] pca_result <- prcomp(nutri_numeric, scale. = TRUE) # PCA plot plot(pca_result$x[,1:2], col = as.numeric(Nutrimouse$diet), pch = 19) legend("topright", legend = levels(Nutrimouse$diet), col = 1:5, pch = 19)data(Nutrimouse) # View structure str(Nutrimouse) # Boxplot of a gene across diets boxplot(Nutrimouse[,1] ~ Nutrimouse$diet, main = "Gene 1 Expression by Diet") # PCA on all numeric variables (excluding factors) nutri_numeric <- Nutrimouse[, sapply(Nutrimouse, is.numeric)] pca_result <- prcomp(nutri_numeric, scale. = TRUE) # PCA plot plot(pca_result$x[,1:2], col = as.numeric(Nutrimouse$diet), pch = 19) legend("topright", legend = levels(Nutrimouse$diet), col = 1:5, pch = 19)
A dataset containing biomedical voice measurements from people with Parkinson's disease and healthy controls. The goal is to analyze voice signal features for detecting and monitoring Parkinson's disease.
data(Parkinsons_Features)data(Parkinsons_Features)
A data frame with 5,876 observations on 22 variables. Each row corresponds to a voice recording from a subject.
subject_id
|
Identifier for the subject (factor or character) |
age
|
Age of the subject (numeric) |
sex
|
Sex of the subject (factor: Male/Female) |
test_time
|
Time of test (numeric, days since baseline) |
motor_UPDRS
|
Unified Parkinson's Disease Rating Scale motor score (numeric) |
total_UPDRS
|
Total UPDRS score (numeric) |
Jitter
|
Measure of frequency variation (numeric) |
Shimmer
|
Measure of amplitude variation (numeric) |
NHR
|
Noise-to-harmonics ratio (numeric) |
HNR
|
Harmonics-to-noise ratio (numeric) |
RPDE
|
Recurrence period density entropy (numeric) |
DFA
|
Detrended fluctuation analysis (numeric) |
PPE
|
Pitch period entropy (numeric) |
...
|
Additional voice signal features and measurements (numeric) |
All features are numerical except for identifiers and categorical variables.
This dataset was collected from subjects with Parkinson's disease and healthy controls. Multiple biomedical voice measurements were recorded over time to evaluate disease progression.
The features include various jitter, shimmer, noise, and entropy measures extracted from sustained vowel phonations.
The dataset is widely used for classification and regression models aiming to predict Parkinson's disease severity or presence.
UCI Machine Learning Repository: Parkinson's Disease Classification Data Set \ https://archive.ics.uci.edu/ml/datasets/Parkinsons+Telemonitoring
Tsanas, A., Little, M.A., McSharry, P.E., & Ramig, L.O. (2010). Accurate telemonitoring of Parkinson's disease progression by noninvasive speech tests. IEEE Transactions on Biomedical Engineering, 57(4), 884–893.
data(Parkinsons_Features) if (all(startsWith(names(Parkinsons_Features), "V"))) { colnames(Parkinsons_Features) <- Parkinsons_Features[1, ] Parkinsons_Features <- Parkinsons_Features[-1, ] } Parkinsons_Features[] <- lapply(Parkinsons_Features, type.convert, as.is = TRUE) summary(Parkinsons_Features$motor_UPDRS) boxplot(motor_UPDRS ~ sex, data = Parkinsons_Features, main = "Motor UPDRS by Sex", ylab = "Motor UPDRS")data(Parkinsons_Features) if (all(startsWith(names(Parkinsons_Features), "V"))) { colnames(Parkinsons_Features) <- Parkinsons_Features[1, ] Parkinsons_Features <- Parkinsons_Features[-1, ] } Parkinsons_Features[] <- lapply(Parkinsons_Features, type.convert, as.is = TRUE) summary(Parkinsons_Features$motor_UPDRS) boxplot(motor_UPDRS ~ sex, data = Parkinsons_Features, main = "Motor UPDRS by Sex", ylab = "Motor UPDRS")
The function supports various distribution types for generating the data, including: Skew-Normal Distribution, Skew-Cauchy Distribution, Skew-t Distribution.
SFM(n, p, m, xi, omega, alpha, distribution_type)SFM(n, p, m, xi, omega, alpha, distribution_type)
n |
Sample size. |
p |
Sample dimensionality. |
m |
Number of factors. |
xi |
A numerical parameter used exclusively in the "Skew-t" distribution, representing the distribution's xi parameter. |
omega |
A numerical parameter representing the omega parameter of the distribution, which affects the degree of skewness in the distribution. |
alpha |
A numerical parameter representing the alpha parameter of the distribution, which influences the shape of the distribution. |
distribution_type |
The type of distribution. |
A list containing:
data |
A matrix of generated data. |
A |
A matrix representing the factor loadings. |
D |
A diagonal matrix representing the unique variances. |
library(MASS) library(SOPC) library(sn) library(matrixcalc) library(psych) n <- 100 p <- 10 m <- 5 xi <- 5 omega <- 2 alpha <- 5 distribution_type <- "Skew-Normal Distribution" X <- SFM(n, p, m, xi, omega, alpha, distribution_type)library(MASS) library(SOPC) library(sn) library(matrixcalc) library(psych) n <- 100 p <- 10 m <- 5 xi <- 5 omega <- 2 alpha <- 5 distribution_type <- "Skew-Normal Distribution" X <- SFM(n, p, m, xi, omega, alpha, distribution_type)
The sparse online principal component can not only process online data sets, but also obtain a sparse solution of online data sets.
SOPC(data, m, gamma, eta)SOPC(data, m, gamma, eta)
data |
is a highly correlated online data set |
m |
is the number of principal component |
gamma |
is a sparse parameter |
eta |
is the proportion of online data to total data |
Aso,Dso
The sparse principal component can obtain sparse solutions of the eigenmatrix to better explain the relationship between principal components and original variables.
SPC(data, m, gamma)SPC(data, m, gamma)
data |
is a highly correlated data set |
m |
is the number of principal component |
gamma |
is a sparse parameter |
As,Ds
Data refer to chemical properties of 178 specimens of three types of wine produced in the Piedmont region of Italy.
data(wines)data(wines)
A data frame with 178 observations on the following 28 variables.
wine
|
wine name (categorical,
levels: Barbera, Barolo, Grignolino) |
alcohol
|
alcohol percentage (numeric) |
sugar
|
sugar-free extract (numeric) |
acidity
|
fixed acidity (numeric) |
tartaric
|
tartaric acid (numeric) |
malic
|
malic acid (numeric) |
uronic
|
uronic acids (numeric) |
pH
|
pH (numeric) |
ash
|
ash (numeric) |
alcal_ash
|
alcalinity of ash (numeric) |
potassium
|
potassium (numeric) |
calcium
|
calcium (numeric) |
magnesium
|
magnesium (numeric) |
phosphate
|
phosphate (numeric) |
cloride
|
chloride (numeric) |
phenols
|
total phenols (numeric) |
flavanoids
|
flavanoids (numeric) |
nonflavanoids
|
nonflavanoid phenols (numeric) |
proanthocyanins
|
proanthocyanins (numeric) |
colour
|
colour intensity (numeric) |
hue
|
hue (numeric) |
OD_dw
|
of diluted wines
(numeric) |
OD_fl
|
of flavanoids
(numeric) |
glycerol
|
glycerol (numeric) |
butanediol
|
2,3-butanediol (numeric) |
nitrogen
|
total nitrogen (numeric) |
proline
|
proline (numeric) |
methanol
|
methanol (numeric) |
The data represent 27 chemical measurements on each of 178 wine specimens belonging to three types of wine produced in the Piedmont region of Italy. The data have been presented and examined by Forina et al. (1986) and were freely accessible from the PARVUS web-site until it was active. These data or, more often, a subset of them are now available from various places, including some R packages. The present dataset includes all variables available on the PARVUS repository, which are the variables listed by Forina et al. (1986) with the exception of ‘Sulphate’. Moreover, it reveals the undocumented fact that the original dataset appears to include also the vintage year; see the final portion of the ‘Examples’ below.
Forina, M., Lanteri, S. Armanino, C., Casolino, C., Casale, M. and Oliveri, P. V-PARVUS 2008: an extendible package of programs for esplorative data analysis, classification and regression analysis. Dip. Chimica e Tecnologie Farmaceutiche ed Alimentari, Università di Genova, Italia. Web-site (not accessible as of 2014): ‘http://www.parvus.unige.it’
Forina M., Armanino C., Castino M. and Ubigli M. (1986). Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25, 189–201.
data(wines) pairs(wines[,c(2,3,16:18)], col=as.numeric(wines$wine)) # code <- substr(rownames(wines), 1, 3) table(wines$wine, code) # year <- as.numeric(substr(rownames(wines), 6, 7)) table(wines$wine, year) # coincides with Table 1(a) of Forina et al. (1986)data(wines) pairs(wines[,c(2,3,16:18)], col=as.numeric(wines$wine)) # code <- substr(rownames(wines), 1, 3) table(wines$wine, code) # year <- as.numeric(substr(rownames(wines), 6, 7)) table(wines$wine, year) # coincides with Table 1(a) of Forina et al. (1986)