| Title: | Distributed Online Covariance Matrix Tests for Truncated Factor Model |
|---|---|
| Description: | The truncated factor model is a statistical model designed to handle specific data structures in data analysis. 'DTFM' is a powerful tool designed to efficiently process and analyze distributed datasets. The philosophy of the package is described in Guo et al. (2023) <doi:10.1007/s00180-022-01270-z>. |
| Authors: | Beibei Wu [aut], Guangbao Guo [aut, cre] |
| Maintainer: | Guangbao Guo <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.5 |
| Built: | 2026-05-10 07:42:20 UTC |
| Source: | https://github.com/cran/DTFM |
A dataset containing parameters relevant to graduate admission prediction.
admission_predictadmission_predict
A data frame representing student profiles.
A questionnaire survey on consumers' perceptions and intentions regarding Chinese herbal tea (CHT). The dataset is used in the real-data analysis for covariance-based hypothesis testing and related factor-structure exploration.
Chinese_Herbal_TeaChinese_Herbal_Tea
A data frame with 723 rows and 15 variables:
Perceived price reasonableness of Chinese herbal tea products (Likert 5-point; 1 = strongly disagree, 5 = strongly agree).
Perceived brand awareness/familiarity influencing evaluation or purchase intention (Likert 5-point).
Perceived attractiveness of packaging/appearance (Likert 5-point).
Perceived taste and flavor satisfaction (Likert 5-point).
Perceived nutritional value (Likert 5-point). Note: a rare value -2 appears and is recommended to be recoded to NA.
Perceived safety assurance (e.g., ingredient safety, quality control) (Likert 5-point).
Perceived health benefits (Likert 5-point).
Influence of celebrity endorsement on purchase/consumption intention (Likert 5-point).
Influence of IP/brand collaborations (co-branding) on purchase/consumption intention (Likert 5-point).
Influence of collaboration with traditional Chinese medicine (TCM) institutions/organizations on purchase/consumption intention (Likert 5-point).
Influence of discounts and promotions on purchase/consumption intention (Likert 5-point).
Agreement that government support/policies would increase willingness to purchase/consume Chinese herbal tea (Likert 5-point).
Future purchase/consumption intention (Likert 5-point).
Future intention to recommend Chinese herbal tea to others (Likert 5-point).
Overall satisfaction with Chinese herbal tea products/experience (Likert 5-point).
Most variables are measured on a 5-point Likert scale (coded as integers 1–5), where larger values
indicate stronger agreement/more positive evaluation. One variable contains a rare special code
(-2) that should be treated as missing/invalid in downstream analysis.
Consumer survey dataset on Chinese herbal tea perceptions, marketing influences, and behavioral intentions.
Given two sets of data matrices X and Y, where X is an
matrix and Y is an matrix, this
function conducts a hypothesis test for the equality of two covariance
matrices. The null hypothesis is
where and are the covariance matrices of
X and Y, respectively. The test is based on the method proposed
by Cai, Liu and Xia (2013). When the p-value is smaller than the significance
level (usually 0.05), the null hypothesis is rejected.
CLX(X, Y, alpha = 0.05)CLX(X, Y, alpha = 0.05)
X |
A numeric matrix with |
Y |
A numeric matrix with |
alpha |
Significance level of the test. |
A list with the following components:
stat |
The test statistic. |
pval |
The p-value of the test. |
power |
The empirical power of the test. |
FDR |
The false discovery rate. |
p <- 500 n1 <- 100 n2 <- 150 X <- matrix(rnorm(n1 * p), ncol = p) Y <- matrix(rnorm(n2 * p), ncol = p) CLX(X, Y, alpha = 0.05)p <- 500 n1 <- 100 n2 <- 150 X <- matrix(rnorm(n1 * p), ncol = p) Y <- matrix(rnorm(n2 * p), ncol = p) CLX(X, Y, alpha = 0.05)
Given a data matrix, this function performs a one-sample test for the covariance matrix. The null hypothesis is
where is the covariance matrix of the data and
is a hypothesized covariance matrix. The test procedure is
based on the method proposed by Cai and Ma (2013).
cm13(X, Sigma0, alpha)cm13(X, Sigma0, alpha)
X |
A numeric data matrix with |
Sigma0 |
A |
alpha |
Significance level of the test. |
A named list with the following components:
The test statistic.
The rejection threshold for the test.
Logical; TRUE if the null hypothesis is rejected,
and FALSE otherwise.
p <- 5 n <- 10 X <- matrix(rnorm(n * p), ncol = p) alpha <- 0.05 Sigma0 <- diag(ncol(X)) cm13(X, Sigma0, alpha)p <- 5 n <- 10 X <- matrix(rnorm(n * p), ncol = p) alpha <- 0.05 Sigma0 <- diag(ncol(X)) cm13(X, Sigma0, alpha)
Data about the compressive strength of concrete based on its ingredients and age.
concreteconcrete
A data frame with component details and strength values.
This function performs Factor Analysis via Principal Component (FanPC) on a given data set. It calculates the estimated factor loading matrix (AF), specific variance matrix (DF), and the mean squared errors.
FanPC_TFM(data, m, A, D, p)FanPC_TFM(data, m, A, D, p)
data |
A matrix of input data. |
m |
The number of principal components. |
A |
The true factor loadings matrix. |
D |
The true uniquenesses matrix. |
p |
The number of variables. |
A list containing:
AF |
Estimated factor loadings. |
DF |
Estimated uniquenesses. |
MSESigmaA |
Mean squared error for factor loadings. |
MSESigmaD |
Mean squared error for uniquenesses. |
LSigmaA |
Loss metric for factor loadings. |
LSigmaD |
Loss metric for uniquenesses. |
library(SOPC) library(MASS) set.seed(123) p <- 10 m <- 3 n <- 50 A <- matrix(rnorm(p * m), nrow = p, ncol = m) D <- diag(runif(p, 0.2, 0.8)) F_mat <- matrix(rnorm(n * m), nrow = n, ncol = m) E_mat <- MASS::mvrnorm(n, mu = rep(0, p), Sigma = D) simulated_data <- F_mat %*% t(A) + E_mat results <- FanPC_TFM(data = simulated_data, m = m, A = A, D = D, p = p) print(results)library(SOPC) library(MASS) set.seed(123) p <- 10 m <- 3 n <- 50 A <- matrix(rnorm(p * m), nrow = p, ncol = m) D <- diag(runif(p, 0.2, 0.8)) F_mat <- matrix(rnorm(n * m), nrow = n, ncol = m) E_mat <- MASS::mvrnorm(n, mu = rep(0, p), Sigma = D) simulated_data <- F_mat %*% t(A) + E_mat results <- FanPC_TFM(data = simulated_data, m = m, A = A, D = D, p = p) print(results)
Daily OHLCV data for Google (ticker: GOOG) from 2018-01-01 to 2020-12-31, split-adjusted.
GOOGGOOG
A data frame with 756 rows and 10 variables:
Raw OHLC prices (USD)
Split-adjusted OHLC prices (USD)
Raw trading volume (shares)
Split-adjusted trading volume (shares)
Given two sets of data matrices X and Y, where X is an n1 rows and p cols matrix and Y is an n2 rows and p cols matrix, we conduct hypothesis testing of the covariance matrix between two samples. The null hypothesis is:
LC(X, Y, delta_sigma = NULL, alpha = 0.05)LC(X, Y, delta_sigma = NULL, alpha = 0.05)
X |
A matrix of n1 by p. |
Y |
A matrix of n2 by p. |
delta_sigma |
A positive definite matrix. |
alpha |
Significance level. |
and are the sample covariance matrices of X and Y respectively. This test method is based on the test method proposed by Li and Chen (2012). When the pval value is less than the significance coefficient (generally 0.05), the null hypothesis is rejected.
stat |
a test statistic value. |
pval |
a test p_value. |
power |
a test power value. |
FDR |
a test FDR value. |
p= 500; n1 = 100; n2 = 150 X=matrix(rnorm(n1*p), ncol=p) Y=matrix(rnorm(n2*p), ncol=p) LC(X,Y)p= 500; n1 = 100; n2 = 150 X=matrix(rnorm(n1*p), ncol=p) Y=matrix(rnorm(n2*p), ncol=p) LC(X,Y)
A questionnaire survey on consumers' purchase intention toward new energy vehicles (NEVs) and its influencing factors. The dataset includes (i) household vehicle purchase history, (ii) attitudes toward policy/product/economic/firm factors measured on a 5-point Likert scale, and (iii) demographic information.
new_energy_vehiclenew_energy_vehicle
A data frame with 520 rows and multiple variables:
Whether the household has purchased an internal-combustion (fuel) vehicle (single choice).
Whether the household has purchased a new energy vehicle (single choice).
Effect of subsidy policies (e.g., toll exemptions, lower purchase price, low-interest loans) on NEV purchase intention (Likert 5-point).
Effect of license-plate policies (e.g., free registration, road-restriction privileges) on NEV purchase intention (Likert 5-point).
Effect of environmental concerns on NEV purchase intention (Likert 5-point).
Effect of charging infrastructure convenience on NEV purchase intention (Likert 5-point).
Effect of driving experience (product factor) on NEV purchase intention (Likert 5-point).
Effect of battery performance (range, lifespan, capacity, charging efficiency) on NEV purchase intention (Likert 5-point).
Effect of safety and technology maturity/reliability on NEV purchase intention (Likert 5-point).
Effect of depreciation/durability concerns (economic factor) on NEV purchase intention (Likert 5-point).
Effect of purchase price (economic factor) on NEV purchase intention (Likert 5-point).
Effect of charging cost (economic factor) on NEV purchase intention (Likert 5-point).
Effect of maintenance/repair cost (economic factor) on NEV purchase intention (Likert 5-point).
Effect of firm service (pre-sales and after-sales) on NEV purchase intention (Likert 5-point).
Effect of brand (firm factor) on NEV purchase intention (Likert 5-point).
Effect of perceived technological advantages (firm factor) on NEV purchase intention (Likert 5-point).
Stated intention to purchase an NEV (Likert 5-point).
Willingness to recommend NEVs to others (Likert 5-point).
Willingness to prioritize buying an NEV next time (Likert 5-point).
Gender (single choice).
Age group (single choice).
Education level (single choice).
Occupation (single choice).
Household registration type (rural/urban; single choice).
Average monthly household income (categorical; single choice).
The Likert scale options are: A = Strongly disagree, B = Disagree, C = Neutral, D = Agree, E = Strongly agree.
Consumer survey dataset on NEV purchase intention and influencing factors.
This is the Protein Data Set from the UCI Machine Learning Repository. It contains information about protein concentration in different samples.
proteinprotein
A data frame with 45730 rows and 10 columns.
SampleID: A unique identifier for each sample.
Protein1: Concentration of Protein 1.
Protein2: Concentration of Protein 2.
Protein3: Concentration of Protein 3.
Protein4: Concentration of Protein 4.
Protein5: Concentration of Protein 5.
Protein6: Concentration of Protein 6.
Protein7: Concentration of Protein 7.
Protein8: Concentration of Protein 8.
Protein9: Concentration of Protein 9.
Protein10: Concentration of Protein 10.
Historical market data of real estate valuation.
real_estate_valuationreal_estate_valuation
A data frame with property attributes and price.
This dataset contains travel reviews from TripAdvisor.com, covering destinations in 10 categories across East Asia. Each traveler's rating is mapped to a scale from Terrible (0) to Excellent (4), and the average rating for each category per user is provided.
data(review)data(review)
A data frame with multiple rows and 10 columns.
1Unique identifier for each user (Categorical)
2Average user feedback on art galleries
3Average user feedback on dance clubs
4Average user feedback on juice bars
5Average user feedback on restaurants
6Average user feedback on museums
7Average user feedback on resorts
8Average user feedback on parks and picnic spots
9Average user feedback on beaches
10Average user feedback on theaters
The dataset is populated by crawling TripAdvisor.com and includes reviews on destinations in 10 categories across East Asia. Each traveler's rating is mapped as follows:
Excellent (4)
Very Good (3)
Average (2)
Poor (1)
Terrible (0)
The average rating for each category per user is used.
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
UCI Machine Learning Repository
# Load the dataset data(review) # Print the first few rows head(review) # Access specific columns (note the backticks for numeric names) review$`1` # User IDs mean(review$`5`) # Average rating for restaurants# Load the dataset data(review) # Print the first few rows head(review) # Access specific columns (note the backticks for numeric names) review$`1` # User IDs mean(review$`5`) # Average rating for restaurants
This dataset contains measurements of riboflavin (vitamin B2) production by Bacillus subtilis, a Gram-positive bacterium commonly used in industrial fermentation processes. The dataset includes
observations with predictors, representing the logarithm of the expression levels of 4088 genes. The response variable is the log-transformed riboflavin production rate.
data(riboflavin)data(riboflavin)
Log-transformed riboflavin production rate (original name: q_RIBFLV). This is a continuous variable indicating the efficiency of riboflavin production by the bacterial strain.
A matrix of dimension containing the logarithm of the expression levels of 4088 genes. Each column corresponds to a gene, and each row corresponds to an observation (experimental condition or time point).
The dataset is provided by DSM Nutritional Products Ltd., a leading company in the field of nutritional ingredients. The data have been preprocessed and normalized to account for technical variations in the microarray measurements.
# Load the riboflavin dataset data(riboflavin) # Display the dimensions of the dataset print(dim(riboflavin$x)) print(length(riboflavin$y))# Load the riboflavin dataset data(riboflavin) # Display the dimensions of the dataset print(dim(riboflavin$x)) print(length(riboflavin$y))
This dataset is a subset of the riboflavin production data by Bacillus subtilis, containing observations. It includes the response variable (log-transformed riboflavin production rate) and the 100 genes with the largest empirical variances from the original dataset.
data(riboflavinv100)data(riboflavinv100)
Log-transformed riboflavin production rate (original name: q_RIBFLV). This is a continuous variable indicating the efficiency of riboflavin production by the bacterial strain.
A matrix of dimension containing the logarithm of the expression levels of the 100 genes with the largest empirical variances.
The dataset is provided by DSM Nutritional Products Ltd., a leading company in the field of nutritional ingredients. The data have been preprocessed and normalized.
# Load the riboflavinv100 dataset data(riboflavinv100) # Display the dimensions of the dataset print(dim(riboflavinv100$x)) print(length(riboflavinv100$y))# Load the riboflavinv100 dataset data(riboflavinv100) # Display the dimensions of the dataset print(dim(riboflavinv100$x)) print(length(riboflavinv100$y))
Given data, it performs 1-sample test for Covariance where the null hypothesis is
where is the covariance of data model and is a
hypothesized covariance based on a procedure proposed by Srivastava, Yanagihara, and Kubokawa (2014).
syk(data, Sigma0, alpha)syk(data, Sigma0, alpha)
data |
an |
Sigma0 |
a |
alpha |
level of significance. |
a named list containing
a test statistic value.
rejection criterion to be compared against test statistic.
a logical; TRUE to reject null hypothesis, FALSE otherwise.
p = 5;n=10 data = matrix(rnorm(n*p), ncol=p) alpha=0.05 Sigma0=diag(ncol(data)) syk(data, Sigma0, alpha)p = 5;n=10 data = matrix(rnorm(n*p), ncol=p) alpha=0.05 Sigma0=diag(ncol(data)) syk(data, Sigma0, alpha)
Data recording taxi trip details and pricing.
taxi_trip_pricingtaxi_trip_pricing
A data frame with trip duration, distance, and fare.
The TFM function generates truncated factor model data using methods
implemented in the tmvtnorm package. It currently supports truncated
multivariate normal and truncated multivariate Student- distributions.
TFM(n, mu, sigma, lower, upper, distribution_type, df = 4)TFM(n, mu, sigma, lower, upper, distribution_type, df = 4)
n |
Total number of observations. |
mu |
Mean vector of the distribution. |
sigma |
Covariance matrix of the distribution. |
lower |
Lower bound of the truncation interval. |
upper |
Upper bound of the truncation interval. |
distribution_type |
A character string specifying the distribution type.
Possible values are |
df |
Degrees of freedom for the truncated Student- |
A matrix containing the generated truncated factor model data.
set.seed(123) n <- 100 mu <- c(0, 1) sigma <- matrix(c(1, 0.7, 0.7, 3), 2, 2) lower <- c(-2, -3) upper <- c(3, 3) X_norm <- TFM(n, mu, sigma, lower, upper, distribution_type = "truncated_normal") X_t <- TFM(n, mu, sigma, lower, upper, distribution_type = "truncated_student", df = 5)set.seed(123) n <- 100 mu <- c(0, 1) sigma <- matrix(c(1, 0.7, 0.7, 3), 2, 2) lower <- c(-2, -3) upper <- c(3, 3) X_norm <- TFM(n, mu, sigma, lower, upper, distribution_type = "truncated_normal") X_t <- TFM(n, mu, sigma, lower, upper, distribution_type = "truncated_student", df = 5)
This function performs a simple t-test for each variable in the dataset of a truncated factor model and calculates the False Discovery Rate (FDR) and power.
ttest.TFM(X, p, alpha = 0.05)ttest.TFM(X, p, alpha = 0.05)
X |
A matrix or data frame of simulated or observed data from a truncated factor model. |
p |
The number of variables (columns) in the dataset. |
alpha |
The significance level for the t-test. |
A list containing:
FDR |
The False Discovery Rate calculated from the rejected hypotheses. |
Power |
The power of the test, representing the proportion of true positives among the non-zero hypotheses. |
pValues |
A numeric vector of p-values obtained from the t-tests for each variable. |
RejectedHypotheses |
A logical vector indicating which hypotheses were rejected based on the specified significance level. |
# Load necessary libraries library(MASS) library(mvtnorm) set.seed(100) # Set parameters for the simulation p <- 400 # Number of features n <- 120 # Number of samples K <- 5 # Number of latent factors true_non_zero <- 100 # Assume 100 features have non-zero means # Simulate factor loadings matrix B (p x K) B <- matrix(rnorm(p * K), nrow = p, ncol = K) # Simulate factor scores (n x K) FX <- MASS::mvrnorm(n, rep(0, K), diag(K)) # Simulate noise U (n x p), assuming Student's t-distribution with 3 degrees of freedom U <- mvtnorm::rmvt(n, df = 3, sigma = diag(p)) # Create the data matrix X based on the truncated factor model # Non-zero means for the first 100 features mu <- c(rep(1, true_non_zero), rep(0, p - true_non_zero)) X <- rep(1, n) %*% t(mu) + FX %*% t(B) + U # The observed data # Apply the t-test function on the data results <- ttest.TFM(X, p, alpha = 0.05) # Print the results print(results)# Load necessary libraries library(MASS) library(mvtnorm) set.seed(100) # Set parameters for the simulation p <- 400 # Number of features n <- 120 # Number of samples K <- 5 # Number of latent factors true_non_zero <- 100 # Assume 100 features have non-zero means # Simulate factor loadings matrix B (p x K) B <- matrix(rnorm(p * K), nrow = p, ncol = K) # Simulate factor scores (n x K) FX <- MASS::mvrnorm(n, rep(0, K), diag(K)) # Simulate noise U (n x p), assuming Student's t-distribution with 3 degrees of freedom U <- mvtnorm::rmvt(n, df = 3, sigma = diag(p)) # Create the data matrix X based on the truncated factor model # Non-zero means for the first 100 features mu <- c(rep(1, true_non_zero), rep(0, p - true_non_zero)) X <- rep(1, n) %*% t(mu) + FX %*% t(B) + U # The observed data # Apply the t-test function on the data results <- ttest.TFM(X, p, alpha = 0.05) # Print the results print(results)
Physicochemical tests of white Portuguese "Vinho Verde" wine.
winequality.whitewinequality.white
A data frame with chemical properties and quality score.
Data concerning the hydrodynamics of sailing yachts.
yacht_hydrodynamicsyacht_hydrodynamics
A data frame with hull geometry and resistance.