| Title: | Online Principal Component Regression for Online Datasets |
|---|---|
| Description: | The online principal component regression method can process the online data set. 'OPCreg' implements the online principal component regression method, which is specifically designed to process online datasets efficiently. This method is particularly useful for handling large-scale, streaming data where traditional batch processing methods may be computationally infeasible.The philosophy of the package is described in 'Guo' (2025) <doi:10.1016/j.physa.2024.130308>. |
| Authors: | Guangbao Guo [aut, cre] (ORCID: <https://orcid.org/0000-0002-4115-6218>), Chunjie Wei [aut] |
| Maintainer: | Guangbao Guo <[email protected]> |
| License: | GPL-3 |
| Version: | 3.0.0 |
| Built: | 2026-06-01 07:03:17 UTC |
| Source: | https://github.com/cran/OPCreg |
This dataset contains measurements related to the slump test of concrete, including input variables (concrete ingredients) and output variables (slump, flow, and compressive strength).
concreteconcrete
A data frame with 103 rows and 10 columns.
Cement: Amount of cement (kg in one M^3 concrete).
Slag: Amount of slag (kg in one M^3 concrete).
Fly_ash: Amount of fly ash (kg in one M^3 concrete).
Water: Amount of water (kg in one M^3 concrete).
SP: Amount of superplasticizer (kg in one M^3 concrete).
Coarse_Aggr: Amount of coarse aggregate (kg in one M^3 concrete).
Fine_Aggr: Amount of fine aggregate (kg in one M^3 concrete).
SLUMP: Slump of the concrete (cm).
FLOW: Flow of the concrete (cm).
Compressive_Strength: 28-day compressive strength of the concrete (MPa).
The dataset includes 7 input variables (concrete ingredients) and 3 output variables (slump, flow, and compressive strength). The initial dataset had 78 data points, with an additional 25 data points added later.
The dataset assumes that all measurements are accurate and does not account for measurement errors. The slump flow of concrete is influenced by multiple factors, including water content and other ingredients.
Donor: I-Cheng Yeh \ Email: icyeh 'at' chu.edu.tw \ Institution: Department of Information Management, Chung-Hua University (Republic of China) \ Other contact information: Department of Information Management, Chung-Hua University, Hsin Chu, Taiwan 30067, R.O.C.
# Load the dataset data(concrete) # Print the first few rows of the dataset print(head(concrete))# Load the dataset data(concrete) # Print the first few rows of the dataset print(head(concrete))
The incremental principal component method can handle online data sets.
IPC(data, m, eta)IPC(data, m, eta)
data |
is an online data set |
m |
is the number of principal component |
eta |
is the proportion of online data to total data |
T2,T2k,V,Vhat,lambdahat,time
library(MASS) n=2000;p=20;m=9; mu=t(matrix(rep(runif(p,0,1000),n),p,n)) mu0=as.matrix(runif(m,0)) sigma0=diag(runif(m,1)) F=matrix(mvrnorm(n,mu0,sigma0),nrow=n) A=matrix(runif(p*m,-1,1),nrow=p) D=as.matrix(diag(rep(runif(p,0,1)))) epsilon=matrix(mvrnorm(n,rep(0,p),D),nrow=n) data=mu+F%*%t(A)+epsilon IPC(data=data,m=m,eta=0.8)library(MASS) n=2000;p=20;m=9; mu=t(matrix(rep(runif(p,0,1000),n),p,n)) mu0=as.matrix(runif(m,0)) sigma0=diag(runif(m,1)) F=matrix(mvrnorm(n,mu0,sigma0),nrow=n) A=matrix(runif(p*m,-1,1),nrow=p) D=as.matrix(diag(rep(runif(p,0,1)))) epsilon=matrix(mvrnorm(n,rep(0,p),D),nrow=n) data=mu+F%*%t(A)+epsilon IPC(data=data,m=m,eta=0.8)
The IPCR function implements an incremental Principal Component Regression (PCR) method designed to handle online datasets. It updates the principal components recursively as new data arrives, making it suitable for real-time data processing.
IPCR(data, eta, m, alpha)IPCR(data, eta, m, alpha)
data |
A data frame where the first column is the response variable and the remaining columns are predictor variables. |
eta |
The proportion of the initial sample size used to initialize the principal components (0 < eta < 1). Default is 0.0035. |
m |
The number of principal components to retain. Default is 3. |
alpha |
The significance level used for calculating critical values. Default is 0.05. |
The IPCR function performs the following steps:
1. Standardizes the predictor variables.
2. Initializes the principal components using the first n0 = round(eta * n) samples.
3. Recursively updates the principal components as each new sample arrives.
4. Fits a linear regression model using the principal component scores.
5. Back-transforms the regression coefficients to the original scale.
This method is particularly useful for datasets where new observations are continuously added, and the model needs to be updated incrementally.
A list containing the following elements:
Bhat |
The estimated regression coefficients, including the intercept. |
RMSE |
The Root Mean Square Error of the regression model. |
summary |
The summary of the linear regression model. |
yhat |
The predicted values of the response variable. |
lm: For fitting linear models.
eigen: For computing eigenvalues and eigenvectors.
## Not run: set.seed(1234) library(MASS) n <- 2000 p <- 10 mu0 <- as.matrix(runif(p, 0)) sigma0 <- as.matrix(runif(p, 0, 10)) ro <- as.matrix(c(runif(round(p / 2), -1, -0.8), runif(p - round(p / 2), 0.8, 1))) R0 <- ro %*% t(ro) diag(R0) <- 1 Sigma0 <- sigma0 %*% t(sigma0) * R0 x <- mvrnorm(n, mu0, Sigma0) colnames(x) <- paste("x", 1:p, sep = "") e <- rnorm(n, 0, 1) B <- sample(1:3, (p + 1), replace = TRUE) en <- matrix(rep(1, n * 1), ncol = 1) y <- cbind(en, x) %*% B + e colnames(y) <- paste("y") data <- data.frame(cbind(y, x)) result <- IPCR(data = data, m = 3, eta = 0.0035, alpha = 0.05) print(result$Bhat) print(result$yhat) print(result$RMSE) print(result$summary) ## End(Not run)## Not run: set.seed(1234) library(MASS) n <- 2000 p <- 10 mu0 <- as.matrix(runif(p, 0)) sigma0 <- as.matrix(runif(p, 0, 10)) ro <- as.matrix(c(runif(round(p / 2), -1, -0.8), runif(p - round(p / 2), 0.8, 1))) R0 <- ro %*% t(ro) diag(R0) <- 1 Sigma0 <- sigma0 %*% t(sigma0) * R0 x <- mvrnorm(n, mu0, Sigma0) colnames(x) <- paste("x", 1:p, sep = "") e <- rnorm(n, 0, 1) B <- sample(1:3, (p + 1), replace = TRUE) en <- matrix(rep(1, n * 1), ncol = 1) y <- cbind(en, x) %*% B + e colnames(y) <- paste("y") data <- data.frame(cbind(y, x)) result <- IPCR(data = data, m = 3, eta = 0.0035, alpha = 0.05) print(result$Bhat) print(result$yhat) print(result$RMSE) print(result$summary) ## End(Not run)
The PCR function performs Principal Component Regression (PCR) on a given dataset.
It standardizes the predictor variables, determines the number of principal components to retain based on a specified threshold,
and fits a linear regression model using the principal component scores.
PCR(data, threshold)PCR(data, threshold)
data |
A data frame where the first column is the response variable and the remaining columns are predictor variables. |
threshold |
The proportion of variance to retain in the principal components (default is 0.95). |
The function performs the following steps: 1. Standardize the predictor variables. 2. Compute the covariance matrix of the standardized predictors. 3. Perform eigen decomposition on the covariance matrix to obtain principal components. 4. Determine the number of principal components to retain based on the cumulative explained variance exceeding the specified threshold. 5. Project the standardized predictors onto the retained principal components. 6. Fit a linear regression model using the principal component scores. 7. Back-transform the regression coefficients to the original scale.
A list containing the following elements:
Bhat |
The estimated regression coefficients, including the intercept. |
RMSE |
The Root Mean Square Error of the regression model. |
summary |
The summary of the linear regression model. |
yhat |
The predicted values of the response variable. |
lm: For fitting linear models.
eigen: For computing eigenvalues and eigenvectors.
## Not run: # Example data set.seed(1234) n <- 2000 p <- 10 mu0 <- as.matrix(runif(p, 0)) sigma0 <- as.matrix(runif(p, 0, 10)) ro <- as.matrix(c(runif(round(p / 2), -1, -0.8), runif(p - round(p / 2), 0.8, 1))) R0 <- ro %*% t(ro) diag(R0) <- 1 Sigma0 <- sigma0 %*% t(sigma0) * R0 x <- mvrnorm(n, mu0, Sigma0) colnames(x) <- paste("x", 1:p, sep = "") e <- rnorm(n, 0, 1) B <- sample(1:3, (p + 1), replace = TRUE) en <- matrix(rep(1, n * 1), ncol = 1) y <- cbind(en, x) %*% B + e colnames(y) <- paste("y") data <- data.frame(cbind(y, x)) # Call the PCR function result <- PCR(data, threshold = 0.9) # Access the estimated regression coefficients print(Bhat <- result$Bhat) # Access the predicted values print(yhat <- result$yhat) # Print the summary of the regression model print(result$summary) # Print the RMSE print(paste("RMSE:", result$RMSE)) ## End(Not run)## Not run: # Example data set.seed(1234) n <- 2000 p <- 10 mu0 <- as.matrix(runif(p, 0)) sigma0 <- as.matrix(runif(p, 0, 10)) ro <- as.matrix(c(runif(round(p / 2), -1, -0.8), runif(p - round(p / 2), 0.8, 1))) R0 <- ro %*% t(ro) diag(R0) <- 1 Sigma0 <- sigma0 %*% t(sigma0) * R0 x <- mvrnorm(n, mu0, Sigma0) colnames(x) <- paste("x", 1:p, sep = "") e <- rnorm(n, 0, 1) B <- sample(1:3, (p + 1), replace = TRUE) en <- matrix(rep(1, n * 1), ncol = 1) y <- cbind(en, x) %*% B + e colnames(y) <- paste("y") data <- data.frame(cbind(y, x)) # Call the PCR function result <- PCR(data, threshold = 0.9) # Access the estimated regression coefficients print(Bhat <- result$Bhat) # Access the predicted values print(yhat <- result$yhat) # Print the summary of the regression model print(result$summary) # Print the RMSE print(paste("RMSE:", result$RMSE)) ## End(Not run)
The perturbation principal component method can handle online data sets.
PPC(data, m, eta)PPC(data, m, eta)
data |
is an online data set |
m |
is the number of principal component |
eta |
is the proportion of online data to total data |
T2,T2k,V,Vhat,lambdahat,time
library(MASS) n=2000;p=20;m=9; mu=t(matrix(rep(runif(p,0,1000),n),p,n)) mu0=as.matrix(runif(m,0)) sigma0=diag(runif(m,1)) F=matrix(mvrnorm(n,mu0,sigma0),nrow=n) A=matrix(runif(p*m,-1,1),nrow=p) D=as.matrix(diag(rep(runif(p,0,1)))) epsilon=matrix(mvrnorm(n,rep(0,p),D),nrow=n) data=mu+F%*%t(A)+epsilon PPC(data=data,m=m,eta=0.8)library(MASS) n=2000;p=20;m=9; mu=t(matrix(rep(runif(p,0,1000),n),p,n)) mu0=as.matrix(runif(m,0)) sigma0=diag(runif(m,1)) F=matrix(mvrnorm(n,mu0,sigma0),nrow=n) A=matrix(runif(p*m,-1,1),nrow=p) D=as.matrix(diag(rep(runif(p,0,1)))) epsilon=matrix(mvrnorm(n,rep(0,p),D),nrow=n) data=mu+F%*%t(A)+epsilon PPC(data=data,m=m,eta=0.8)
This function performs Perturbation-based Principal Component Regression (PPCR) on the provided dataset. It combines Principal Component Analysis (PCA) with linear regression, incorporating perturbation to enhance robustness.
PPCR(data, eta = 0.0035, m = 3, alpha = 0.05, perturbation_factor = 0.1)PPCR(data, eta = 0.0035, m = 3, alpha = 0.05, perturbation_factor = 0.1)
data |
A data frame containing the response variable and predictors. |
eta |
A proportion (between 0 and 1) determining the initial sample size for PCA. |
m |
The number of principal components to retain. |
alpha |
Significance level (currently not used in the function). |
perturbation_factor |
A factor controlling the magnitude of perturbation added to the principal components. |
The function first standardizes the predictors, then performs PCA on an initial subset of the data. It iteratively updates the principal components by incorporating new observations and adding random perturbations. Finally, it fits a linear regression model using the principal components as predictors and transforms the coefficients back to the original space.
A list containing the following components:
Bhat |
Estimated regression coefficients in the original space. |
RMSE |
Root Mean Squared Error of the regression model. |
summary |
Summary of the linear regression model. |
Vhat |
Estimated principal components. |
lambdahat |
Estimated eigenvalues. |
yhat |
Predicted values from the regression model. |
lm: For linear regression models.
prcomp: For principal component analysis.
## Not run: # Example data set.seed(1234) n <- 2000 p <- 10 mu0 <- as.matrix(runif(p, 0)) sigma0 <- as.matrix(runif(p, 0, 10)) ro <- as.matrix(c(runif(round(p / 2), -1, -0.8), runif(p - round(p / 2), 0.8, 1))) R0 <- ro %*% t(ro) diag(R0) <- 1 Sigma0 <- sigma0 %*% t(sigma0) * R0 x <- mvrnorm(n, mu0, Sigma0) colnames(x) <- paste("x", 1:p, sep = "") e <- rnorm(n, 0, 1) B <- sample(1:3, (p + 1), replace = TRUE) en <- matrix(rep(1, n * 1), ncol = 1) y <- cbind(en, x) %*% B + e colnames(y) <- paste("y") data <- data.frame(cbind(y, x)) # Call the PPCR function result <- PPCR(data, eta = 0.0035, m = 3, alpha = 0.05, perturbation_factor = 0.1) # Print results print(result$Bhat) # Estimated regression coefficients print(result$RMSE) # RMSE of the model print(result$summary) # Summary of the regression model ## End(Not run)## Not run: # Example data set.seed(1234) n <- 2000 p <- 10 mu0 <- as.matrix(runif(p, 0)) sigma0 <- as.matrix(runif(p, 0, 10)) ro <- as.matrix(c(runif(round(p / 2), -1, -0.8), runif(p - round(p / 2), 0.8, 1))) R0 <- ro %*% t(ro) diag(R0) <- 1 Sigma0 <- sigma0 %*% t(sigma0) * R0 x <- mvrnorm(n, mu0, Sigma0) colnames(x) <- paste("x", 1:p, sep = "") e <- rnorm(n, 0, 1) B <- sample(1:3, (p + 1), replace = TRUE) en <- matrix(rep(1, n * 1), ncol = 1) y <- cbind(en, x) %*% B + e colnames(y) <- paste("y") data <- data.frame(cbind(y, x)) # Call the PPCR function result <- PPCR(data, eta = 0.0035, m = 3, alpha = 0.05, perturbation_factor = 0.1) # Print results print(result$Bhat) # Estimated regression coefficients print(result$RMSE) # RMSE of the model print(result$summary) # Summary of the regression model ## End(Not run)
This dataset contains protein sequences and their corresponding secondary structures, including beta-sheets (E), helices (H), and coils (_).
proteinprotein
A data frame with multiple rows and columns representing protein sequences and their secondary structures.
Sequence: Amino acid sequence (using 3-letter codes).
Structure: Secondary structure of the protein (E for beta-sheet, H for helix, _ for coil).
Parameters: Additional parameters for neural networks (to be ignored).
Biophysical_Constants: Biophysical constants (to be ignored).
The dataset is used for predicting protein secondary structures from amino acid sequences. The first few numbers in each sequence are parameters for neural networks and should be ignored. The '<' symbol is used as a spacer between proteins and to mark the beginning and end of sequences.
The biophysical constants included in the dataset were found to be unhelpful and are generally ignored in analysis.
Vince G. Sigillito, Applied Physics Laboratory, Johns Hopkins University.
# Load the dataset data(protein) # Print the first few rows of the dataset print(head(protein))# Load the dataset data(protein) # Print the first few rows of the dataset print(head(protein))
The stochastic approximate component method can handle online data sets.
SAPC(data, m, eta, alpha)SAPC(data, m, eta, alpha)
data |
is a online data set |
m |
is the number of principal component |
eta |
is the proportion of online data to total data |
alpha |
is the step size |
T2,T2k,V,Vhat,lambdahat,time
library(MASS) n=2000;p=20;m=9; mu=t(matrix(rep(runif(p,0,1000),n),p,n)) mu0=as.matrix(runif(m,0)) sigma0=diag(runif(m,1)) F=matrix(mvrnorm(n,mu0,sigma0),nrow=n) A=matrix(runif(p*m,-1,1),nrow=p) D=as.matrix(diag(rep(runif(p,0,1)))) epsilon=matrix(mvrnorm(n,rep(0,p),D),nrow=n) data=mu+F%*%t(A)+epsilon SAPC(data=data,m=m,eta=0.8,alpha=1)library(MASS) n=2000;p=20;m=9; mu=t(matrix(rep(runif(p,0,1000),n),p,n)) mu0=as.matrix(runif(m,0)) sigma0=diag(runif(m,1)) F=matrix(mvrnorm(n,mu0,sigma0),nrow=n) A=matrix(runif(p*m,-1,1),nrow=p) D=as.matrix(diag(rep(runif(p,0,1)))) epsilon=matrix(mvrnorm(n,rep(0,p),D),nrow=n) data=mu+F%*%t(A)+epsilon SAPC(data=data,m=m,eta=0.8,alpha=1)
The stochastic principal component method can handle online data sets.
SPCR(data, eta, m)SPCR(data, eta, m)
data |
A data frame containing the response variable and predictors. |
eta |
proportion (between 0 and 1) determining the initial sample size for PCA. |
m |
The number of principal components to retain. |
A list containing the following elements:
Bhat |
The estimated regression coefficients, including the intercept. |
RMSE |
The Root Mean Square Error of the regression model. |
summary |
The summary of the linear regression model. |
yhat |
The predicted values of the response variable. |
# Example data library(MASS);library(stats) set.seed(1234) n <- 2000 p <- 10 mu0 <- as.matrix(runif(p, 0)) sigma0 <- as.matrix(runif(p, 0, 10)) ro <- as.matrix(c(runif(round(p / 2), -1, -0.8), runif(p - round(p / 2), 0.8, 1))) R0 <- ro %*% t(ro) diag(R0) <- 1 Sigma0 <- sigma0 %*% t(sigma0) * R0 x <- mvrnorm(n, mu0, Sigma0) colnames(x) <- paste("x", 1:p, sep = "") e <- rnorm(n, 0, 1) B <- sample(1:3, (p + 1), replace = TRUE) en <- matrix(rep(1, n * 1), ncol = 1) y <- cbind(en, x) %*% B + e colnames(y) <- paste("y") data <- data.frame(cbind(y, x)) result <- SPCR(data, eta = 0.0035, m = 3)# Example data library(MASS);library(stats) set.seed(1234) n <- 2000 p <- 10 mu0 <- as.matrix(runif(p, 0)) sigma0 <- as.matrix(runif(p, 0, 10)) ro <- as.matrix(c(runif(round(p / 2), -1, -0.8), runif(p - round(p / 2), 0.8, 1))) R0 <- ro %*% t(ro) diag(R0) <- 1 Sigma0 <- sigma0 %*% t(sigma0) * R0 x <- mvrnorm(n, mu0, Sigma0) colnames(x) <- paste("x", 1:p, sep = "") e <- rnorm(n, 0, 1) B <- sample(1:3, (p + 1), replace = TRUE) en <- matrix(rep(1, n * 1), ncol = 1) y <- cbind(en, x) %*% B + e colnames(y) <- paste("y") data <- data.frame(cbind(y, x)) result <- SPCR(data, eta = 0.0035, m = 3)
The stochastic principal component regression with varying learning-rate can handle online data sets.
spcrl(data, m, eta, alpha)spcrl(data, m, eta, alpha)
data |
is a online data set |
m |
is the number of principal component |
eta |
is the proportion of online data to total data |
alpha |
is the step size |
T2,T2k,V,Vhat,lambdahat,time
library(MASS) n <- 2000 p <- 20 m <- 9 mu <- t(matrix(rep(runif(p, 0, 1000), p, n))) mu0 <- as.matrix(runif(p, 0)) sigma0 <- diag(runif(p, 1, 10)) ro <- as.matrix(c(runif(round(p/2), -1, -0.8), runif(p - round(p/2), 0.8, 1))) R0 <- ro %*% t(ro) diag(R0) <- 1 Sigma0 <- sigma0 %*% R0 %*% sigma0 x <- mvrnorm(n, mu0, Sigma0) colnames(x) <- paste0("x", 1:p) e <- rnorm(n, 0, 1) B <- sample(1:3, (p + 1), replace = TRUE) en <- matrix(rep(1, n), ncol = 1) y <- cbind(en, x) %*% B + e colnames(y) <- "y" data <- data.frame(cbind(y, x)) spcrl(data = data, m = m, eta = 0.8, alpha = 0.5)library(MASS) n <- 2000 p <- 20 m <- 9 mu <- t(matrix(rep(runif(p, 0, 1000), p, n))) mu0 <- as.matrix(runif(p, 0)) sigma0 <- diag(runif(p, 1, 10)) ro <- as.matrix(c(runif(round(p/2), -1, -0.8), runif(p - round(p/2), 0.8, 1))) R0 <- ro %*% t(ro) diag(R0) <- 1 Sigma0 <- sigma0 %*% R0 %*% sigma0 x <- mvrnorm(n, mu0, Sigma0) colnames(x) <- paste0("x", 1:p) e <- rnorm(n, 0, 1) B <- sample(1:3, (p + 1), replace = TRUE) en <- matrix(rep(1, n), ncol = 1) y <- cbind(en, x) %*% B + e colnames(y) <- "y" data <- data.frame(cbind(y, x)) spcrl(data = data, m = m, eta = 0.8, alpha = 0.5)
This dataset contains the hydrodynamic characteristics of sailing yachts, including design parameters and performance metrics.
yacht_hydrodynamicsyacht_hydrodynamics
A data frame with 308 rows and 7 columns.
Residuary Resistance: Residuary resistance per unit weight of displacement (performance metric).
Longitudinal Position of Center of Buoyancy: Longitudinal position of the center of buoyancy.
Prismatic Coefficient: Prismatic coefficient.
Length-Displacement Ratio: Length-displacement ratio.
Beam-Draft Ratio: Beam-draft ratio.
Length-Beam Ratio: Length-beam ratio.
Froude Number: Froude number.
The dataset contains hydrodynamic data for sailing yachts, with the goal of predicting the residuary resistance from various design parameters.
The dataset is commonly used for regression analysis and machine learning tasks to model the relationship between design parameters and performance metrics.
UCI Machine Learning Repository
# Load the dataset data(yacht_hydrodynamics) # Print the first few rows of the dataset print(head(yacht_hydrodynamics))# Load the dataset data(yacht_hydrodynamics) # Print the first few rows of the dataset print(head(yacht_hydrodynamics))