| Title: | The LIC Criterion for Optimal Subset Selection |
|---|---|
| Description: | The LIC criterion is to determine the most informative subsets so that the subset can retain most of the information contained in the complete data. The philosophy of the package is described in Guo G. (2022) <doi:10.1080/02664763.2022.2053949>. |
| Authors: | Guangbao Guo [aut, cre], Yue Sun [aut], Guoqi Qian [aut], Qian Wang [aut] |
| Maintainer: | Guangbao Guo <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.0.2 |
| Built: | 2026-05-19 06:50:27 UTC |
| Source: | https://github.com/cran/LIC |
The Airfoil self-noise data set
data("airfoil")data("airfoil")
A data frame with 1503 observations on the following 6 variables.
V1a numeric vector
V2a numeric vector
V3a numeric vector
V4a numeric vector
V5a numeric vector
V6a numeric vector
The data set contains 1503 data points, including the 6 variables. Among them, the scaled sound pressure level is the dependent variable and the other five are independent variables.
The Airfoil Self-Noise data set is from the NASA data set in UCI database.
T.F. Brooks, D.S. Pope, and A.M. Marcolini. Airfoil self-noise and prediction. Technical report, NASA RP-1218, July 1989.
data(airfoil) ## maybe str(airfoil) ; plot(airfoil) ...data(airfoil) ## maybe str(airfoil) ; plot(airfoil) ...
The real estate valuation data set.
data("estate")data("estate")
A data frame with 414 observations on the following 8 variables.
Noa numeric vector
X1.transaction.datea numeric vector
X2.house.agea numeric vector
X3.distance.to.the.nearest.MRT.stationa numeric vector
X4.number.of.convenience.storesa numeric vector
X5.latitudea numeric vector
X6.longitudea numeric vector
Y.house.price.of.unit.areaa numeric vector
Real estate valuation data set contains information about 414 real estate prices of 5 independent variables. The dependent variable is the price per unit area.
The data set is from Xindian District, New Taipei City, Taiwan.
Yeh, I. C., & Hsu, T. K. (2018). Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft Computing, 65, 260-271.
data(estate) ## maybe str(estate) ; plot(estate) ...data(estate) ## maybe str(estate) ; plot(estate) ...
The gas turbine NOx emission data set.
data("gt2015")data("gt2015")
A data frame with 7384 observations on the following 11 variables.
ATa numeric vector
APa numeric vector
AHa numeric vector
AFDPa numeric vector
GTEPa numeric vector
TITa numeric vector
TATa numeric vector
TEYa numeric vector
CDPa numeric vector
COa numeric vector
NOXa numeric vector
To predict nitrogen oxide emissions, we use the gas turbine NOx emission data set in UCI database, which contains 36,733 instances of 11,733 sensor measurements. The pollutant emission factors of gas turbines include 9 variables. We select 7,200 data points in 2015.
The gas turbine NOx emission data set is from UCI database.
NA
data(gt2015) ## maybe str(gt2015) ; plot(gt2015) ...data(gt2015) ## maybe str(gt2015) ; plot(gt2015) ...
The LIC criterion is to determine the most informative subsets so that the subset can retain most of the information contained in the complete data.
LIC(X, Y, alpha, K, nk)LIC(X, Y, alpha, K, nk)
X |
is a design matrix |
Y |
is a random response vector of observed values |
alpha |
is the significance level |
K |
is the number of subsets |
nk |
is the sample size of subsets |
MUopt,Bopt,MAEMUopt,MSEMUopt,opt,Yopt
set.seed(12) X=matrix(data=sample(1:3,1200*5, replace = TRUE) ,nrow=1200,ncol=5) b=sample(1:3,5, replace = TRUE) e= rnorm(1200, 0, 1) Y=X%*%b+e alpha=0.05 K=10 nk=1200/K LIC(X,Y,alpha,K,nk)set.seed(12) X=matrix(data=sample(1:3,1200*5, replace = TRUE) ,nrow=1200,ncol=5) b=sample(1:3,5, replace = TRUE) e= rnorm(1200, 0, 1) Y=X%*%b+e alpha=0.05 K=10 nk=1200/K LIC(X,Y,alpha,K,nk)
The Opt1 chooses the optimal index subset based on minimized interval length.
Opt1(X, Y, alpha, K, nk)Opt1(X, Y, alpha, K, nk)
X |
is a design matrix |
Y |
is a random response vector of observed values |
alpha |
is the significance level |
K |
is the number of subsets |
nk |
is the sample size of subsets |
MUopt1,Bopt1,MAEMUopt1,MSEMUopt1,opt1,Yopt1
set.seed(12) X=matrix(data=sample(1:3,1200*5, replace = TRUE) ,nrow=1200,ncol=5) b=sample(1:3,5, replace = TRUE) e= rnorm(1200, 0, 1) Y=X%*%b+e alpha=0.05 K=10 nk=1200/K Opt1(X,Y,alpha,K,nk)set.seed(12) X=matrix(data=sample(1:3,1200*5, replace = TRUE) ,nrow=1200,ncol=5) b=sample(1:3,5, replace = TRUE) e= rnorm(1200, 0, 1) Y=X%*%b+e alpha=0.05 K=10 nk=1200/K Opt1(X,Y,alpha,K,nk)
The Opt2 chooses the optimal index subset based on maximized information sub-matrix.
Opt2(X, Y, alpha, K, nk)Opt2(X, Y, alpha, K, nk)
X |
is a design matrix |
Y |
is a random response vector of observed values |
alpha |
is the significance level |
K |
is the number of subsets |
nk |
is the sample size of subsets |
MUopt2,Bopt2,MAEMUopt2,MSEMUopt2,opt2,Yopt2
set.seed(12) X=matrix(data=sample(1:3,1200*5, replace = TRUE) ,nrow=1200,ncol=5) b=sample(1:3,5, replace = TRUE) e= rnorm(1200, 0, 1) Y=X%*%b+e alpha=0.05 K=10 nk=1200/K Opt2(X,Y,alpha,K,nk)set.seed(12) X=matrix(data=sample(1:3,1200*5, replace = TRUE) ,nrow=1200,ncol=5) b=sample(1:3,5, replace = TRUE) e= rnorm(1200, 0, 1) Y=X%*%b+e alpha=0.05 K=10 nk=1200/K Opt2(X,Y,alpha,K,nk)
The OSA gives a simple average estimatoris by averaging all these least squares estimators.
OSA(X, Y, alpha, K, nk)OSA(X, Y, alpha, K, nk)
X |
is a design matrix |
Y |
is a random response vector of observed values |
alpha |
is the significance level |
K |
is the number of subsets |
nk |
is the sample size of subsets |
MUA,BetaA,MAEMUA,MSEMUA
set.seed(12) X=matrix(data=sample(1:3,1200*5, replace = TRUE) ,nrow=1200,ncol=5) b=sample(1:3,5, replace = TRUE) e= rnorm(1200, 0, 1) Y=X%*%b+e alpha=0.05 K=10 nk=1200/K OSA(X,Y,alpha,K,nk)set.seed(12) X=matrix(data=sample(1:3,1200*5, replace = TRUE) ,nrow=1200,ncol=5) b=sample(1:3,5, replace = TRUE) e= rnorm(1200, 0, 1) Y=X%*%b+e alpha=0.05 K=10 nk=1200/K OSA(X,Y,alpha,K,nk)
The OSM is a median processing method for the central processor.
OSM(X, Y, alpha, K, nk)OSM(X, Y, alpha, K, nk)
X |
is a design matrix |
Y |
is a random response vector of observed values |
alpha |
is the significance level |
K |
is the number of subsets |
nk |
is the sample size of subsets |
MUM,BetaM,MAEMUM,MSEMUM
set.seed(12) X=matrix(data=sample(1:3,1200*5, replace = TRUE) ,nrow=1200,ncol=5) b=sample(1:3,5, replace = TRUE) e= rnorm(1200, 0, 1) Y=X%*%b+e alpha=0.05 K=10 nk=1200/K OSM(X,Y,alpha,K,nk)set.seed(12) X=matrix(data=sample(1:3,1200*5, replace = TRUE) ,nrow=1200,ncol=5) b=sample(1:3,5, replace = TRUE) e= rnorm(1200, 0, 1) Y=X%*%b+e alpha=0.05 K=10 nk=1200/K OSM(X,Y,alpha,K,nk)