Title: | The LIC Criterion for Optimal Subset Selection |
---|---|
Description: | The LIC criterion is to determine the most informative subsets so that the subset can retain most of the information contained in the complete data. The philosophy of the package is described in Guo G. (2022) <doi:10.1080/02664763.2022.2053949>. |
Authors: | Guangbao Guo [aut, cre], Yue Sun [aut], Guoqi Qian [aut], Qian Wang [aut] |
Maintainer: | Guangbao Guo <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.0.2 |
Built: | 2025-02-16 05:01:09 UTC |
Source: | https://github.com/cran/LIC |
The Airfoil self-noise data set
data("airfoil")
data("airfoil")
A data frame with 1503 observations on the following 6 variables.
V1
a numeric vector
V2
a numeric vector
V3
a numeric vector
V4
a numeric vector
V5
a numeric vector
V6
a numeric vector
The data set contains 1503 data points, including the 6 variables. Among them, the scaled sound pressure level is the dependent variable and the other five are independent variables.
The Airfoil Self-Noise data set is from the NASA data set in UCI database.
T.F. Brooks, D.S. Pope, and A.M. Marcolini. Airfoil self-noise and prediction. Technical report, NASA RP-1218, July 1989.
data(airfoil) ## maybe str(airfoil) ; plot(airfoil) ...
data(airfoil) ## maybe str(airfoil) ; plot(airfoil) ...
The real estate valuation data set.
data("estate")
data("estate")
A data frame with 414 observations on the following 8 variables.
No
a numeric vector
X1.transaction.date
a numeric vector
X2.house.age
a numeric vector
X3.distance.to.the.nearest.MRT.station
a numeric vector
X4.number.of.convenience.stores
a numeric vector
X5.latitude
a numeric vector
X6.longitude
a numeric vector
Y.house.price.of.unit.area
a numeric vector
Real estate valuation data set contains information about 414 real estate prices of 5 independent variables. The dependent variable is the price per unit area.
The data set is from Xindian District, New Taipei City, Taiwan.
Yeh, I. C., & Hsu, T. K. (2018). Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft Computing, 65, 260-271.
data(estate) ## maybe str(estate) ; plot(estate) ...
data(estate) ## maybe str(estate) ; plot(estate) ...
The gas turbine NOx emission data set.
data("gt2015")
data("gt2015")
A data frame with 7384 observations on the following 11 variables.
AT
a numeric vector
AP
a numeric vector
AH
a numeric vector
AFDP
a numeric vector
GTEP
a numeric vector
TIT
a numeric vector
TAT
a numeric vector
TEY
a numeric vector
CDP
a numeric vector
CO
a numeric vector
NOX
a numeric vector
To predict nitrogen oxide emissions, we use the gas turbine NOx emission data set in UCI database, which contains 36,733 instances of 11,733 sensor measurements. The pollutant emission factors of gas turbines include 9 variables. We select 7,200 data points in 2015.
The gas turbine NOx emission data set is from UCI database.
NA
data(gt2015) ## maybe str(gt2015) ; plot(gt2015) ...
data(gt2015) ## maybe str(gt2015) ; plot(gt2015) ...
The LIC criterion is to determine the most informative subsets so that the subset can retain most of the information contained in the complete data.
LIC(X, Y, alpha, K, nk)
LIC(X, Y, alpha, K, nk)
X |
is a design matrix |
Y |
is a random response vector of observed values |
alpha |
is the significance level |
K |
is the number of subsets |
nk |
is the sample size of subsets |
MUopt,Bopt,MAEMUopt,MSEMUopt,opt,Yopt
set.seed(12) X=matrix(data=sample(1:3,1200*5, replace = TRUE) ,nrow=1200,ncol=5) b=sample(1:3,5, replace = TRUE) e= rnorm(1200, 0, 1) Y=X%*%b+e alpha=0.05 K=10 nk=1200/K LIC(X,Y,alpha,K,nk)
set.seed(12) X=matrix(data=sample(1:3,1200*5, replace = TRUE) ,nrow=1200,ncol=5) b=sample(1:3,5, replace = TRUE) e= rnorm(1200, 0, 1) Y=X%*%b+e alpha=0.05 K=10 nk=1200/K LIC(X,Y,alpha,K,nk)
The Opt1 chooses the optimal index subset based on minimized interval length.
Opt1(X, Y, alpha, K, nk)
Opt1(X, Y, alpha, K, nk)
X |
is a design matrix |
Y |
is a random response vector of observed values |
alpha |
is the significance level |
K |
is the number of subsets |
nk |
is the sample size of subsets |
MUopt1,Bopt1,MAEMUopt1,MSEMUopt1,opt1,Yopt1
set.seed(12) X=matrix(data=sample(1:3,1200*5, replace = TRUE) ,nrow=1200,ncol=5) b=sample(1:3,5, replace = TRUE) e= rnorm(1200, 0, 1) Y=X%*%b+e alpha=0.05 K=10 nk=1200/K Opt1(X,Y,alpha,K,nk)
set.seed(12) X=matrix(data=sample(1:3,1200*5, replace = TRUE) ,nrow=1200,ncol=5) b=sample(1:3,5, replace = TRUE) e= rnorm(1200, 0, 1) Y=X%*%b+e alpha=0.05 K=10 nk=1200/K Opt1(X,Y,alpha,K,nk)
The Opt2 chooses the optimal index subset based on maximized information sub-matrix.
Opt2(X, Y, alpha, K, nk)
Opt2(X, Y, alpha, K, nk)
X |
is a design matrix |
Y |
is a random response vector of observed values |
alpha |
is the significance level |
K |
is the number of subsets |
nk |
is the sample size of subsets |
MUopt2,Bopt2,MAEMUopt2,MSEMUopt2,opt2,Yopt2
set.seed(12) X=matrix(data=sample(1:3,1200*5, replace = TRUE) ,nrow=1200,ncol=5) b=sample(1:3,5, replace = TRUE) e= rnorm(1200, 0, 1) Y=X%*%b+e alpha=0.05 K=10 nk=1200/K Opt2(X,Y,alpha,K,nk)
set.seed(12) X=matrix(data=sample(1:3,1200*5, replace = TRUE) ,nrow=1200,ncol=5) b=sample(1:3,5, replace = TRUE) e= rnorm(1200, 0, 1) Y=X%*%b+e alpha=0.05 K=10 nk=1200/K Opt2(X,Y,alpha,K,nk)
The OSA gives a simple average estimatoris by averaging all these least squares estimators.
OSA(X, Y, alpha, K, nk)
OSA(X, Y, alpha, K, nk)
X |
is a design matrix |
Y |
is a random response vector of observed values |
alpha |
is the significance level |
K |
is the number of subsets |
nk |
is the sample size of subsets |
MUA,BetaA,MAEMUA,MSEMUA
set.seed(12) X=matrix(data=sample(1:3,1200*5, replace = TRUE) ,nrow=1200,ncol=5) b=sample(1:3,5, replace = TRUE) e= rnorm(1200, 0, 1) Y=X%*%b+e alpha=0.05 K=10 nk=1200/K OSA(X,Y,alpha,K,nk)
set.seed(12) X=matrix(data=sample(1:3,1200*5, replace = TRUE) ,nrow=1200,ncol=5) b=sample(1:3,5, replace = TRUE) e= rnorm(1200, 0, 1) Y=X%*%b+e alpha=0.05 K=10 nk=1200/K OSA(X,Y,alpha,K,nk)
The OSM is a median processing method for the central processor.
OSM(X, Y, alpha, K, nk)
OSM(X, Y, alpha, K, nk)
X |
is a design matrix |
Y |
is a random response vector of observed values |
alpha |
is the significance level |
K |
is the number of subsets |
nk |
is the sample size of subsets |
MUM,BetaM,MAEMUM,MSEMUM
set.seed(12) X=matrix(data=sample(1:3,1200*5, replace = TRUE) ,nrow=1200,ncol=5) b=sample(1:3,5, replace = TRUE) e= rnorm(1200, 0, 1) Y=X%*%b+e alpha=0.05 K=10 nk=1200/K OSM(X,Y,alpha,K,nk)
set.seed(12) X=matrix(data=sample(1:3,1200*5, replace = TRUE) ,nrow=1200,ncol=5) b=sample(1:3,5, replace = TRUE) e= rnorm(1200, 0, 1) Y=X%*%b+e alpha=0.05 K=10 nk=1200/K OSM(X,Y,alpha,K,nk)