nyaggle.validation¶
- class nyaggle.validation.Nth(n, base_validator)[source]¶
Returns N-th fold of the base validator
This validator wraps the base validator to take n-th (1-origin) fold.
- Parameters
n (
int
) – The number of folds to be taken.base_validator (
BaseCrossValidator
) – The base validator to be wrapped.
Example
>>> import numpy as np >>> import pandas as pd >>> from sklearn.model_selection import KFold >>> from nyaggle.validation import Nth
>>> # take the 3rd fold >>> folds = Nth(3, KFold(5)) >>> folds.get_n_splits() 1
- get_n_splits(X=None, y=None, groups=None)[source]¶
Returns the number of splitting iterations in the cross-validator
- split(X, y=None, groups=None)[source]¶
Generate indices to split data into training and test set.
- Parameters
X (array-like of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – The target variable for supervised learning problems.
groups (array-like of shape (n_samples,), default=None) – Group labels for the samples used while splitting the dataset into train/test set.
- Yields
train (ndarray) – The training set indices for that split.
test (ndarray) – The testing set indices for that split.
- class nyaggle.validation.Skip(n, base_validator)[source]¶
Skips the first N folds and returns the remaining folds
This validator wraps the base validator to skip first n folds.
- Parameters
n (
int
) – The number of folds to be skipped.base_validator (
BaseCrossValidator
) – The base validator to be wrapped.
Example
>>> import numpy as np >>> import pandas as pd >>> from sklearn.model_selection import KFold >>> from nyaggle.validation import Skip
>>> # take the last 2 folds out of 5 >>> folds = Skip(3, KFold(5)) >>> folds.get_n_splits() 2
- get_n_splits(X=None, y=None, groups=None)[source]¶
Returns the number of splitting iterations in the cross-validator
- split(X, y=None, groups=None)[source]¶
Generate indices to split data into training and test set.
- Parameters
X (array-like of shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
y (array-like of shape (n_samples,)) – The target variable for supervised learning problems.
groups (array-like of shape (n_samples,), default=None) – Group labels for the samples used while splitting the dataset into train/test set.
- Yields
train (ndarray) – The training set indices for that split.
test (ndarray) – The testing set indices for that split.
- class nyaggle.validation.SlidingWindowSplit(source, train_from, train_to, test_from, test_to, n_windows, stride)[source]¶
Sliding window time series cross-validator
Time Series cross-validator which provides train/test indices based on the sliding window to split variable interval time series data. Splitting for each fold will be as follows:
Folds Training data Testing data 1 ((train_from-(N-1)*stride, train_to-(N-1)*stride), (test_from-(N-1)*stride, test_to-(N-1)*stride)) ... ... ... N-1 ((train_from-stride, train_to-stride), (test_from-stride, test_to-stride)) N ((train_from, train_to), (test_from, test_to))
This class is compatible with sklearn’s
BaseCrossValidator
(base class ofKFold
,GroupKFold
etc).- Parameters
source (
Union
[Series
,str
]) – The column name or series of timestamp.train_from (
Union
[datetime
,str
]) – Start datetime for the training data in the base split.train_to (
Union
[datetime
,str
]) – End datetime for the training data in the base split.test_from (
Union
[datetime
,str
]) – Start datetime for the testing data in the base split.test_to (
Union
[datetime
,str
]) – End datetime for the testing data in the base split.n_windows (
int
) – The number of windows (or folds) in the validation.stride (
timedelta
) – Time delta between folds.
- class nyaggle.validation.StratifiedGroupKFold(n_splits=3, shuffle=False, random_state=None)[source]¶
Stratified K-Folds cross-validator with grouping
Provides train/test indices to split data in train/test sets. This cross-validation object is a variation of GroupKFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class. Read more in the User Guide.
- Parameters
n_splits (
int
) – Number of folds. Must be at least 2.
Example
>>> from pprint import pprint >>> rng = np.random.RandomState(0) >>> groups = [1, 1, 3, 4, 2, 2, 7, 8, 8] >>> y = [1, 1, 1, 1, 2, 2, 2, 3, 3] >>> X = np.empty((len(y), 0)) >>> self = StratifiedGroupKFold(random_state=rng) >>> skf_list = list(self.split(X=X, y=y, groups=groups)) >>> pprint(skf_list) [ (np.array([2, 3, 4, 5, 6]), np.array([0, 1, 7, 8])), (np.array([0, 1, 2, 7, 8]), np.array([3, 4, 5, 6])), (np.array([0, 1, 3, 4, 5, 6, 7, 8]), np.array([2])), ]
- class nyaggle.validation.Take(n, base_validator)[source]¶
Returns the first N folds of the base validator
This validator wraps the base validator to take first n folds.
- Parameters
n (
int
) – The number of folds.base_validator (
BaseCrossValidator
) – The base validator to be wrapped.
Example
>>> import numpy as np >>> import pandas as pd >>> from sklearn.model_selection import KFold >>> from nyaggle.validation import Take
>>> # take the first 3 folds out of 5 >>> folds = Take(3, KFold(5)) >>> folds.get_n_splits() 3
- class nyaggle.validation.TimeSeriesSplit(source, times=None)[source]¶
Time Series cross-validator
Time Series cross-validator which provides train/test indices to split variable interval time series data. This class provides low-level API for time series validation strategy. This class is compatible with sklearn’s
BaseCrossValidator
(base class ofKFold
,GroupKFold
etc).- Parameters
source (
Union
[Series
,str
]) – The column name or series of timestamp.times (
Optional
[List
[Tuple
[Tuple
[Union
[datetime
,str
],Union
[datetime
,str
]],Tuple
[Union
[datetime
,str
],Union
[datetime
,str
]]]]]) – Splitting window, where times[i][0] and times[i][1] denotes train and test time interval in (i-1)th fold respectively. Each time interval should be pair of datetime or str, and the validator generates indices of rows where timestamp is in the half-open interval [start, end). For example, iftimes[i][0] = ('2018-01-01', '2018-01-03')
, indices for (i-1)th training data will be rows where timestamp value meets2018-01-01 <= t < 2018-01-03
.
Example
>>> import numpy as np >>> import pandas as pd >>> from nyaggle.validation import TimeSeriesSplit >>> df = pd.DataFrame() >>> df['time'] = pd.date_range(start='2018/1/1', periods=5)
>>> folds = TimeSeriesSplit('time', >>> [(('2018-01-01', '2018-01-02'), ('2018-01-02', '2018-01-04')), >>> (('2018-01-02', '2018-01-03'), ('2018-01-04', '2018-01-06'))])
>>> folds.get_n_splits() 2
>>> splits = folds.split(df)
>>> train_index, test_index = next(splits) >>> train_index [0] >>> test_index [1, 2]
>>> train_index, test_index = next(splits) >>> train_index [1] >>> test_index [3, 4]
- add_fold(train_interval, test_interval)[source]¶
Append 1 split to the validator.
- Parameters
train_interval (
Tuple
[Union
[datetime
,str
],Union
[datetime
,str
]]) – start and end time of training data.test_interval (
Tuple
[Union
[datetime
,str
],Union
[datetime
,str
]]) – start and end time of test data.
- nyaggle.validation.adversarial_validate(X_train, X_test, importance_type='gain', estimator=None, categorical_feature=None, cv=None)[source]¶
Perform adversarial validation between X_train and X_test.
- Parameters
X_train (
DataFrame
) – Training dataX_test (
DataFrame
) – Test dataimportance_type (
str
) – The type of feature importance calculated.estimator (
Optional
[BaseEstimator
]) – The custom estimator. If None, LGBMClassifier is automatically used. Only LGBMModel or CatBoost instances are supported.categorical_feature (
Optional
[List
[str
]]) – List of categorical column names. IfNone
, categorical columns are automatically determined by dtype.cv (
Union
[int
,Iterable
,BaseCrossValidator
,None
]) – Cross validation split. IfNone
, the first fold out of 5 fold is used as validation.
- Return type
ADVResult
- Returns
Namedtuple with following members
- auc:
float, ROC AUC score of adversarial validation.
- importance:
pandas DataFrame, feature importance of adversarial model (order by importance)
Example
>>> from sklearn.model_selection import train_test_split >>> from nyaggle.testing import make_regression_df >>> from nyaggle.validation import adversarial_validate
>>> X, y = make_regression_df(n_samples=8) >>> X_train, X_test, y_train, y_test = train_test_split(X, y) >>> auc, importance = cross_validate(X_train, X_test) >>> >>> print(auc) 0.51078231 >>> importance.head() feature importance col_1 231.5827204 col_5 207.1837266 col_7 188.6920685 col_4 174.5668498 col_9 170.6438643
- nyaggle.validation.cross_validate(estimator, X_train, y, X_test=None, cv=None, groups=None, eval_func=None, logger=None, on_each_fold=None, fit_params=None, importance_type='gain', early_stopping=True, type_of_target='auto')[source]¶
Evaluate metrics by cross-validation. It also records out-of-fold prediction and test prediction.
- Parameters
estimator (
Union
[BaseEstimator
,List
[BaseEstimator
]]) – The object to be used in cross-validation. For list inputs,estimator[i]
is trained on i-th fold.X_train (
Union
[DataFrame
,ndarray
]) – Training datay (
Union
[Series
,ndarray
]) – TargetX_test (
Union
[DataFrame
,ndarray
,None
]) – Test data (Optional). If specified, prediction on the test data is performed using ensemble of models.cv (
Union
[int
,Iterable
,BaseCrossValidator
,None
]) –int, cross-validation generator or an iterable which determines the cross-validation splitting strategy.
None, to use the default
KFold(5, random_state=0, shuffle=True)
,integer, to specify the number of folds in a
(Stratified)KFold
,CV splitter (the instance of
BaseCrossValidator
),An iterable yielding (train, test) splits as arrays of indices.
groups (
Optional
[Series
]) – Group labels for the samples. Only used in conjunction with a “Group” cv instance (e.g.,GroupKFold
).eval_func (
Optional
[Callable
]) – Function used for logging and returning scoreslogger (
Optional
[Logger
]) – loggeron_each_fold (
Optional
[Callable
[[int
,BaseEstimator
,DataFrame
,Series
],None
]]) – called for each fold with (idx_fold, model, X_fold, y_fold)fit_params (
Union
[Dict
[str
,Any
],Callable
,None
]) – Parameters passed to the fit method of the estimatorimportance_type (
str
) – The type of feature importance to be used to calculate result. Used only inLGBMClassifier
andLGBMRegressor
.early_stopping (
bool
) – IfTrue
,eval_set
will be added tofit_params
for each fold.early_stopping_rounds = 100
will also be appended to fit_params if it does not already have one.type_of_target (
str
) – The type of target variable. Ifauto
, type is inferred bysklearn.utils.multiclass.type_of_target
. Otherwise,binary
,continuous
, ormulticlass
are supported.
- Return type
CVResult
- Returns
Namedtuple with following members
- oof_prediction (numpy array, shape (len(X_train),)):
The predicted value on put-of-Fold validation data.
- test_prediction (numpy array, hape (len(X_test),)):
The predicted value on test data.
None
if X_test isNone
.
- scores (list of float, shape (nfolds+1,)):
scores[i]
denotes validation score in i-th fold.scores[-1]
is the overall score. None if eval is not specified.
- importance (list of pandas DataFrame, shape (nfolds,)):
importance[i]
denotes feature importance in i-th fold model. If the estimator is not GBDT, empty array is returned.
Example
>>> from sklearn.datasets import make_regression >>> from sklearn.linear_model import Ridge >>> from sklearn.metrics import mean_squared_error >>> from nyaggle.validation import cross_validate
>>> X, y = make_regression(n_samples=8) >>> model = Ridge(alpha=1.0) >>> pred_oof, pred_test, scores, _ = >>> cross_validate(model, >>> X_train=X[:3, :], >>> y=y[:3], >>> X_test=X[3:, :], >>> cv=3, >>> eval_func=mean_squared_error) >>> print(pred_oof) [-101.1123267 , 26.79300693, 17.72635528] >>> print(pred_test) [-10.65095894 -12.18909059 -23.09906427 -17.68360714 -20.08218267] >>> print(scores) [71912.80290003832, 15236.680239881942, 15472.822033121925, 34207.43505768073]