qstack.regression.cross_validate_results

Cross-validation results.

qstack.regression.cross_validate_results.cv_results(X, y, sigmaarr=[1.0, 3.1622776601683795, 10.0, 31.622776601683793, 100.0, 316.22776601683796, 1000.0, 3162.2776601683795, 10000.0, 31622.776601683792, 100000.0, 316227.7660168379, 1000000.0], etaarr=[1e-10, 3.162277660168379e-08, 1e-05, 0.0031622776601683794, 1.0], akernel='L', gkernel=None, gdict={'alpha': 1.0, 'normalize': 1, 'verbose': 0}, test_size=0.2, train_size=[0.125, 0.25, 0.5, 0.75, 1.0], splits=5, printlevel=0, adaptive=False, read_kernel=False, n_rep=5, save=False, preffix='unknown', save_pred=False, progress=False, sparse=None, seed0=0)[source]

Compute various learning curves (LC) ,with random sampling, and returns the average performance.

Parameters:
  • X (numpy.ndarray[Nsamples,...]) – Array containing the representations of all Nsamples.

  • y (numpy.1darray[Nsamples]) – Array containing the target property of all Nsamples.

  • sigmaarr (list) – List of kernel width for the grid search.

  • etaarr (list) – List of regularization strength for the grid search.

  • akernel (str) – Local kernel (‘L’ for Laplacian, ‘G’ for Gaussian, ‘dot’, ‘cosine’).

  • gkernel (str) – Global kernel (None, ‘REM’, ‘avg’).

  • gdict (dict) – Parameters of the global kernels.

  • test_size (float or int) – Test set fraction (or number of samples).

  • train_size (list) – List of training set size fractions used to evaluate the points on the LC.

  • splits (int) – K number of splits for the Kfold cross-validation.

  • printlevel (int) – Controls level of output printing.

  • adaptive (bool) – To expand the grid for optimization adaptatively.

  • read_kernel (bool) – If ‘X’ is a kernel and not an array of representations.

  • n_rep (int) – The number of repetition for each point (using random sampling).

  • save (bool) – Wheather to save intermediate LCs (.npy).

  • preffix (str) – The prefix to use for filename when saving intemediate results.

  • save_pred (bool) – To save predicted targets for all LCs (.npy).

  • progress (bool) – To print a progress bar.

  • sparse (int) – The number of reference environnments to consider for sparse regression.

  • seed0 (int) – The initial seed to produce a set of seeds used for random number generator.

Returns:

The averaged LC data points as a numpy.ndarray containing (train sizes, MAE, std)

qstack.regression.cross_validate_results.main()[source]

Command-line entry point for full cross-validation with hyperparameter search.

Command-line use

This program runs a full cross-validation of the learning curves (hyperparameters search included).

usage: python3 -m qstack.regression.cross_validate_results [-h] --x REPR
                                                           --y PROP
                                                           [--eta ETA [ETA ...]]
                                                           [--sigma SIGMA [SIGMA ...]]
                                                           [--akernel {G,L,dot,cosine,G_sklearn,G_custom_c,L_sklearn,L_custom_c,L_custom_py,myG,myL,myLfast}]
                                                           [--gkernel {avg,rem}]
                                                           [--gdict [GDICT ...]]
                                                           [--test TEST_SIZE]
                                                           [--train TRAIN_SIZE [TRAIN_SIZE ...]]
                                                           [--ll]
                                                           [--readkernel]
                                                           [--sparse SPARSE]
                                                           [--splits SPLITS]
                                                           [--print PRINTLEVEL]
                                                           [--ada]
                                                           [--name NAMEOUT]
                                                           [--n N_REP]
                                                           [--save]
                                                           [--save-pred]

Named Arguments

--x

path to the representations file

--y

path to the properties file

--eta

eta array

Default: [1e-10, 3.162277660168379e-08, 1e-05, 0.0031622776601683794, 1.0]

--sigma

sigma array

Default: [1.0, 3.1622776601683795, 10.0, 31.622776601683793, 100.0, 316.22776601683796, 1000.0, 3162.2776601683795, 10000.0, 31622.776601683792, 100000.0, 316227.7660168379, 1000000.0]

--akernel

Possible choices: G, L, dot, cosine, G_sklearn, G_custom_c, L_sklearn, L_custom_c, L_custom_py, myG, myL, myLfast

local kernel type: “G” for Gaussian, “L” for Laplacian, “dot” for dot products, “cosine” for cosine similarity. “G_{sklearn,custom_c}”, “L_{sklearn,custom_c,custom_py}” for specific implementations. “L_custompy” is suited to open-shell systems

Default: 'L'

--gkernel

Possible choices: avg, rem

global kernel type: “avg” for average, “rem” for REMatch

--gdict

dictionary like input string to initialize global kernel parameters, e.g. “–gdict alpha=2 normalize=0”

Default: {'alpha': 1.0, 'normalize': 1, 'verbose': 0}

--test

test set fraction

Default: 0.2

--train

training set fractions

Default: [0.125, 0.25, 0.5, 0.75, 1.0]

--ll

if correct for the numper of threads

Default: False

--readkernel

if X is kernel

Default: False

--sparse

regression basis size for sparse learning

--splits

k in k-fold cross validation

Default: 5

--print

printlevel

Default: 0

--ada

if adapt sigma

Default: False

--name

the name of the output file

--n

the number of repetition for each point

Default: 5

--save

if saving intermediate results in .npy file

Default: False

--save-pred

if save test-set prediction

Default: False

Note

If you built those docs yourself and the command-line section is empty, please make sure you have installed the right components of qstack.