APIs

AutoMS

Command line interface

Automatic Model Selection Using Cluster Indices

usage: automs [-h] [--oneshot | --subsampling] [--num_processes NUM_PROCESSES]
              [--warehouse_path WAREHOUSE_PATH] [--truef1]
              [--result RESULTS_FILENAME]
              dataset_filename

Positional Arguments

dataset_filename

path to a CSV, LIBSVM or ARFF data file. The path must also have associated configuration file for the dataset (with name as dataset_filename suffixed with ‘.config.py’).

Named Arguments

--oneshot

Whether to use oneshot approach

--subsampling

Whether to use sub-sampling approach

--num_processes

Number of parallel processes or jobs to use

--warehouse_path

Location for storing intermediate files and results corresponding to the dataset being processed

--truef1

Whether to compute the true f1-scores for dataset

--result

Path to file to write predicted classification complexity, estimated f1 scores and true f1 scores results

Python interface

automs.automs.automs(dataset_filename, oneshot=True, num_processes=1, warehouse_path=None, return_true_f1s=False)

Predicts the maximum achievable F1-score corresponding to various classifier models for a given dataset

Parameters
dataset_filenamestr

path to a CSV, LIBSVM or ARFF data file. The path must also have associated configuration file for the dataset (with name as dataset_filename suffixed with ‘.config.py’).

oneshotbool, optional

Whether to use oneshot approach or subsampling approach (default corresponds to the ‘approach’ specified in automs configuration, if automs is configured. Else, it defaults to True).

num_processesint, optional

Number of parallel processes or jobs to use (default corresponds to ‘num processes’ specified in automs configuration, if automs is configured. Else, it defaults to 1).

warehouse_pathstr, optional

Location for storing intermediate files and results corresponding to the dataset being processed (default corresponds to ‘warehouse path’ specified in automs configuration, if automs is configured. Else, it must be specified here).

return_true_f1sbool, optional

Whether to return the true f1-scores for dataset. If True, also returns the true f1-scores for dataset.

Returns
bool

Classification complexity of dataset. True implies dataset is hard to classify, False implies dataset is not hard to classify.

dict of {strint}

Dictionary mapping each classification model name to its estimated f1 score.

dict of {strint}, optional

Dictionary mapping each classification model name to its true f1 score. Only returned if return_true_f1s is True.

Raises
ValueError

If the dataset is not a binary classification dataset.

ValueError

If oneshot is specified to False, and the dataset has less or equal to 500 samples.

Notes

AutoMS performs cluster analysis and generates cluster indices on the dataset or its subsamples, and estimates the f1-scores for various classification models from these clustering-based metafeatures by using fitted regression models.

AutoMS works only for binary classification datasets, although the idea can be extended to multi-class problems using one-vs-rest statergy.

AutoMS with sub-sampling approach works only for datasets with more than 500 samples.

Note

The default values for the keyword arguments oneshot, num_processes and warehouse_path for function automs.automs.automs() will be overriden with default values configured using AutoMS configuration wizard.

Dataset Configuration

CSV Data Format

class automs.config.CsvConfig(sep=', ', skiprows=None, header_row=None, usecols=None, target_col=-1, categorical_cols='infer', na_values=None, **kargs)

Dataset configuration class for CSV data format

Parameters
sepstr, optional

Column delimiter. Accepted values: None implies autodetect delimiter, '\s+' uses combination of spaces and tabs, regular expressions. (default is ',' ).

skiprowslist of int or int, optional

List of line indices to skip or the number of starting lines to skip. (default value None implies don’t skip any lines)

header_rowint, optional

Relative Zero-Index (index of rows after skipping rows using skiprows parameter) of the row containing column names. Note: All preceding rows are ignored. (default value None implies no header row)

usecolslist, optional

List of column names (or column indices, if no header row specified) to consider. (default value None indicates use of all columns)

target_colint, optional

Relative Zero-Index of column (after filtering columns using usecols parameter) to use as target values. None indicates absence of target value columns. (default value -1 implies use the last column as target values)

categorical_cols‘infer’ or list or str or int or ‘all’, optional

List (str or int if singleton) of column names (or absolute indices of columns, if no header row specified) of categorical columns to encode. Default value 'infer' autodetects nominal categorical columns. 'all' implies all columns are nominal categorical. None implies no nominal categorical columns exist.

na_valuesscalar or str or list-like or dict, optional

Additional strings to recognize as NA/NaN. If dict is passed, it specifies per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’. (default value None implies no additional values to intrepret as NaN)

**kargs

Other keyword arguments accepted by pandas.read_csv() such as comment and lineterminator.

Notes

  • skiprows parameter uses absolute row indices whereas header_row parameter uses relative index (i.e., zero-index after removing rows specied by skiprows parameter).

  • usecols and categorical_cols parameters use absolute column names (or indices, if no header row) whereas target_cols parameter uses relative column indices (or names) after filtering out columns specified by usecols parameter.

  • categorical_cols='infer' identifies and encodes nominal features (i.e., features of ‘string’ type, with fewer unique entries than a value heuristically determined from the number of data samples) and drops other ‘string’ and ‘date’ type features from the dataset. Use automs.eda.max_classes_nominal() to find the heuristically determined value of maximum number of distinct entries in nominal features for a given number of samples.

  • Data samples with any NA/NaN features are implicitly dropped.

LIBSVM Data Format

class automs.config.LibsvmConfig

Dataset configuration class for LIBSVM data format

ARFF Data Format

class automs.config.ArffConfig(target_attr='class', numeric_categorical_attrs=None)

Dataset configuration class for ARFF data format

Parameters
target_attrstr, optional

Attribute name of the target column. None implies no target columns. (default value is 'class')

numeric_categorical_attrslist of str, optional

List of names of numeric attributes to be inferred as nominal and to be encoded. Note: All nominal attributes are implicitly encoded. (default value None implies no numeric attributes are to be infered as nominal)

Notes

All nominal type attributes are implicitly encoded.