APIs¶

AutoMS¶

Command line interface¶

Automatic Model Selection Using Cluster Indices

usage: automs [-h] [--oneshot | --subsampling] [--num_processes NUM_PROCESSES]
              [--warehouse_path WAREHOUSE_PATH] [--truef1]
              [--result RESULTS_FILENAME]
              dataset_filename

Positional Arguments¶

dataset_filename: path to a CSV, LIBSVM or ARFF data file. The path must also have associated configuration file for the dataset (with name as dataset_filename suffixed with ‘.config.py’).

Named Arguments¶

--oneshot: Whether to use oneshot approach
--subsampling: Whether to use sub-sampling approach
--num_processes: Number of parallel processes or jobs to use
--warehouse_path: Location for storing intermediate files and results corresponding to the dataset being processed
--truef1: Whether to compute the true f1-scores for dataset
--result: Path to file to write predicted classification complexity, estimated f1 scores and true f1 scores results

Python interface¶

automs.automs.automs(dataset_filename, oneshot=True, num_processes=1, warehouse_path=None, return_true_f1s=False)¶

Predicts the maximum achievable F1-score corresponding to various classifier models for a given dataset

Parameters

dataset_filenamestr: path to a CSV, LIBSVM or ARFF data file. The path must also have associated configuration file for the dataset (with name as dataset_filename suffixed with ‘.config.py’).
oneshotbool, optional: Whether to use oneshot approach or subsampling approach (default corresponds to the ‘approach’ specified in automs configuration, if automs is configured. Else, it defaults to True).
num_processesint, optional: Number of parallel processes or jobs to use (default corresponds to ‘num processes’ specified in automs configuration, if automs is configured. Else, it defaults to 1).
warehouse_pathstr, optional: Location for storing intermediate files and results corresponding to the dataset being processed (default corresponds to ‘warehouse path’ specified in automs configuration, if automs is configured. Else, it must be specified here).
return_true_f1sbool, optional: Whether to return the true f1-scores for dataset. If True, also returns the true f1-scores for dataset.

Returns

bool: Classification complexity of dataset. True implies dataset is hard to classify, False implies dataset is not hard to classify.
dict of {strint}: Dictionary mapping each classification model name to its estimated f1 score.
dict of {strint}, optional: Dictionary mapping each classification model name to its true f1 score. Only returned if return_true_f1s is True.

Raises

ValueError: If the dataset is not a binary classification dataset.
ValueError: If oneshot is specified to False, and the dataset has less or equal to 500 samples.

Notes

AutoMS performs cluster analysis and generates cluster indices on the dataset or its subsamples, and estimates the f1-scores for various classification models from these clustering-based metafeatures by using fitted regression models.

AutoMS works only for binary classification datasets, although the idea can be extended to multi-class problems using one-vs-rest statergy.

AutoMS with sub-sampling approach works only for datasets with more than 500 samples.

Note

The default values for the keyword arguments oneshot, num_processes and warehouse_path for function automs.automs.automs() will be overriden with default values configured using AutoMS configuration wizard.

Dataset Configuration¶

CSV Data Format¶

class automs.config.CsvConfig(sep=', ', skiprows=None, header_row=None, usecols=None, target_col=-1, categorical_cols='infer', na_values=None, **kargs)¶

Dataset configuration class for CSV data format

Parameters

sepstr, optional: Column delimiter. Accepted values: None implies autodetect delimiter, '\s+' uses combination of spaces and tabs, regular expressions. (default is ',' ).
skiprowslist of int or int, optional: List of line indices to skip or the number of starting lines to skip. (default value None implies don’t skip any lines)
header_rowint, optional: Relative Zero-Index (index of rows after skipping rows using skiprows parameter) of the row containing column names. Note: All preceding rows are ignored. (default value None implies no header row)
usecolslist, optional: List of column names (or column indices, if no header row specified) to consider. (default value None indicates use of all columns)
target_colint, optional: Relative Zero-Index of column (after filtering columns using usecols parameter) to use as target values. None indicates absence of target value columns. (default value -1 implies use the last column as target values)
categorical_cols‘infer’ or list or str or int or ‘all’, optional: List (str or int if singleton) of column names (or absolute indices of columns, if no header row specified) of categorical columns to encode. Default value 'infer' autodetects nominal categorical columns. 'all' implies all columns are nominal categorical. None implies no nominal categorical columns exist.
na_valuesscalar or str or list-like or dict, optional: Additional strings to recognize as NA/NaN. If dict is passed, it specifies per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’. (default value None implies no additional values to intrepret as NaN)
**kargs: Other keyword arguments accepted by pandas.read_csv() such as comment and lineterminator.

Notes

skiprows parameter uses absolute row indices whereas header_row parameter uses relative index (i.e., zero-index after removing rows specied by skiprows parameter).
usecols and categorical_cols parameters use absolute column names (or indices, if no header row) whereas target_cols parameter uses relative column indices (or names) after filtering out columns specified by usecols parameter.
categorical_cols='infer' identifies and encodes nominal features (i.e., features of ‘string’ type, with fewer unique entries than a value heuristically determined from the number of data samples) and drops other ‘string’ and ‘date’ type features from the dataset. Use automs.eda.max_classes_nominal() to find the heuristically determined value of maximum number of distinct entries in nominal features for a given number of samples.
Data samples with any NA/NaN features are implicitly dropped.

LIBSVM Data Format¶

class automs.config.LibsvmConfig¶: Dataset configuration class for LIBSVM data format

ARFF Data Format¶

class automs.config.ArffConfig(target_attr='class', numeric_categorical_attrs=None)¶

Dataset configuration class for ARFF data format

Parameters

target_attrstr, optional: Attribute name of the target column. None implies no target columns. (default value is 'class')
numeric_categorical_attrslist of str, optional: List of names of numeric attributes to be inferred as nominal and to be encoded. Note: All nominal attributes are implicitly encoded. (default value None implies no numeric attributes are to be infered as nominal)

Notes

All nominal type attributes are implicitly encoded.