APIs¶
AutoMS¶
Command line interface¶
Automatic Model Selection Using Cluster Indices
usage: automs [-h] [--oneshot | --subsampling] [--num_processes NUM_PROCESSES]
[--warehouse_path WAREHOUSE_PATH] [--truef1]
[--result RESULTS_FILENAME]
dataset_filename
Positional Arguments¶
- dataset_filename
path to a CSV, LIBSVM or ARFF data file. The path must also have associated configuration file for the dataset (with name as dataset_filename suffixed with ‘.config.py’).
Named Arguments¶
- --oneshot
Whether to use oneshot approach
- --subsampling
Whether to use sub-sampling approach
- --num_processes
Number of parallel processes or jobs to use
- --warehouse_path
Location for storing intermediate files and results corresponding to the dataset being processed
- --truef1
Whether to compute the true f1-scores for dataset
- --result
Path to file to write predicted classification complexity, estimated f1 scores and true f1 scores results
Python interface¶
-
automs.automs.automs(dataset_filename, oneshot=True, num_processes=1, warehouse_path=None, return_true_f1s=False)¶ Predicts the maximum achievable F1-score corresponding to various classifier models for a given dataset
- Parameters
- dataset_filenamestr
path to a CSV, LIBSVM or ARFF data file. The path must also have associated configuration file for the dataset (with name as dataset_filename suffixed with ‘.config.py’).
- oneshotbool, optional
Whether to use oneshot approach or subsampling approach (default corresponds to the ‘approach’ specified in automs configuration, if automs is configured. Else, it defaults to True).
- num_processesint, optional
Number of parallel processes or jobs to use (default corresponds to ‘num processes’ specified in automs configuration, if automs is configured. Else, it defaults to 1).
- warehouse_pathstr, optional
Location for storing intermediate files and results corresponding to the dataset being processed (default corresponds to ‘warehouse path’ specified in automs configuration, if automs is configured. Else, it must be specified here).
- return_true_f1sbool, optional
Whether to return the true f1-scores for dataset. If True, also returns the true f1-scores for dataset.
- Returns
- bool
Classification complexity of dataset. True implies dataset is hard to classify, False implies dataset is not hard to classify.
- dict of {strint}
Dictionary mapping each classification model name to its estimated f1 score.
- dict of {strint}, optional
Dictionary mapping each classification model name to its true f1 score. Only returned if return_true_f1s is True.
- Raises
- ValueError
If the dataset is not a binary classification dataset.
- ValueError
If oneshot is specified to False, and the dataset has less or equal to 500 samples.
Notes
AutoMS performs cluster analysis and generates cluster indices on the dataset or its subsamples, and estimates the f1-scores for various classification models from these clustering-based metafeatures by using fitted regression models.
AutoMS works only for binary classification datasets, although the idea can be extended to multi-class problems using one-vs-rest statergy.
AutoMS with sub-sampling approach works only for datasets with more than 500 samples.
Note
The default values for the keyword arguments oneshot, num_processes and warehouse_path for function automs.automs.automs() will be overriden with default values configured using AutoMS configuration wizard.
Dataset Configuration¶
CSV Data Format¶
-
class
automs.config.CsvConfig(sep=', ', skiprows=None, header_row=None, usecols=None, target_col=-1, categorical_cols='infer', na_values=None, **kargs)¶ Dataset configuration class for CSV data format
- Parameters
- sepstr, optional
Column delimiter. Accepted values:
Noneimplies autodetect delimiter,'\s+'uses combination of spaces and tabs, regular expressions. (default is',').- skiprowslist of int or int, optional
List of line indices to skip or the number of starting lines to skip. (default value
Noneimplies don’t skip any lines)- header_rowint, optional
Relative Zero-Index (index of rows after skipping rows using
skiprowsparameter) of the row containing column names. Note: All preceding rows are ignored. (default valueNoneimplies no header row)- usecolslist, optional
List of column names (or column indices, if no header row specified) to consider. (default value
Noneindicates use of all columns)- target_colint, optional
Relative Zero-Index of column (after filtering columns using
usecolsparameter) to use as target values.Noneindicates absence of target value columns. (default value-1implies use the last column as target values)- categorical_cols‘infer’ or list or str or int or ‘all’, optional
List (str or int if singleton) of column names (or absolute indices of columns, if no header row specified) of categorical columns to encode. Default value
'infer'autodetects nominal categorical columns.'all'implies all columns are nominal categorical.Noneimplies no nominal categorical columns exist.- na_valuesscalar or str or list-like or dict, optional
Additional strings to recognize as NA/NaN. If dict is passed, it specifies per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’. (default value
Noneimplies no additional values to intrepret as NaN)- **kargs
Other keyword arguments accepted by
pandas.read_csv()such ascommentandlineterminator.
Notes
skiprowsparameter uses absolute row indices whereasheader_rowparameter uses relative index (i.e., zero-index after removing rows specied byskiprowsparameter).usecolsandcategorical_colsparameters use absolute column names (or indices, if no header row) whereastarget_colsparameter uses relative column indices (or names) after filtering out columns specified byusecolsparameter.categorical_cols='infer'identifies and encodes nominal features (i.e., features of ‘string’ type, with fewer unique entries than a value heuristically determined from the number of data samples) and drops other ‘string’ and ‘date’ type features from the dataset. Useautoms.eda.max_classes_nominal()to find the heuristically determined value of maximum number of distinct entries in nominal features for a given number of samples.Data samples with any NA/NaN features are implicitly dropped.
LIBSVM Data Format¶
-
class
automs.config.LibsvmConfig¶ Dataset configuration class for LIBSVM data format
ARFF Data Format¶
-
class
automs.config.ArffConfig(target_attr='class', numeric_categorical_attrs=None)¶ Dataset configuration class for ARFF data format
- Parameters
- target_attrstr, optional
Attribute name of the target column.
Noneimplies no target columns. (default value is'class')- numeric_categorical_attrslist of str, optional
List of names of numeric attributes to be inferred as nominal and to be encoded. Note: All nominal attributes are implicitly encoded. (default value
Noneimplies no numeric attributes are to be infered as nominal)
Notes
All nominal type attributes are implicitly encoded.