Outlier removal in the Pattern Recognition Toolbox

Often, in real data sets, there exist samples that are considered outliers, that for whatever reason, do not accurately represent the actual data. It is often beneficial to remove outliers from data prior to training, as they can badly skew results. The Pattern Recognition Toolbox provides several functions for the removal of outliers.

Contents

Removal of missing or non-finite data

In some data sets, it is possible that some samples have missing or non-finite elements, that need to be removed. Missing data is commonly marked as NaN. As a simple example, consider the following:

dataSet = prtDataGenUnimodal;               % Load a data Set
outlier = prtDataSetClass([NaN NaN],1);     % Create and insert
dataSet = catObservations(dataSet,outlier); % an Outlier

% Create the prtOutlierRemoval object, specifying that on run, any NaNs
% will be removed.
outRemove = prtOutlierRemovalMissingData('runMode','removeObservation');

outRemove = outRemove.train(dataSet);    % Train and run
dataSetNew = outRemove.run(dataSet);

Run mode options

The above code removes all NaN data from the data set when the run function is called. The runMode property specifies what the outlier removal object should do when a data member is determined to be an outlier. There are four options for this property:

       'noAction' - When running the outlier removal action, do
       nothing.  This ensures that the outlier removal action outputs
       data sets of the same size as the input data set.
       'replaceWithNan' - When running the outlier removal action
       replace outlier values with nans.  This ensures that the
       outlier removal action outputs data sets of the same size as
       the input data set.
       'removeObservation' - When running the outlier removal action,
       remove observations where any feature value is flagged as an
       outlier.  This can change the size of the data set during
       running and can result in invalid cross-validation folds.
       'removeFeature'  - When running the outlier removal action,
       remove features where any observation contains an outlier.

Removal of outliers beyond a number of standard deviations

Another common outlier removal technique is to remove any data members that lie beyond a certain number of standard deviations from the average data member. For example:

dataSet = prtDataGenUnimodal;               % Load a data Set
outlier = prtDataSetClass([-10 -10],1);     % Create and insert
dataSet = catObservations(dataSet,outlier); % an outlier

% Create the prtOutlierRemoval object
nStdRemove = prtOutlierRemovalNStd('runMode','removeObservation');

nStdRemove = nStdRemove.train(dataSet);    % Train and run
dataSetNew = nStdRemove.run(dataSet);

% Plot the results
subplot(2,1,1); plot(dataSet);
title('Original Data');
subplot(2,1,2); plot(dataSetNew);
title('NstdOutlierRemove Data');

Notice in the above plot, the outlier at [-10 -10] has been removed.

All outlier removal objects in the Pattern Recognition Toolbox have the same API as discussed above. For a list of all the different objects, and links to their individual help entries, A list of commonly used functions