- If you pulled the most recent version of the PRT since October 9, 2013, the PRT no longer shows up in the builtin MATLAB help-browser or
- If you updated to MATLAB 2013B, MATLAB freezes during startup.
Here’s the skinny – there’s a bug in the new version of MATLAB (2013B) that, according to TMW:

This is a known issue with MATLAB 8.2 (R2013b) in the way that the MATLAB startup handles the info.xml file. There is a deadlock between the Help System initialization and the path changes (raised by the startup.m) at the start of MATLAB.

The result of this bug is that MATLAB seems to start up fine, but then just sits there, and never accepts inputs. This is pretty much as bad a bug as can be (besides wrong answers), since it makes MATLAB completely unusable, and you can’t even easily determine what is causing it since… MATLAB is unusable.

There is a patch to fix this bug, but if you don’t have the patch already installed it’s **very** difficult to figure out what is going wrong and it can be very frustrating. Alternatively, removing (or moving) the XML files will let MATLAB startup, but the automatic help search, prtDoc, etc. will no longer work. We thought that moving the XML files would cause the least amount of pain overall, so that’s what we did. Our XML files that used to live in prtRoot now live in fullfile(prtRoot,‘]xml’) which is by default **not** on the prtPath, so should not cause you any issues.

If you still want to use the PRT documentation in the MATLAB help browser, simply install the patch by following the instructions below, then move the .xml files from fullfile(prtRoot,‘]xml’) to prtRoot. That should get everything running!

Alternatively just use prtDoc instead of doc to open up the help in your browser (or the MATLAB web browser, depending on your version of MATLAB). For example:

```
prtDoc prtClassRvm
```

Sorry about any headaches anyone encountered, and a big “thank you” to Cesar from TMW who helped us get to the bottom of this quickly and professionally. Here are the patch installation instructions from Cesar:

]]>To work around this issue, unzip the patch into your MATLAB root directory (most likely C:\Program Files\MATLAB\R2013b)

verboseStorage is a logical flag of prtAction that specifies whether the training dataset should be stored within the action. The default value is true. Let’s see an example

First let’s get a toy dataset and plot it to see what we are talking about.

ds = prtDataGenUnimodal; plot(ds);

Let’s train a prtClassMap and plot the resulting decision contours.

c = train(prtClassMap, ds); plot(c) title('Classifier Decision Contrours with Training Data Set','FontSize',16);

You can see that even though we only plotted the trained classifier, the dataset appears in the plot. The training dataset is stored in the read-only property dataSet

c.dataSet

ans = prtDataSetClass with properties: nFeatures: 2 featureInfo: [] data: [400x2 double] targets: [400x1 double] observationInfo: [] nObservations: 400 nTargetDimensions: 1 isLabeled: 1 name: 'prtDataGenUnimodal' description: '' userData: [1x1 struct] nClasses: 2 uniqueClasses: [2x1 double] nObservationsByClass: [2x1 double] classNames: {2x1 cell} isUnary: 0 isBinary: 1 isMary: 0 isZeroOne: 1 hasUnlabeled: 0

Let’s try this again, this time setting verboseStorage to false.

cVerboseStorageFalse = train(prtClassMap('verboseStorage',false), ds); plot(cVerboseStorageFalse) title('Classifier Decision Contrours without Training Data Set','FontSize',16);

Plotting the classifier contours without a data set is sometimes useful for examples. Now you can see that the classifiers dataSet property is empty.

cVerboseStorageFalse.dataSet

ans = []

A long time ago, earlier versions of the PRT had no verboseStorage property and the dataSet was always saved. You can see how this might create problems when dataSets get large. We originally used the dataSet to determine plot limits and other things for the classifier plot as well. Now we use the dataSetSummary field to create plots. All prtDataSets must have a summarize() method that yields a structure that can be used by other actions when plotting. You can see that for the above examples the value of verbose storage does not change the dataSetSummary. This is how prtClass.plot() knows what image bounds to use for plotting.

c.dataSetSummary cVerboseStorageFalse.dataSetSummary

ans = upperBounds: [4.7304 4.7950] lowerBounds: [-4.0730 -3.5644] nFeatures: 2 nTargetDimensions: 1 nObservations: 400 uniqueClasses: [2x1 double] nClasses: 2 isMary: 0 ans = upperBounds: [4.7304 4.7950] lowerBounds: [-4.0730 -3.5644] nFeatures: 2 nTargetDimensions: 1 nObservations: 400 uniqueClasses: [2x1 double] nClasses: 2 isMary: 0

Since verboseStorgae is a property of prtAction it is also a property of prtAlgorithm. When you set the verboseStorage property for an algorithm you are actually setting the verboseStorgae property for all actions within the algorithm. If verboseStorgae is true for a prtAlgorithm you can use prtAlgorithm.plot() to explore what the data coming into any stage of the algorithm (the training data) looks like. Here is a quick example. Note: plotting prtAlgorithms requires graphviz. There may be issues with the current version of graphviz and the PRT. Please file an issue on github.

```
algo = prtPreProcZmuv + prtClassRvm/prtClassPlsda + prtClassLogisticDiscriminant;
algo.verboseStorage = true; % Just for clarity, this is the default
trainedAlgo = train(algo, ds);
plot(trainedAlgo);
```

Boxes with bold outlines are clickable. Double clicking those will open another figure and call plot on the action. For example double clicking on the RVM plots the resulting decision contours (and the dataSet after it has been preprocessed using ZMUV (notice the X and Y labels).

Similarly you can plot the PLSDA decision contour (also with the preprocessed ZMUV data).

The fusion of the two classifiers is shown be clicking on the prtClassLogisticDiscriminant.

The total confidence provided by the output of the algorithm is shown as a function of the input dataSet by clicking on the output block. Notice here that the features are the original input features and the contours show the contours of the entire algorithm.

If we repeat the whole process with verboseStorgae false you will see that the resulting plots do not have the dataSet just like before.

```
algo = prtPreProcZmuv + prtClassRvm/prtClassPlsda + prtClassLogisticDiscriminant;
algo.verboseStorage = false; % This will feed through to all actions
trainedAlgo = train(algo, ds);
plot(trainedAlgo);
```

As an example here is just the final output contours of the algorithm.

Well that’s verboseStorage. If you have a big dataset you probably want to turn it off but if you don’t it can be useful to fully explore an algorithm. Let us know what you think.

]]>Our classifier only currently allows standard batch-propagation learning. It should be relatively easy to include new training approaches, but we haven’t done so yet. The current prtClassNnet only allows for three-layer (one hidden-layer) networks. Depending on who you ask, this is either very important, or not important at all. In either case, we hope to expand the capabilities here eventually. The current formulation only works for binary classification problems. Extensions to enable multi-class classification are also in progress.

prtClassNnet acts pretty much the same as any other classifier. As you might expect, we can set the number of neurons in the hidden layer, and set the min ans max number of training epochs, and the tolerance to check for convergence:

nnet = prtClassNnet; nnet.nHiddenUnits = 10; nnet.minIters = 10000; nnet.relativeErrorChangeThreshold = 1.0000e-04; % check for convergence if nIters > minIters nnet.maxIters = 100000; % kick out after this many, no matter what

The activation functions are an important part of neural network design. The prtClassNnet object allows you to manually specify the activation function, but you need to set both the “forward function” and the first derivative of the forward function. These can be specified using function handles in the fields fwdFn and fwdFnDeriv. The “classic” formulation of a neural network uses a sigmoid activation function, so the parameters can be set like so:

sigmoidFn = @(x) 1./(1 + exp(-x)); nnet.fwdFn = sigmoidFn; nnet.fwdFnDeriv = @(x) sigmoidFn(x).*(1-sigmoidFn(x));

prtClassNnet enables automatic visualization of the algorithm progress as learning proceeds. You can set how often (or whether) this visualization occurs by setting nnet.plotOnIter to a scalar; the scalar represents how often to update the plots. Use 0 to use no visualization.

So, what does the resulting process look like? Let’s give it a whirl with a stadnard X-OR data set:

dsTrain = prtDataGenXor; dsTest = prtDataGenXor; nnet = prtClassNnet('nHiddenUnits',10,'plotOnIter',1000,'relativeErrorChangeThreshold',1e-4); nnet = nnet.train(dsTrain); yOut = nnet.run(dsTest);

We hope that using prtClassNnet enables you to do some new, neat things. If you like it, please help us re-write the code to overcome our current restrictions!

Happy coding.

]]>All this is well and good, but how do I know whether my problem is appropriate for use with an SVM? I’m doing object tracking – is that an SVM-like problem?”

This question is extremely deep and subtle, and it comes up *a lot*. Let’s break it down into some related sub-questions:

- What do we mean when we talk about “SVMs” or RVMs, or random forests, neural networks, or other ‘supervised learning’ approaches? And what types of problems are these intended to solve?
- Is my problem one of those problems? (or, “What kind of problem is my problem?”)
- Is that all “machine learning” is? What other kinds of problems are there?

As we mentioned, these questions may only admit rather theoretical-sounding answers, but we’ll try and give a quick overview in easy-to-understand language.

So, what are we talking about when we talk about ‘machine learning’? 90% of the time, when someone is talking about machine learning, or pattern recognition, or statistical inference they’re really referring to a set of problems that can be boiled down to a label-prediction problem.

Assume we have a number of objects, and a number of different measurements we collect for each object. Let’s use i to index the objects (1 through N) and j to index the measurements, (1 through P). Then the j’th measurement for object i is just x{i,j}.

Let’s use a simple example to cement ideas (this example is stolen from Duda, Hart, and Stork). Pretend that we’re running a fish-processing plant, and we want to automatically distinguish between salmon and tuna as they come down our conveyor belt. I don’t know anything about fish, but we might consider measuring something about each fish, like it’s size, weight, and color, as it comes doen the belt, and we’d like to make an automatic decision based on that information. In that case, x{3,2} might represent, say, the weight (in lbs.) of the 3rd fish. Similarly x{4,2} is the weight of the fourth fish, and x{1,1} is the size of the first fish. We can use x{i} to represent all the measurements of the i’th fish.

Note that if we assume each x{i} is a 1xP vector, we can form a matrix, X, out of all N of the x{i}’s. X will be size N x P.

So, for each fish, we have x{i}, and in addition to that information, we’ve also collected a bunch of ‘labeled examples’ where we also have y{i}. Each y{i} provides the label of the corresponding x{i}, e.g., y{i} is either ‘tuna’, or ‘salmon’ if x{i} was measured from a tuna or a salmon – y{i} is the value we’re trying to decipher from x{i}. Usually we can use different integers to mean different classes – so y{i} = 0 might indicate tuna, while y{i} = 1 means salmon. Note that we can form a vector, Y, of all N y{i}’s. Y will be size N x 1.

Now, if we’re clever, we’re going to have a lot of labeled examples to get started – this set is called our training set – {X,Y} = { x{i},y{i} } for i = 1…N.

The goal of supervised learning is to develop techniques for predicting y’s based on x’s. E.g., given then training set, {X,Y}, we’d like to develop a function, f:

```
(guess) y{i} = f(x{i})
```

That’s it. That’s supervised learning. Maybe this problem sounds super simple the way we’ve described it here. I assure you, the general problem is quite complicated, subtle, and interesting. But the basic outline is always the same – you have a training set of data and labels: {X,Y} and want to learn how to guess y’s given x’s.

- Number of Observations – the number of unique objects (fish) measured (N)
- Dimensionality – the number of measurements taken for each object (P)
- Feature – any column of X, e.g., all the ‘weight’ measurements.
- Label – the value of Y, and the value we want to infer from X
- Observation – any row of X, e.g., all the measurements for object i

Supervised learning is very well studied, and we can divide it up into a number of special cases.

If the set of Y’s you want to guess form a discrete set, e.g., {Tuna, Salmon}, or {Sick, Healthy}, or {Titanium, Aluminum, Tungsten}, you have what’s called a classification problem, and your y{i} values are usually some subset of the integers. More on wikipedia..

If the set of Y’s you want to guess form a continuous set, e.g., you have x{i} values and y{i} correspond to some other object measurement – say, height, or weight, you have what’s called a regression problem, and your y{i} values are usually some subset of the reals. More on wikipedia.

If you have a number of sets of data {X,Y}{k}, where each classification problem is similar, but not the same, (say in a nearby plant, you want to tell swordfish from hallibut) and you want to leverage things you learned in Plant 1 to help in plant K, you may have a multi-task learning problem. More on wikipedia.

If you only have labels for sets of observations (and not for individual observations), you probably have a multiple-instance problem. More on wikipedia.

Above we made the explicit assumption that each of the observations you made could be sorted into meaningful vectors of length P, and concatenated to form X, where each column of X corresponds to a unique measurement. That’s not always the case. For example, you might have measured:

- Time-series
- Written text
- Tweets
- ‘Likes’
- Images
- Radar data
- MRI data
- Etc.

Under these scenarios, you need to perform specialized application-specific processing to extract the features that make supervised learning tractable. More on wikipedia.

Now that you know a little about supervised learning, some of the design decisions in the PRT might make a little more sense. For example, in prtDataSetStandard we always use a matrix to store your data, X. That’s because in standard supervised learning problems, X can always be stored as a matrix! Similarly, your labels, Y, is a vector of size Nx1, as should be clear from the discussion above.

Also, prtDataSetClass, and prtDataSetRegress make a separation between the classification and regression problems outlined above.

Furthermore, the PRT makes it easy to swap in and out any techniques that fall under the rubrik of supervised learning – since algorithms that are appropriate for one task may be completely inadequate for another.

It depends. Maybe? That’s kind of up to you. A whole lot of problems are close to supervised learning problems. Even if your specific problem isn’t exactly supervised learning, most really interesting statistical problems use supervised learning somewhere inside them, so learning some supervised learning is pretty much always a good idea. If you’re not sure if your problem is ‘supervised learning’, maybe an explicit list of other kinds of problems might help…

There are lots and lots of problems out there. Your problem might be much closer to one of them than it is to classic supervised learning. If so, you should explore the literature in that specific sub-field, and see what techniques you can leverage there. But if your problem is far removed from supervised learning, the PRT may not be the right tool for the job – in fact, your problem may require it’s own set of tools and techniques, and maybe it’s time for you to write a new toolbox!

Here are a few examples of problems that don’t fit cleanly into classic supervised learning although they may make use of supervised learning.

- System Control
- Reinforcement Learning
- Natural language processing
- Network prediction / Matrix completion
- Computer vision
- Video Tracking

And here’s a great paper, from 2007, that’s still quite relevant: Structured Machine Learning: 10 Problems for the Next 10 Years.

We hope this makes at least some of what we mean by “supervised learning” make a little more sense – when it’s appropriate, when it’s not, and whether your problem fits into it.

If your problem is a supervised learning problem, we hope you’ll consider the PRT!

]]>This blog entry will serve two purposes - 1) to provide an introduction to practical issues you (as an engineer or scientist) may encounter when using an SVM on your data, and 2) to be the first in a series of similar “for Engineers & Scientists” posts dedicated to helping engineers understand the tradeoffs and assumptions, and practical details of using various machine learning approaches on their data.

Thoughtout this post, we’ll be using prtClassLibSvm, which is built directly on top of the fantastic LibSVM library, available here:

http://www.csie.ntu.edu.tw/~cjlin/libsvm/

The parameter nomenclature we’re using matches theirs pretty closely, so feel free to leverage their documentation as well.

Typical SVM formulations assume that you have a set of n-dimensional real training vectors, {x_i} for i = 1…N, and corresponding labels {y_i}, y_i \in {-1,1}. Let x_ik represent the k’th element of the vector x_i.

Also assume that you have a relevant kernel function (https://en.wikipedia.org/wiki/Kernel_methods), P, which takes two input arguments, both n-dimensional real vectors, and outputs a scalar metric - P(x_i,x_j) = z_ij. The most common choice of P is a radial basis function (http://en.wikipedia.org/wiki/Radial_basis_function): P(x_i,x_j) = exp(- (\sum_{k} (x_ik-x_jk)^2 )/s^2 )

SVMs perform prediction of new labels by calculating:

f(x) = \hat{y} = ( \sum_{i} (w_i*P(x_i,x) - b) ) > 0

e.g., the SVM learns a representation for the labels (y) based on the data (x) with a linear combination (w) of a set of functions of the training data (x_i) and the test data (x).

Binary/M-Ary: Typically, SVMs are appropriate for binary classification problems - multi-class problems require some extensions of SVMs, although in the PRT, SVMs can be used in prtClassBinaryToMaryOneVsAll to emulate multi-class classification.

Data: SVM formulations often assume vector-valued training data, however as long as a suitable kernel-function can be constructed, SVMs can be used on arbitrary data (e.g., string-match distances can be usned as a kernel for calculating the distances between character strings). Note, however, that SVMs do assume that the kernel used is a Mercer kernel, so some functions are not appropriate as SVM kernels - http://en.wikipedia.org/wiki/Mercer’s_theorem.

Computational Considerations: Depending on the kernel, and particular algorithm under consideration, training an SVM can be very time-consuming for very large data sets. Proper selection of SVM parameters can significantly improve training time. At run-time, SVMs are typically very fast, with computational complexity that grows approximately linearly with the size of the training data set.

As you might imagine, several SVM parameters will have significant effect on overall classification performance. Good performance requires careful selection of each of these; though some general rules-of-thumb can help provide reasonable performance with a minimum of headaches.

Internally, the SVM is going to try and ignore a whole bunch of your training data, by setting their corresponding w_i to zero. This might sound counter-intuitive, but it’s very important, because it makes for fast run-time, and also (it turns out) that setting a bunch of w’s to zero is fundamental to why the SVM performs so well in general (see any number of articles on V-C Theory for more information).

Unfortunately, this presents a dillema - how much should the SVM try and make w’s zero vs. how mhuch should it try and classify your data absolutely perfectly? More zero-w’s might improve performance on the training set, but reduce the performance of the SVM on an unseen testing set!

The “Cost” parameter in the SVM enables you to control this trade off. Higher cost leads to more non-zero w’ vectors, and more correctly classified training points, while lower costs tend to generate w vectors with lots of zeros, and slightly worse performance on training data (though performance on testing data may be better).

We usually run a number of experiments for different cost values across a range of, say 0.01 to 100, though if performance is plateauing it might make sense to extend this range. The following figures show how the SVM decision boundaries change with varying costs in the PRT.

close all; ds = prtDataGenUnimodal; c = prtClassLibSvm; count = 1; for w = logspace(-2,2,4); c.cost = w; c = c.train(ds); subplot(2,2,count); plot(c); legend off; title(sprintf('Cost: %.2f',c.cost)); count = count + 1; end

In typical discussions of “cost”, errors in both classes are treated equally – e.g., it’s equally bad to call a “-1” a “1” and vice-versa. In realistic operations, that may not be the case – for example, failing to detect a landmine, is significantly worse than calling a coke-can a landmine.

Luckily, SVMs enable us to specify class-specific error costs, so if class 1 has error cost of 1, and class -1 has an error cost of 100, it’s 100x as bad to mistake a “-1” for a “1” as the opposite.

LibSVM implements these class-specific weights using parameters called “w-1”, “w1”, etc. In the PRT, these are implemented as a vector, weights. The following example shows how the effects of changing the error weight on class 1 affects the overall SVM contours. Clearly, as the cost on class 1 increases, the SVM spends more effort to correctly classify red elements.

close all; c = prtClassLibSvm; count = 1; for w = logspace(-1,1,4); c.weight = [1 w]; %Class0: 1, Class1: w c = c.train(ds); subplot(2,2,count); c.plot(); legend off; title(sprintf(‘Weight: [%.2f,%.2f]’,c.weight(1),c.weight(2))); count = count + 1; end

The proper choice of kernel makes a huge difference in the resulting performance of your classifier. We tend to stick with RBF and linear kernels (kernelType = 0 or 2 in prtClassLibSvm), but several other options (including hand-made kernels) are also possible. The linear kernel doesn’t have any parameters to set, but the RBF has a parameter that can significantly impact performance. In most formulations, the parameter is referred to as sigma, but in LibSVM, the parameter is gamma, and it’s equivalent to 1/sigma. For the RBF, you can set it to any positive value. You can also use the special character ‘k’, and specify a coefficient as a string. ‘k’ will evaluate to the number of features in the data set – e.g., ‘5k’ evaluates to 10 for a 2-dimensional data set.

In general, we find that for normalized data (see below), the default gamma value of ‘k’ (the number of dimensions) works well.

The following example code generates 4 example images for SVM decision boundaries for varying gamma parameters.

close all; c = prtClassLibSvm; count = 1; d = prtDataGenUnimodal; for kk = logspace(-1,.5,4); c.gamma = sprintf(‘%.2fk’,kk); c = c.train(d); subplot(2,2,count); c.plot(); title(sprintf(‘\gamma = %s’,c.gamma)); legend off; count = count + 1; end

Note that for many kernel choices (e.g., RBF, and many others, see http://en.wikipedia.org/wiki/Kernel_methods#Popular_kernels), the kernel output (P(x_i,x_j) depends strongly and non-linearly on the magnitudes of the data vectors. E.g., exp(-1000) is not equal to 1000*exp(-1). In fact, if you refer to the RBF equation above, you’ll notice that if two elements of your vector have a difference approaching 1000, P(x1,x2) will be dominated by a term like exp(-1000), which by any reasonable metric (and certainly in floating point precision) is exactly 0. This is a bad thing ™.

In general, non-linear kernel functions should only be applied to data that is guaranteed to be in a reasonable range (e.g., -10 to 10), or data that has been pre-processed to remove outliers or control for data magnitude. The PRT pamkes several such techniques available – compare and contrast the performance in the following example:

close all; ds = prtDataGenBimodal; ds.X = 100*ds.X; %scale the data yOutNaive = kfolds(prtClassLibSvm,ds,3); yOutNorm = kfolds(prtPreProcZmuv + prtClassLibSvm,ds,3); [pfNaive,pdNaive] = prtScoreRoc(yOutNaive); [pfNorm,pdNorm] = prtScoreRoc(yOutNorm); h = plot(pfNaive,pdNaive,pfNorm,pdNorm); set(h,'linewidth',3); legend(h,{'Naive','Pre-Proc'}); title('ROC Curves for Naive and Pre-Processed Application of SVM to Bimodal Data');

Clearly, performance on un-normalized data is attrocious, but simple re-scaling acheives good results.

The general procedure in developing an SVM is to optimize both the C and gamma parameters for your particular data set. You can do this using two for-loops and the PRT:

close all; gammaVec = logspace(-2,1,10); costVec = logspace(-2,1,10); ds = prtDataGenUnimodal; auc = nan(length(gammaVec),length(costVec)); kfoldsInds = ds.getKFoldKeys(3); for gammaInd = 1:length(gammaVec); for costInd = 1:length(costVec); c = prtClassLibSvm; c.cost = costVec(costInd); c.gamma = gammaVec(gammaInd); yOut = crossValidate(c,ds,kfoldsInds); auc(gammaInd,costInd) = prtScoreAuc(yOut); imagesc(auc,[.95 1]); colorbar drawnow; end end title('AUC vs. Gamma Index (Vertical) and Cost Index (Horizontal)');

In general, you may not have time or simply want to optimize over your SVM parameters. In this case, you can usually get by using ZMUV pre-processing, and the default SVM parameters (RBF kernel, Cost = 1, gamma = ‘k’)

algo = prtPreProcZmuv + prtClassLibSvm;

We hope this entry helps you make sense of how to use an SVM in real-world scenarios, and how to optimize the SVM parameters for your particular data set. As always, proper cross-validation is fundamental to good generalizability.

Happy coding.

]]>Simply put, observationInfo is a structure array stored in a prtDataSetStandard that has a number of entries equal to the number of observations in the dataSet. This structure array has user defined fields that store side information. When observations are removed from a dataset the observationInfo structure is also properly indexed. Here is a quick example.

First let’s make a simple dataset with only 4 observations

X = cat(1,prtRvUtilMvnDraw([0 0],eye(2),2),prtRvUtilMvnDraw([2 2],eye(2),2)); Y = [0; 0; 1; 1;]; ds = prtDataSetClass(X,Y); plot(ds);

Now let’s create a structure of observationInfo and set it.

obsInfo = struct(‘fileIndex’,{1,2,3,4}‘,'timeOfDay’,{‘day’,‘night’,‘day’,‘night’}‘);ds.observationInfo = obsInfo;

If we retain (or remove) observations from this dataSet, the observation info is properly indexed.

dsSub = ds.retainObservations([1; 4]); dsSub.observationInfo.fileIndex

ans = 1 ans = 4

A hidden method of prtDataSetClass allows us to “select” observations from a dataSet by evaluating a function on the observationInfo. select() takes a function handle that is evaluated for each entry in the dataSet and returns a logical index. A dataSet containing only the observations for which the function was true is returned.

```
dsDayOnly = ds.select(@(s)strcmpi(s.timeOfDay,'day'));
{dsDayOnly.observationInfo.timeOfDay}
```

ans = 'day' 'day'

Sometimes with complex dataSets with lots of observationInfo sorting through the data can be difficult. A graphical way to view observationInfo and create a function handle for select. This functionality is currently in Beta so make sure you include those directories in your path.

prtUiDataSetStandardObservationInfoSelect(ds);

Right clicking (ctrl+click in OSX) in the table allows you to graphically select observations that will be returned by select.

This is a simple example of what observation info is and how it can be used. We use it all of the time to manage all of our side information. Because all calls to retainObservation correctly index the observationInfo the side information is available within actions even during cross-validation. Making use of observation info is a quick way to fake a custom type of dataSet that contains other side information. Let us know how you use observationInfo and if you have any ideas to improve it.

]]>Consider a binary classification problem with an m-ary classifer (prtClassMap).

dsTrain = prtDataGenUnimodal; dsTest = prtDataGenUnimodal; classifier = prtClassMap; trainedClassifier = train(classifier, dsTrain); plot(trainedClassifier); title('Binary Classification with MAP','FontSize',16); output = run(trainedClassifier, dsTest); output.nFeatures

ans = 1

As you can see the output only has one feature. But consider the same M-ary classifier with a 3-class problem.

dsTrainMary = prtDataGenMary; dsTestMary = prtDataGenMary; trainedClassifierMary = train(classifier, dsTrainMary); plot(trainedClassifierMary); title('M-ary Classification with MAP','FontSize',16); outputMary = run(trainedClassifierMary, dsTestMary); outputMary.nFeatures

ans = 3

Now the output has 3 feautres. The trend continues with more classes, because prtClassMap declares itself as an Mary classififer

classifier.isNativeMary

ans = 1

the output is (as it should) have a column corresponding to the confidence of each class. Binary classification is an exception.

By default the PRT checks when M-ary classifiers are run on binary data to see if it should output a single confidence or binary confidences. The mode of operation is stored in twoClassParadigm which by default is set to the string M-ary.

classifier.twoClassParadigm

ans = binary

If we return to the binary classification problem and set twoClassParadigm to mary we can see that we get the two outputs that we expect.

```
classifier.twoClassParadigm = 'mary';
trainedClassifierBinaryActingAsMary = train(classifier, dsTrain);
outputBinaryActingAsMary = run(trainedClassifierBinaryActingAsMary, dsTest);
outputBinaryActingAsMary.nFeatures
```

ans = 2

We debated, but ultimately we decided that it was less confusing this way. It is more rare to want both outputs for a binary classification problem than it is to expect a single. Most users expected this to run out of the box.

classifier = prtClassMap; trainedClassifier = train(classifier, dsTrain); output = run(trainedClassifier, dsTest); prtScoreRoc(output); title('Binary Classification with an M-ary Classifier','FontSize',16);

If you change the classifier to non-m-ary classification (prtClassGlrt for example) you expect the same code to work only by changing the classifier declaration.

classifier = prtClassGlrt; trainedClassifier = train(classifier, dsTrain); output = run(trainedClassifier, dsTest); prtScoreRoc(output); title('Binary Classification with a Binary Classifier','FontSize',16);

Without the twoClassParadigm system it wouldn’t be as easy to switch classifiers for binary problems as an extra step would be required to manually select which column of the output should be used to calculate the ROC.

Well that’s twoClassParadigm. It’s a convenient feature of the PRT that not many people know about because they don’t have to. In the rare cases when you want the confidence assigned to each class, you now know how to get them.

]]>A few weeks ago we talked about clustering with K-Means, and using K-Means distances as a pre-processing step. K-Means is great when euclidean distance in your input feature-space is meaningful, but what if your data instead lies on a high-dimensional manifold?

We recently introduced some new clustering and distance-metric approaches suitable for these cases - spectral clustering. The theory behind spectral clustering is beyond the scope of this entry, but as usual, the wikipedia page has a good summary - http://en.wikipedia.org/wiki/Spectral_clustering.

Although I’m writing the blog entry, all of the code in this demo was written by one of our graduate students @ Duke University - Dmitry Kalika, who’s a new convert to the PRT! Welcome Dima!

Throughout the following and the code for spectral clustering in the PRT, we make use of the excellent Bengio, 2003 paper - Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering http://www.iro.umontreal.ca/~lisa/pointeurs/tr1238.pdf

In particular, we use that extention for performing cluster approximation for out-of-sample embedding estimation.

Spectral clustering typically relies upon what’s referred to as a spectral embedding; this is a low-dimensional representation of a high-dimensional proximity graph.

We can use features derived from spectral embeddings like so:

ds = prtDataGenBimodal; dsTest = prtDataGenBimodal(10); algo = prtPreProcSpectralEmbed; algo = algo.train(ds); yOut = algo.run(ds); plot(yOut);

While spectral embedding provides a feature space for additional processing, we can also use prtClusterSpectralKmeans to perform direct clustering in the spectral space.

For example, the Moon data set (see prtDataGenMoon) creates two crescent moon-shapes that are not well-separated by euclidean distance metrics, but can be easily separated in spectral-cluster space.

ds = prtDataGenMoon; preProc = prtPreProcZmuv; preProc = preProc.train(ds); dsNorm = preProc.run(ds); kmeans = prtClusterKmeans(‘nClusters’,2); kmeansSpect = prtClusterSpectralKmeans(‘nClusters’,2); kmeans = kmeans.train(dsNorm); kmeansSpect = kmeansSpect.train(dsNorm); subplot(1,2,1); plot(kmeans); title(‘K-Means Clusters’); subplot(1,2,2); plot(kmeansSpect) title(‘Spect-K-Means Clusters’);

Spectral clustering provides a very useful technique for non-linear and non-euclidean clustering. Right now our spectral clustering approaches are constrained to using RBF kernels, though there’s nothing that prevents you from using alternate kernels in future versions.

As always, let us know if you have questions or comments.

]]>Note, unlike most of our other objects, prtClusterMeanShift requires the bio-informatics toolbox,

As you might expect, we start by generating some data, and a prtClusterMeanShift object:

ds = prtDataGenUnimodal; ms = prtClusterMeanShift;

We can train, run, and plot the mean-shift algorithm just like anything else

ms = ms.train(ds); plot(ms);

In the above figure, the mean-shift algorithm correctly identified two clusters. We can mess with the Gaussian bandwidth parameter (sigma) to see how this affects how many clusters mean-shift finds:

sigmaVec = [.1 .3 .6 1 2 5]; for ind = 1:length(sigmaVec)`ms = prtClusterMeanShift; ms.sigma = sigmaVec(ind); ms = ms.train(ds); subplot(2,3,ind); plot(ms); prtPlotUtilFreezeColors title(sprintf(<span class="string">'sigma = %.2d'</span>,sigmaVec(ind)));`

end

Note how changing the sigma value can drastically alter the number of clusters that mean-shift finds. Careful tuning of that parameter may be necessary for your particular application.

We mentioned before that you can use mean shift in image processing – here’s a quick and dirty example applying mean shift to the famous “cameraman” photo:

I = imread(‘cameraman.tif’); I = imresize(I,0.25); I = double(I); [II,JJ] = meshgrid(1:size(I,2),1:size(I,1));ij = bsxfun(@minus,cat(2,II(:),JJ(:)),size(I));

ds = prtDataSetClass(cat(2,I(:)-128,ij)); ms = train(prtClusterMeanShift(‘sigma’,200),ds); out = run(ms, ds); [~,out] = max(out.X,[],2);

figure(‘position’,[479 447 1033 366]); subplot(1,2,1) imagesc(I) colormap(gray(256)) prtPlotUtilFreezeColors; title(‘Cameraman.tif’,‘FontSize’,16);

subplot(1,2,2); imagesc(reshape(out,size(I))); colormap(prtPlotUtilClassColors(ms.nClusters)) prtPlotUtilFreezeColors; title(‘Cameraman.tif – Mean Shift’,‘FontSize’,16);

Determining convergence in a mean shift scenario can actually be pretty subtle, the code we provide is based on

http://dl.acm.org/citation.cfm?id=1143864 Fast Nonparametric Clustering with Gaussian Blurring Mean-Shift Miguel A. Carreira-Perpinan ICML 2006

That’s all for now. If you have the bio-informatics toolbox, have fun with prtClusterMeanShift. If you don’t, we need to find or write a replacement for graphconncomp to de-couple MeanShift from bioinformatics. One day, hopefully.

]]>Today we’d like to talk about using K-Means as a non-linear feature extraction algorithm. This is becoming a pretty popular way to deal with a number of classification tasks, since K-means followed by linear classification is relatively easy to paralellize and works well on very large data sets.

We’ll leave the large data set processing to another time, and for now, just look at a new prtPreProc object - prtPreProcKmeans

You may be used to using prtClusterKmeans previously, and wonder why we need prtPreProcKmeans - the answer is a little subtle. prtCluster* objects are expected to output the max a-posteriori cluster assignments. But for feature extraction, we actually want to output the distances from each observation to each cluster center (vs. the class outputs). You can see the difference in the following:

ds = prtDataGenBimodal; cluster = prtClusterKmeans('nClusters',4); preProc = prtPreProcKmeans('nClusters',4); cluster = cluster.train(ds); preProc = preProc.train(ds); dsCluster = cluster.run(ds); dsPreProc = preProc.run(ds); subplot(1,2,1); imagesc(dsCluster); title('Cluster Assignments'); subplot(1,2,2); imagesc(dsPreProc); title('Cluster Distances');

We can combine prtPreProcKmeans with any classifier – let’s try with a logistic discriminant, and see how well we can do:

algoSimple = prtClassLogisticDiscriminant; algoKmeans = prtPreProcKmeans(‘nClusters’,4) + prtClassLogisticDiscriminant;yOutSimple = kfolds(algoSimple,ds,5); yOutKmeans = kfolds(algoKmeans,ds,5);

yOutAll = catFeatures(yOutSimple,yOutKmeans); [pf,pd] = prtScoreRoc(yOutAll); subplot(1,1,1); h = prtUtilCellPlot(pf,pd); set(h,‘linewidth’,3); legend(h,{‘Log Disc’,‘K-Means + Log-Disc’}); xlabel(‘Pfa’); ylabel(‘Pd’);

We can visualize the resulting decision boundary using a hidden (and undocumented method) of prtAlgorithm, that lets us plot algorithms as though they were classifiers as long as certain conditions are met.

Here’s an example:

```
algoKmeans = algoKmeans.train(ds);
algoKmeans.plotAsClassifier;
title(‘K-Means + Logistic Discriminant’);
```

K-Means pre-processing is a potentially powerful way to combine simple clustering and simple classification algorithms to form powerful non-linear classifiers.

We’re working on some big additions to the PRT in the next few weeks… especially dealing with very large data sets. Stay tuned.

]]>For a lot of high-dimensional data sets, it turns out creating an observations x features image of the data is a great way to visualize and understand your data. This week we made that process a little easier and cleaner by introducing a method of prtDataSetClass - imagesc.

The method takes care of a number of things that were a little tricky to do previously - first, it makes sure the observations are sorted by class index, next it creates an image of all the data with black bars denoting the class boundaries, and finally, it makes the y-tick-marks contain the relevant class names.

It’s now easy to generate clean visualizations like so:

ds = prtDataGenCylinderBellFunnel; ds.imagesc;

Of course, you can do the same thing with other data sets, too. Look at how easy it is to see which features are important in prtDataGenFeatureSelection:

ds = prtDataGenFeatureSelection; ds.imagesc;

That’s it for this week. We use imagesc-based visualization all the time, and hopefully you’ll find it interesting and useful, too.

]]>prtDataGenSandP500 generates data containing stock-price information from the S&P 500. The information dates back to January 3, 1950, and includes the index’s open, close, volume, and other features.

Check it out:

ds = prtDataGenSandP500; ds.featureNames spClose = ds.retainFeatures(5); plot(spClose.X,'linewidth',2); title('S&P 500 Closing Value vs. Days since 1/3/1950');

ans = 'Date' 'Open' 'High' 'Low' 'Close' 'Volume' 'AdjClose'

If you can do decent prediction on that data… you might be able to make some money :)

prtDataGenCylinderBellFunnel is a tool for generating a synthetic data set which contains a number of time-series, each of which has either a flat plateau (cylinder), a rising (bell) or a falling (funnel) slope.

You can find the specification we used to generate the data here: http://www.cse.unsw.edu.au/~waleed/phd/html/node119.html

And the data was used in an important paper in the data-mining community – Keogh and Lin, Clustering of Time Series Subsequences is Meaningless: Implications for Previous and Future Research. http://www.cs.ucr.edu/~eamonn/meaningless.pdf

```
ds = prtDataGenCylinderBellFunnel;
imagesc(ds.X);
title(‘Cylinders (1:266), Bells (267:532), and Funnels (533:798)’);
```

That’s all for now. Hope you enjoy these new data sets, we’re always adding new data to the PRT; let us know what you’d like to see!

]]>Kernels have very precise meanings in certain contexts (Mercer kernels for example) so it is important that we really define what we mean by prtKernel. A prtKernel is a standard prtAction. That means that it supports the train operation, that takes a dataset and outputs a prtKernel with modified parameters, and the run operation, that takes a dataset and outputs modified dataset. What makes prtKernel different than other prtActions is that they typically transform a dataset into a different dimensionality. The new features are usually the distance to a collection of training examples and most kernels differ in their selection of the distance function. The most widely used kernel is the radial basis function.

Let’s look at using prtKernelRbf.

ds = prtDataGenBimodal; kernel = prtKernelRbf('sigma',2); % Set the kernel Parameter trainedKernel = kernel.train(ds); % Train the kernel using the input data kernelTransformedData = trainedKernel.run(ds); subplot(2,1,1) plot(ds); subplot(2,1,2) imagesc(kernelTransformedData.X); colormap(hot) title('Kernel Transformation'); ylabel('observation'); xlabel('feature')

You can see in the image space that there is a checkerboard pattern highlighting the multi-modal nature of the data.

kernelRbf = prtKernelRbf(‘sigma’,2); trainedKernelRbf = kernelRbf.train(ds); kernelTransformedDataRbf = trainedKernelRbf.run(ds);subplot(2,2,1) imagesc(kernelTransformedDataRbf.X); title(‘RBF Kernel Transformation’); ylabel(‘observation’); xlabel(‘feature’)

kernelHyp = prtKernelHyperbolicTangent; trainedKernelHyp = kernelHyp.train(ds); kernelTransformedDataHyp = trainedKernelHyp.run(ds);

subplot(2,2,2) imagesc(kernelTransformedDataHyp.X); title(‘Hyperbolic Tangent Kernel Transformation’); ylabel(‘observation’); xlabel(‘feature’)

kernelPoly = prtKernelPolynomial; trainedKernelPoly = kernelPoly.train(ds); kernelTransformedDataPoly = trainedKernelPoly.run(ds);

subplot(2,2,3) imagesc(kernelTransformedDataPoly.X); title(‘Polynomial Kernel Transformation’); ylabel(‘observation’); xlabel(‘feature’)

kernelDirect = prtKernelDirect; trainedKernelDirect = kernelDirect.train(ds); kernelTransformedDataDirect = trainedKernelDirect.run(ds);

subplot(2,2,4) imagesc(kernelTransformedDataDirect.X); title(‘Direct Kernel Transformation’); ylabel(‘observation’); xlabel(‘feature’)

You can see how the choice of the kernel (and kernel parameters) can really effect the outcoming feature space. It is also interesting to notice that the direct kernel is not actually a kernel at all. It just uses the data as the output featurespace (essentially doing nothing). This is useful for combining the original feature space with kernel transformed data using kernel sets.

Kernels can be combined using the & operator to great prtKernelSets. These perform collumn wise concatonation of several kernels. This allows one to create a single kernel transformation out of several prtKernels. In theory one could use / to make a parallel prtAlgorithm to accomplish the same task but there are several reasons to use & that allow them to work within the prtClassRvm and prtClassSvm to remain efficient at run-time.

clf; % Clear those subplots from earlier kernel = prtKernelDc & prtKernelRbf(‘sigma’,1) & prtKernelHyperbolicTangent; trainedKernel = kernel.train(ds); % Train the kernel using the input data kernelTransformedData = trainedKernel.run(ds); imagesc(kernelTransformedData.X); title(‘A Kernel Set Transformation’); ylabel(‘observation’); xlabel(‘feature’)

You can see that the different transformed feature spaces are concatenated together.

In prtClassRvm the “kernels” property can be set to the prtKernel of our choosing. The RVM is essentially a sparse (it tries to have most coefficients be zero) linear classifier that opperates on kernel transformed data. Let’s look at some classification results of prtDataGenBimodal using several different choices for the kernel.

subplot(2,2,1) plot(train(prtClassRvm(‘kernels’,prtKernelRbf(‘sigma’,2)),ds)) title(‘RBF Kernel RVM’);subplot(2,2,2) plot(train(prtClassRvm(‘kernels’,prtKernelHyperbolicTangent),ds)) title(‘Hyperbolic Tangent Kernel RVM’);

subplot(2,2,3) plot(train(prtClassRvm(‘kernels’,prtKernelPolynomial),ds)) title(‘Polynomial Kernel RVM’);

subplot(2,2,4) plot(train(prtClassRvm(‘kernels’,prtKernelDirect),ds)) title(‘Direct Kernel RVM’);

As you can see the correct choice for the kernel is very important for robust classifcation. The RBF kenerl is a common choice but even it has the sigma parameter which can grealy impact performance. One interesting variant of the RBF kernel is call prtKernelRbfNeighborhoodScaled. This kernel sets the sigma parameter differently for each data point depending on the local neighborhood of the training point.

clf; % Clear those subplots from earlier plot(train(prtClassRvm(‘kernels’,prtKernelRbfNeighborhoodScaled),ds)) title(‘Locally Scaled RBF Kernel RVM’);

In the forum the other day someone asked if we could do non-linear regression with multi-dimensional output. Sadly, the answer is “not directly” but using kernels you can. By transforming the data to kernel space and then using a linear regression technique you can perform non-linear regression. I wont copy the content over here but check out the answer from the forum. http://www.newfolderconsulting.com/node/412

This was a pretty quick overview of things you can do with kernels in the PRT. We don’t have every kernel but we have quite a few. If there is something you think we should add let us know.

]]>Principal component analysis (PCA) is a widely used technique in the statistics and signal processing literature. Even if you haven’t heard of PCA, if you know some linear algebra, you may have heard of the singular value decomposition (SVD), or, if you come from the signal processing literature, you’ve probably heard of the Karhunen–Loeve transformation (KLT). Both of these are identical in form to PCA. Turns out a lot of different groups have re-created the same algorithm in a lot of different fields!

We won’t have time to delve into the nitty gritty about PCA here. For our purposes it’s enough to say that given a (zero-mean) data set X of nObservations x nFeatures, we often want to find a linear transformation of X, S = X*Z, for a matrix Z of size nPca x nFeatures where:

1) nPca < nFeatures 2) The resulting data, S, contains "most of the information from" X.

As you can imagine, the phrase “most of the information” is vague, and subject to interpretation… Mathematically, PCA considers “most of the information in X” to be equivalent to “explains most of the variance in X. It turns out that this statement of the problem has some very nice mathematical solutions - e.g., the columns of S can be viewed as the dominant eigenvectors in the covariance of X!

You can find our more about PCA on the fabulous wikipedia article: https://en.wikipedia.org/wiki/Principal_component_analysis.

PCA is implemented in the PRT using prtPreProcPca. Older versions of prtPreProcPca used to make use of different algorithms for different sized data sets (there are a lot of ways to do PCA quickly depending on matrix dimensions). Since 2012, we found that the MATLAB function SVDS was beating all of our approaches in terms of speed and accuracy, so have switched over to using SVDS to solve for the principal component vectors.

Let’s take a quick look at some PCA projections. First, we’ll need some data:

ds = prtDataGenUnimodal;

We also need to make a prtPreProcPca object, and we’ll use 2 components in the PCA projection:

```
pca = prtPreProcPca('nComponents',2);
```

prtPreProc* objects can be trained and run just like any other objects:

pca = pca.train(ds);

Let’s visualize the results, first we’ll look at the original data, and the vectors from the PCA analysis:

plot(ds); hold on; h1 = plot([0 pca.pcaVectors(1,1)],[0,pca.pcaVectors(2,1)],'k'); h2 = plot([0 pca.pcaVectors(1,2)],[0,pca.pcaVectors(2,2)],'k--'); set([h1,h2],'linewidth',3); hold off; axis equal; title('Original Data & Two PCA Vectors');

From this plot, we can see that the PCA vectors are oriented first along the dimension of largest variance in the data (diagonal wiht a positive slope), and the second PCA is oriented orthogonal to the first PCA.

We can project our data onto this space using the RUN method:

```
dsPca = pca.run(ds);
plot(dsPca);
title(‘PCA-Projected Data’);
```

In general, it might be somewhat complicated to determine how many PCA components are necessary to explain most of the variance in a particular data set. Above we used 2, but for higher dimensional data sets, how many should we use in general?

We can measure how much variance each PC explains during training by exploring the vector pca.totalPercentVarianceCumulative which is set during training. This vector contains the percent of the total variance of the data set explained by 1:N PCA components. For example, totalPercentVarianceCumulative(3) contains the percent variance explained by components 1 through 3. When this metric plateaus, that’s a pretty good sign that we have enough components.

For example:

ds = prtDataGenProstate; pca = prtPreProcPca(‘nComponents’,ds.nFeatures); pca = pca.train(ds);stem(pca.totalPercentVarianceCumulative,‘linewidth’,3); xlabel(‘#Components’); ylabel(‘Percent Variance Explained’); title(‘Prostate Data – PCA Percent Variance Explained’);

For PCA to be meaningful, the data used has to have zero-mean columns, and prtPreProcPca takes care of that for you (so you don’t have to zero mean the columns yourself). However, different authors disagree about whether or not the columns provided to PCA should all have the same variance before PCA analysis. Depending on normalization, you can get very different PCA projections. To leave the option open, the PRT does **not** automatically normalize the columns of the input data to have uniform variance. You can manually enforce this before your PCA processing with prtPreProcZmuv.

Here’s a simplified example, where we do the two processes separately to show the differences.

ds = prtDataGenProstate; dsNorm = rt(prtPreProcZmuv,ds); pca = prtPreProcPca(‘nComponents’,ds.nFeatures); pca = pca.train(ds); pcaNorm = pca.train(dsNorm);subplot(2,1,1); stem(pca.totalPercentVarianceCumulative,‘linewidth’,3); xlabel(‘#Components’); ylabel(‘Percent Variance Explained’); title(‘Prostate Data – PCA Percent Variance Explained’);

subplot(2,1,2); stem(pcaNorm.totalPercentVarianceCumulative,‘linewidth’,3); xlabel(‘#Components’); ylabel(‘Percent Variance Explained’); title(‘Prostate Data – PCA Percent Variance Explained (Normalized Data)’);

As you can see, processing normalized and un-normalized data results in quite different assessments of how many PCA components are required to summarize the data.

Our recommendation is that if your data comes from different sources, with different sensor ranges or variances (as in the prostate data), it’s imperative that you perform standard-deviation normalization prior to PCA processing.

Otherwise, it’s worthwhile to try both with and without ZMUV pre-processing and see what gives better performance.

That’s about it for PCA processing. Of course, you can use PCA as a pre-processor for any algorithm you’re developing, to reduce the dimensionality of your data, for example:

algo = prtPreProcPca + prtClassLibSvm;

Let us know if you have questions or comments about using prtPreProcPca.

]]>`rt()` stands run, train. You can use it to perform a “run on train” operation and you only the output dataset. In other words, it’s a quick and dirty method for when you would otherwise need to call `run(train(action, dataset), dataset)`. To see how one might use `rt()`, let’s calculated the first two principal components of Fisher’s Iris dataset.

```
ds = prtDataGenIris;
pca = train(prtPreProcPca('nComponents',2), ds);
dsPca = run(pca, ds);
```

If we didn’t really care about keeping the (trained) PCA object around we could have done this all in one line.

```
dsPca = run(train(prtPreProcPca('nComponents',2), ds),ds);
```

That string at the end `ds),ds)` is odd looking and this is when `rt()` comes to the rescue.

```
dsPca = rt(prtPreProcPca('nComponents',2), ds);
```

That’s how you can use `rt()` in a nutshell. It’s a nice method to keep in your back pocket when you are just beginning to explore a dataset and cross-validation isn’t yet on your mind. Remember `rt()` is a (hidden) method of `prtAction` and therefore can be used with all classifiers, pre-processors etc. Let us know if you find any use for it. We do.

The data we’re going to use is from a Kaggle competition that’s going on from now (March 28, 2013) until April 15, 2013. Kaggle is a company that specializes in connecting data analysts with interesting data - it’s pretty great for hobbyists and individuals to get started with some data, and potentially win some money! And they just generally have a lot of cool data from a lot of interesting problems.

The data we’re going to use is based on identifying an author’s gender from samples of their handwriting. Here’s the URL for the competition home page, which gives some details on the data:

http://www.kaggle.com/c/icdar2013-gender-prediction-from-handwriting

The competition includes several sets of images,as well as some pre-extracted features. The image files can be gigantic, so we’re only going to use the pre-extracted features for today. Go ahead and download train.csv, train_answers.csv, and test.csv, from the link above, and put them in

fullfile(prtRoot,'dataGen','dataStorage','kaggleTextGender_2013');

Once the files are in the correct location, you should be able to use:

[dsTrain,dsTest] = prtDataGenTextGender;

to load in the data.

Obviously, prtDataGenTextGender.m is new, as are a number of other files we’re going to use throughout this example. These include prtEvalLogLoss.m, prtScoreLogLoss.m, prtUtilAccumArrayLike.m, and prtUtilAccumDataSetLike.m. You’ll need to update your PRT to the newest version (as of March, 2013, anyway) to get access to these files. You can always get the PRT here: http://github.com/newfolder/PRT

Once you’ve done all that, go ahead and try the following:

[dsTrain,dsTest] = prtDataGenTextGender;

That should load in the data. As always, we can visualize the data using someting simple, like PCA:

```
pca = prtPreProcPca;
pca = pca.train(dsTrain);
dsPca = pca.run(dsTrain);
plot(dsPca);
title('Kaggle Handwriting/Gender ICDAR 2013 Data');
```

Kaggle competitions will often provide a baseline performance metric for some standard classification algorithms. In this example they told us that the baseline random forest performamce they’ve observed obtains a log-loss of about 0.65. We can confirm this using our random forest, 3-fold cross-validation, and our new function prtScoreLogLoss:

```
yOut = kfolds(prtClassTreeBaggingCap,dsTrain,3);
logLossInitialRf = prtScoreLogLoss(yOut);
fprintf(‘Random Forest LogLoss: %.2f\n’,logLossInitialRf);
```

Random Forest LogLoss: 0.64

About 0.65, so we’re right in the ball-park. Can we do better?

That performance wasn’t that great. And the leaderboard shows us that some clever people have already done significantly better than the basic random forest.

Let’s investigate the data a little and see what’s going on. First, what is the standard deviation of the features?

stem(log(std(dsTrain.X))); xlabel(‘Feature Number’); ylabel(‘Log-\sigma’); title(‘Log(\sigma) vs. Feature Number’);

Wow, there are a lot of features with a standard deviation of zero! That means that we can’t learn anything from these features, since they always take the exact same value in the training set. Let’s go ahead and remove these features.

```
fprintf(‘There are %d features that only take one value… \n’,length(find(std(dsTrain.X)==0)));
removeFeats = std(dsTrain.X) == 0;
dsTrainRemove = dsTrain.removeFeatures(removeFeats);
dsTestRemove = dsTest.removeFeatures(removeFeats);
```

There are 2414 features that only take one value…

What happens if we re-run the random forest on this data with the new features removed? The random forest is pretty robust to meaningless features, but not totally impervious… let’s try it:

```
yOutRf = kfolds(prtClassTreeBaggingCap,dsTrainRemove,3);
logLossRfFeatsRemoved = prtScoreLogLoss(yOutRf);
fprintf(‘Random Forest LogLoss with meaningless features removed: %.2f\n’,logLossRfFeatsRemoved);
```

Random Forest LogLoss with meaningless features removed: 0.61

Hey! That did marginally better – our log-loss went from about 0.65 to 0.61 or so. Nothing to write home about, but a slight improvement. What else can we do?

If you pay attention to the data set, you’ll notice something interesting – we have a lot of writing samples (4) from each writer. And our real goal is to identify the gender of each writer – so we should be able to average our classifications over each writer and get better performance.

This blog entry introduces a new function called “prtUtilAccumDataSetLike”, which acts a lot like “accumarray” in base MATLAB. Basically, prtUtilAccumDataSetLike takes a set of keys of size dataSet.nObservations x 1, and for each observation corresponding to each unique key, aggregates the data in X and Y and outputs a new data set.

It’s a little complicated to explain – take a look at the help entry for accumarray, and then take a look at this example:

writerIds = [dsTrainRemove.observationInfo.writerId]‘; yOutAccum = prtUtilAccumDataSetLike(writerIds,yOutRf,@(x)mean(x));

The code above outputs a new data set generated by averaging the confidences in yOutRf across sets of writerIds.

Does this help performance?

```
logLossAccum = prtScoreLogLoss(yOutAccum);
fprintf('Writer ID Accumulated Random Forest LogLoss: %.2f\n’,logLossAccum);
```

Writer ID Accumulated Random Forest LogLoss: 0.59

That’s marginally better still! What else can we try…

When a random forest seems to be doing somewhat poorly, often it’s a good idea to take a step back and run a linear classifier in lieu of a nice fancy random forest. I’m partial to PLSDA as a classifier (see the help entry for prtClassPlsda for more information).

PLSDA has one parameter – the number of components to use, that we should optimize over. Since each kfolds-run is random, we’ll run 10 experiments of 3-Fold Cross-validation for each of 1 – 30 components in PLSDA… This might take a little while depending on your computer…

We’re also going to do something a little tricky here – PLSDA is a linear classifier, and won’t output values between zero and one by default. But the outputs from PLSDA should be linearly correlated with confidence that the author of a particular text was a male. We can translate from PLSDA outputs to values with probabilistic interpretations by attaching a logistic-discriminant function to the end of our PLSDA classifier. That’s easy to do in the PRT like so:

classifier = prtClassPlsda(‘nComponents’,nComp) + prtClassLogisticDiscriminant;

nIter = 10; maxComp = 30; logLossPlsda = nan(maxComp,nIter); logLossPlsdaAccum = nan(maxComp,nIter); for nComp = 1:maxComp;`classifier = prtClassPlsda(<span class="string">'nComponents'</span>,nComp) + prtClassLogisticDiscriminant; classifier.showProgressBar = false; <span class="keyword">for</span> iter = 1:nIter yOutPlsda = kfolds(classifier,dsTrainRemove,3); logLossPlsda(nComp,iter) = prtScoreLogLoss(yOutPlsda); yOutAccum = prtUtilAccumDataSetLike(writerIds,yOutPlsda,@(x)mean(x)); logLossPlsdaAccum(nComp,iter) = prtScoreLogLoss(yOutAccum); <span class="keyword">end</span> fprintf(<span class="string">'%d '</span>,nComp);`

end fprintf(‘\n’);

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Let’s take a look at the PLSDA classifier performance as a function of the number of components we used. The following code generates box-plots (recall, we ran 3-fold cross-validation 10 times for each # of components between 1 and 30…

boxplot(logLossPlsdaAccum') hold on; h2 = plot(1:maxComp,repmat(logLossInitialRf,1,maxComp),‘k:’,1:maxComp,repmat(logLossRfFeatsRemoved,1,maxComp),‘b:’,1:maxComp,repmat(logLossAccum,1,maxComp),‘g:’); hold off; legend(h2,{‘Random Forest Log-Loss’,‘Random Forest – Removed Features’,‘Random Forest – Removed Features – Accum’}); h = findobj(gca,‘type’,‘line’); set(h,‘linewidth’,2); xlabel(‘#PLSDA Components’); ylabel(‘Log-Loss’); title(‘Log-Loss For PLSDA With Accumumation (vs. # Components) and Random Forest’)

Wow! The dotted lines here represent the random forest performance we’ve seen, and the boxes represent the performance we get with PLSDA – PLSDA is significantly outperforming our RF classifier on this data!

PLSDA performance seems to plateau around 17 components, so we’ll use 17 from now on.

I think we might have something here – our code gets Log-Losses around 0.46 many times. Let’s actually submit an experiment to Kaggle.

First we’ll train our classifier and test it:

classifier = prtClassPlsda(‘nComponents’,17) + prtClassLogisticDiscriminant; classifier = classifier.train(dsTrainRemove); yOutTest = classifier.run(dsTestRemove);writerIdsTest = [dsTestRemove.observationInfo.writerId]‘;

Don’t forget to accumulate:

[yOutPlsdaTestAccum,uKeys] = prtUtilAccumDataSetLike(writerIdsTest,yOutTest,@(x)mean(x)); matrixOut = cat(2,uKeys,yOutPlsdaTestAccum.X);

And write the output the way Kaggle wants us to.

```
csvwrite('outputPlsda.csv’,matrixOut);
```

I made a screen-cap of the results from the output above – here it is:

```
imshow(‘leaderBoard_2013_03_20.PNG’);
```

That’s me in the middle there – #38 out of about 100. And way better than the naive random forest implementation – not too bad!

Can we do better?

One way we can reduce variation and improve performance is to not include all 4652 features left in our data set. We can use feature selection to pick the ones we want!

I’m going to go ahead and warn you – don’t run this code unless you want to leave it running overnight. It takes forever… but it gets the job done:

warning off; c = prtClassPlsda('nComponents' ,17, 'showProgressBar',false); sfs = prtFeatSelSfs('nFeatures',100, 'evaluationMetric',@(ds)-1*prtEvalLogLoss(c,ds,2)); sfs = sfs.train(dsTrainRemove);

Instead, we already ran that code, and saved the results in sfs.mat, which you can down-load at the end of this post.

For now, let’s look at how performance is affected by the number of features retained:

load sfs.mat sfs set(gcf,'position',[403 246 560 420]); %fix from IMSHOW plot(-sfs.performance) xlabel('# Features'); ylabel('Log-Loss'); title('Log-Loss vs. # Features Retained');

It looks like performance is bottoming out around 60 or so features, and anything past that isn’t adding performance (though maybe if we selected 1000 or 2000 we could do better!)

We can confirm this with the following code, which also takes quite a while to run a bunch of experiments on all the sub-sets SFS found for us:

logLossClassifierFeatSel = nan(100,10); for nFeats = 1:100;`<span class="keyword">for</span> iter = 1:10 dsTrainRemoveFeatSel = dsTrainRemove.retainFeatures(sfs.selectedFeatures(1:nFeats)); yOutPlsdaFeatSel = classifier.kfolds(dsTrainRemoveFeatSel,3); xOutAccum = prtUtilAccumArrayLike(writerIds,yOutPlsdaFeatSel.X,[],@(x)mean(x)); yOutAccum = prtUtilAccumArrayLike(writerIds,yOutPlsdaFeatSel.Y,[],@(x)unique(x)); yOutAccumFeatSel = prtDataSetClass(xOutAccum,yOutAccum); logLossClassifierFeatSel(nFeats,iter) = prtScoreLogLoss(yOutAccumFeatSel); <span class="keyword">end</span>`

end

boxplot(logLossClassifierFeatSel’) drawnow;

ylabel(‘Log-Loss’); xlabel(‘# Features’) title(‘Log-Loss vs. # Features’)

In a minute we’ll down-select the number of features we want to use – but first let’s do one more thing. Recall that we added a logistic discriminant at the end of our PLSDA classifier. That was clever, but after that, we accumulted a bunch of data together. We might be able to run **another** logistic discriminant after the accumultion to do even better!

Let’s see what that code looks like:

classifier = prtClassPlsda(‘nComponents’,17) + prtClassLogisticDiscriminant; classifier = classifier.train(dsTrainRemove); yOutPlsdaKfolds = classifier.kfolds(dsTrainRemove,3);yOutAccum = prtUtilAccumDataSetLike(writerIds,yOutPlsdaKfolds,@(x)mean(x)); yOutAccumLogDisc = kfolds(prtClassLogisticDiscriminant,yOutAccum,3);

logLossPlsdaAccum = prtScoreLogLoss(yOutAccum); logLossPlsdaAccumLogDisc = prtScoreLogLoss(yOutAccumLogDisc); fprintf(‘Without post-Log-Disc: %.3f; With: %.3f\n’,logLossPlsdaAccum,logLossPlsdaAccumLogDisc);

Without post-Log-Disc: 0.479; With: 0.461

That’s a slight improvement, too!

Let’s put everything together, and see what happens:

First, pick the right # of features based on our big experiment above:

[minVal,nFeatures] = min(mean(logLossClassifierFeatSel')); dsTrainTemp = dsTrainRemove.retainFeatures(sfs.selectedFeatures(1:nFeatures)); dsTestTemp = dsTestRemove.retainFeatures(sfs.selectedFeatures(1:nFeatures));

Now, train a classifier, and a logistic discriminant:

classifier = prtClassPlsda(‘nComponents’,17) + prtClassLogisticDiscriminant; classifier = classifier.train(dsTrainRemove); yOutPlsdaKfolds = classifier.kfolds(dsTrainRemove,3);yOutAccum = prtUtilAccumDataSetLike(writerIds,yOutPlsdaKfolds,@(x)mean(x)); yOutPostLogDisc = kfolds(prtClassLogisticDiscriminant,yOutAccum,3); postLogDisc = train(prtClassLogisticDiscriminant,yOutAccum); logLossPlsdaEstimate = prtScoreLogLoss(yOutPostLogDisc);

And run the same classifier, followed by the post-processing logistic discriminant:

yOut = classifier.run(dsTestRemove); [xOutAccumSplit,uLike] = prtUtilAccumDataSetLike(writerIdsTest,yOut,@(x)mean(x)); dsTestPost = prtDataSetClass(xOutAccumSplit); yOutPost = postLogDisc.run(dsTestPost);matrixOut = cat(2,uLike,yOutPost.X); csvwrite(‘outputPlsda_FeatSelPostProc.csv’,matrixOut);

We submitted this version to Kaggle also. The results are shown below:

```
imshow(‘leaderBoard_2013_03_21.PNG’);
```

That bumped us up by just a little bit in terms of overall log-loss, but quite a bit in the leader-list!

A lot of people are still doing way better than this blog entry (and have gotten better since a week ago, when we did this analysis!), but that’s not bad performance for what turns out to be about 20 lines of code, don’t you think?

If you have any success with the text/gender analysis data using the PRT, let us know – either post on the Kaggle boards, or here, or drop us an e-mail.

]]>In general, the term “mixture model” implies that each observation of data has an associated hidden variable and that observation is drawn from a distribution that is dependent on that hidden variable. In general statistics, a mixture model can have either continuous or discrete hidden variables. In the PRT, our prtRvMixture only considers discrete mixtures. That is, there are fixed number of “components” each with a mixing proportion and each component is itself a parameterized distribution, like a Gaussian.

The most common mixture is the Gaussian mixture model (GMM). A guassian mixture model with K components has a K dimensional discrite distrubtion for the mixing variable and has K individual Gaussian components. Today’s post will focus on using working with Gaussian mixture models.

prtRvMixture has two main properties that are of interest “components” and “mixingProportions”.

components should be an array of prtRvs that also inherit from prtRvMembershipModel. Without getting too indepth a prtRvMembershipModel is a special attribute of some prtRvs that specifies that this RV knows how to work inside of a mixture model. As we mentioned before, we are focusing on mixture of prtRvMvn objects to make a Gaussian mixture model. Luckily prtRvMvn inherits from prtRvMembershipModel and therefore it knows how to work in a mixture.

mixingProportions is the discrite mixing density for the mixture model. It should be a vector that sums to one with the same length as the “components” array.

To get started let’s make an array of 2D MVN RV objects with different means and a non-diagonal covariance matrix.

```
gaussianSet1 = repmat(prtRvMvn('sigma',[1 -0.5; -0.5 2;]),2,1);
gaussianSet1(1).mu = [-2 -2];
gaussianSet1(2).mu = [2 2];
```

Then we will can make a mixture by specifying some mixingProportions

mixture1 = prtRvMixture('components',gaussianSet1,'mixingProportions',[0.5 0.5]);

Because prtRvMixtures are prtRvs we get all of the nice plotting that comes along with things. Let’s take a look at the density of our prtRv

plotPdf(mixture1);

To show how we can do classification with these mixtures, let’s make another mixture with different parameters. Then we will draw some data from both mixtures and plot our classification dataset.

gaussianSet2 = repmat(prtRvMvn(‘sigma’,[1 0.5; 0.5 3;]),2,1); gaussianSet2(1).mu = [2 -2]; gaussianSet2(2).mu = [-2 2];mixture2 = prtRvMixture(‘components’,gaussianSet2,‘mixingProportions’,[0.5 0.5]);

ds = prtDataSetClass( cat(1,mixture1.draw(500),mixture2.draw(500)), prtUtilY(500,500)); % Draw 500 samples from each mixture plot(ds)

Like we showed in part 1 of this series we can set the “rvs” property of prtClassMap to any prtRv object and use that rv for classification. Let’s for prtClassMap to use a mixture of prtRvMvn objects.

emptyMixture = prtRvMixture(‘components’,repmat(prtRvMvn,2,1)); % 2 component mixtureclassifier = prtClassMap(‘rvs’,emptyMixture);

trainedClassifier = train(classifier, ds);

plot(trainedClassifier);

As you can see it looks like this classifier would perform quite well. We can see that the learned means of the class 0 data (blue) closely match the means that we specified above for guassianSet1. So things appear to be working well.

cat(1,trainedClassifier.rvs(1).components.mu)

ans = -1.9712 -2.0488`2.0139 1.9548`

Since Guassian mixture models are the most common type of mixture a number of techniques have been established that help them perform better when working with limited and/or high dimensional data. To help facilitate some of those tweak there is prtRvGmm. It works much the same way as prtRvMixture only the components must be prtRvMvns.

mixture1Gmm = prtRvGmm(‘components’,gaussianSet1,‘mixingProportions’,[0.5 0.5]);plotPdf(mixture1Gmm);

One of the available tweaks is that the covarianceStructure of all components is controled by a single parameter, “covarianceStructure’. See the documentation prtRvMvn to know how this works. Let’s see how changing the covarianceStructure of all of our components changes the appears of our density.

mixture1GmmMod = mixture1Gmm; mixture1GmmMod.covarianceStructure = ‘spherical’; % Force independence with a shared variance.plotPdf(mixture1GmmMod);

Using prtRvGmm for classification is a little easier than prtRvMixture because we only need to specify the number of components. We don’t have to built the array of components ourselves.

Let’s redo the same problem as before.

classifier = prtClassMap(‘rvs’,prtRvGmm(‘nComponents’,2)); trainedClassifier = train(classifier,ds); plot(trainedClassifier);

As you can see, things look nearly identical (as they should). Now, let’s make use of a few of the extra tweak offered by prtRvGmm and see how they change our decision contours.

classifier = prtClassMap(‘rvs’,prtRvGmm(‘nComponents’,2,‘covarianceStructure’,‘spherical’,‘covariancePool’,true)); trainedClassifier = train(classifier,ds); plot(trainedClassifier);

You can see that the decision contours are more regular now. This may help classificaiton performance in the precense of limited and/or high dimensional data.

We hope this post showed how you can use mixture models just like any other prtRv object when doing classificaiton using prtClassMap.

Chances are that you want to use prtRvGmm for your mixture modeling needs but you might be able to guess that prtRvMixture is much more general allowing you to make mixture models out of general prtRvs. However, at this time the only prtRv that is able to be used is prtRvMvn. We are interested in making more prtRvs compatible but we want to know what people want to use. Let us know if you need something specific.

In the next part of this series we will look at prtRvHmm and how it can be used for time-series classification.

]]>The PRT comes with a utility function to generate data that has about 10 features, and for which only 5 of the features are actually informative. The function is prtDataGenFeatureSelection, I’ll let the help entry explain how it works:

```
help prtDataGenFeatureSelection
```

prtDataGenFeatureSelection Generates some unimodal example data for the prt. DataSet = prtDataGenFeatureSelection The data is distributed: H0: N([0 0 0 0 0 0 0 0 0 0],eye(10)) H1: N([0 2 0 1 0 2 0 1 0 2],eye(10)) Syntax: [X, Y] = prtDataGenFeatureSelection(N) Inputs: N ~ number of samples per class (200) Outputs: X - 2Nx2 Unimodal data Y - 2Nx1 Class labels Example: DataSet = prtDataGenFeatureSelection; explore(DataSet) Other m-files required: none Subfunctions: none MAT-files required: none See also: prtDataGenUnimodal

As you can see from the help, only dimensions 2, 4, 6, 8, and 10 are actually informative in this data set. And we can use feature selection to help us pick out what features are actually useful.

Feature selection objects are prefaced in the prt with prtFeatSel*, and they act just like any other prtAction objects. During training a feature selection action will typically perform some iterative search over the feature space, and determine which features are most informative. At run-time, the same object will return a prtDataSet containing onlt the features the algorithm considered “informative”.

For example:

```
ds = prtDataGenFeatureSelection;
featSel = prtFeatSelSfs;
featSel = featSel.train(ds);
dsFeatSel = featSel.run(ds);
plot(dsFeatSel);
title('Three Most Informative Features');
```

How does the feature selection algorithm determine what features are informative? For many (but not necessarily all) feature selection objects, the interesting field is “evaluationMetric”.

Let’s take a look:

featSel.evaluationMetric

ans =`@(DS)prtEvalAuc(prtClassFld,DS)`

Obviously, evaluationMetric is a function handle – in particular it represents a call to a prtEval* method. prtEval* methods typically take 2 or 3 input arguments – a classifier to train and run, a data set to train and run on, and (optionally) a number of folds (or fold specification) to use for cross-validation.

Feature selection objects iteratively search through the features available – in this case, all 10 of them, and apply the prtEval* method to the sub-sets of data formed by retaining a sub-set of the available features. The exact order in which the features are retained and removed depends on the feature selection approach – in SFS, the algorithm first iteratively searches through the features – 1,2,3…,10. Then it remembers which single feature provided the best performance – say it was feature 2. Next, the SFS algorithm iteratively searches through all 9 combinations of other features with feature 2: { {1,2},{3,2},{4,2},…,{10,2}} And remembers which of **those** performed best. This process is iterated, and features continually added to the set being evaluated until nFeatures are selected.

The resulting performance is then stored in “performance”, and the features selected are stored in “selectedFeatures”. Let’s force the SFS approach to look for 10 features (so it will eventually select all of them).

ds = prtDataGenFeatureSelection; featSel = prtFeatSelSfs; featSel.nFeatures = ds.nFeatures; featSel.evaluationMetric = @(DS)prtEvalAuc(prtClassFld,DS,3); featSel = featSel.train(ds);h = plot(1:ds.nFeatures,featSel.performance); set(h,‘linewidth’,3); set(gca,‘xtick’,1:ds.nFeatures); set(gca,‘xticklabel’,featSel.selectedFeatures); xlabel(‘Features Selected’); title(‘AUC vs. Features Selected’);

The features that get selected tend to favor features 2,6, and 10, then features 4 and 8, which makes sense as these are the 3 most informative followed by the two moderately-informative features!

There are a number of prtFeatSel* actions available, but not as many as we’d like. We’re constantly on the look-out for new ones, and we’d like to one day include “K-forward, L-Backward” searches, but just haven’t had the time recently.*

*Also, this example only used prtEvalAuc as the performance metric, but there are a number of prtEval* functions you can use, or, of course – feel free to write your own!

Take a look at prtEvalAuc to see how they work and how to create your own!

Enjoy!

]]>If you have already read through our previous post, you know how to get the Microsoft Research Cambridge Object Recognition Image Database (MSRCORID), which is really a fantastic resource for image processing and classification.

Once you’ve downloaded, we can run the following code which was for the most-prt ripped right out of the previous blog post:

ds = prtDataGenMsrcorid; patchSize = [8 8]; col = []; for imgInd = 1:ds.nObservations; img = ds.X{imgInd}; img = rgb2gray(img); img = imresize(img,.5); col = cat(1,col,im2col(img,patchSize,'distinct')'); end dsCol = prtDataSetClass(double(col)); preProc = prtPreProcZeroMeanRows + prtPreProcStdNormalizeRows('varianceOffset',10) + prtPreProcZca; preProc = preProc.train(dsCol); dsNorm = preProc.run(dsCol); skm = prtClusterSphericalKmeans('nClusters',50); skm = skm.train(dsNorm);

Last time, we used a simple bag-of-words model to do classification based on the feature vectors in each image. That’s definitely an interesting way to proceed, but most image-processing techniques make use of something called “max-pooling” to aggregate feature vectors over small regions of an image.

The process can be accomplished in MATLAB using blockproc.m, which is in the Image-processing toolbox. (If you don’t have image processing, it’s not too hard to write a replacement for blockproc.)

The goal of max-pooling is to aggregate feature vectors over local regions of an image. For example, we can take the MAX of the cluster memberships over each 8x8 region in an image using something like:

featsBp = blockproc(feats,[8 8],@(x)max(max(x.data,[],1),[],2));

Where we’ve assumed that feats is size nx x ny x nFeats.

Max pooling is nice because it reduces the dependency of the feature vectors on their exact placement in an image (each element of each 8x8 block gets treated about the same), and it also maintains a lot of the information that was in each of the feature vectors, especially when the feature vectors are expected to be sparse (e.g., have a lot of zeros; see http//www.ece.duke.edu/~lcarin/Bo12.3.2010.ppt).

There’s a lot more to max-pooling than we have time to get into here, for example, you can max-pool, and then re-cluster, and then re-max-pool! This is actually a super clever technique to reduce the amount of spatial variation in your image, and also capture information about the relative placements of various objects.

featVec = nan(ds.nObservations,skm.nClusters*20); clusters = skm.run(dsNorm); for imgInd = 1:ds.nObservations; img = ds.X{imgInd}; img = rgb2gray(img); imgSize = size(img); % Extract the sub-patches col = im2col(img,patchSize,'distinct'); col = double(col); dsCol = prtDataSetClass(col'); dsCol = run(preProc,dsCol); dsFeat = skm.run(dsCol); dsFeat.X = max(dsFeat.X,.05); % Max Pool! % Feats will be size 30 x 40 x nClusters % featsBp will be size [4 x 5] x nClusters (because of the way % blockproc handles edsges) feats = reshape(dsFeat.X,imgSize(1)/8,imgSize(2)/8,[]); featsBp = blockproc(feats,[8 8],@(x)max(max(x.data,[],1),[],2)); % We'll cheat a little here, and use the whole max-pooled feature set % as our feature vector. Instead, we might want to re-cluster, and % re-max-pool, and repeat this process a few times. For now, we'll % keep it simple: featVec(imgInd,:) = featsBp(:); end

Now that we’ve max-pooled, we can use our extracted features for classification - we’ll use a simple PLSDA + MAP classifier and decision algorithm here:

```
dsFeat = prtDataSetClass(featVec,ds.targets);
dsFeat.classNames = ds.classNames;
yOut = kfolds(prtClassPlsda + prtDecisionMap,dsFeat,3);
close all;
prtScoreConfusionMatrix(yOut)
```

Almost 99% correct! We’ve improved performance over our previous work with bag-of-words models, and an SVM, by just (1) max-pooling, and (2) replacing the SVM with a PLSDA classifier.

Until now we’ve focused on just two classes in MSRCORID. But there are a lot of types of objects in the MSRCORID database. In the following, we just repeat a bunch of the code from above, and run it on a data set containing images of benches, buildings, cars, chimneys, clouds and doors:

ds = prtDataGenMsrcorid({‘benches_and_chairs’,‘buildings’,‘cars\front view’,‘cars\rear view’,‘cars\side view’,‘chimneys’,‘clouds’,‘doors’});patchSize = [8 8]; col = []; for imgInd = 1:ds.nObservations;

`img = ds.X{imgInd}; img = rgb2gray(img); img = imresize(img,.5); col = cat(1,col,im2col(img,patchSize,<span class="string">'distinct'</span>)');`

end dsCol = prtDataSetClass(double(col));

preProc = prtPreProcZeroMeanRows + prtPreProcStdNormalizeRows(‘varianceOffset’,10) + prtPreProcZca; preProc = preProc.train(dsCol); dsNorm = preProc.run(dsCol);

skm = prtClusterSphericalKmeans(‘nClusters’,50); skm = skm.train(dsNorm);

featVec = nan(ds.nObservations,skm.nClusters*20); clusters = skm.run(dsNorm);

for imgInd = 1:ds.nObservations;

`img = ds.X{imgInd}; img = rgb2gray(img); imgSize = size(img); <span class="comment">% Extract the sub-patches</span> col = im2col(img,patchSize,<span class="string">'distinct'</span>); col = double(col); dsCol = prtDataSetClass(col'); dsCol = run(preProc,dsCol); dsFeat = skm.run(dsCol); dsFeat.X = max(dsFeat.X,.05); <span class="comment">% Max Pool!</span> <span class="comment">% Feats will be size 30 x 40 x nClusters</span> <span class="comment">% featsBp will be size [4 x 5] x nClusters (because of the way</span> <span class="comment">% blockproc handles edsges)</span> feats = reshape(dsFeat.X,imgSize(1)/8,imgSize(2)/8,[]); featsBp = blockproc(feats,[8 8],@(x)max(max(x.data,[],1),[],2)); <span class="comment">% We'll cheat a little here, and use the whole max-pooled feature set</span> <span class="comment">% as our feature vector. Instead, we might want to re-cluster, and</span> <span class="comment">% re-max-pool, and repeat this process a few times. For now, we'll</span> <span class="comment">% keep it simple:</span> featVec(imgInd,:) = featsBp(:);`

end

dsFeat = prtDataSetClass(featVec,ds.targets); dsFeat.classNames = ds.classNames;

yOut = kfolds(prtClassPlsda(‘nComponents’,10) + prtDecisionMap,dsFeat,3); yOut.classNames = cellfun(@(s)s(1:min([length(s),10])),yOut.classNames,‘uniformoutput’,false); close all; prtScoreConfusionMatrix(yOut); set(gcf,‘position’,[426 125 777 558]);

Now we’re doing some image processing! Overall we got about 90% correct, and that includes a lot of confusions between cars\front and cars\rear. That makes sense since the front and backs of cars look pretty similar, and there are only 23 car front examples in the whole data set.

The code in a lot of this blog entry is pretty gross – for example we have to constantly be taking data out of, and putting it back into the appropriate image sizes.

At some point in the future, we’d like to introduce a good prtDataSet that will handle cell-arrays containing images properly. We’re not there yet, but when we are, we’ll let you know on this blog!

Happy coding!

Adam Coates and Andrew Y. Ng, Learning Feature Representations with K-means, G. Montavon, G. B. Orr, K.-R. Muller (Eds.), Neural Networks: Tricks of the Trade, 2nd edn, Springer LNCS 7700, 2012

]]>One of the reasons that we like to use prtPath in a startup file is that prtPath does more than just add the PRT and all of it’s subfolders to your path.

prtPath is actually very basic. It calls the function prtRoot and uses the MATLAB function genpath() to get a list of all of the subdirectories in the prtFolder. It then eventually calls addpath() to add that list of directories to the MATLAB path (it does not save the path for future sessions of MATLAB).

Before prtPath() calls addpath() though, it selectively removes some directories from the subdirectory list.

First it removes any directory that starts with a “.”. This was added to prevent any hidden folders (like those from source control systems) from showing up in the MATLAB path.

More importantly though, it removes any folders from the list that start with a “]”. This is something special that we put in to add some extra functionality to the PRT.

Most of our users want to stick to code that is well tested and is known to behave nicely. But as we go about our jobs and use the PRT we need to add some new functionality. We typically add things like: new classifiers, new datatypes or new pre processing techniques.

Some of our users want access to this newest code so it gets added to the PRT in the “]alpha” and eventually the “]beta” folder. By default prtPath will not include these folders in the path. Instead you have to tell prtPath that you are willing to accept the responsibilities of the bleeding edge. You do this by giving prtPath a list of “]” folders that you want to include. (Or rather not exclude).

For example:

prtPath('alpha', 'beta');

will add both the “]alpha” and “]beta” folders (and their subfolders) to the path.

Currently in the PRT we have one other “]” folder, “]internal”. In the internal folder you will find some code on unit testing and documentation building. You probably wont be interested in much that’s in there so I probably wouldn’t clutter my path with it.

We were searching for a character that is a valid path (folder) name character on all major operating systems and is at the same time a character that most people wouldn’t start a directory name with. MATLAB already uses “@” for (old style) class definitions and “+” for packages. We thought “]” fit all of these criteria.

We hope that cleared up a little of why we recommend prtPath over pathtool(), at least for the PRT. In general just call prtPath() by itself but if you want to see what might lie ahead for the PRT checkout the ]alpha and ]beta folders. In some future posts we will talk about some things in these folders that might be of interest to you. Maybe that will entice you to explore the ].

]]>