In this notebook we're going to show two nice applications of Julia: one performing a sorting algorithm (Merge Sort) and the other showing Julia's ease to make ML Classification projects
Have you forgotten how to use and set threads? Let's remember together
Threads.nthreads()
Now we see that we're using only one thread; that is so serial! let's change it a little bit
export JULIA_NUM_THREADS = 4
And let's check it out
Threads.nthreads()
import Base.Threads.@spawn
# sort the elements of `v` in place, from indices `lo` to `hi` inclusive
function psort!(v, lo::Int=1, hi::Int=length(v))
if lo >= hi # 1 or 0 elements; nothing to do
return v
end
if hi - lo < 100000 # below some cutoff, run in serial
sort!(view(v, lo:hi), alg = MergeSort)
return v
end
mid = (lo+hi)>>>1 # find the midpoint
half = @spawn psort!(v, lo, mid)# task to sort the lower half; will run
psort!(v, mid+1, hi) # in parallel with the current call sorting
# the upper half
wait(half) # wait for the lower half to finish
temp = v[lo:mid] # workspace for merging
i, k, j = 1, lo, mid+1 # merge the two sorted sub-arrays
@inbounds while k < j <= hi
if v[j] < temp[i]
v[k] = v[j]
j += 1
else
v[k] = temp[i]
i += 1
end
k += 1
end
@inbounds while k < j
v[k] = temp[i]
k += 1
i += 1
end
return v
end
a = rand(20000000);
b = copy(a); @time sort!(b, alg = MergeSort); # single-threaded
b = copy(a); @time psort!(b); # two threads
b = copy(a); @time psort!(b); # two threads
Put simply, classification is the task of predicting a label for a given observation. For example: you are given certain physical descriptions of an animal, and your taks is to classify them as either a dog or a cat. Here, we will classify iris flowers.
As we will see later, we will use different classifiers and at the end of this notebook, we will compare them. We will define our accuracy function right now to get it out of the way. We will use a simple accuracy function that returns the ratio of the number of correctly classified observations to the total number of predictions.
findaccuracy(predictedvals,groundtruthvals) = sum(predictedvals.==groundtruthvals)/length(groundtruthvals)
import Pkg; #Downloading necessary packages
Pkg.add("GLMNet")
Pkg.add("RDatasets")
Pkg.add("MLBase")
Pkg.add("Plots")
Pkg.add("DecisionTree")
Pkg.add("Distances")
Pkg.add("NearestNeighbors")
Pkg.add("LinearAlgebra")
Pkg.add("DataStructures")
Pkg.add("LIBSVM")
using GLMNet
using RDatasets
using MLBase
using Plots
using DecisionTree
using Distances
using NearestNeighbors
using Random
using LinearAlgebra
using DataStructures
using LIBSVM
We're going to use a dataset called iris, which contains different data of flowers such as sepal length, sepal width, petal length and petal width and the species of each specimen
iris = dataset("datasets", "iris")
As you might see, the first 4 columns (ignoring the one of the index) of our dataset are the features of the flowers and the last one is the species. Now, we're going to setup our dataset, by storing in X the values of the features and in irislabels the values of the species
X = Matrix(iris[:,1:4])
irislabels = iris[:,5]
X
For practical reasons we're going to change the type of our labels to discret numeric data type, assigning to each label a corresponding number (1-3) instead of our labels (setosa, versicolor and virginica). This can be done with the following functions:
irislabelsmap = labelmap(irislabels)
y = labelencode(irislabelsmap, irislabels)
In classification, we often want to use some of the data to fit a model, and the rest of the data to validate (commonly known as training
and testing
data). We will get this data ready now so that we can easily use it in the rest of this notebook.
function perclass_splits(y,at)
uids = unique(y)
keepids = []
for ui in uids
curids = findall(y.==ui)
rowids = randsubseq(curids, at)
push!(keepids,rowids...)
end
return keepids
end
?randsubseq
trainids = perclass_splits(y,0.7)
testids = setdiff(1:length(y),trainids)
We will need one more function, and that is the function that will assign classes based on the predicted values when the predicted values are continuous.
assign_class(predictedvalue) = argmin(abs.(predictedvalue .- [1,2,3]))
path = glmnet(X[trainids,:], y[trainids])
cv = glmnetcv(X[trainids,:], y[trainids])
# choose the best lambda to predict with.
path = glmnet(X[trainids,:], y[trainids])
cv = glmnetcv(X[trainids,:], y[trainids])
mylambda = path.lambda[argmin(cv.meanloss)]
path = glmnet(X[trainids,:], y[trainids],lambda=[mylambda]);
q = X[testids,:];
predictions_lasso = GLMNet.predict(path,q)
predictions_lasso = assign_class.(predictions_lasso)
findaccuracy(predictions_lasso,y[testids])
We will use the same function but set alpha to zero.
# choose the best lambda to predict with.
path = glmnet(X[trainids,:], y[trainids],alpha=0);
cv = glmnetcv(X[trainids,:], y[trainids],alpha=0)
mylambda = path.lambda[argmin(cv.meanloss)]
path = glmnet(X[trainids,:], y[trainids],alpha=0,lambda=[mylambda]);
q = X[testids,:];
predictions_ridge = GLMNet.predict(path,q)
predictions_ridge = assign_class.(predictions_ridge)
findaccuracy(predictions_ridge,y[testids])
We will use the same function but set alpha to 0.5 (it's the combination of lasso and ridge).
# choose the best lambda to predict with.
path = glmnet(X[trainids,:], y[trainids],alpha=0.5);
cv = glmnetcv(X[trainids,:], y[trainids],alpha=0.5)
mylambda = path.lambda[argmin(cv.meanloss)]
path = glmnet(X[trainids,:], y[trainids],alpha=0.5,lambda=[mylambda]);
q = X[testids,:];
predictions_EN = GLMNet.predict(path,q)
predictions_EN = assign_class.(predictions_EN)
findaccuracy(predictions_EN,y[testids])
We will use the package DecisionTree
model = DecisionTreeClassifier(max_depth=2)
DecisionTree.fit!(model, X[trainids,:], y[trainids])
q = X[testids,:];
predictions_DT = DecisionTree.predict(model, q)
findaccuracy(predictions_DT,y[testids])
The RandomForestClassifier
is available through the DecisionTree
package as well.
model = RandomForestClassifier(n_trees=20)
DecisionTree.fit!(model, X[trainids,:], y[trainids])
q = X[testids,:];
predictions_RF = DecisionTree.predict(model, q)
findaccuracy(predictions_RF,y[testids])
We will use the NearestNeighbors
package here.
Xtrain = X[trainids,:]
ytrain = y[trainids]
kdtree = KDTree(Xtrain')
queries = X[testids,:]
idxs, dists = knn(kdtree, queries', 5, true)
c = ytrain[hcat(idxs...)]
possible_labels = map(i->counter(c[:,i]),1:size(c,2))
predictions_NN = map(i->parse(Int,string(argmax(DataFrame(possible_labels[i])[1,:]))),1:size(c,2))
findaccuracy(predictions_NN,y[testids])
We will use the LIBSVM
package here.
Xtrain = X[trainids,:]
ytrain = y[trainids]
model = svmtrain(Xtrain', ytrain)
predictions_SVM, decision_values = svmpredict(model, X[testids,:]')
findaccuracy(predictions_SVM,y[testids])
Putting all the results together:
overall_accuracies = zeros(7)
methods = ["lasso","ridge","EN", "DT", "RF","kNN", "SVM"]
ytest = y[testids]
overall_accuracies[1] = findaccuracy(predictions_lasso,ytest)
overall_accuracies[2] = findaccuracy(predictions_ridge,ytest)
overall_accuracies[3] = findaccuracy(predictions_EN,ytest)
overall_accuracies[4] = findaccuracy(predictions_DT,ytest)
overall_accuracies[5] = findaccuracy(predictions_RF,ytest)
overall_accuracies[6] = findaccuracy(predictions_NN,ytest)
overall_accuracies[7] = findaccuracy(predictions_SVM,ytest)
hcat(methods, overall_accuracies)
We used multiple methods to run classification on the iris
dataset which is a dataset of flowers and there are three types of iris flowers in it. We split the data into training and testing and ran our methods. Here is the scoreboard:
method | accuracy score |
---|---|
lasso | 1.0 |
ridge | 1.0 |
EN | 1.0 |
DT | 0.960784 |
RF | 0.980392 |
kNN | 1.0 |
SVM | 1.0 |