Julia's applications

In this notebook we're going to show two nice applications of Julia: one performing a sorting algorithm (Merge Sort) and the other showing Julia's ease to make ML Classification projects

Merge sort

Have you forgotten how to use and set threads? Let's remember together

In [40]:
Threads.nthreads()
Out[40]:
1

Now we see that we're using only one thread; that is so serial! let's change it a little bit

In [71]:
export JULIA_NUM_THREADS = 4
Out[71]:
4-element Array{Int64,1}:
 6
 7
 8
 9

And let's check it out

In [72]:
Threads.nthreads()
Out[72]:
1
In [73]:
import Base.Threads.@spawn
# sort the elements of `v` in place, from indices `lo` to `hi` inclusive
function psort!(v, lo::Int=1, hi::Int=length(v))
    if lo >= hi                       # 1 or 0 elements; nothing to do
        return v
    end
    if hi - lo < 100000               # below some cutoff, run in serial
        sort!(view(v, lo:hi), alg = MergeSort)
        return v
    end
    mid = (lo+hi)>>>1               # find the midpoint
    half = @spawn psort!(v, lo, mid)# task to sort the lower half; will run
    psort!(v, mid+1, hi)            # in parallel with the current call sorting
                                    # the upper half
    wait(half)                      # wait for the lower half to finish
    temp = v[lo:mid]                  # workspace for merging
    i, k, j = 1, lo, mid+1            # merge the two sorted sub-arrays
    @inbounds while k < j <= hi
        if v[j] < temp[i]
            v[k] = v[j]
            j += 1
        else
            v[k] = temp[i]
            i += 1
        end
        k += 1
    end
    @inbounds while k < j
        v[k] = temp[i]
        k += 1
        i += 1
    end
    return v
end
Out[73]:
psort! (generic function with 3 methods)
In [74]:
a = rand(20000000);
b = copy(a); @time sort!(b, alg = MergeSort);   # single-threaded
  2.605274 seconds (3 allocations: 76.294 MiB)
In [60]:
b = copy(a); @time psort!(b);    # two threads
  2.989460 seconds (3.34 k allocations: 686.956 MiB, 1.79% gc time)
In [61]:
b = copy(a); @time psort!(b);    # two threads
 14.794936 seconds (3.34 k allocations: 686.956 MiB, 81.05% gc time)

Classification

Put simply, classification is the task of predicting a label for a given observation. For example: you are given certain physical descriptions of an animal, and your taks is to classify them as either a dog or a cat. Here, we will classify iris flowers.

As we will see later, we will use different classifiers and at the end of this notebook, we will compare them. We will define our accuracy function right now to get it out of the way. We will use a simple accuracy function that returns the ratio of the number of correctly classified observations to the total number of predictions.

In [1]:
findaccuracy(predictedvals,groundtruthvals) = sum(predictedvals.==groundtruthvals)/length(groundtruthvals)
Out[1]:
findaccuracy (generic function with 1 method)
In [3]:
import Pkg; #Downloading necessary packages
Pkg.add("GLMNet")
Pkg.add("RDatasets")
Pkg.add("MLBase")
Pkg.add("Plots")
Pkg.add("DecisionTree")
Pkg.add("Distances")
Pkg.add("NearestNeighbors")
Pkg.add("LinearAlgebra")
Pkg.add("DataStructures")
Pkg.add("LIBSVM")
   Updating registry at `C:\Users\oscam\.julia\registries\General`

   Updating git-repo `https://github.com/JuliaRegistries/General.git`
    Fetching: [

[1mFetching: [========================================>]  100.0 %
  Resolving package versions...
  Installed Rmath_jll ──────────────────── v0.2.2+0
  Installed Rmath ──────────────────────── v0.6.1
  Installed ZeroMQ_jll ─────────────────── v4.3.2+4
  Installed Arpack ─────────────────────── v0.4.0
  Installed GLMNet ─────────────────────── v0.5.1
  Installed CompilerSupportLibraries_jll ─ v0.3.3+0
  Installed OpenSpecFun_jll ────────────── v0.5.3+3
  Installed StatsBase ──────────────────── v0.33.0
  Installed PDMats ─────────────────────── v0.9.12
  Installed FillArrays ─────────────────── v0.8.10
  Installed QuadGK ─────────────────────── v2.3.1
  Installed Arpack_jll ─────────────────── v3.5.0+3
  Installed glmnet_jll ─────────────────── v5.0.0+0
  Installed StatsFuns ──────────────────── v0.9.5
  Installed SpecialFunctions ───────────── v0.10.3
  Installed Distributions ──────────────── v0.23.4
  Installed OpenBLAS_jll ───────────────── v0.3.9+4
   Updating `C:\Users\oscam\.julia\environments\v1.4\Project.toml`
  [8d5ece8b] + GLMNet v0.5.1
   Updating `C:\Users\oscam\.julia\environments\v1.4\Manifest.toml`
  [7d9fca2a] + Arpack v0.4.0
  [68821587] + Arpack_jll v3.5.0+3
  [e66e0078] + CompilerSupportLibraries_jll v0.3.3+0
  [31c24e10] + Distributions v0.23.4
  [1a297f60] + FillArrays v0.8.10
  [8d5ece8b] + GLMNet v0.5.1
  [4536629a] + OpenBLAS_jll v0.3.9+4
  [efe28fd5] + OpenSpecFun_jll v0.5.3+3
  [90014a1f] + PDMats v0.9.12
  [1fd47b50] + QuadGK v2.3.1
  [79098fc4] + Rmath v0.6.1
  [f50d1b31] + Rmath_jll v0.2.2+0
  [276daf66] + SpecialFunctions v0.10.3
  [2913bbd2] + StatsBase v0.33.0
  [4c63d2b9] + StatsFuns v0.9.5
  [8f1865be] ↑ ZeroMQ_jll v4.3.2+3 ⇒ v4.3.2+4
  [78c6b45d] + glmnet_jll v5.0.0+0
  [4607b0f0] + SuiteSparse 
  Resolving package versions...
   Updating `C:\Users\oscam\.julia\environments\v1.4\Project.toml`
 [no changes]
   Updating `C:\Users\oscam\.julia\environments\v1.4\Manifest.toml`
 [no changes]
  Resolving package versions...
  Installed IterTools ─ v1.3.0
  Installed MLBase ──── v0.8.0
   Updating `C:\Users\oscam\.julia\environments\v1.4\Project.toml`
  [f0e99cf1] + MLBase v0.8.0
   Updating `C:\Users\oscam\.julia\environments\v1.4\Manifest.toml`
  [c8e1da08] + IterTools v1.3.0
  [f0e99cf1] + MLBase v0.8.0
  Resolving package versions...
  Installed FFMPEG ──────────── v0.3.0
  Installed FreeType2_jll ───── v2.10.1+2
  Installed FFMPEG_jll ──────── v4.1.0+3
  Installed FixedPointNumbers ─ v0.8.0
  Installed PlotUtils ───────── v1.0.4
  Installed PlotThemes ──────── v2.0.0
  Installed Plots ───────────── v1.3.7
  Installed StaticArrays ────── v0.12.3
  Installed NaNMath ─────────── v0.3.3
  Installed libfdk_aac_jll ──── v0.1.6+2
  Installed Measures ────────── v0.3.1
  Installed x265_jll ────────── v3.0.0+1
  Installed LibVPX_jll ──────── v1.8.1+1
  Installed FriBidi_jll ─────── v1.0.5+3
  Installed Showoff ─────────── v0.3.1
  Installed Contour ─────────── v0.5.3
  Installed LAME_jll ────────── v3.100.0+1
  Installed OpenSSL_jll ─────── v1.1.1+2
  Installed IniFile ─────────── v0.5.0
  Installed RecipesPipeline ─── v0.1.10
  Installed HTTP ────────────── v0.8.15
  Installed Colors ──────────── v0.12.1
  Installed GeometryTypes ───── v0.8.3
  Installed libass_jll ──────── v0.14.0+2
  Installed x264_jll ────────── v2019.5.25+2
  Installed GR ──────────────── v0.50.1
  Installed Opus_jll ────────── v1.3.1+1
  Installed Ogg_jll ─────────── v1.3.4+0
  Installed Bzip2_jll ───────── v1.0.6+2
  Installed ColorTypes ──────── v0.10.3
  Installed libvorbis_jll ───── v1.3.6+4
  Installed ColorSchemes ────── v3.9.0
   Updating `C:\Users\oscam\.julia\environments\v1.4\Project.toml`
  [91a5bcdd] + Plots v1.3.7
   Updating `C:\Users\oscam\.julia\environments\v1.4\Manifest.toml`
  [6e34b625] + Bzip2_jll v1.0.6+2
  [35d6a980] + ColorSchemes v3.9.0
  [3da002f7] + ColorTypes v0.10.3
  [5ae59095] + Colors v0.12.1
  [d38c429a] + Contour v0.5.3
  [c87230d0] + FFMPEG v0.3.0
  [b22a6f82] + FFMPEG_jll v4.1.0+3
  [53c48c17] + FixedPointNumbers v0.8.0
  [d7e528f0] + FreeType2_jll v2.10.1+2
  [559328eb] + FriBidi_jll v1.0.5+3
  [28b8d3ca] + GR v0.50.1
  [4d00f742] + GeometryTypes v0.8.3
  [cd3eb016] + HTTP v0.8.15
  [83e8ac13] + IniFile v0.5.0
  [c1c5ebd0] + LAME_jll v3.100.0+1
  [dd192d2f] + LibVPX_jll v1.8.1+1
  [442fdcdd] + Measures v0.3.1
  [77ba4419] + NaNMath v0.3.3
  [e7412a2a] + Ogg_jll v1.3.4+0
  [458c3c95] + OpenSSL_jll v1.1.1+2
  [91d4177d] + Opus_jll v1.3.1+1
  [ccf2f8ad] + PlotThemes v2.0.0
  [995b91a9] + PlotUtils v1.0.4
  [91a5bcdd] + Plots v1.3.7
  [01d81517] + RecipesPipeline v0.1.10
  [992d4aef] + Showoff v0.3.1
  [90137ffa] + StaticArrays v0.12.3
  [0ac62f75] + libass_jll v0.14.0+2
  [f638f0a6] + libfdk_aac_jll v0.1.6+2
  [f27f6e37] + libvorbis_jll v1.3.6+4
  [1270edf5] + x264_jll v2019.5.25+2
  [dfaa095f] + x265_jll v3.0.0+1
   Building GR ───→ `C:\Users\oscam\.julia\packages\GR\Atztx\deps\build.log`
   Building Plots → `C:\Users\oscam\.julia\packages\Plots\Xnzc7\deps\build.log`
  Resolving package versions...
  Installed ScikitLearnBase ─ v0.5.0
  Installed DecisionTree ──── v0.10.2
   Updating `C:\Users\oscam\.julia\environments\v1.4\Project.toml`
  [7806a523] + DecisionTree v0.10.2
   Updating `C:\Users\oscam\.julia\environments\v1.4\Manifest.toml`
  [7806a523] + DecisionTree v0.10.2
  [6e75b9c4] + ScikitLearnBase v0.5.0
  Resolving package versions...
  Installed Distances ─ v0.9.0
   Updating `C:\Users\oscam\.julia\environments\v1.4\Project.toml`
  [b4f34e82] + Distances v0.9.0
   Updating `C:\Users\oscam\.julia\environments\v1.4\Manifest.toml`
  [b4f34e82] + Distances v0.9.0
  Resolving package versions...
  Installed NearestNeighbors ─ v0.4.5
   Updating `C:\Users\oscam\.julia\environments\v1.4\Project.toml`
  [b8a86587] + NearestNeighbors v0.4.5
   Updating `C:\Users\oscam\.julia\environments\v1.4\Manifest.toml`
  [b8a86587] + NearestNeighbors v0.4.5
  Resolving package versions...
   Updating `C:\Users\oscam\.julia\environments\v1.4\Project.toml`
  [37e2e46d] + LinearAlgebra 
   Updating `C:\Users\oscam\.julia\environments\v1.4\Manifest.toml`
 [no changes]
  Resolving package versions...
   Updating `C:\Users\oscam\.julia\environments\v1.4\Project.toml`
  [864edb3b] + DataStructures v0.17.17
   Updating `C:\Users\oscam\.julia\environments\v1.4\Manifest.toml`
 [no changes]
  Resolving package versions...
  Installed LIBSVM ──── v0.4.0
  Installed LIBLINEAR ─ v0.5.1
   Updating `C:\Users\oscam\.julia\environments\v1.4\Project.toml`
  [b1bec4e5] + LIBSVM v0.4.0
   Updating `C:\Users\oscam\.julia\environments\v1.4\Manifest.toml`
  [2d691ee1] + LIBLINEAR v0.5.1
  [b1bec4e5] + LIBSVM v0.4.0
   Building LIBLINEAR → `C:\Users\oscam\.julia\packages\LIBLINEAR\yTdp5\deps\build.log`
   Building LIBSVM ───→ `C:\Users\oscam\.julia\packages\LIBSVM\5Z99T\deps\build.log`
In [4]:
using GLMNet
using RDatasets
using MLBase
using Plots
using DecisionTree
using Distances
using NearestNeighbors
using Random
using LinearAlgebra
using DataStructures
using LIBSVM
┌ Info: Precompiling GLMNet [8d5ece8b-de18-5317-b113-243142960cc6]
└ @ Base loading.jl:1260
┌ Info: Precompiling MLBase [f0e99cf1-93fa-52ec-9ecc-5026115318e0]
└ @ Base loading.jl:1260
┌ Info: Precompiling Plots [91a5bcdd-55d7-5caf-9e0b-520d859cae80]
└ @ Base loading.jl:1260
┌ Info: Precompiling DecisionTree [7806a523-6efd-50cb-b5f6-3fa6f1930dbb]
└ @ Base loading.jl:1260
┌ Info: Precompiling Distances [b4f34e82-e78d-54a5-968a-f98e89d6e8f7]
└ @ Base loading.jl:1260
┌ Info: Precompiling NearestNeighbors [b8a86587-4115-5ab1-83bc-aa920d37bbce]
└ @ Base loading.jl:1260
┌ Info: Precompiling LIBSVM [b1bec4e5-fd48-53fe-b0cb-9723c09d164b]
└ @ Base loading.jl:1260

We're going to use a dataset called iris, which contains different data of flowers such as sepal length, sepal width, petal length and petal width and the species of each specimen

In [6]:
iris = dataset("datasets", "iris")
Out[6]:

150 rows × 5 columns

SepalLengthSepalWidthPetalLengthPetalWidthSpecies
Float64Float64Float64Float64Cat…
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa
74.63.41.40.3setosa
85.03.41.50.2setosa
94.42.91.40.2setosa
104.93.11.50.1setosa
115.43.71.50.2setosa
124.83.41.60.2setosa
134.83.01.40.1setosa
144.33.01.10.1setosa
155.84.01.20.2setosa
165.74.41.50.4setosa
175.43.91.30.4setosa
185.13.51.40.3setosa
195.73.81.70.3setosa
205.13.81.50.3setosa
215.43.41.70.2setosa
225.13.71.50.4setosa
234.63.61.00.2setosa
245.13.31.70.5setosa
254.83.41.90.2setosa
265.03.01.60.2setosa
275.03.41.60.4setosa
285.23.51.50.2setosa
295.23.41.40.2setosa
304.73.21.60.2setosa

As you might see, the first 4 columns (ignoring the one of the index) of our dataset are the features of the flowers and the last one is the species. Now, we're going to setup our dataset, by storing in X the values of the features and in irislabels the values of the species

In [7]:
X = Matrix(iris[:,1:4])
irislabels = iris[:,5]
Out[7]:
150-element CategoricalArray{String,1,UInt8}:
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 ⋮
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
In [8]:
X
Out[8]:
150×4 Array{Float64,2}:
 5.1  3.5  1.4  0.2
 4.9  3.0  1.4  0.2
 4.7  3.2  1.3  0.2
 4.6  3.1  1.5  0.2
 5.0  3.6  1.4  0.2
 5.4  3.9  1.7  0.4
 4.6  3.4  1.4  0.3
 5.0  3.4  1.5  0.2
 4.4  2.9  1.4  0.2
 4.9  3.1  1.5  0.1
 5.4  3.7  1.5  0.2
 4.8  3.4  1.6  0.2
 4.8  3.0  1.4  0.1
 ⋮              
 6.0  3.0  4.8  1.8
 6.9  3.1  5.4  2.1
 6.7  3.1  5.6  2.4
 6.9  3.1  5.1  2.3
 5.8  2.7  5.1  1.9
 6.8  3.2  5.9  2.3
 6.7  3.3  5.7  2.5
 6.7  3.0  5.2  2.3
 6.3  2.5  5.0  1.9
 6.5  3.0  5.2  2.0
 6.2  3.4  5.4  2.3
 5.9  3.0  5.1  1.8

For practical reasons we're going to change the type of our labels to discret numeric data type, assigning to each label a corresponding number (1-3) instead of our labels (setosa, versicolor and virginica). This can be done with the following functions:

In [12]:
irislabelsmap = labelmap(irislabels)
y = labelencode(irislabelsmap, irislabels)
Out[12]:
150-element Array{Int64,1}:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 ⋮
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3

In classification, we often want to use some of the data to fit a model, and the rest of the data to validate (commonly known as training and testing data). We will get this data ready now so that we can easily use it in the rest of this notebook.

In [13]:
function perclass_splits(y,at)
    uids = unique(y)
    keepids = []
    for ui in uids
        curids = findall(y.==ui)
        rowids = randsubseq(curids, at) 
        push!(keepids,rowids...)
    end
    return keepids
end
Out[13]:
perclass_splits (generic function with 1 method)
In [14]:
?randsubseq
search: randsubseq randsubseq! RandomSub StratifiedRandomSub

Out[14]:
randsubseq([rng=GLOBAL_RNG,] A, p) -> Vector

Return a vector consisting of a random subsequence of the given array A, where each element of A is included (in order) with independent probability p. (Complexity is linear in p*length(A), so this function is efficient even if p is small and A is large.) Technically, this process is known as "Bernoulli sampling" of A.

Examples

jldoctest
julia> rng = MersenneTwister(1234);

julia> randsubseq(rng, collect(1:8), 0.3)
2-element Array{Int64,1}:
 7
 8
In [15]:
trainids = perclass_splits(y,0.7)
testids = setdiff(1:length(y),trainids)
Out[15]:
34-element Array{Int64,1}:
  13
  14
  26
  29
  34
  38
  39
  49
  50
  51
  52
  54
  60
   ⋮
  88
  91
  96
 100
 102
 104
 110
 111
 117
 121
 140
 141

We will need one more function, and that is the function that will assign classes based on the predicted values when the predicted values are continuous.

In [16]:
assign_class(predictedvalue) = argmin(abs.(predictedvalue .- [1,2,3]))
Out[16]:
assign_class (generic function with 1 method)

⚫ Method 1: Lasso

In [17]:
path = glmnet(X[trainids,:], y[trainids])
cv = glmnetcv(X[trainids,:], y[trainids])
Out[17]:
Least Squares GLMNet Cross Validation
72 models for 4 predictors in 10 folds
Best λ 0.001 (mean loss 0.050, std 0.007)
In [18]:
# choose the best lambda to predict with.
path = glmnet(X[trainids,:], y[trainids])
cv = glmnetcv(X[trainids,:], y[trainids])
mylambda = path.lambda[argmin(cv.meanloss)]

path = glmnet(X[trainids,:], y[trainids],lambda=[mylambda]);
In [19]:
q = X[testids,:];
predictions_lasso = GLMNet.predict(path,q)
Out[19]:
34×1 Array{Float64,2}:
 0.9146683630742383
 0.9027034924929771
 0.9989459600781313
 0.9072236516239828
 0.8313765151883691
 0.8727250193938867
 0.998934696413957
 0.9041861447141626
 0.9361291549187013
 2.215471751594401
 2.299734330379395
 2.212020807713042
 2.264340376293498
 ⋮
 2.2120358259319413
 2.2323069388267367
 2.1405846303725884
 2.187033114965047
 2.7813647257591096
 2.770792080136811
 3.170316552025599
 2.733060904097049
 2.7179756084531945
 3.0098009089429016
 2.8223405451162105
 3.0751583179089907
In [20]:
predictions_lasso = assign_class.(predictions_lasso)
findaccuracy(predictions_lasso,y[testids])
Out[20]:
0.9705882352941176

⚫ Method 2: Ridge

We will use the same function but set alpha to zero.

In [21]:
# choose the best lambda to predict with.
path = glmnet(X[trainids,:], y[trainids],alpha=0);
cv = glmnetcv(X[trainids,:], y[trainids],alpha=0)
mylambda = path.lambda[argmin(cv.meanloss)]
path = glmnet(X[trainids,:], y[trainids],alpha=0,lambda=[mylambda]);
q = X[testids,:];
predictions_ridge = GLMNet.predict(path,q)
predictions_ridge = assign_class.(predictions_ridge)
findaccuracy(predictions_ridge,y[testids])
Out[21]:
1.0

⚫ Method 3: Elastic Net

We will use the same function but set alpha to 0.5 (it's the combination of lasso and ridge).

In [22]:
# choose the best lambda to predict with.
path = glmnet(X[trainids,:], y[trainids],alpha=0.5);
cv = glmnetcv(X[trainids,:], y[trainids],alpha=0.5)
mylambda = path.lambda[argmin(cv.meanloss)]
path = glmnet(X[trainids,:], y[trainids],alpha=0.5,lambda=[mylambda]);
q = X[testids,:];
predictions_EN = GLMNet.predict(path,q)
predictions_EN = assign_class.(predictions_EN)
findaccuracy(predictions_EN,y[testids])
Out[22]:
0.9705882352941176

⚫ Method 4: Decision Trees

We will use the package DecisionTree

In [23]:
model = DecisionTreeClassifier(max_depth=2)
DecisionTree.fit!(model, X[trainids,:], y[trainids])
Out[23]:
DecisionTreeClassifier
max_depth:                2
min_samples_leaf:         1
min_samples_split:        2
min_purity_increase:      0.0
pruning_purity_threshold: 1.0
n_subfeatures:            0
classes:                  [1, 2, 3]
root:                     Decision Tree
Leaves: 3
Depth:  2
In [24]:
q = X[testids,:];
predictions_DT = DecisionTree.predict(model, q)
findaccuracy(predictions_DT,y[testids])
Out[24]:
0.9705882352941176

⚫ Method 5: Random Forests

The RandomForestClassifier is available through the DecisionTree package as well.

In [25]:
model = RandomForestClassifier(n_trees=20)
DecisionTree.fit!(model, X[trainids,:], y[trainids])
Out[25]:
RandomForestClassifier
n_trees:             20
n_subfeatures:       -1
partial_sampling:    0.7
max_depth:           -1
min_samples_leaf:    1
min_samples_split:   2
min_purity_increase: 0.0
classes:             [1, 2, 3]
ensemble:            Ensemble of Decision Trees
Trees:      20
Avg Leaves: 6.3
Avg Depth:  4.5
In [26]:
q = X[testids,:];
predictions_RF = DecisionTree.predict(model, q)
findaccuracy(predictions_RF,y[testids])
Out[26]:
0.9705882352941176

⚫ Method 6: Using a Nearest Neighbor method

We will use the NearestNeighbors package here.

In [32]:
Xtrain = X[trainids,:]
ytrain = y[trainids]
kdtree = KDTree(Xtrain')
Out[32]:
KDTree{StaticArrays.SArray{Tuple{4},Float64,1,4},Euclidean,Float64}
  Number of points: 116
  Dimensions: 4
  Metric: Euclidean(0.0)
  Reordered: true
In [33]:
queries = X[testids,:]
Out[33]:
34×4 Array{Float64,2}:
 4.8  3.0  1.4  0.1
 4.3  3.0  1.1  0.1
 5.0  3.0  1.6  0.2
 5.2  3.4  1.4  0.2
 5.5  4.2  1.4  0.2
 4.9  3.6  1.4  0.1
 4.4  3.0  1.3  0.2
 5.3  3.7  1.5  0.2
 5.0  3.3  1.4  0.2
 7.0  3.2  4.7  1.4
 6.4  3.2  4.5  1.5
 5.5  2.3  4.0  1.3
 5.2  2.7  3.9  1.4
 ⋮              
 6.3  2.3  4.4  1.3
 5.5  2.6  4.4  1.2
 5.7  3.0  4.2  1.2
 5.7  2.8  4.1  1.3
 5.8  2.7  5.1  1.9
 6.3  2.9  5.6  1.8
 7.2  3.6  6.1  2.5
 6.5  3.2  5.1  2.0
 6.5  3.0  5.5  1.8
 6.9  3.2  5.7  2.3
 6.9  3.1  5.4  2.1
 6.7  3.1  5.6  2.4
In [34]:
idxs, dists = knn(kdtree, queries', 5, true)
Out[34]:
([[2, 10, 39, 30, 27], [36, 9, 41, 3, 4], [30, 10, 2, 27, 39], [25, 33, 1, 16, 8], [29, 14, 15, 13, 6], [5, 1, 34, 8, 16], [9, 36, 4, 41, 3], [11, 25, 18, 40, 20], [8, 33, 31, 1, 30], [42, 65, 56, 47, 57]  …  [72, 66, 71, 44, 61], [72, 71, 66, 69, 61], [109, 84, 90, 116, 62], [106, 97, 82, 101, 114], [110, 111, 76, 104, 93], [114, 86, 57, 112, 106], [106, 114, 82, 97, 83], [110, 93, 111, 83, 76], [83, 112, 108, 93, 114], [111, 83, 110, 77, 93]], [[0.1414213562373099, 0.17320508075688815, 0.19999999999999998, 0.20000000000000037, 0.244948974278318], [0.31622776601683816, 0.34641016151377546, 0.4795831523312718, 0.5000000000000003, 0.519615242270663], [0.17320508075688762, 0.19999999999999993, 0.22360679774997896, 0.22360679774997916, 0.3000000000000002], [0.14142135623730964, 0.14142135623730995, 0.14142135623730995, 0.17320508075688806, 0.22360679774997916], [0.3464101615137755, 0.3605551275463992, 0.38729833462074176, 0.412310562561766, 0.4795831523312721], [0.14142135623730925, 0.24494897427831727, 0.26457513110645886, 0.2645751311064591, 0.29999999999999954], [0.14142135623730948, 0.20000000000000018, 0.29999999999999954, 0.2999999999999996, 0.3605551275463989], [0.10000000000000053, 0.22360679774997896, 0.2449489742783178, 0.24494897427831785, 0.2828427124746191], [0.14142135623730964, 0.17320508075688762, 0.22360679774997877, 0.22360679774997896, 0.24494897427831747], [0.26457513110645914, 0.33166247903553986, 0.45825756949558427, 0.5196152422706637, 0.5567764362830021]  …  [0.14142135623730964, 0.1732050807568884, 0.33166247903554, 0.3741657386773941, 0.43588989435406733], [0.14142135623730995, 0.17320508075688815, 0.22360679774997935, 0.26457513110645864, 0.26457513110645864], [0.0, 0.2645751311064589, 0.31622776601683755, 0.33166247903553997, 0.3605551275463989], [0.24494897427831802, 0.3316624790355402, 0.38729833462074154, 0.4242640687119288, 0.4999999999999996], [0.6324555320336759, 0.7071067811865474, 0.7549834435270749, 0.806225774829855, 0.8124038404635958], [0.22360679774997935, 0.374165738677394, 0.4242640687119286, 0.4242640687119287, 0.46904157598234314], [0.1414213562373093, 0.3605551275463988, 0.38729833462074154, 0.3872983346207416, 0.42426406871192845], [0.22360679774997935, 0.2999999999999998, 0.30000000000000016, 0.3605551275463991, 0.3999999999999997], [0.1732050807568879, 0.360555127546399, 0.36055512754639935, 0.4123105625617659, 0.46904157598234336], [0.24494897427831785, 0.34641016151377513, 0.346410161513776, 0.360555127546399, 0.374165738677394]])
In [35]:
c = ytrain[hcat(idxs...)]
possible_labels = map(i->counter(c[:,i]),1:size(c,2))
predictions_NN = map(i->parse(Int,string(argmax(DataFrame(possible_labels[i])[1,:]))),1:size(c,2))
findaccuracy(predictions_NN,y[testids])
Out[35]:
0.9705882352941176

⚫ Method 7: Support Vector Machines

We will use the LIBSVM package here.

In [36]:
Xtrain = X[trainids,:]
ytrain = y[trainids]
Out[36]:
116-element Array{Int64,1}:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 ⋮
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
In [37]:
model = svmtrain(Xtrain', ytrain)
Out[37]:
LIBSVM.SVM{Int64}(SVC, LIBSVM.Kernel.RadialBasis, nothing, 4, 3, [1, 2, 3], Int32[1, 2, 3], Float64[], Int32[], LIBSVM.SupportVectors{Int64,Float64}(39, Int32[6, 16, 17], [1, 1, 1, 1, 1, 1, 2, 2, 2, 2  …  3, 3, 3, 3, 3, 3, 3, 3, 3, 3], [5.7 5.7 … 6.5 5.9; 4.4 3.8 … 3.0 3.0; 1.5 1.7 … 5.2 5.1; 0.4 0.3 … 2.0 1.8], Int32[14, 17, 22, 23, 35, 38, 42, 43, 44, 45  …  96, 98, 100, 102, 103, 107, 108, 113, 114, 116], LIBSVM.SVMNode[LIBSVM.SVMNode(1, 5.7), LIBSVM.SVMNode(1, 5.7), LIBSVM.SVMNode(1, 5.1), LIBSVM.SVMNode(1, 4.8), LIBSVM.SVMNode(1, 4.5), LIBSVM.SVMNode(1, 5.1), LIBSVM.SVMNode(1, 6.9), LIBSVM.SVMNode(1, 6.5), LIBSVM.SVMNode(1, 5.7), LIBSVM.SVMNode(1, 6.3)  …  LIBSVM.SVMNode(1, 6.1), LIBSVM.SVMNode(1, 7.2), LIBSVM.SVMNode(1, 7.9), LIBSVM.SVMNode(1, 6.3), LIBSVM.SVMNode(1, 6.1), LIBSVM.SVMNode(1, 6.0), LIBSVM.SVMNode(1, 6.9), LIBSVM.SVMNode(1, 6.3), LIBSVM.SVMNode(1, 6.5), LIBSVM.SVMNode(1, 5.9)]), 0.0, [0.448029161923002 0.9556421011989508; 0.03976149509479696 0.0; … ; -0.0 -1.0; -0.0 -1.0], Float64[], Float64[], [0.03446786611956108, 0.16792461061618347, 0.17802546496124236], 3, 0.25, 200.0, 0.001, 1.0, 0.5, 0.1, true, false)
In [38]:
predictions_SVM, decision_values = svmpredict(model, X[testids,:]')
findaccuracy(predictions_SVM,y[testids])
Out[38]:
0.9705882352941176

Putting all the results together:

In [39]:
overall_accuracies = zeros(7)
methods = ["lasso","ridge","EN", "DT", "RF","kNN", "SVM"]
ytest = y[testids]
overall_accuracies[1] = findaccuracy(predictions_lasso,ytest)
overall_accuracies[2] = findaccuracy(predictions_ridge,ytest)
overall_accuracies[3] = findaccuracy(predictions_EN,ytest)
overall_accuracies[4] = findaccuracy(predictions_DT,ytest)
overall_accuracies[5] = findaccuracy(predictions_RF,ytest)
overall_accuracies[6] = findaccuracy(predictions_NN,ytest)
overall_accuracies[7] = findaccuracy(predictions_SVM,ytest)
hcat(methods, overall_accuracies)
Out[39]:
7×2 Array{Any,2}:
 "lasso"  0.970588
 "ridge"  1.0
 "EN"     0.970588
 "DT"     0.970588
 "RF"     0.970588
 "kNN"    0.970588
 "SVM"    0.970588

🥳 One cool finding

We used multiple methods to run classification on the iris dataset which is a dataset of flowers and there are three types of iris flowers in it. We split the data into training and testing and ran our methods. Here is the scoreboard:

method accuracy score
lasso 1.0
ridge 1.0
EN 1.0
DT 0.960784
RF 0.980392
kNN 1.0
SVM 1.0