In this notebook we're going to show two nice applications of Julia: one performing a sorting algorithm (Merge Sort) and the other showing Julia's ease to make ML Classification projects
Have you forgotten how to use and set threads? Let's remember together
Threads.nthreads()
1
Now we see that we're using only one thread; that is so serial! let's change it a little bit
export JULIA_NUM_THREADS = 4
4-element Array{Int64,1}: 6 7 8 9
And let's check it out
Threads.nthreads()
1
import Base.Threads.@spawn
# sort the elements of `v` in place, from indices `lo` to `hi` inclusive
function psort!(v, lo::Int=1, hi::Int=length(v))
if lo >= hi # 1 or 0 elements; nothing to do
return v
end
if hi - lo < 100000 # below some cutoff, run in serial
sort!(view(v, lo:hi), alg = MergeSort)
return v
end
mid = (lo+hi)>>>1 # find the midpoint
half = @spawn psort!(v, lo, mid)# task to sort the lower half; will run
psort!(v, mid+1, hi) # in parallel with the current call sorting
# the upper half
wait(half) # wait for the lower half to finish
temp = v[lo:mid] # workspace for merging
i, k, j = 1, lo, mid+1 # merge the two sorted sub-arrays
@inbounds while k < j <= hi
if v[j] < temp[i]
v[k] = v[j]
j += 1
else
v[k] = temp[i]
i += 1
end
k += 1
end
@inbounds while k < j
v[k] = temp[i]
k += 1
i += 1
end
return v
end
psort! (generic function with 3 methods)
a = rand(20000000);
b = copy(a); @time sort!(b, alg = MergeSort); # single-threaded
2.605274 seconds (3 allocations: 76.294 MiB)
b = copy(a); @time psort!(b); # two threads
2.989460 seconds (3.34 k allocations: 686.956 MiB, 1.79% gc time)
b = copy(a); @time psort!(b); # two threads
14.794936 seconds (3.34 k allocations: 686.956 MiB, 81.05% gc time)
Put simply, classification is the task of predicting a label for a given observation. For example: you are given certain physical descriptions of an animal, and your taks is to classify them as either a dog or a cat. Here, we will classify iris flowers.
As we will see later, we will use different classifiers and at the end of this notebook, we will compare them. We will define our accuracy function right now to get it out of the way. We will use a simple accuracy function that returns the ratio of the number of correctly classified observations to the total number of predictions.
findaccuracy(predictedvals,groundtruthvals) = sum(predictedvals.==groundtruthvals)/length(groundtruthvals)
findaccuracy (generic function with 1 method)
import Pkg; #Downloading necessary packages
Pkg.add("GLMNet")
Pkg.add("RDatasets")
Pkg.add("MLBase")
Pkg.add("Plots")
Pkg.add("DecisionTree")
Pkg.add("Distances")
Pkg.add("NearestNeighbors")
Pkg.add("LinearAlgebra")
Pkg.add("DataStructures")
Pkg.add("LIBSVM")
Updating registry at `C:\Users\oscam\.julia\registries\General`
Updating git-repo `https://github.com/JuliaRegistries/General.git`
Fetching: [
[1mFetching: [========================================>] 100.0 %
Resolving package versions... Installed Rmath_jll ──────────────────── v0.2.2+0 Installed Rmath ──────────────────────── v0.6.1 Installed ZeroMQ_jll ─────────────────── v4.3.2+4 Installed Arpack ─────────────────────── v0.4.0 Installed GLMNet ─────────────────────── v0.5.1 Installed CompilerSupportLibraries_jll ─ v0.3.3+0 Installed OpenSpecFun_jll ────────────── v0.5.3+3 Installed StatsBase ──────────────────── v0.33.0 Installed PDMats ─────────────────────── v0.9.12 Installed FillArrays ─────────────────── v0.8.10 Installed QuadGK ─────────────────────── v2.3.1 Installed Arpack_jll ─────────────────── v3.5.0+3 Installed glmnet_jll ─────────────────── v5.0.0+0 Installed StatsFuns ──────────────────── v0.9.5 Installed SpecialFunctions ───────────── v0.10.3 Installed Distributions ──────────────── v0.23.4 Installed OpenBLAS_jll ───────────────── v0.3.9+4 Updating `C:\Users\oscam\.julia\environments\v1.4\Project.toml` [8d5ece8b] + GLMNet v0.5.1 Updating `C:\Users\oscam\.julia\environments\v1.4\Manifest.toml` [7d9fca2a] + Arpack v0.4.0 [68821587] + Arpack_jll v3.5.0+3 [e66e0078] + CompilerSupportLibraries_jll v0.3.3+0 [31c24e10] + Distributions v0.23.4 [1a297f60] + FillArrays v0.8.10 [8d5ece8b] + GLMNet v0.5.1 [4536629a] + OpenBLAS_jll v0.3.9+4 [efe28fd5] + OpenSpecFun_jll v0.5.3+3 [90014a1f] + PDMats v0.9.12 [1fd47b50] + QuadGK v2.3.1 [79098fc4] + Rmath v0.6.1 [f50d1b31] + Rmath_jll v0.2.2+0 [276daf66] + SpecialFunctions v0.10.3 [2913bbd2] + StatsBase v0.33.0 [4c63d2b9] + StatsFuns v0.9.5 [8f1865be] ↑ ZeroMQ_jll v4.3.2+3 ⇒ v4.3.2+4 [78c6b45d] + glmnet_jll v5.0.0+0 [4607b0f0] + SuiteSparse Resolving package versions... Updating `C:\Users\oscam\.julia\environments\v1.4\Project.toml` [no changes] Updating `C:\Users\oscam\.julia\environments\v1.4\Manifest.toml` [no changes] Resolving package versions... Installed IterTools ─ v1.3.0 Installed MLBase ──── v0.8.0 Updating `C:\Users\oscam\.julia\environments\v1.4\Project.toml` [f0e99cf1] + MLBase v0.8.0 Updating `C:\Users\oscam\.julia\environments\v1.4\Manifest.toml` [c8e1da08] + IterTools v1.3.0 [f0e99cf1] + MLBase v0.8.0 Resolving package versions... Installed FFMPEG ──────────── v0.3.0 Installed FreeType2_jll ───── v2.10.1+2 Installed FFMPEG_jll ──────── v4.1.0+3 Installed FixedPointNumbers ─ v0.8.0 Installed PlotUtils ───────── v1.0.4 Installed PlotThemes ──────── v2.0.0 Installed Plots ───────────── v1.3.7 Installed StaticArrays ────── v0.12.3 Installed NaNMath ─────────── v0.3.3 Installed libfdk_aac_jll ──── v0.1.6+2 Installed Measures ────────── v0.3.1 Installed x265_jll ────────── v3.0.0+1 Installed LibVPX_jll ──────── v1.8.1+1 Installed FriBidi_jll ─────── v1.0.5+3 Installed Showoff ─────────── v0.3.1 Installed Contour ─────────── v0.5.3 Installed LAME_jll ────────── v3.100.0+1 Installed OpenSSL_jll ─────── v1.1.1+2 Installed IniFile ─────────── v0.5.0 Installed RecipesPipeline ─── v0.1.10 Installed HTTP ────────────── v0.8.15 Installed Colors ──────────── v0.12.1 Installed GeometryTypes ───── v0.8.3 Installed libass_jll ──────── v0.14.0+2 Installed x264_jll ────────── v2019.5.25+2 Installed GR ──────────────── v0.50.1 Installed Opus_jll ────────── v1.3.1+1 Installed Ogg_jll ─────────── v1.3.4+0 Installed Bzip2_jll ───────── v1.0.6+2 Installed ColorTypes ──────── v0.10.3 Installed libvorbis_jll ───── v1.3.6+4 Installed ColorSchemes ────── v3.9.0 Updating `C:\Users\oscam\.julia\environments\v1.4\Project.toml` [91a5bcdd] + Plots v1.3.7 Updating `C:\Users\oscam\.julia\environments\v1.4\Manifest.toml` [6e34b625] + Bzip2_jll v1.0.6+2 [35d6a980] + ColorSchemes v3.9.0 [3da002f7] + ColorTypes v0.10.3 [5ae59095] + Colors v0.12.1 [d38c429a] + Contour v0.5.3 [c87230d0] + FFMPEG v0.3.0 [b22a6f82] + FFMPEG_jll v4.1.0+3 [53c48c17] + FixedPointNumbers v0.8.0 [d7e528f0] + FreeType2_jll v2.10.1+2 [559328eb] + FriBidi_jll v1.0.5+3 [28b8d3ca] + GR v0.50.1 [4d00f742] + GeometryTypes v0.8.3 [cd3eb016] + HTTP v0.8.15 [83e8ac13] + IniFile v0.5.0 [c1c5ebd0] + LAME_jll v3.100.0+1 [dd192d2f] + LibVPX_jll v1.8.1+1 [442fdcdd] + Measures v0.3.1 [77ba4419] + NaNMath v0.3.3 [e7412a2a] + Ogg_jll v1.3.4+0 [458c3c95] + OpenSSL_jll v1.1.1+2 [91d4177d] + Opus_jll v1.3.1+1 [ccf2f8ad] + PlotThemes v2.0.0 [995b91a9] + PlotUtils v1.0.4 [91a5bcdd] + Plots v1.3.7 [01d81517] + RecipesPipeline v0.1.10 [992d4aef] + Showoff v0.3.1 [90137ffa] + StaticArrays v0.12.3 [0ac62f75] + libass_jll v0.14.0+2 [f638f0a6] + libfdk_aac_jll v0.1.6+2 [f27f6e37] + libvorbis_jll v1.3.6+4 [1270edf5] + x264_jll v2019.5.25+2 [dfaa095f] + x265_jll v3.0.0+1 Building GR ───→ `C:\Users\oscam\.julia\packages\GR\Atztx\deps\build.log` Building Plots → `C:\Users\oscam\.julia\packages\Plots\Xnzc7\deps\build.log` Resolving package versions... Installed ScikitLearnBase ─ v0.5.0 Installed DecisionTree ──── v0.10.2 Updating `C:\Users\oscam\.julia\environments\v1.4\Project.toml` [7806a523] + DecisionTree v0.10.2 Updating `C:\Users\oscam\.julia\environments\v1.4\Manifest.toml` [7806a523] + DecisionTree v0.10.2 [6e75b9c4] + ScikitLearnBase v0.5.0 Resolving package versions... Installed Distances ─ v0.9.0 Updating `C:\Users\oscam\.julia\environments\v1.4\Project.toml` [b4f34e82] + Distances v0.9.0 Updating `C:\Users\oscam\.julia\environments\v1.4\Manifest.toml` [b4f34e82] + Distances v0.9.0 Resolving package versions... Installed NearestNeighbors ─ v0.4.5 Updating `C:\Users\oscam\.julia\environments\v1.4\Project.toml` [b8a86587] + NearestNeighbors v0.4.5 Updating `C:\Users\oscam\.julia\environments\v1.4\Manifest.toml` [b8a86587] + NearestNeighbors v0.4.5 Resolving package versions... Updating `C:\Users\oscam\.julia\environments\v1.4\Project.toml` [37e2e46d] + LinearAlgebra Updating `C:\Users\oscam\.julia\environments\v1.4\Manifest.toml` [no changes] Resolving package versions... Updating `C:\Users\oscam\.julia\environments\v1.4\Project.toml` [864edb3b] + DataStructures v0.17.17 Updating `C:\Users\oscam\.julia\environments\v1.4\Manifest.toml` [no changes] Resolving package versions... Installed LIBSVM ──── v0.4.0 Installed LIBLINEAR ─ v0.5.1 Updating `C:\Users\oscam\.julia\environments\v1.4\Project.toml` [b1bec4e5] + LIBSVM v0.4.0 Updating `C:\Users\oscam\.julia\environments\v1.4\Manifest.toml` [2d691ee1] + LIBLINEAR v0.5.1 [b1bec4e5] + LIBSVM v0.4.0 Building LIBLINEAR → `C:\Users\oscam\.julia\packages\LIBLINEAR\yTdp5\deps\build.log` Building LIBSVM ───→ `C:\Users\oscam\.julia\packages\LIBSVM\5Z99T\deps\build.log`
using GLMNet
using RDatasets
using MLBase
using Plots
using DecisionTree
using Distances
using NearestNeighbors
using Random
using LinearAlgebra
using DataStructures
using LIBSVM
┌ Info: Precompiling GLMNet [8d5ece8b-de18-5317-b113-243142960cc6] └ @ Base loading.jl:1260 ┌ Info: Precompiling MLBase [f0e99cf1-93fa-52ec-9ecc-5026115318e0] └ @ Base loading.jl:1260 ┌ Info: Precompiling Plots [91a5bcdd-55d7-5caf-9e0b-520d859cae80] └ @ Base loading.jl:1260 ┌ Info: Precompiling DecisionTree [7806a523-6efd-50cb-b5f6-3fa6f1930dbb] └ @ Base loading.jl:1260 ┌ Info: Precompiling Distances [b4f34e82-e78d-54a5-968a-f98e89d6e8f7] └ @ Base loading.jl:1260 ┌ Info: Precompiling NearestNeighbors [b8a86587-4115-5ab1-83bc-aa920d37bbce] └ @ Base loading.jl:1260 ┌ Info: Precompiling LIBSVM [b1bec4e5-fd48-53fe-b0cb-9723c09d164b] └ @ Base loading.jl:1260
We're going to use a dataset called iris, which contains different data of flowers such as sepal length, sepal width, petal length and petal width and the species of each specimen
iris = dataset("datasets", "iris")
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Cat… | |
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
7 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
8 | 5.0 | 3.4 | 1.5 | 0.2 | setosa |
9 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
10 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
11 | 5.4 | 3.7 | 1.5 | 0.2 | setosa |
12 | 4.8 | 3.4 | 1.6 | 0.2 | setosa |
13 | 4.8 | 3.0 | 1.4 | 0.1 | setosa |
14 | 4.3 | 3.0 | 1.1 | 0.1 | setosa |
15 | 5.8 | 4.0 | 1.2 | 0.2 | setosa |
16 | 5.7 | 4.4 | 1.5 | 0.4 | setosa |
17 | 5.4 | 3.9 | 1.3 | 0.4 | setosa |
18 | 5.1 | 3.5 | 1.4 | 0.3 | setosa |
19 | 5.7 | 3.8 | 1.7 | 0.3 | setosa |
20 | 5.1 | 3.8 | 1.5 | 0.3 | setosa |
21 | 5.4 | 3.4 | 1.7 | 0.2 | setosa |
22 | 5.1 | 3.7 | 1.5 | 0.4 | setosa |
23 | 4.6 | 3.6 | 1.0 | 0.2 | setosa |
24 | 5.1 | 3.3 | 1.7 | 0.5 | setosa |
25 | 4.8 | 3.4 | 1.9 | 0.2 | setosa |
26 | 5.0 | 3.0 | 1.6 | 0.2 | setosa |
27 | 5.0 | 3.4 | 1.6 | 0.4 | setosa |
28 | 5.2 | 3.5 | 1.5 | 0.2 | setosa |
29 | 5.2 | 3.4 | 1.4 | 0.2 | setosa |
30 | 4.7 | 3.2 | 1.6 | 0.2 | setosa |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
As you might see, the first 4 columns (ignoring the one of the index) of our dataset are the features of the flowers and the last one is the species. Now, we're going to setup our dataset, by storing in X the values of the features and in irislabels the values of the species
X = Matrix(iris[:,1:4])
irislabels = iris[:,5]
150-element CategoricalArray{String,1,UInt8}: "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" ⋮ "virginica" "virginica" "virginica" "virginica" "virginica" "virginica" "virginica" "virginica" "virginica" "virginica" "virginica" "virginica"
X
150×4 Array{Float64,2}: 5.1 3.5 1.4 0.2 4.9 3.0 1.4 0.2 4.7 3.2 1.3 0.2 4.6 3.1 1.5 0.2 5.0 3.6 1.4 0.2 5.4 3.9 1.7 0.4 4.6 3.4 1.4 0.3 5.0 3.4 1.5 0.2 4.4 2.9 1.4 0.2 4.9 3.1 1.5 0.1 5.4 3.7 1.5 0.2 4.8 3.4 1.6 0.2 4.8 3.0 1.4 0.1 ⋮ 6.0 3.0 4.8 1.8 6.9 3.1 5.4 2.1 6.7 3.1 5.6 2.4 6.9 3.1 5.1 2.3 5.8 2.7 5.1 1.9 6.8 3.2 5.9 2.3 6.7 3.3 5.7 2.5 6.7 3.0 5.2 2.3 6.3 2.5 5.0 1.9 6.5 3.0 5.2 2.0 6.2 3.4 5.4 2.3 5.9 3.0 5.1 1.8
For practical reasons we're going to change the type of our labels to discret numeric data type, assigning to each label a corresponding number (1-3) instead of our labels (setosa, versicolor and virginica). This can be done with the following functions:
irislabelsmap = labelmap(irislabels)
y = labelencode(irislabelsmap, irislabels)
150-element Array{Int64,1}: 1 1 1 1 1 1 1 1 1 1 1 1 1 ⋮ 3 3 3 3 3 3 3 3 3 3 3 3
In classification, we often want to use some of the data to fit a model, and the rest of the data to validate (commonly known as training
and testing
data). We will get this data ready now so that we can easily use it in the rest of this notebook.
function perclass_splits(y,at)
uids = unique(y)
keepids = []
for ui in uids
curids = findall(y.==ui)
rowids = randsubseq(curids, at)
push!(keepids,rowids...)
end
return keepids
end
perclass_splits (generic function with 1 method)
?randsubseq
search: randsubseq randsubseq! RandomSub StratifiedRandomSub
randsubseq([rng=GLOBAL_RNG,] A, p) -> Vector
Return a vector consisting of a random subsequence of the given array A
, where each element of A
is included (in order) with independent probability p
. (Complexity is linear in p*length(A)
, so this function is efficient even if p
is small and A
is large.) Technically, this process is known as "Bernoulli sampling" of A
.
jldoctest
julia> rng = MersenneTwister(1234);
julia> randsubseq(rng, collect(1:8), 0.3)
2-element Array{Int64,1}:
7
8
trainids = perclass_splits(y,0.7)
testids = setdiff(1:length(y),trainids)
34-element Array{Int64,1}: 13 14 26 29 34 38 39 49 50 51 52 54 60 ⋮ 88 91 96 100 102 104 110 111 117 121 140 141
We will need one more function, and that is the function that will assign classes based on the predicted values when the predicted values are continuous.
assign_class(predictedvalue) = argmin(abs.(predictedvalue .- [1,2,3]))
assign_class (generic function with 1 method)
path = glmnet(X[trainids,:], y[trainids])
cv = glmnetcv(X[trainids,:], y[trainids])
Least Squares GLMNet Cross Validation 72 models for 4 predictors in 10 folds Best λ 0.001 (mean loss 0.050, std 0.007)
# choose the best lambda to predict with.
path = glmnet(X[trainids,:], y[trainids])
cv = glmnetcv(X[trainids,:], y[trainids])
mylambda = path.lambda[argmin(cv.meanloss)]
path = glmnet(X[trainids,:], y[trainids],lambda=[mylambda]);
q = X[testids,:];
predictions_lasso = GLMNet.predict(path,q)
34×1 Array{Float64,2}: 0.9146683630742383 0.9027034924929771 0.9989459600781313 0.9072236516239828 0.8313765151883691 0.8727250193938867 0.998934696413957 0.9041861447141626 0.9361291549187013 2.215471751594401 2.299734330379395 2.212020807713042 2.264340376293498 ⋮ 2.2120358259319413 2.2323069388267367 2.1405846303725884 2.187033114965047 2.7813647257591096 2.770792080136811 3.170316552025599 2.733060904097049 2.7179756084531945 3.0098009089429016 2.8223405451162105 3.0751583179089907
predictions_lasso = assign_class.(predictions_lasso)
findaccuracy(predictions_lasso,y[testids])
0.9705882352941176
We will use the same function but set alpha to zero.
# choose the best lambda to predict with.
path = glmnet(X[trainids,:], y[trainids],alpha=0);
cv = glmnetcv(X[trainids,:], y[trainids],alpha=0)
mylambda = path.lambda[argmin(cv.meanloss)]
path = glmnet(X[trainids,:], y[trainids],alpha=0,lambda=[mylambda]);
q = X[testids,:];
predictions_ridge = GLMNet.predict(path,q)
predictions_ridge = assign_class.(predictions_ridge)
findaccuracy(predictions_ridge,y[testids])
1.0
We will use the same function but set alpha to 0.5 (it's the combination of lasso and ridge).
# choose the best lambda to predict with.
path = glmnet(X[trainids,:], y[trainids],alpha=0.5);
cv = glmnetcv(X[trainids,:], y[trainids],alpha=0.5)
mylambda = path.lambda[argmin(cv.meanloss)]
path = glmnet(X[trainids,:], y[trainids],alpha=0.5,lambda=[mylambda]);
q = X[testids,:];
predictions_EN = GLMNet.predict(path,q)
predictions_EN = assign_class.(predictions_EN)
findaccuracy(predictions_EN,y[testids])
0.9705882352941176
We will use the package DecisionTree
model = DecisionTreeClassifier(max_depth=2)
DecisionTree.fit!(model, X[trainids,:], y[trainids])
DecisionTreeClassifier max_depth: 2 min_samples_leaf: 1 min_samples_split: 2 min_purity_increase: 0.0 pruning_purity_threshold: 1.0 n_subfeatures: 0 classes: [1, 2, 3] root: Decision Tree Leaves: 3 Depth: 2
q = X[testids,:];
predictions_DT = DecisionTree.predict(model, q)
findaccuracy(predictions_DT,y[testids])
0.9705882352941176
The RandomForestClassifier
is available through the DecisionTree
package as well.
model = RandomForestClassifier(n_trees=20)
DecisionTree.fit!(model, X[trainids,:], y[trainids])
RandomForestClassifier n_trees: 20 n_subfeatures: -1 partial_sampling: 0.7 max_depth: -1 min_samples_leaf: 1 min_samples_split: 2 min_purity_increase: 0.0 classes: [1, 2, 3] ensemble: Ensemble of Decision Trees Trees: 20 Avg Leaves: 6.3 Avg Depth: 4.5
q = X[testids,:];
predictions_RF = DecisionTree.predict(model, q)
findaccuracy(predictions_RF,y[testids])
0.9705882352941176
We will use the NearestNeighbors
package here.
Xtrain = X[trainids,:]
ytrain = y[trainids]
kdtree = KDTree(Xtrain')
KDTree{StaticArrays.SArray{Tuple{4},Float64,1,4},Euclidean,Float64} Number of points: 116 Dimensions: 4 Metric: Euclidean(0.0) Reordered: true
queries = X[testids,:]
34×4 Array{Float64,2}: 4.8 3.0 1.4 0.1 4.3 3.0 1.1 0.1 5.0 3.0 1.6 0.2 5.2 3.4 1.4 0.2 5.5 4.2 1.4 0.2 4.9 3.6 1.4 0.1 4.4 3.0 1.3 0.2 5.3 3.7 1.5 0.2 5.0 3.3 1.4 0.2 7.0 3.2 4.7 1.4 6.4 3.2 4.5 1.5 5.5 2.3 4.0 1.3 5.2 2.7 3.9 1.4 ⋮ 6.3 2.3 4.4 1.3 5.5 2.6 4.4 1.2 5.7 3.0 4.2 1.2 5.7 2.8 4.1 1.3 5.8 2.7 5.1 1.9 6.3 2.9 5.6 1.8 7.2 3.6 6.1 2.5 6.5 3.2 5.1 2.0 6.5 3.0 5.5 1.8 6.9 3.2 5.7 2.3 6.9 3.1 5.4 2.1 6.7 3.1 5.6 2.4
idxs, dists = knn(kdtree, queries', 5, true)
([[2, 10, 39, 30, 27], [36, 9, 41, 3, 4], [30, 10, 2, 27, 39], [25, 33, 1, 16, 8], [29, 14, 15, 13, 6], [5, 1, 34, 8, 16], [9, 36, 4, 41, 3], [11, 25, 18, 40, 20], [8, 33, 31, 1, 30], [42, 65, 56, 47, 57] … [72, 66, 71, 44, 61], [72, 71, 66, 69, 61], [109, 84, 90, 116, 62], [106, 97, 82, 101, 114], [110, 111, 76, 104, 93], [114, 86, 57, 112, 106], [106, 114, 82, 97, 83], [110, 93, 111, 83, 76], [83, 112, 108, 93, 114], [111, 83, 110, 77, 93]], [[0.1414213562373099, 0.17320508075688815, 0.19999999999999998, 0.20000000000000037, 0.244948974278318], [0.31622776601683816, 0.34641016151377546, 0.4795831523312718, 0.5000000000000003, 0.519615242270663], [0.17320508075688762, 0.19999999999999993, 0.22360679774997896, 0.22360679774997916, 0.3000000000000002], [0.14142135623730964, 0.14142135623730995, 0.14142135623730995, 0.17320508075688806, 0.22360679774997916], [0.3464101615137755, 0.3605551275463992, 0.38729833462074176, 0.412310562561766, 0.4795831523312721], [0.14142135623730925, 0.24494897427831727, 0.26457513110645886, 0.2645751311064591, 0.29999999999999954], [0.14142135623730948, 0.20000000000000018, 0.29999999999999954, 0.2999999999999996, 0.3605551275463989], [0.10000000000000053, 0.22360679774997896, 0.2449489742783178, 0.24494897427831785, 0.2828427124746191], [0.14142135623730964, 0.17320508075688762, 0.22360679774997877, 0.22360679774997896, 0.24494897427831747], [0.26457513110645914, 0.33166247903553986, 0.45825756949558427, 0.5196152422706637, 0.5567764362830021] … [0.14142135623730964, 0.1732050807568884, 0.33166247903554, 0.3741657386773941, 0.43588989435406733], [0.14142135623730995, 0.17320508075688815, 0.22360679774997935, 0.26457513110645864, 0.26457513110645864], [0.0, 0.2645751311064589, 0.31622776601683755, 0.33166247903553997, 0.3605551275463989], [0.24494897427831802, 0.3316624790355402, 0.38729833462074154, 0.4242640687119288, 0.4999999999999996], [0.6324555320336759, 0.7071067811865474, 0.7549834435270749, 0.806225774829855, 0.8124038404635958], [0.22360679774997935, 0.374165738677394, 0.4242640687119286, 0.4242640687119287, 0.46904157598234314], [0.1414213562373093, 0.3605551275463988, 0.38729833462074154, 0.3872983346207416, 0.42426406871192845], [0.22360679774997935, 0.2999999999999998, 0.30000000000000016, 0.3605551275463991, 0.3999999999999997], [0.1732050807568879, 0.360555127546399, 0.36055512754639935, 0.4123105625617659, 0.46904157598234336], [0.24494897427831785, 0.34641016151377513, 0.346410161513776, 0.360555127546399, 0.374165738677394]])
c = ytrain[hcat(idxs...)]
possible_labels = map(i->counter(c[:,i]),1:size(c,2))
predictions_NN = map(i->parse(Int,string(argmax(DataFrame(possible_labels[i])[1,:]))),1:size(c,2))
findaccuracy(predictions_NN,y[testids])
0.9705882352941176
We will use the LIBSVM
package here.
Xtrain = X[trainids,:]
ytrain = y[trainids]
116-element Array{Int64,1}: 1 1 1 1 1 1 1 1 1 1 1 1 1 ⋮ 3 3 3 3 3 3 3 3 3 3 3 3
model = svmtrain(Xtrain', ytrain)
LIBSVM.SVM{Int64}(SVC, LIBSVM.Kernel.RadialBasis, nothing, 4, 3, [1, 2, 3], Int32[1, 2, 3], Float64[], Int32[], LIBSVM.SupportVectors{Int64,Float64}(39, Int32[6, 16, 17], [1, 1, 1, 1, 1, 1, 2, 2, 2, 2 … 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], [5.7 5.7 … 6.5 5.9; 4.4 3.8 … 3.0 3.0; 1.5 1.7 … 5.2 5.1; 0.4 0.3 … 2.0 1.8], Int32[14, 17, 22, 23, 35, 38, 42, 43, 44, 45 … 96, 98, 100, 102, 103, 107, 108, 113, 114, 116], LIBSVM.SVMNode[LIBSVM.SVMNode(1, 5.7), LIBSVM.SVMNode(1, 5.7), LIBSVM.SVMNode(1, 5.1), LIBSVM.SVMNode(1, 4.8), LIBSVM.SVMNode(1, 4.5), LIBSVM.SVMNode(1, 5.1), LIBSVM.SVMNode(1, 6.9), LIBSVM.SVMNode(1, 6.5), LIBSVM.SVMNode(1, 5.7), LIBSVM.SVMNode(1, 6.3) … LIBSVM.SVMNode(1, 6.1), LIBSVM.SVMNode(1, 7.2), LIBSVM.SVMNode(1, 7.9), LIBSVM.SVMNode(1, 6.3), LIBSVM.SVMNode(1, 6.1), LIBSVM.SVMNode(1, 6.0), LIBSVM.SVMNode(1, 6.9), LIBSVM.SVMNode(1, 6.3), LIBSVM.SVMNode(1, 6.5), LIBSVM.SVMNode(1, 5.9)]), 0.0, [0.448029161923002 0.9556421011989508; 0.03976149509479696 0.0; … ; -0.0 -1.0; -0.0 -1.0], Float64[], Float64[], [0.03446786611956108, 0.16792461061618347, 0.17802546496124236], 3, 0.25, 200.0, 0.001, 1.0, 0.5, 0.1, true, false)
predictions_SVM, decision_values = svmpredict(model, X[testids,:]')
findaccuracy(predictions_SVM,y[testids])
0.9705882352941176
Putting all the results together:
overall_accuracies = zeros(7)
methods = ["lasso","ridge","EN", "DT", "RF","kNN", "SVM"]
ytest = y[testids]
overall_accuracies[1] = findaccuracy(predictions_lasso,ytest)
overall_accuracies[2] = findaccuracy(predictions_ridge,ytest)
overall_accuracies[3] = findaccuracy(predictions_EN,ytest)
overall_accuracies[4] = findaccuracy(predictions_DT,ytest)
overall_accuracies[5] = findaccuracy(predictions_RF,ytest)
overall_accuracies[6] = findaccuracy(predictions_NN,ytest)
overall_accuracies[7] = findaccuracy(predictions_SVM,ytest)
hcat(methods, overall_accuracies)
7×2 Array{Any,2}: "lasso" 0.970588 "ridge" 1.0 "EN" 0.970588 "DT" 0.970588 "RF" 0.970588 "kNN" 0.970588 "SVM" 0.970588
We used multiple methods to run classification on the iris
dataset which is a dataset of flowers and there are three types of iris flowers in it. We split the data into training and testing and ran our methods. Here is the scoreboard:
method | accuracy score |
---|---|
lasso | 1.0 |
ridge | 1.0 |
EN | 1.0 |
DT | 0.960784 |
RF | 0.980392 |
kNN | 1.0 |
SVM | 1.0 |