Anomaly Detection Using IsolationForest

IsolationForest is an anomaly detection algorithm offered by ScikitLearn and used for detecting anomalies or outliers in high-dimension data. Anomalies are detected by isolating "observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature." It returns a score of 1 or -1. A score of 1 denotes an inlier whereas a score of -1 denotes an outlier.

Below is an example:

Create a sample dataset

import pandas as pd
data = {'X': [4, 2, 1, 5, 2, 3, 1, 300, 103, 4], 
     'Y': [3, 4, 2, 1, 3, 5, 100, 500, 1, 3]}
data = pd.DataFrame(data=data)
data

Train a dataset with outliers

from sklearn.ensemble import IsolationForest
model=IsolationForest()
model.fit(data[['X']])

Predict outliers

Notice that we're using a new column that the model has not seen before.

data['anomaly_score'] = model.predict(data[['Y']]) 
anomalies = data[data['anomaly_score']==-1].head() #Only keep the outliers
anomalies

#All the code

import pandas as pd
from sklearn.ensemble import IsolationForest

#Create a dataset
data = {'X': [4, 2, 1, 5, 2, 3, 1, 300, 103, 4], 
     'Y': [3, 4, 2, 1, 3, 5, 100, 500, 1, 3]}
data = pd.DataFrame(data=data)
#data

#Train data on the X column
model=IsolationForest()
model.fit(data[['X']])


#Predict outliers using the Y column 
data['anomaly_score'] = model.predict(data[['Y']]) 
anomalies = data[data['anomaly_score']==-1].head() #Only keep the outliers
#anomalies