Cricket Score Prediction
The above picture obviously lets you know how awful is accepting run rate as a solitary variable to foresee the last score in a restricted overs cricket match. In ODI and T-20 cricket, many elements assume a critical part in choosing what the last score will be. It will depend on some key factors such as number of runs scored in past overs, number of wickets left, score of striker and non-striker batter currently, nature of pitch, weather conditions…
Data Collection
I have downloaded the dataset from cricsheet. The site gives us ball by ball subtleties of matches. We then, composed a custom code to just incorporate a portion of the highlights which we will utilize.
The dataset contains ball by ball coverage of:
- 1188 ODI matches: data/odi.csv
- 1474 T-20 matches: data/t20.csv
- 617 IPL matches: data/ipl.csv
Each dataset comprises of following features:
- mid: Unique ID for match
- date: Date when the match was played
- venue: Stadium where the match was played
- bat_team: Batting team
- bowl_team: Bowling team
- batsman: Striker batter (currently replaced batsman as batter)
- bowler: Bowler
- runs: runs scored by batting team at that instance
- wickets: wickets fallen at that instance
- overs: overs bowled at that instance
- runs_last_5: Total runs scored in last 5 overs
- wickets_last_5: Total wickets that fell in last 5 overs
- striker: max(runs scored by striker, runs scored by non-striker)
- non-striker: min(runs scored by striker, runs scored by non-striker)
- total: Total runs scored by batting team after first innings
Prepare Data for consumption
Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import Lasso
Meet and Greet Data
This is the meet and greet step. Get to know your data by first name and learn a little bit about it. What does it look like (datatype and values), what makes it tick (independent/feature variables(s)), what’s its goals in life (dependent/target variable(s)).
dataset = pd.read_csv('data/odi.csv')
X = dataset.iloc[:,[7,8,9,12,13]].values #Input features
y = dataset.iloc[:, 14].values #Label
I have utilized ‘odi.csv’ datasete here for anticipating scores in ODI cricket. We used only runs, wickets, overs, striker, non-striker as all the other features didn’t make much difference in outcomes.
print(dataset.shape)
print(dataset.head())
Out:
(350899, 15)
mid date venue bat_team bowl_team \
0 1 2006-06-13 Civil Service Cricket Club, Stormont England Ireland
1 1 2006-06-13 Civil Service Cricket Club, Stormont England Ireland
2 1 2006-06-13 Civil Service Cricket Club, Stormont England Ireland
3 1 2006-06-13 Civil Service Cricket Club, Stormont England Ireland
4 1 2006-06-13 Civil Service Cricket Club, Stormont England Ireland
batsman bowler runs wickets overs runs_last_5 \
0 ME Trescothick DT Johnston 0 0 0.1 0
1 ME Trescothick DT Johnston 0 0 0.2 0
2 ME Trescothick DT Johnston 4 0 0.3 4
3 ME Trescothick DT Johnston 6 0 0.4 6
4 ME Trescothick DT Johnston 6 0 0.5 6
wickets_last_5 striker non-striker total
0 0 0 0 301
1 0 0 0 301
2 0 0 0 301
3 0 0 0 301
4 0 0 0 301
Data Visualization
scatter_matrix(dataset)
plt.show()
dataset.hist() #histogram
plt.show()
Out:
Splitting data into training and testing set
We have used 75/25 for training and testing respectively.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
X_train, X_test, y_train, y_test
Out:
(array([[ 0.09370533, 0.44565669, 0.18010786, 0.84849475, 0.3048047 ],
[ 0.01649844, 0.44565669, -0.2551692 , -0.89540433, -0.76114359],
[ 1.97240628, 1.31445566, 1.61933043, 0.35023787, -0.56127828],
...,
[ 0.55694666, -0.85754176, 0.03267531, 2.02295739, 2.56994483],
[-0.53681759, 0.01125721, -0.47982834, -0.9309941 , -0.76114359],
[-0.8971164 , -0.85754176, -0.90808481, -0.75304522, -0.02830414]]),
array([[-0.72983481, 1.31445566, -0.2692104 , -0.468327 , -0.82776536],
[-0.43387507, 0.88005618, -0.03753067, -0.21919856, 0.1049394 ],
[-1.23167959, -1.29194125, -1.30123829, -0.78863499, -0.69452182],
...,
[-0.44674289, -1.29194125, -0.96424959, 0.13669921, 1.70386184],
[-1.41182899, -1.29194125, -1.66630938, -1.2513021 , -0.82776536],
[-1.30888647, -1.29194125, -1.39250606, -1.00217366, -0.42803475]]),
array([252, 230, 295, ..., 401, 302, 247]),
array([ 80, 220, 256, ..., 307, 336, 321]))
X,y
Out:
(array([[0.00e+00, 0.00e+00, 1.00e-01, 0.00e+00, 0.00e+00],
[0.00e+00, 0.00e+00, 2.00e-01, 0.00e+00, 0.00e+00],
[4.00e+00, 0.00e+00, 3.00e-01, 0.00e+00, 0.00e+00],
...,
[2.01e+02, 8.00e+00, 4.94e+01, 5.90e+01, 1.80e+01],
[2.02e+02, 8.00e+00, 4.95e+01, 5.90e+01, 1.80e+01],
[2.03e+02, 8.00e+00, 4.96e+01, 5.90e+01, 1.80e+01]]),
array([301, 301, 301, ..., 203, 203, 203]))
def custom_accuracy(y_test,y_pred,thresold):
right = 0
l = len(y_pred)
for i in range(0,l):
if(abs(y_pred[i]-y_test[i]) <= thresold):
right += 1
return ((right/l)*100)
This is a custom fuction defined for testing this model. Custom Accuracy is defined on the basis of difference between the predicted score and actual score. If this difference falls below a particular thresold, we count it as a correct prediction.
R-sqaured is a statistic that will give some information about the goodness of fit of a model. In regression, the R-squared coefficient of determination is a statistical measure of how well the regression predictions approximate the real data points. An R-squared value of 1 indicates that the regression predictions perfectly fit the data.
Model
Linear Regression
lin = LinearRegression()
lin.fit(X_train,y_train)
y_pred = lin.predict(X_test)
score = lin.score(X_test,y_test)*100
print("R-squared value:" , score)
lin_acc=custom_accuracy(y_test,y_pred,25)
print("Custom accuracy:" , lin_acc)
Out:
R-squared value: 52.737657811129445
Custom accuracy: 51.78797378170419
Random Forest
ran = RandomForestRegressor(n_estimators=100,max_features=None)
ran.fit(X_train,y_train)
y_pred = ran.predict(X_test)
score = ran.score(X_test,y_test)*100
print("R-squared value:" , score)
ran_acc=custom_accuracy(y_test,y_pred,25)
print("Custom accuracy:" , ran_acc)
Out:
R-squared value: 79.56968334784949
Custom accuracy: 81.6129951553149
K Nearest Neighbours
knn = KNeighborsClassifier(n_neighbors = 7)
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)
score = knn.score(X_test,y_test)*100
print("R-squared value:" , score)
knn_acc=custom_accuracy(y_test,y_pred,25)
print("Custom accuracy:" , knn_acc)
Out:
R-squared value: 64.7044742091764
Custom accuracy: 77.92533485323455
Lasso Regression
las = Lasso(alpha=0.01, max_iter=10e5)
las.fit(X_train,y_train)
y_pred = las.predict(X_test)
score = las.score(X_test,y_test)*100
print("R-squared value:" , score)
las_acc=custom_accuracy(y_test,y_pred,37)
print("Custom accuracy:" , las_acc)
Out:
R-squared value: 52.73751630496808
Custom accuracy: 67.71045882017668
Testing
cur_score=int(input('Current Score: '))
cur_wickets=int(input('Current Wickets: '))
cur_overs=int(input('Current Overs: '))
cur_striker=int(input('Current Striker Score: '))
cur_non_striker=int(input('Current Non-Striker Score: '))
Out:
Current Score: 130
Current Wickets: 7
Current Overs: 19
Current Striker Score: 12
Current Non-Striker Score: 21
if cur_score<cur_striker+cur_non_striker or cur_overs<0 or cur_overs>50:
print('Error in Input')
else:
new_prediction = lin.predict(sc.transform(np.array([[cur_score,cur_wickets,cur_overs,cur_striker,cur_non_striker]])))
print("Linear Regression - Prediction score:" , new_prediction)
new_prediction = ran.predict(sc.transform(np.array([[cur_score,cur_wickets,cur_overs,cur_striker,cur_non_striker]])))
print("Random Forest Regression - Prediction score:" , new_prediction)
new_prediction = knn.predict(sc.transform(np.array([[cur_score,cur_wickets,cur_overs,cur_striker,cur_non_striker]])))
print("K Nearest Neighbours - Prediction score:" , new_prediction)
new_prediction = las.predict(sc.transform(np.array([[cur_score,cur_wickets,cur_overs,cur_striker,cur_non_striker]])))
print("Lasso Regression - Prediction score:" , new_prediction)
Out:
Linear Regression - Prediction score: [234.70655237]
Random Forest Regression - Prediction score: [185.76]
K Nearest Neighbours - Prediction score: [130]
Lasso Regression - Prediction score: [234.502072]
a=[]
a.append(lin_acc)
a.append(ran_acc)
a.append(knn_acc)
a.append(las_acc)
b=['Linear','Random Forest','KNN','LASSO']
plt.bar(b, a, color ='purple',width = 0.4)
plt.show()
Comparing the accuracies of all four algorithms clearly shows that Random Forest Regression gives the highest accuracy
I implemented Linear Regression from scratch without using sklearn
functions
Linear Regression is a supervised learning algorithm used to predict the real-valued output y based on the given input value x. It depicts the relationship between the dependent variable y and the independent variables xi ( or features ). The hypothetical function used for prediction is represented by h( x ).
h( x ) = w * x + b
here, b is the bias.
x represents the feature vector
w represents the weight vector.
Mathematical Intuition
The cost function (or loss function) is used to measure the performance of a machine learning model or quantifies the error between the expected values and the values predicted by our hypothetical function. The cost function for Linear Regression is represented by J.
Here, m is the total number of training examples in the dataset.
y^i represents the value of target variable for ith training example.
Algorithm
repeat until convergence {
tmpi = wi - alpha * dwi
wi = tmpi
}
where alpha is the learning rate.
Implementation
class LinearRegression() :
def __init__( self, learning_rate, iterations ) :
self.learning_rate = learning_rate
self.iterations = iterations
def fit( self, X, Y ) :
self.m, self.n = X.shape
self.W = np.zeros( self.n )
self.b = 0
self.X = X
self.Y = Y
for i in range( self.iterations ) :
self.update_weights()
return self
def update_weights( self ) :
Y_pred = self.predict( self.X )
dW = - ( 2 * ( self.X.T ).dot( self.Y - Y_pred ) ) / self.m
db = - 2 * np.sum( self.Y - Y_pred ) / self.m
self.W = self.W - self.learning_rate * dW
self.b = self.b - self.learning_rate * db
return self
def predict( self, X ) :
return X.dot( self.W ) + self.b
Fitting the model
model = LinearRegression(learning_rate=0.01,iterations=1000)
model.fit(X_train,y_train)
new_prediction = model.predict(sc.transform(np.array([[cur_score,cur_wickets,cur_overs,cur_striker,cur_non_striker]])))
print("Linear Regression - Prediction score:" , new_prediction)
model_acc=custom_accuracy(y_test,y_pred,25)
print("Custom accuracy:" , lin_acc)
Out:
Custom accuracy: 51.78797378170419
c=[]
c.append(lin_acc)
c.append(model_acc)
d=['sklearn Linear','Scratch Linear']
plt.bar(d, c, color ='purple',width = 0.4)
print(c)
plt.show()
Out:
Authors
Resources: