Recognizing Handwritten Digits
Recognizing handwritten text is a problem that can be traced back to the first automatic machines that needed to recognize individual characters in handwritten documents. Classifying handwritten text or numbers is important for many real world scenarios. For example, a postal service can scan postal codes on envelopes which has to be sent to the same place. This article presents recognizing the handwritten digits (0 to 9) using the famous digits data set from Scikit-Learn using a classifier called Logistic Regression.
Scikit-Learn is a library for python that contains numerous useful algorithms that can be easily implemented and altered for the purpose of classification and other machine learning tasks.
One of the most fascinating things about the Scikit-Learn library is that it has a 4-step modeling pattern that makes it easy to code a machine learning classifier:
1. Import the model you want to use.
In Scikit-Learn, all machine learning models are implemented as Python classes.
2. Make an Instance of the Model
3. Training the model on the data and storing the information learned from the data.
4. Predicting the labels of new data, using the information learned during the training process.
Pre-Requisites
If you already have Jupyter Notebook and all necessary python libraries and packages installed, you are ready to get started.
Implementation
Loading the dataset
The Scikit-Learn library provides numerous datasets, among which we will be using a data set of images called Digits. This data set consists of 1,797 images that are 8x8 pixels in size. Each image is a handwritten digit in grayscale.
We load the Dataset as shown below:
Now that we have loaded the dataset, we can run the following command to know the shape of the dataset:
Visualizing the images and labels in our dataset
We can obtain the gray scale image using the matplotlib library:
The above command gives the output:
Splitting our dataset into training and testing sets
Now let’s split our dataset into training and test sets to make sure that after we train our model, it is able to generalize well to new data.
The Scikit-Learn 4-Step Modelling Pattern
Step 1: Importing the model we want to use
Here we will be using Logistic Regression. Logistic Regression is a linear classifier and therefore used when there is some sort of linear relationship between the data.
Step 2: Making an instance of the model
Step 3: Training the model
Here the Model is learning the relationship between the digits(x_train) and label(y_train)
Step 4: Predicting the labels of new data
Measuring the performance of our model
To test the accuracy of our predictions we can use accuracy_score
This number is the probability for the digits in the test sample to be classified in the right category, meaning that we get 95.11% of the digits correct.
Confusion Matrix
A confusion matrix is a table that is often used to evaluate the accuracy of a classification model. We can use Seaborn or Matplotlib to plot the confusion matrix. We will be using Seaborn for our confusion matrix.
The above code displays the confusion matrix as shown below
Conclusion
From this article, we can see how easy it is to import a dataset, build a model using Scikit Learn, train the model, make predictions with it and finding the accuracy of our prediction( which we got as 95.11%).
Thank you for reading my article!
For the source code, click at https://github.com/tanisha15/Recognizing-Handwritten-digits-using-scikit-learn