Classification using TensorFlow's Estimator API
Classification
Classification is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y). The output variables are often called labels or categories.
For example, an email can be classified as spam or not spam or a transaction can e classified as fraudulent or authorized
There are a number of classification models. Classification models include Logistic Regression, Decision Tree, Random Forest, Neural Networks etc.
TensorFlow
TensorFlow is an open source software library for numerical computation using data flow graphs. More about tensorflow can be found in the official documentation.
Estimators is a high-level API that reduces much of the boilerplate code you previously needed to write when training a TensorFlow model.
Loading the dataset.
The dataset we’ll be using is the Census Income Dataset. This datset will help us in predicting whether the income exceeds $50K/yr based on census data.
Since the task is Binary Classification, we will make use of Pandas apply
function to convert the income_bracket
column to a label whose value is 1 if the income is above $50k and 0 otherwise.
Train and Test Split.
Let’s make use of sklearn’s train_test_split method to split the data into training and test set.
Categorical Feature Columns
Feature column is an abstract concept of any raw or derived variable that can be used to predict the target label. To define a feature column for a categorical feature, we can create a CategoricalColumn
using the tf.feature_column API.
If we know the possible values for a Categorical feature columns then we can make use of categorical_column_with_vocabulary_list
but if we are not aware of all the possible values then we can make use of categorical_column_with_hash_bucket
with the hash_bucket_size defined.
Numeric Feature Columns
For Numeric Column we can make use of numeric_column
for each continuous feature colums.
Building the Logistic Regression Model
Before building the model we will first need to define an input function. The input_fn
is used to pass feature and labels to the train, evaluate and predict methods of the Estimator.
Train and Evaluate
Training a model is just a single command using the tf.estimator API
We can evaluate the model’s accuracy using the evaluate()
function, using our test data set for validation.
Accuray on the test set is 0.836174 i.e. 83.62%. You can check out the entire code at my github repo