Anitya Gangurde Portfolio Website

Home Product Profile ML Profile Blogs Get In Touch

Back to Blogs
How to approach any Data Science Problem?
A Data Science problem can have multiple facets. From determining various features to understanding the gist of the dataset and implementing a model around it.

Aug 3, 2021 · 6 min read

I have created this guide which will make the Data Science process much simpler and systematic. I am also going to explain a practical example using our old and lovely Titanic dataset.

Let us start by understanding the Titanic dataset because knowing what we are working with should be the first step to analyze anything.

PART 1: Understanding The Dataset

Below you can see the .csv file for the Titanic dataset.

The first column is for PassengerID which is just an unique ID given to every passenger on the Titanic. Then next we have the Survived column which is a boolean, 0 meaning death,1 meaning that they lived. The Pclass column denotes what class did the passenger sit in going from 1 for first class, 2 for second and 3 for the third class passengers.

The columns Name, Sex and Age are self-explanatory. The next column of Sibsp refers to the number of siblings or spouses aboard. Similarly, the column Parch means the number of parents or children aboard.

The column Embarked tells the embarkation points from where the ship embarked from and took the passengers with it. So, C denotes Cherbourg, Q denotes Queenstown and S denotes Southampton. Finally, the column Ticket is the ticket number and column Cabin is the cabin number.

So, what we have here is a mixed dataset with numerical as well as categorical values.

PART 2: Data Cleaning

Now that we have got the gist of the dataset we can move on with data cleaning or strategically removing useless columns, getting rid of any null values in the data and making the data more model friendly.

We can see that while coming up with a model for training the data on, some features may not be that useful at all. Features such as PassengerID, Name, Ticket and Cabin would not contribute that much to the prediction of whether the person will survive or not.

So, we will delete these columns from our dataset to make it simpler and less memory intensive for our model.

Next, we can go deeper in understanding which features are actually numerical and which are just categorical. Just having a look at the rows within the dataset we can identify which features repeat their values and which are completely random.

So, going through the dataset, features such as Age, SibSp, Parch and Fare have values which are not divided in classes. Hence, we can call them numerical features. Though the SibSp and Parch look categorical they have more of a discrete nature. Whereas, features like Survived, Pclass, Sex and Embarked are completely categorical in nature.

Now, we have understood the data even more clearly and deeply. We can further delete the features which will not be affecting the survival rate in a logical sense. But we can’t always rely on our logic and that is when hypothesis testing comes into play.

We can create some hypotheses like for example whether gender played any role in the survival of the person. And we can test this hypothesis by using various statistical techniques such as finding the p-value. After coming up with the results we can cancel the null hypothesis and accept the hypothesis which we considered.

PART 3: Feature Engineering

So, after differentiating between categorical and numerical features we will need to do some feature extraction and keep the features that will be the most useful. A data scientist’s domain knowledge about the data comes into play at this step.

We can start with Exploratory Data Analysis or EDA to find patterns among the data and make visualizations. This will help us know how a particular feature affects some other feature and we will know which features are the most effective in predicting the target values.

Here we can also create new features by combining two other features.

Next, we will have to convert our categorical variables into dummy variables. This is important as the model will not understand these features as they are not numerical in nature. Hence, we will need to convert the categories into binary numbers.

We can use Label encoding for features containing labels (eg. Embarked) and Ordinal encoding for categories having a ordered range (eg. Pclass). We will use these dummy variables to create our feature dataset to pass it in as input to our model.

But before that we will remove all the columns which we think would not be useful in the feature dataset. We will also create a separate dataset for our target variables and remove it from the features.

We now have a feature dataset and respective target values which can be used for further training.

We will also split the dataset into training, validation and testing dataset right now before passing it into the model. This will be useful to get better results and tune our model to make it more rigid.

PART 4: Selecting A Model, Training And Testing

This part is the actual machine learning part where we select an appropriate model, train it on the data and then test it. We can select ML models such as SVM, Logistic Regression, etc.

Each and every model will have some benefits and some downfalls. It is better to study which one will perform better or one can train multiple models and compare their results to select the one which performed better than the other.

We will use our training data to train the model which contains our features and the targets. Depending on the machine algorithm we are using the model will use various tricks and learn weights to come up with a generalized solution.

To test the validity of this solution we can use the test data to determine the accuracy and various other statistical metrics such as precision, recall and F1-score which will tell us how well the model has trained.

Thus…

This is how we can approach a Data Science problem and come up with a solution by following the above steps in order. Try implementing this guide with other datasets and I’m sure it will be helpful.