From today onwards, I will make a notes and share ideas what I had learned. The resources and books that I had covered will be mention here.
| S.N. | Projects Title | Status |
|---|---|---|
| 1. | Email/SMS spam detection |
Today, I covered basic Neural network and its intuition from the Coursera's Andrew NG course. Here, I gained intuition behind the neural network.
Definition: A neural network in machine learning is a computational model inspired by the way biological neural networks in the human brain process information. Basically, neural network is a type of algorithms that try to mimic the brain.
- Input Layer: This layer receives the input data. Each node in the input layer represents a feature of the input data.
- Hidden Layer: These layers contain neurons that process the input received from the previous layer. The number of hidden layers and neurons per layer is a crucial design choice.
- Output Layer: The final layer produces the output, such as a predicted value or classification result. The number of neurons in this layer corresponds to the number of possible outputs.
- Activation Function: The term 'a' stands for activation, and it's actually a term from neuroscience and refers to how much a neuron is sending a high output to other neurons downstream from it. Each neuron in the hidden and output layers uses an activation function to transform the weighted sum of its inputs. Common activation functions include ReLU (Rectified Linear Unit), Sigmoid, and Tanh. The activation function introduces non-linearity, which is essential for the network to learn complex patterns.
- When building neural network, one of the decisions you need to make is how many hidden layers do you want and how many hidden layers do you want each hidden layer to have. And this question of how many hidden layers is a question of the architecture of the neural network.
As you can see in above picture that there are more than one hidden layers also called as multiple hidden layers and sometimes refers as multilayer perceptron.Multiple hidden layers enable neural networks to learn complex, hierarchical features from data, making them powerful for tasks like image classification, speech recognition, and language understanding. The depth of a neural network significantly enhances its ability to model complex patterns, but it also introduces challenges like overfitting, requiring careful training and regularization techniques.
I gained how neural network model are design and understand some important notation that needed to know while working with neural network.
- In above picture, logistics regression as the activation function in a neural network is essentially using the sigmoid function as the activation function for neurons, particularly in binary classification problems.
- The picture only contains one hidden layer and hidden layers has three neurons. The output layer contains only one neuron and output of the hidden layer is the input of output layer.
ais the vector of activation values from layer 1.
-
Let's look up! The thing to remember is whenever you see this superscript square bracket 1 i.e
a^[1], that just refer to the quality that is associated with layer 1 of the neural network. and if you see superscript square bracket 2 i.ea^[2], that refers to a quality associated with layer 2 of the neural network and similarly for other layers as well. -
As you can see,
a^[1]becomes the input to layer 2. So, the input to layer 2 is the output of layer 1. Generally, the input features will be the output vector from the previous layer. -
In above, there are 4 layers and
xvector is input to neural network and not consider as an activation vector. If we assume thatxis an activation of layer 0 i.e.a^[0]. Then we can generalize that output of layerl-1(i.e. previous layer) -
Parameters
wandbof layerl, unitj. In each particular layers, number of neurons (unit) contained and denoted as subscript j. It's just like rows of matrix. And we denotedjfor each neuron associate withllayer andl-1layer's activation function is the input of layerl.
I go through this Notebook and learn some Tensorflow and Keras stuff and understand how logistics and neural network distinguish.
"One of the remarkable things about neural network is the same algorithm can be applied to so many different application."
Today, I look about how tensorflow and numpy array distinguish. Inorder to build the model let's first understand how tensorflow and numpy works.
- Numpy was first created and becomes a standard library for linear algebra in Python.
- In above, we can see that matrices can be represented in rows and columns form. And numpy array can help to achieve representation of matrix.
- This numpy array uses broadcasting, slicing which makes efficient for computation of matrix.
- Tensorflow and Numpy should be used wisely because they contain similar type of workflow and makes us illusion.
Tensorflow is a machine learning package developed by Google. In 2019, Google integrated Keras into Tensorflow and released Tensorflow 2.0. Keras is a framework developed independently by François Chollet that creates a simple, layer-centric interface to Tensorflow.
- In order to use the Dense object we should import some packages.
from tensorflow.keras import Dense- Dense is another name for the layers of a neural network that we've learned about so far. As we learn more about neural network, we learn about other types of layers as well.
- A tensor here is a datatype that the Tensorflow team had created inorder to store and carry out computational on matrices efficiently. So, whenever you see tensor just think of the matrix on these above images.Technically, a tensor is a little bit more general than the matrix.
- We can convert the tensor into numpy as you can see in below code:
x = np.array([[200.0, 17.0]])
layer_1 = Dense(units=3, activation='sigmoid')
a1 = layer_1(x)
# We can convert the tensor object into teh numpy
a1.numpy() # Output: array([[0.2, 0.7, 0.3]], dtype=float32)- Return it in the form of numpy rather than in the form of tensorflow array or Tensorflow matrix.
So, we collect some of the basic information. We are ready to discuss the model implementation. See the code of digit classification model in picture below:
- Firstly, created a layers that we needed to design a model and using sequential object of Tensorflow, compile we can fit and predict easily. However, I haven't explained details of above mention objects and will understand in further session.
Today, from Coursera's Machine Learning Specialization I had explored the important concept and after that I dived into the scratch implementation of neural network. All the coding stuff and core concepts from today session are mention below:
- Core Concept: Forward and backward propagation are the core mechanics that allow the network to learn from data and improve its performance over time.
- Forward propagation is where the network makes a guess based on input.
- Backward Propagation is where the network learns from its mistake by adjusting how it makes those guesses.
- These steps happen over and over, and each time, the network gets better at making accurate predictions. However, I had implemented forward prop only for today and another will be covering later on this specialization.
The picture shows how we can actually implement neural network using numpy only but quite lengthier. Although, we don't perform like this in production level. There are different framework such as tensorflow and pytorch. These frameworks make our works easier. But to be a good machine learning engineer it's a good practice to have scratch implementation of any algorithm so that can help us in debugging.
To make task easier, I had mentioned code below:
import numpy as np
# Let's define sigmoid function
def g(z):
return 1/(1+np.exp(-z))
# Also define a dense function
def dense(a_in, W, b):
units = W.shape[1]
a_out = np.zeros(units)
for j in range(units):
w = W[:, j]
z = np.dot(w, a_in) + b[j]
a_out = g(z)
return a_out
# Definition of sequential function
def sequential(x):
a1 = dense(x, W1, b1)
a2 = dense(a1, W2, b2)
a3 = dense(a2, W3, b3)
a4 = dense(a3, W4, b4)
f_x = a4
return f_xI feel AI is superior but looking above makes broaden my thinking. AGI stands for Artificial General Intelligence which is much more complex as human brain. In today, we only focus to solve one particular problem but if we can generalize it (i.e. can solve any type of problem like human). This concept fascinates me and will learn more about that in upcoming days.
By learning this much, I had completed my week 1 from Advance Learning Algorithm course. And from tomorrow onwards, I will enter to week 2 learning journey. Stay tuned!
In this specialization, I dived into week 2 with the full implementation of neural network in Tensorflow. Again taking the digit classification problem and with systematic 3 steps, we conclude our today updates.
Here, we define a model and import all the tensorflow dependencies as mention in above code. Sequential and Dense from keras will be helpful for efficient execution of program.
Loss can be different depending upon what type of problem you have taken. When you work on predicting numbers, then *loss will be mean squared error and similarly binary cross-entropy (also known as logistic loss) is a way to measure how "wrong" or "right" the model was in its prediction, based on the actual truth (whether the email is actually spam or not).
Visualizing Loss function J(w) over weights w and compute the minimum value using partial derivatives concept. Where alpha should be taken in such a way that weight and bias for higher order must be smaller. Epochs is the number of steps in gradient descent.
Activation function plays a crucial role during modeling neural network. Activation function are heart of most machine learning model. Today, I dived into some of the important types of activation functions. They are explained below:
- Linear activation function are the most general activation. Some people assume no use of activation during use of linear activation function. When we use to predict any numbers either -ve or +ve, linear activation function is useful.
- ReLU(Rectified Linear Unit) is the another most common activation function. Basically,when we need to predict the house price we use ReLU because price can't be in negative
ReLU = max(0, z). It is very efficient because of its simplicity and mostly used for the hidden layers. - Sigmoid is the another popular activation function basically deals with the binary classification type of problem. Its discrete value ranges between 0 and 1. However, it is not suitable for the multi class classification problems such as handwritten digits recognition problem. For the multi class classification problem we have to use the softmax activation function because it is the generalize form of sigmoid function. Comparison between the logistics and softmax and also cost difference is shown in below images.
Multi-class classification is a type of machine learning problem where the goal is to categorize data into one of more than two possible classes. For example, classifying an image as either a dog, cat, or bird is a multiclass problem because there are multiple categories (classes) to choose from.
Softmax is a mathematical function that helps solve this by turning the output of a model (which could be raw scores or logits) into probabilities. It does this by taking the scores and exponent them, and then normalizing them so that the sum of all probabilities equals 1. This way, each class has a probability, and the class with the highest probability is chosen as the model’s prediction.
In short: Softmax helps in multiclass classification by converting model outputs into understandable probabilities for each class. The class with the highest probability is selected as the predicted class.
The above image show the full implementation of neural network with the MNIST datasets using softmax. However, Andrew Ng doesn't recommend us to use it because of not so accurate decimal values. Instead use it after certain modification is done. Some of the modification is highlighted in below image.
Some changes are output layer's activation changes to linear and instead of using loss as binarycrossentropy, we use SparseCategoricalCrossentropy. These modification helps to achieve with the more numerically accurate results.
In Dense layer , each neuron output is a function of all the activation outputs of the previous layer. While in case of Convolution layer, each neuron only looks at part of the previous layer's output.
- Faster computation
- Need less training data
- Less prone to overfitting
Gradient descent requires the derivative of the cost with respect to each parameter in the network. Neural networks can have millions or even billions of parameters. The back propagation algorithm is used to compute those derivatives. Computation graphs are used to simplify the operation.
A computation graph simplifies the computation of complex derivatives by breaking them into smaller steps. And if you want to visualize more deep into the backward propagation, then this graph concept provide you with better intuition.
In this week 2, I had learned about the different type of activation function, training details and multi-class classification, backward propagation details. And from tomorrow onward, I will jump to week 3 on Coursera's Advance Learning Algorithm.
With short learning, I started a new week 3 session from today. I learnt about how we can evaluate our model. Due to imbalance datasets, it is very difficult to generalize our model to new data. However, we manage low biases during the training process. But cross validation and KFold helps us to reduce the overfitting problem.
A test that you run to gain insight into what is/isn't working with a learning algorithm, to gain guidance into improving its performance.
In above, to determine the mean squared error metric we will not include the regularization term.
All the details are shown in the above picture, we can easily understand.
Today, I learnt about bias and variance concept which is a crucial part of the machine learning cycle. In machine learning, bias and variance are two sources of error that affect how well your model performs. The goal is to find the right balance.
- Too much bias: Your model is underfitting and missing important patterns.
- Too much variance: Your model is overfitting and capturing noise instead of the true signal.
By understanding bias and variance, you can tweak your model to make it more accurate and reliable!
Today, I delved into understanding the fundamental that every ML engineer should know. While varying regularization parameter i.e. lambda, what differences can occur in bias and variance.
- In above figure, CV stands for cross validation is a sampling technique. It helps in estimating how well the model will perform on an independent dataset.
- Also, shows the two curves which are exactly mirror of both. They show that while varying lambda and degree of polynomial, how loss function of train and cv makes their behavior.
-
As we can see that when lambda is taken as larger value, the algorithm is highly motivated to keep these
wvery small and so you end up withw1,w2and really all parameters will be very close to zero. -
Similarly, when lambda is taken small, it means no regularization term. So we're just fitting the fourth order polynomial. We end up with that curve that you saw in the above picture. Then model will have high variance (overfit) because it fit all the training data and fail to generalize new data.
-
And in middle of the above picture, is the more generalize case and have intermediate lambda value. Where loss function of both cv and train has smaller magnitude.
Learning curves helps in diagnosing underfit and overfit in machine learning model. It is a graphical representation of how a model's performance changes over time as it learns from more data or as the training progresses. Let's dive in graphical views of high bias and high variance cases:
- High bias
- High variance
- A high training loss indicates the model doesn't fit the data correctly.
- A high testing loss indicates the model doesn't generalize well.
Someone said that "After a lot of work experience in a few different companies, he realized that bias and variance is one of those concepts that takes a short time to learn, but takes a lifetime to master".
In above, we have to balance the complexity that is the degree of polynomial or regularization parameter lambda. But it turns out that neural network offer us a way out of this dilemma of having tradeoff bias and variance with some caveats.
-
When you increase the complexity of a neural network (e.g., adding more layers or neurons), bias typically decreases (the model becomes more capable of capturing complex patterns), but variance increases (the model becomes more prone to overfitting).
-
When you decrease the complexity of a neural network, bias increases (the model becomes too simple to capture complex patterns), but variance decreases (the model becomes less sensitive to the training data and more robust to noise).
- Large neural network are low bias machines. This above images show the clear ideas behind the neural network's bias and variance.
The iterative nature of machine learning development revolves around trial and error, continuous learning, and improvement. Through experimentation, error analysis, and constant refinements, machine learning models can be progressively optimized. This iterative approach, combined with solid data and model evaluation, is central to building successful machine learning applications.
Beyond getting brand new training example (x, y), there's another technique that's widely used especially for images and audio data that can increase your training set size significantly called as data augmentation. Creating additional examples like this hold the learning algorithm, do a better job learning how to recognize the letter A in above figure. Learnt some of the important topics related to data augmentation:
- Data augmentation by introducing distortions.
- Data augmentation for speech recognition.
Today, I learnt about the synthesis data. Synthesis data are the artificial data develop to increase the performance of the model. Basically, using artificial data inputs to create a new training example. All other learning updates are mention below:
Most machine learning researchers attention was on the conventional model centric approach. You can take the reference in below image:
A machine learning system or an AI system includes both code to implement your model, the data that you train the algorithm model as well. And over the last few decades, most researchers doing machine learning research would download the dataset and hold the data fixed while they focus on improving the code or the algorithm or the model. Sometimes it can be more fruitful to spend more of your time taking a data centric approach in which you focus on engineering the data used by your algorithm.
The figure illustrates the full cycle of a machine learning project, which consists of four key stages:
-
Scope Project (Define project):
- This step involves defining the problem you are trying to solve and determining the project's objectives.
- Key questions addressed here include: What is the goal of the project? What business or research problem are you solving? What will success look like?
-
Collect Data (Define and collect data):
- In this stage, relevant data is identified, collected, and prepared for use.
- Data may be sourced from databases, APIs, manual collection, or other means. Data cleaning and preprocessing (e.g., handling missing values, normalization) are also part of this step.
-
Train Model (Training, error analysis & iterative improvement):
- The collected data is used to train machine learning models.
- Error analysis is performed to evaluate model performance and identify potential improvements.
- Iterative refinement and tuning (e.g., hyperparameter optimization, feature engineering) take place to achieve better accuracy and reliability.
-
Deploy in Production (Deploy, monitor, and maintain system):
- Once the model is ready, it is deployed into a production environment where it serves real-world tasks.
- Monitoring and maintaining the deployed model is essential to ensure consistent performance, including adapting to changes in data patterns or requirements.
The arrows between the stages indicate that this cycle is iterative—you may need to revisit earlier steps based on results or new requirements (e.g., refining the project scope, collecting additional data, or retraining the model). This ensures continuous improvement and adaptation to changing conditions.
The deployment model is the process of integrating a trained machine learning model into a production environment where it can deliver predictions or decisions in real-world applications. Below image show detail about deployment:
In general term, there is a mediator called as API (Application Program Interface) between the inference server and mobile app / website. For the larger scale, we need a software engineer to meet the following objective:
- Ensure reliable and efficient predictions.
- Scaling
- Logging
- System monitoring
- Model updates
When our datasets contain an imbalanced data or skewed data, then accuracy fails to give proper idea. When positive labeled data contain 95% and false labeled data contain 5% which makes the datasets imbalanced. So, we can further move on other metrics calculation which will be beneficial for the imbalance dataset. Use metrics that give you a clearer picture, like:
-
Confusion Matrix: A table that shows how many predictions were correct, and where the model made mistakes.

-
Precision: How many of the predicted positives were correct?
-
Recall: How many of the actual positives did the model catch?

-
F1-Score: A balance of precision and recall. Below image gives us a crystal idea.

These metrics can help you understand the real performance of your model beyond just accuracy. With these learning I completed my week 3 also. And will go on learning week 4 stuff from tomorrow onwards.
Today, heading towards week 4 of Coursera's Advance Learning Algoritm. I learnt about the Decision Tree concept. A Decision Tree is a supervised machine learning algorithm used for classification and regression tasks. It is a tree-like structure where each internal node represents a decision based on a feature, each branch represents an outcome of the decision, and each leaf node represents a final prediction (class label for classification or a numerical value for regression).
Two most important decision we must consider and those decisions are mention below:
- Decision 1: How to choose what features to split on at each node?
- Decision 2: When do you stop splitting?

Also revisited neural network concept and practice lab session. I will learn deeper from tomorrow on this decision tree topics.
Note:
Programmatically speaking, decision tree are nothing but a giant structure of nested if-else condition.
Mathematically speaking, decision tree use hyperplanes which run parallel to any one of the axes to cut your coordinate system into hyper cuboids.
Today, I dived into the important concept that we have to considered while building the Decision tree model.
Basically, entropy is just measure of disorder/impurity. As you can see that entropy has the parabolic curve with open downwards and has maximum value of H(p1=0.5) = 1 at the middle of the curve. For better intuition, there is the maximum variation of data at middle. And low entropy found when p1 = 0.0 and p1 = 1.0. It means only one category found on the low entropy instances. In above, p1 denotes as fraction of examples that are positive. Here, positive means what you are trying to infer.
Here, each split of training data should calculate entropy. It doesn't give us an average idea what feature to choose for the better prediction. So, here's information gain comes into picture. Information gain generally measure reduction of entropy. As we know that entropy and information gain are inversely proportional to each other. Information gain is also known as knowledge. More variation of data gives poor knowledge about our dataset.
Also explored about the one hot encoding technique which plays vital role when our category contains more than two discrete values. Generally, machine learning frameworks not able to perform on categorical data. So, this encoding technique helps to convert our nominal data into binary form to distinguish each datapoints easily.
Learnt about the ensemble tree concept. Although, we have already a decision tree algorithm then why we need ensemble tree? Here, a single decision tree is highly sensitive to small changes into data. Due to this limitation, ensemble tree algorithm comes into picture. Simply, a multiple collection of decision tree is ensemble tree. This makes our model less sensitive and making our algorithm more robust. The key idea is that an ensemble of trees works better than an individual tree by averaging their predictions (in regression) or using a majority vote (in classification).
Random Forest is a powerful tree ensemble method that builds multiple decision trees and combines their predictions to improve accuracy and reduce overfitting.
Key Features of Random Forest:
- Uses bagging (Bootstrap Aggregation): Each tree is trained on a random subset of the data.
- Uses feature randomness: Each split in a tree considers a random subset of features.
- Final prediction is made by majority voting (classification) or averaging (regression).
- Handles missing data and high-dimensional datasets well.
- Reduces overfitting compared to a single decision tree.
XGBoost (eXtreme Gradient Boosting) is a high-performance, scalable tree-based algorithm that improves upon traditional Gradient Boosting by being faster, more accurate, and optimized for large datasets.
Key Advantages of XGBoost
- Fast and efficient (optimized for speed and memory).
- Prevents overfitting (L1 & L2 regularization).
- Handles missing values and large datasets well.
- Works for both classification & regression.
When to use decision tree and neural network
All things are included into the images below:
After learning all these things, I completed part 2 of Machine Learning Specialization from the Coursera. And got certificate for completing this course. Here is the link of my accomplishment: Completion of Advanced Learning Algorithms.
Today, I worked on optimizing Random Forest for crop prediction while improving training speed. Initially, GridSearchCV took a long time to train, so I applied the following optimizations:
- Reduced Search Space – Limited the range of hyperparameters to focus on the most impactful ones.
- Lowered n_estimators – Used 50–150 trees instead of 500 to reduce computation.
- Limited max_depth – Set maximum depth to 10–20 to prevent overfitting and speed up training.
- Decreased n_iter in GridSearchCV – Reduced it to 10 iterations for a faster search.
- Used n_jobs=-1 – Leveraged all CPU cores for parallel processing.
- Lowered Cross-Validation (cv=3) – Reduced the number of folds to minimize training time.
These improvements significantly reduced training time by 50-70% while still achieving good accuracy.
Here's some of my code snippet in below:
See my code, points in my notebook: Notebook For Random Forest Also I had downloaded the dataset from the Kaggle.
I've been starting new topic which is beyond supervised learning. This course was designed by Coursera's co-founder Andrew Ng named as Machine Learning Specialization: Unsupervised Learning, Recommender system, Reinforcement Learning. This course is divided into three weeks.
- Unsupervised Learning
- Clustering
- Anomaly detection
- Recommender Systems
- Reinforcement Learning
However, I started Unsupervised Learning. Unsupervised learning is a type of machine learning where a model is trained on unlabeled data to discover hidden patterns, structures, or relationships. Unlike supervised learning, there are no predefined labels or target variables; the algorithm must infer the structure of the data on its own.
- No labeled data: The model learns patterns without explicit supervision.
- Finds hidden structures: Clusters similar data points or reduces dimensionality.
- Exploratory: Used for understanding data distributions and trends.
- Commonly used for feature engineering: Helps create meaningful features for supervised learning.
Clustering is a fundamental technique in unsupervised learning, where the goal is to group data points into clusters based on their similarities. Since clustering is unsupervised, there are no labeled outputs, the algorithm discovers inherent patterns within the dataset.
K-Means is a method used to group similar data points together into K clusters.
- Picking K random points as starting centers (centroids).
- Assigning each data point to the nearest centroid.
- Updating the centroids by taking the average position of the points in each cluster.
- Repeating the process until the clusters stop changing.
- Grouping similar news
- DNA analysis
- Astronomical data analysis
In unsupervised learning, particularly in K-means clustering, the optimization process revolves around minimizing a cost function (distortion) that measure how well the clusters represent the data. The standard cost function for K-Means is the sum of squares (SSD) between each data point and its assigned cluster centers. The cost function also called inertia or with-cluster sum of squares (WCSS) is defined in below images:
- The objective of K-Means is to minimize this function by iteratively updating cluster assignments and centroids.
The Elbow Method is a simple way to find the best number of clusters in K-Means. It works by measuring how much the data points in each cluster differ from their assigned center (called a centroid). The idea is:
- If you use too few clusters, the points within a cluster will be very far apart, meaning the groups aren’t well-defined.
- If you use too many clusters, each cluster will only have a few points, which isn’t useful.
To find the best number of clusters (k), we calculate a cost function called WCSS (Within-Cluster Sum of Squares).
In this above figure, in K = 3, the curve not bending too much. So, we have three cluster formation.
KMeans Clustering can be used with open-source library scikit-learn. But to know about the working from depth, we should gain scratch implementation of this algorithm. Although, I had used scikit-learn to create a clustering dataset.
In this code, I had created a 100 datapoints using make_blobs function. Here, number of clusters are two and also iterate for 100 times to come up with accurate cluster centroids. I design a class named as KMeans. I initialize the constructor with n_centroids and max_iter. And assign centroids with None initially. Inside a KMeans class, I had define three functions such as fit_predict, assign_cluster and move_cluster. fit_predict is our main function where all training actions comes apart. And remaining two are the helping function.
- assign_cluster: In this section, I had find the euclidean distances for each datapoints from each cluster centroids. And find the minimum distances and assign its index position to the list named as
cluster_group.cluster_grouplist consist of 100 elements. And lastly that function returns list ofcluster_group. - move_cluster: The function is defined with all datapoints and list of
cluster_groupwhich we had got from theassign_clusterfunction. I separated all records with respect to its centroids.Then, calculate mean and assign with new centroids and return that value. - If you find the old centroids and current centroids remain same, then you can stop the iteration. Hence, with these brief ideas we can complete our scratch implementation of KMeans.
Below in images, you can look how I came with idea.

Output is shown below:
Resources: CampusX - KMeans Clustering Algorithm From Scratch in Python
Anomaly Detection is an unsupervised learning because anomalies are rare and labeled data is often unavailable. However, anomaly detection can also be approached using supervised and semi-supervised learning depending on the availability of labeled data.
In other word, Anomaly detection refers to the process of identifying data points, events, or patterns that significantly deviate from the normal or expected behavior in a dataset. These anomalies, also known as outliers, can indicate critical insights, such as fraud, system failures, cyber intrusions, or rare events. Some of the examples are shown in below images.
Before diving deeper into anomaly detection we should first understand about the density estimation. Studying density estimation first helps build intuition about what is normal, how anomalies stand out, and how models can identify them.
- Density estimation is the process of determining a probability distribution from a given dataset. It aims to estimate the probability density function (PDF) of a continuous random variable without assuming a predefined distribution. So, let's discuss a well known normal distribution below which a parametric density estimation.
Normal distribution is one of the most important probability distributions we'll learn about since a countless number of statistical methods rely on it. It applies to more real-world situations than other distributions. Its shape is like bell curve as shown in above picture.
Important properties
- It's symmetrical so left side is a mirror image of the right.
- The area beneath the curve is 1.
(Area =1) - The probability never hits
0, even if it looks like it does at the tail ends. - Describe by mean and standard deviation.
When a normal distribution has mean 0 and standard deviation of 1, it's a special distribution called the standard normal deviation. As it's illustrated into below image.
In above, if mean changes the position then curve will shift and increases in standard deviation makes the curve flat. And squeezes of curve when the standard deviation tends to decrease. With these statistical ideas we are good to go with the anomaly detection algorithm. Tomorrow, I'll explore the algorithm of anomaly detection.
Gaussian distribution is the crucial part for this algorithm. There is a value called epsilon. If pdf of normal distribution is less than epsilon then algorithm detected anomaly and if pdf is greater than epsilon then the detected point is normal (non-anomalous). First of all, we need to take all the normal datapoints while training model because it contains skewed data. However, there could be a misconception between anomaly detection and supervised learning. So, if our dataset split into the ratio of 99:1 then that problem would be anomalous detection. If dataset split into nearly equal proportion then obviously we should choose supervised classification model. Let's look out into more mathematical and stepwise algo in image below:
Taking two features and utilizing algorithm, we can clearly see into the image below:
However, in real-world it is impossible to find well known distribution such as Gaussian distribution. But we can achieve that known distribution with feature transformation.

Yesterday, I had completed week 1 from the last part of Machine Learning Specialization. And today, I explored the recommended system. Learn about the collaborative filtering, regularization term and gradient descent in recommender systems.
Collaborative Filtering is a technique used in recommendation systems that predicts user preferences by analyzing past interactions and similarities between users or items. It assumes that similar users or items will have similar preferences.
Below images show about the movie recommendation system. This image show the scratch implementation. Although, mathematical formulae coincides with linear regression.

Above shows the notation used for movie recommendation in image.
Included the regularization term same as linear ridge regression.
In recommendation systems, Collaborative Filtering works by predicting user preferences based on the preferences of similar users. However, different users have different rating scales. Mean normalization helps remove these biases and improve similarity calculations.

Why Use Mean Normalization?
- Different users rate items on different scales.
- Some users may give high ratings to all movies, while others rate more conservatively.
- Mean normalization removes user-specific biases, making it easier to compare users and items.
- Helps compute similarity metrics (cosine similarity, Pearson correlation) more accurately.
- Essential for Netflix, Amazon, and YouTube recommendation systems.
Collaborative filtering is a key technique in recommendation systems, aiming to predict missing user-item interactions based on observed data. Andrew Ng, in the Machine Learning Specialization, explains its implementation using matrix factorization with gradient descent in TensorFlow. This approach represents users and items as low-dimensional latent vectors, where the dot product of these vectors approximates user ratings. The model optimizes embeddings by minimizing the Mean Squared Error (MSE) loss, focusing only on known ratings to enhance learning accuracy. Training is conducted using advanced optimizers like Adam, iteratively refining embeddings for better predictions. This method is widely applied in industry, including Netflix and Amazon, to personalize user experiences. While matrix factorization is effective, deep-learning-based models such as Neural Collaborative Filtering (NCF) offer further improvements by capturing complex user-item relationships.
In recommendation systems, finding related items is an essential task for providing personalized suggestions to users. This typically involves identifying items that are similar to a given item, based on the preferences of users or the characteristics of the items themselves.
Finding related items in recommendation systems is a combination of calculating item similarity using either item features (content-based) or user interactions (collaborative filtering). Advanced methods like matrix factorization provide a powerful way to uncover latent patterns and improve the relevance of related item recommendations.
Today, I dived into another approach for recommender system. Let's explore about two distinct approach:
Collaborative Filtering
- This approach recommends items based on the interactions and preferences of other users with similar tastes.
- It assumes that if two users have similar past behavior, they will likely prefer similar items in the future.
- It is divided into:
- User-based Collaborative Filtering: Finds similar users and recommends items liked by them.
- Item-based Collaborative Filtering: Finds similar items based on user interactions and suggests items similar to what the user has interacted with.
- Example: If User A and User B both like Movie X, and User B also likes Movie Y, then User A is recommended Movie Y.
Content-Based Filtering
- This approach recommends items based on the characteristics of the items and the preferences of the user.
- It relies on item features (e.g., genre, keywords, descriptions) and user profiles.
- It uses techniques such as TF-IDF (Term Frequency-Inverse Document Frequency) and cosine similarity to measure how similar items are to what a user has liked before.
- Example: If you watch many action movies, the system recommends other action movies with similar features.
Yesterday, we had completely learnt about the collaborative filtering. Today, I had explored content based filtering. Lets deep dive:
The content based filtering consist of two crucial steps: They are Retrival and Ranking:
Step 1: Retrival
The retrieval step focuses on selecting a subset of relevant items from a large pool of available content. Since searching through an entire catalog can be computationally expensive, this step efficiently filters down the dataset to a manageable size.
Step 2: Ranking
Once the retrieval step provides candidate items, the ranking step sorts these items in order of relevance to the user. A ranking algorithm determines which items are most relevant to the user’s interests.
Ethical Use of Recommender System:

Some goals of recommender system:
These are the application of recommender system. However, it helps in profit making but should maintain ethnicity.
At last, I learnt about the implementation of content based filtering approach using Tensorflow framework.
Will further discuss on implementation in later coming days.
Principal Component Analysis (PCA) is a dimensionality reduction technique used in data science and machine learning. It transforms a high-dimensional dataset into a lower-dimensional space while preserving as much variance (information) as possible.
If we draw the scatter plot between data and using trial and error concept to ensure axes which can capture quite a lot of the spread of the data. Basically, PCA helps to find the axes which provides a larger variance capturing more information of original data with few dimensions.
Let's look into the below images which show the scikit-learn implementation of PCA algorithm:
In above, the 2-dimensional dataset is converted into 1-dimensional data which describe 0.992 ratio of variance of original data.
Today, I had revised the decision tree concept and also explored about decision tree visualization using scikit-learn library. plot_tree function is used to visualize the tree diagram which gives us a better intuition. I also learned to customize tree diagram using some of the attributes.
Also explored the one of the important visualization technique i.e. pie chart using matplotlib library. Here, attributes such as startangle, explode , autopct, label and color are used for customizing pie chart. For plotting the pie chart, I used builtin dataset named as iris-dataset.
I had completed the second last week from the third part of Machine Learning Specialization. I will dive deeper to last week i.e. week 3 from tomorrow onwards. In last part, I will learn about the basic concept related to reinforcement learning.
Today's Notebook: Practice decision tree session
What I learnt: I explored introduction of reinforcement learning and learned some of its terminologies. Also, gain insights on MDP stands for Markov Decision Process. All things are describe in detail below:
Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment to achieve a goal. The agent takes actions, receives feedback in the form of rewards, and updates its strategy to maximize cumulative rewards over time.
- Agent – The learner or decision-maker.
- Environment – The system the agent interacts with.
- State (s) – The current situation of the agent.
- Action (a) – The choices the agent can make.
- Reward (r) – A numerical signal given as feedback.
- Policy (π) – A strategy that maps states to actions.
Return - It refers to the total accumulated reward an agent receives from a given state until the end of an episode. It is used to evaluate how good a sequence of actions is.
A Markov Decision Process (MDP) is a mathematical framework used to model decision-making problems where outcomes are partly random and partly under the control of an agent. MDPs are widely used in Reinforcement Learning (RL) to model the interaction between an agent and an environment.
The state value function, denoted as V(s), measures how good it is for an agent to be in a particular state s while following a specific policy π.
Formally, the state value function is the expected total reward an agent can obtain starting from state s and following policy π thereafter.
Q(s, a)= Return if you- start in state s
- take action
a(once) - then behave optimally after that
where,
s: current state
a: current action
R(s): reward of current state
s': state you get to after taking action a.
a': action that you take in state s'.
The state-action value function already tells us how good an action is in a given state, so why Bellman equation needed.
Here's some reasons:
- Instead of estimating values from scratch, the Bellman equation breaks the problem into smaller parts.
- It tells us that the value of a state/action depends on immediate rewards + future values.
- The Bellman equation helps update values over time as the agent explores new states and so on.
At terminal state, Q(s, a) = R(s)
Conclusion
- The Bellman equation is a recursive formula that helps in decision-making.
- It breaks down a problem into smaller sub-problems, making it easier for RL agents to learn.
- It forms the basis of many RL algorithms like Q-learning and Deep RL.
Previously, I explored on single state, one state at a time for finding current state action value function. Also, explored on discrete state which only takes quantized value. The state could be +ve integer number. Today, I dived into the continuous state which contains position, speeds and twist of particular object. They are written collectively in the form of vector which contain different attributes ase mention in above.
I understand continuous state using some of the real world based project such as autonomous helicopter, lunar lander. Mostly, Andrew Ng explain this section with the help of lunar lander project.
For lunar lander, state is a vector which contains 8 dimension vector such as x-position, y-position, change in x-position, change in y-position, twist/rotation and change in rotation. And lastly, contain two boolean element of vector named as l and r. This attributes l and r show the lunar lander left legs and right legs touches or not on the surface respectively. This lunar lander contain actions such as nothing, left, main, right based on producing trust in particular direction.
This reinforcement learning can be implemented with the help of neural network. However, it is not optimized and can enhance optimization using output layers contain 4 nodes instead of using 1. For that, input doesn't contain actions.
- Algorithm can be refined using mini-batch gradient and soft updates.
- Much easier to ge to work in a simulation than a real robot.
- Far fewer applications than supervised and unsupervised learning.
- But ... exciting research direction with potential for future applications.
Finally, learning all these stuff comes to an end. I had completed Machine Learning Specialization after learning some of the ideas behind the reinforcement learning in last week of this session. However, I will continue on working some of the projects and revised all this stuffs for the better understanding.











































































