Getting Deep Into AI :))

8 min readFeb 26, 2025

Ever fascinated by brain, the smart visual cortex being able to differentiate between different visuals — A daunting programming task aimed to be solved by neural network.

What’re Neural Networks?

Neuron → Thing that holds on a Function & spits out a value in a range. — A number held by the neuron is called ‘Activation’

Bunch of layers of neural connected with each other is called the Neural Network, with layers in between input & output layer called ‘Hidden Layers’. Pattern of activations in one layer determine the pattern of activations in the next layers.

How do neural network work intelligently?

The network works across exhaustive set of permutations & combinations of probabilities such as detecting edges, unique combinations etc.

It does it by tweaking the parameters (Weights & Biases) in each network to achieve the closest matching patterns or results.

We assign weights (numbers depicting strength of connection) and activations (number) between a layer & each single neuron from the next layer to calculate the weighted sum as W1A1 + W2A2 + … + WnAn (Refer the example below)

This weighted sum is put into a function to squish the output inside a desirable range between 0 & 1. A common function is Sigmoid Function (another is RELU) which essentially assigns negative inputs close to 0 & positive inputs close to one whilst steadily increasing around 0.

We sometimes need a Bias in cases when we need the weighted sum to to beactive/inactive basis a threshold value (for eg. let’s say 10). In this case the weighted sum will be active only in case the sum is more than a threshold value of 10.

Now all the above is the story of a single layer’s connection with a single neuron of next layer, each of the connections have weights & biases associated with it.

Hence, a network especially in the hidden layers turns out to have millions of connections which work together as dial & knobs to achieve a particular result. Here’s a better representation of the calculation in each connection

So how do neural networks learn exactly?

Algorithm is required to show neural network a bunch of training data (images, text, videos etc.) & labels against them (what they’re supposed to be)
But the results can vary a lot, for this we define a cost function (essentially guiding a network on what kind of activations it should’ve had provided. Essentially it sums up the square of the difference between those trash output activations & the desired value.
A sum might be high if there’s a large difference (Inaccuracy) or small if it’s quite accurate. Hence, the average cost is considered of a training data to determine how bad it’s working. Here’s the snapshot of the difference:

A neural network function takes numerical inputs in the input layer, processes them through weighted connections and activation functions across multiple layers, and outputs a set of numbers in the final layer.

The cost function sits on top of the network, evaluating its performance. It takes the model’s parameters (weights and biases) and outputs a single value (variance between the desired & actual output) — the cost — which quantifies how well or poorly the network is performing. This function is defined based on how accurately the model maps inputs to outputs across the entire training dataset.

Essentially, a cost function can be viewed as a function of weights and biases (w). Goal of the function is to find the optimal combination of these parameters that minimizes the ‘cost’ or variance from desired results, as illustrated in the function graph below. The cost function slides left & right depending if the slope (derivative of the cost at the point) is positive or negative & it discovers the minima if tweaked with the correct amount of steps (else it might overshoot).

A function moves to the **Left:** If Slope is Positive | **Right:** If Slope is Negative. This way it might reach the local minimum of the function graph.

Hence, which direction to step in order to decrease the output of function quickly — Gradient of function gives the direction towards the steepest descent in a graph whilst the gradient calculated for each neuron’s weight signifies how important is the change in that particular weight.

The negative gradient function calculates the steepest descent of cost function for all the weights & biases in a network so that their combination provides the most accurate results for the function after decreasing them till certain defined ‘steps’ to find a local minima

Negative Gradient over these weights result in numbers depicting the amount of change required in the corresponding weights in order to find the desired values. Change of any particular weight compared to others in same layer can have a greater affect on the cost function.

Back Propagation

An algorithm to calculate this gradient efficiently is known as Back-Propagation; It means how each single training data is nudged for weights & biases (what relative proportion of changes result in the most rapid decrease in the cost).

Effective practice is to check for every weight/biases and average it for each weight/bias which is loosely proportional to the negative gradient of cost function.

By Subdividing the training data into several mini batches and making adjustments to arrive at local minima with the training examples.

From the output layer, if we wish to make changes in a certain activation (which is a weighted sum of all previous activations plus a bias, all passed through an activation function like Sigmoid or ReLU), there are three main ways to do so:

Changing activations in previous layers (in proportion to weight): Since each activation in the output layer is influenced by activations from the previous layer through weighted connections, modifying those activations will affect the final output. The extent of the change depends on the weights — activations that pass through larger weights will have a stronger impact on the final activation. This is why during back-propagation, error signals are propagated backward, influencing activations in earlier layers based on their contribution to the final result.
Changing the weights (in proportion to activation): The connections with the most active neurons from the preceding layers have a larger effect on the output. Changing the weights of those connections alters how strongly those neurons contribute to the final activation. During training, gradient descent adjusts these weights based on how much each connection contributes to the overall error. Larger changes are made to weights connected to neurons that had higher activations, as they play a bigger role in determining the output.
Changing the biases: Biases control the baseline activation of a neuron, independent of the input it receives. Increasing or decreasing the bias shifts the neuron’s activation threshold, making it more or less likely to activate. Adjusting biases helps fine-tune the model’s overall behavior, ensuring that neurons can still activate even when the weighted inputs are small.

Hebbian Theory (Neurons Fire Together, Wire Together): Biggest increase in weights or strengthening of connections happen with neurons that are either most active or are tweaked / we wish to make it active.

Hence, taking the changes together in each preceding layer, & in proportion to how much each neuron should change. we apply the recursive process the whole network.

Large Language Models

GPT =

Generative: Bots that produce new text
Pre-Trained: Model learnt from massive amount of data
Transformer: More room to fine tune it with additional training. This is the main element of the model driving the AI race.

Transformers enable a wide range of models, including audio-to-text, text-to-image, and more. The original Transformer, introduced by Google in the paper "Attention Is All You Need," was designed for language translation.

in following section, we'll explore the underlying architecture of ChatGPT. Its model is trained on massive datasets, including text, accompanying audio & images, to predict the next sequence of words. This is achieved through probability generation—each word is predicted based on the previous context, and the process continues iteratively until the passage is complete.

First of all, an input sentence is broken down into pieces called Tokens; which incase of text can be words, little pieces of words or common character combos, in case of image it might be a small chunk of the image & similarly a small patch of sound for audio.
Each of these tokens are then turnt into vectors, each vector should encode the meaning of the token.

Vectors give out co-ordinates in some high-dimension space. Vectors with similar meanings tend to be way closer in that space.

These sequence of vectors are passed into an operation known as ‘Attention’ block. Inside it, the vectors can talk to each other & update the values. It’s responsible for finding out, which words in the context are relevant to updating the meanings of which other words & how exactly the meanings should be updated.
NOTE: Here Meaning means, the encodings inside a vector.

Afterwards, these vectors go through a different set of operation known as feed-forward layer or multi-layer perceptron. These operations happen in parallel for all the vectors.

Essentially, in Multi-Layer Perceptron, each vector does some calculatioons according to let’s say a long list of questions & updating them based on the answers.

Now the above process keeps on repeating…

until the essential meaning of a passage is baked into the last vector in the sequence. A certain activity is performed on the last vector to output a probability distribution of all the chunks of text or tokens that might come next in the passage.

Hence, in transformer-based language models, text generation begins by converting the input text into smaller units called tokens.
The model processes these tokens to understand the context and predicts the most likely next token.
This predicted token is then added to the sequence, and the process repeats, generating text one token at a time until the desired output is achieved.

Getting Deep Into AI :))

Back Propagation

Large Language Models

GPT =

Written by Rohit Singh

No responses yet