How Generative Pre-Trained (GPT) Works?

10 min readOct 17, 2023

LLM — Large Language Model (Default GPT 3.5 for Chatgpt)

Neural Network trained on massive unlabelled text data (text that does not have any category or classification assigned to it can be referred to as unlabeled data) to understand and generate human language. It uses this text as training data to learn statistical patterns and relationships b/w words in language and predict subsequent words (Transformer). LLM is characterised by its size and number of parameters it has. Largest model GPT 3.5 has 175 billion parameters (A parameter is a numerical value that represents either a weight or a bias in the neural network. Weights are numerical values that define the strength of connections between neurons across different layers in the model. In large language models like GPT, weights start as random values and are adjusted during training → Prior training enhance their params/weights) spread across 96 layers in neural network

Input and output are organised by tokens that are numerical representations of words as they can be processed faster. Gpt 3.5 was trained on internet data containing 500B tokens. Model was trained to predict the next token given a sequence of tokens

The Transformer neural network architecture, used in GPT models, employs self-attention mechanisms to process input text efficiently, capturing extensive context and improving performance in Natural Language Processing (NLP) tasks.

Encoder: GPT’s encoder component processes text inputs as embeddings, assigning weights to words to capture contextual information. Position encoders enable differentiation between sentences with similar words but different meanings, preventing ambiguity in understanding.
Decoder: The decoder utilises the vector representation generated by the encoder to predict the desired output. Through self-attention mechanisms, it focuses on various parts of the input to generate accurate responses.

Compared to its predecessors, like recurrent neural networks, GPT’s transformer architecture allows for parallel processing of the entire input, enhancing its efficiency. The extensive training and fine-tuning of GPT models enable them to provide coherent and accurate responses to a wide range of input prompts.

Example: Using ChatGpt for Generating Answers

GPT COMPONENTS & FLOW:

Transformers: This architecture operates as a sequence-to-sequence model (Predicting output sequence from input) → transforming one sequence into another while maintaining the defined ordering of the sequences. They utilise a mechanism called “self-attention” to process sequential input data (Self Attention enables the model to weigh the importance of different elements in an input sequence and dynamically adjust their influence on the output). Transformers can process the entire input data at once, capturing context and relevance.
- The Encoder converts words into abstract numerical representations and stores them in a memory bank.
- The Decoder generates words one by one, referring to the generated output and consulting the memory bank through attention. (Weighing the input & seeing it’s influence on Output while forming sequence of tokens)
Training and Transfer Learning: To effectively train transformers, a substantial amount of labelled data specific to the task is required. This demand is addressed through the implementation of transfer learning, combined with the existing transformer architecture.
Encoder and Decoder: Transformers consist of two main components: the encoder and the decoder, each specialising in learning representations of language. Stacking encoders bidirectionally yields BERT, while stacking decoders gives rise to GPT, each leading to distinct lines of research.
Transfer Learning and Pre-training: Implementing transfer learning in GPT involves a two-part training process. Initially, GPT is pre-trained on language modelling, honing its understanding of language through completing random sentence parts. Subsequently, fine-tuning is carried out, leveraging transfer learning to improve GPT’s performance on specific tasks.
GPT-2 and Meta-learning: GPT-2 integrates meta-learning, building on its pre-training for language modelling. During the fine-tuning phase, it incorporates zero-shot learning, which does not require parameter updates post pre-training. However, it necessitates specific instructions during inference, demanding an expanded GPT architecture.
GPT-3 and Learning Techniques: GPT-3, a large language model, employs meta-learning concepts and scales its architecture even further. It is pre-trained with a language modelling objective and fine-tuned with various learning approaches, including zero-shot learning, one-shot learning, and few-shot learning.
Zero, One, and Few-Shot Learning: Zero-shot learning involves providing prompts along with the input, yet it lacks initial examples, posing a disadvantage. One-shot learning involves presenting an example in the form of a vector alongside the input, while few-shot learning entails feeding multiple examples, typically ranging from 10 to 100, according to the context window.
Meta-learning and Fine-tuning: Meta-learning and fine-tuning are distinct concepts. Meta-learning involves adapting the model’s learning processes, while fine-tuning emphasises the adjustment of specific parameters to improve performance on targeted tasks.

Tokens:
OpenAI’s large language models (sometimes referred to as GPT’s) process text using tokens, which are common sequences of characters found in a set of text. The models learn to understand the statistical relationships between these tokens, and excel at producing the next token in a sequence of tokens. As a rough rule of thumb, 1 token is approximately 4 characters or 0.75 words for English text.

Common words like “cat” are a single token, while less common words are often broken down into multiple tokens. For example, “Butterscotch” translates to four tokens: “But”, “ters”, “cot”, and “ch”.

Given some text, the model determines which token is most likely to come next. For example, the text “Horses are my favourite” is most likely to be followed by the token “ animal”.

This is where temperature comes into play. If you submit this prompt 4 times with temperature set to 0, the model will always return “ animal” next because it has the highest probability. If you increase the temperature, it will take more risks and consider tokens with lower probabilities.

Able to generate a structure that is grammatically correct and semantically similar. For chatgpt, fine tuning was done via carefully engineered prompts called as prompt engineering to make it more aligned for human comprehension. This process was called (Reinforcement Learning through Human Feedback — RLHF) that was used to make it aligned with human values.

Fine tuned gpt –Train reward model → reward model–optimise using reinforcement learning (PPO Algorithm) → Chatgpt model
API Structure:
response = openai.Completion.create ( # Completion endpoint
model = “gpt-3.5-turbo-instruct”,
prompt = generate_prompt(animal), # prompts the system to create or produce text that is relevant to or centered around the concept of animals
temperature=0.6 #Amount of creativity/Risk
)

The completion endpoint is flexible enough to solve virtually any language processing task, including content generation, summarization, semantic search, topic tagging, sentiment analysis, and so much more.

Clear Instructions:

The system message can be used to specify the persona used by the model in its replies.

Eg:

SYSTEM: When I ask for help to write something, you will reply with a document that contains at least one joke or playful comment in every paragraph.

USER: Write a thank you note to my steel bolt vendor for getting the delivery in on time and in short notice. This made it possible for us to deliver an important order.

Delimiters like triple quotation marks, XML tags, section titles, etc. can help demarcate sections of text to be treated differently.

Use Open AI’s Playground for more intuitive learnings:

SYSTEM: Refers to the general context or setting within which you are using the OpenAI Playground. In this case, it’s labelled as “SYSTEM,” (Prompt Engineering) which may suggest the context or system in which the conversation is taking place.
ASSISTANT: Represents the identity or role you’ve assigned to the conversational partner within the system. It could be any predefined identity or label you choose to assign.
Add message: This feature allows you to input messages or instructions for the assistant to respond to within the conversation. You can add messages to continue the dialogue or to prompt specific responses from the AI model.
Mode: Indicates the mode of interaction with the AI model. In this context, it’s set to “Chat,” which suggests that the interaction with the AI will be in the form of a conversation.
Model: Specifies the version or type of the AI model being used to generate responses. In this case, it is set to “gpt-3.5-turbo,” which is the specific model used for generating text based on the input provided.
Temperature: Determines the creativity or randomness of the AI-generated responses. A higher temperature value leads to more creative but potentially less coherent or relevant responses.
Maximum length: Sets the maximum length of the AI-generated response. It ensures that the response doesn’t exceed the specified length, allowing for more concise or specific outputs.
Stop sequences: These are specific sequences of characters that, when encountered, signal the AI to stop generating the response. They are useful for controlling the length and content of the generated text.
Top P: Represents the nucleus sampling or top-p sampling technique used to control the diversity of the generated responses. It’s a filter that controls how many different words or phrases the language model considers when it’s trying to predict the next word. If you set the “top p” value to 0.5, the language model will only consider the 50 most likely words or phrases that might come next.
Frequency penalty: A penalty applied to words or phrases based on their frequency in the generated text. It can be used to encourage the AI model to generate more diverse or less repetitive responses.
Presence penalty: A penalty applied to words or phrases based on their repetition in the input and output sequences. It helps encourage the AI model to generate responses that do not simply mirror the input provided.

Few Shot Learning in image classification: Bring text in → Use Large Language Model eg. BERT → Which gives Embeddings → Which you put on classifier or classifier head of network → Outcome (Churn Probability)

Drawbacks:

Fine Tuning requires a lot of data since after training, the model has to be fine tuned based on data not seen before (Test Data)
Lots of labelled data required, goes dramatically up if more classes are required. Humans only need a few data points to differentiate as it has large prior knowledge.
Depending on prior knowledge based on similarity, learning & data is essential.
Disadvantages for model, to add more nodes, modify network architecture and train again in case of new data.

Challenges with one shot learning problem is to classify with just one example and hence, DL models don’t work well with this.

Basically what is currently done is, We provide image input → put on conv net → have output on label y using softmax [does not work well on small training set, will have to retrain everytime]

To solve this, there will be introduced, “Learning Similarity”:

Learn a function that can provide a degree of difference between two objects.
So if both person images are the same then you want this degree to be a small number and vice versa.
So during recognition if the degree of similarity is greater/smaller than a particular threshold or hyperparameter then it can differentiate between them.

Zero-Shot Learning (ZSL):

Semantic Mapping: Zero-shot learning involves mapping knowledge when the model was trained on known examples and using it to predict unseen classes without explicit training examples from the unseen classes.
Auxiliary Information: It relies on leveraging auxiliary information or INSTRUCTION (Info about the task) such as attributes, class descriptions, or semantic embeddings to establish connections between different domains.
Domain Transfer: ZSL often requires transferring knowledge from a source domain with labelled data to a target domain with unseen classes, utilising semantic representations to bridge the gap between the domains.

One-Shot Learning:

Learning from Singular Examples: One-shot learning trains models using just a single example per class, necessitating the extraction of discriminative features from minimal input.
Siamese Networks: It often involves the use of siamese networks, which learn embeddings to measure similarity/dissimilarity between samples, enabling the model to distinguish between classes based on learned representations.
Metric Learning: Metric learning is frequently employed to enable the model to learn a suitable distance metric in the feature space, facilitating effective discrimination even with limited training samples.

Few-Shot Learning:

Generalisation from Limited Samples: Few-shot learning extends one-shot learning to scenarios where there are a few examples per class, enabling the model to generalise effectively from a small number of labelled examples.
Meta-Learning: It often leverages meta-learning techniques, training models on various tasks to improve their ability to adapt quickly to new tasks and generalise from limited data.
Prototypical Networks: Few-shot learning models, such as prototypical networks, learn a representation for each class based on a few examples, enabling the model to classify new instances based on their similarity to the learned prototypes.

N-Shot Learning:

Handling Slightly Larger Datasets: N-shot learning is a generalisation of few-shot learning, considering scenarios where the model is trained with ’n’ examples per class, where ’n’ is a small number greater than one.
Enhanced Pattern Recognition: It allows the model to capture more intricate patterns and variations in the data compared to few-shot learning, leading to improved generalisation performance.
Adapted Techniques: Similar to few-shot learning, techniques like meta-learning and prototype-based approaches are adapted to handle scenarios with a slightly larger number of training examples per class, further refining the model’s learning and generalisation capabilities.

These techniques address challenges related to data scarcity and enable models to learn from minimal labelled data, thus broadening their applicability to real-world scenarios with limited labelled datasets.

Designing your prompt is essentially how you “program” the model.

Embeddings

Embeddings are used to make machine learning models more efficient and easier to work with, and can be used with other models as well. ( Embeddings are dense numerical representations of real-world objects and relationships, expressed as a vector. )

In natural language processing (NLP), [A word embedding is a representation of a word that encodes the meaning of the word in such a way that words that are closer in the vector space are expected to be similar in meaning.]

An embedding is a vector representation of a piece of data (e.g. some text) that is meant to preserve aspects of its content and/or its meaning. Chunks of data that are similar in some way will tend to have embeddings that are closer together than unrelated data. OpenAI offers text embedding models that take as input a text string and produce as output an embedding vector. Embeddings are useful for search, clustering, recommendations, anomaly detection, classification,

How Generative Pre-Trained (GPT) Works?

Embeddings

Written by Rohit Singh

No responses yet