Transformer Architecture | Part 1 Encoder Architecture

Transformer Architecture | Part 1 Encoder Architecture | CampusX

Рет қаралды 20,709

Күн бұрын

Пікірлер: 156

@AidenDsouza-ii8rb 4 ай бұрын

Your Deep Learning playlist is pure gold! The intuition and simplicity you bring to complex concepts are amazing. As a dedicated student who's watched it all, I can say it's top-notch quality. Thank you for this video series.

@FindingTheSuccess-w2b 4 ай бұрын

Sir... please make videos regularly.....🙏🙏

@bhaveshshrivastava3024 2 ай бұрын

bhai videos regularly upload nahi ho paati isliye tutorial quality itni top notch hain..sir....aaram se time leke videos upload kijiye, 2 2 week main

@sowmyaraoch 3 ай бұрын

This entire playlist is so intuitive and you've made all complex concepts so simple. Please continue this playlist.

@aneessohail1008 4 ай бұрын

Sir kindly aap yeh series ki videos regularly upload Kiya karain aap jaisa course Kisi ka nhi ❤❤

@drrabiairfan993 3 ай бұрын

Even the author of the original paper cannot explain that well.... an absolutely amazing and illustrative explanation to transformer....without any doubt the best explanation available anywhere.

@jooeeemusic7963 4 ай бұрын

Sir I'm waiting for these videos every single day. Please Upload on regular basis sir.

@amanagrawal4198 4 ай бұрын

Great!! , watched full DL playlist , great resource to understand whole deep learning

@ersushantkashyap 2 ай бұрын

Nitish Sir, jaise app bolte ho, "Khol ker rakhunga apke samne" apne exactly khol ker rakh diya ye video, thank you so very much.

@Shisuiii69 2 ай бұрын

Seriously bro 💯

@justrax8466 23 сағат бұрын

43:44 In my assumption sir, when I learned about resnet 50 architecture where we also use skip connection to overcome the vanishing gradient problem, when we do a multi-head attention mechanism there might be a possibility that the value becomes too low and we know if if add something in vector so its transform the size but does not change the functionality of the vector, that can be a reason they used it

@zeeshanahmed8640 26 күн бұрын

Hi Nitish sir, your deep learning playlist is absolutely mind blowing. please also upload videos related to fine tuning of encoder type T/F , decoder type T/F and encoder/decoder T/F. and also upload videos related to langchain please

@anuradhabalasubramanian9845 Ай бұрын

How are you brilliant Sir !!!!!!!!!!!!! Super Guru for us !!!!!!Great explanation Sir

@sukumarane2302 Ай бұрын

So great and appreciable ! You made this complex task of explaining transformer architecture simple … Thank you sir

@narasimhasaladi7 2 ай бұрын

the combination of add and norm operations in the residual connection of transformer encoders provides these benefits: Improved gradient flow Preservation of information Enhanced learning Increased stability Faster convergence Better generalization

@sohaibahmed4439 4 ай бұрын

Superb curriculum management and teaching style! Thanks!

@paragbharadia2895 2 ай бұрын

huge respect, and lot more lessons to learn from all videos you have posted! thank you sir!

@harshsingh7842 8 күн бұрын

What a great explanation Man Loved it

@mukul3634 2 ай бұрын

I am amazed like now i am feeling there is noting easy than transformers............ie i am an mechanical engineer u understand quite quite well sir.....................even i am feeling like it is easier than linear or logistic regression............now i can teach this concept to any 5 year old child

@videoediting0 4 ай бұрын

Marvelous explanation in a very simplified way, great man.

@nikhilraj3840 3 ай бұрын

One of best best transformers explanation playlist , you are amazing

@ParthivShah 4 ай бұрын

Thank You Very Much sir for continuing this playlist.

@khatiwadaAnish 3 ай бұрын

Thank you so much, You made this topic so simple even I am feeling confident to teach other.

@RdXDeveloper 3 ай бұрын

Sir is maintaining Quality not Quantity. That's why he takes time for every video. Thank you so much sir.❤️‍🩹

@laxminarayangaidhane7063 3 ай бұрын

Wonderful explanation...I was getting bored of watching previous few videos but after completion of these videos ... I understood the current video easily😊. You have explained very nicely.

@SBhupendraAdhikari 3 ай бұрын

Thanks Sir for such beautiful explanation

@SrideviSutraya 2 ай бұрын

Very good explanation

@tirthadebnath2497 4 ай бұрын

Your tutorials are really gold for higher studies.

@ayushparwal2210 2 ай бұрын

interesting video sir thanks to you.

@gender121 3 ай бұрын

Waiting anxiously for the remaining videos ..please bring them soon.

@trickydotworld 3 ай бұрын

Thank you very much. Waiting for Decoder part

@ai_pie1000 3 ай бұрын

Bhai the way you taught na which is exactly same how easy our mind try to memorize the hard concepts. ❤

@saurabhkaushik8282 4 ай бұрын

Great explanation, sir! I watched the entire Transformer series, and you made it so easy to understand. Many thanks! Looking forward to the decoder parts.

@imteyazahmad9616 4 ай бұрын

Amazing 🤗, please upload videos regularly. Waiting for next video on decoder

@just__arif 2 ай бұрын

Great Explanation!

@arjunsingh-dn2fo 17 күн бұрын

Sir, As we learned in the boosting algorithm, we use residuals to ensure the difference between actual and predicted values. So, sir, I think this residual connection is doing the same thing here. It is analyzing the difference between actual embedding and contextually aware embedding. If there is a vanishing problem, as you said, it passes on actual embedding to the next feed-forward neural network. Sir, What's your opinion about this?

@princekhunt1 3 ай бұрын

Nice explanation 👌

@nikhilgupta6803 4 ай бұрын

as usual....awesome and simple

@manjeet4418 4 ай бұрын

Thank You Sir For Detailed Explanation ❤

@harshmohan5411 3 ай бұрын

sir, i think the reason of residual connection is, so that the information regarding positional encoding doesn't get lost because as you say that they uses 8 encoder blocks in the original transformer , so to remind the transformer about the positions i think.

@amitbohra9283 4 ай бұрын

Sir great video thanks, waiting eagerly for the second part.

@meetpatel8733 4 ай бұрын

that was great video....but I have a question on the multihead attention part......In the previous video of multihead attention there were two self attention blocks used and for money bank example....Ymoney1 and Ymoney2 ....two vectors were generated so for two words money bank there were 4 vector generated Ymoney1, Ybank1 and Ymoney2, Ybank2....... but here in the main architecture you told 512 dimension vector will be input to the multi head attention block and it will give same size of vector of 512 dimension.....I don't now my question is silly or not...if you can explain on that please...... but all the videos were great.... Thank you....

@lokeshsharma4177 3 ай бұрын

Just like "Inputs" are there any prior operations for "output" before it goes to Decoders?

@arpittalmale6440 4 ай бұрын

The Residual connection they use in each head because by using the original contextual vector after each operation they can maintained the meaning of the sentence we provide earlier because If they do not use it, The 95% probability that after the operation through the Attentions layer they loose the context of the word with respect to each other, Because at the output, model compute a loss value based on objective and this loss is then back propagated to update the model weights including updating the value of the word embedding vectors. Their is concept of "Teacher Forcing " in this during the training the model is feed with actual ground truth output(target sequence), This can help stabilize training and accelerate convergence during training by providing more accurate and consistent results.

@AnshSingh-dk5uu 2 ай бұрын

46:05 couldn't understand the second reason... even if the transformation is bad the parameters will manage it in multiple backpropagations right? so why do we need a residual connection for that

@peace-it4rg 3 ай бұрын

Sir i think ki sparse embedding or matrix generate na ho isiliye resnet connections use kiya hoga taki thora dense or stable rahe architecture or network learn kar paye nhi to overfitting ho jayega. what is your thought about this

@KumR 4 ай бұрын

Hi Nitish - I understand that GPT uses this transformer arch . Do all LLMs use this ?

@RdXDeveloper 4 ай бұрын

Sir thanks a lot for your This Effort❤. You just sir awsome. Your free courses are more valuable than a paid course. On of the best KZbin Channel This Is.❤

@shubhamgattani5357 23 күн бұрын

Thank you god!

@SamiUllah-ql9my 4 ай бұрын

Sir I am waiting this video very long time I love you teaching style I can't found any one who can teach better then you

@nomannosher8928 4 ай бұрын

always the best explanation.

@virajkaralay8844 4 ай бұрын

Absolute banger video on transformers encoder, cannot wait for the decorder video to drop

@PawanAgrawal3012 4 ай бұрын

Good one. Please make a dedicated playlist on Pytorch dealing with neural network

@dataninjaa 4 ай бұрын

i was desperately waiting for your videos, itna to Mirzapur 3 ka bhi nahi kiya tha

@vimalshrivastava6586 4 ай бұрын

Thank you for this wonderful video.❤

@SachinBareth-d2k 3 ай бұрын

very helpful

@LMessii10 3 ай бұрын

Brilliance ❤ 👏

@kunaldere-g8l 3 ай бұрын

Sir I remember one thing you said about transformer, while starting transformer topic. That is transformer architecture dumped from future.

@Shubham_gupta18 4 ай бұрын

Please continue on this playlist Nitish sir, and regularly upload the videos just a humble request. The placement season is coming soon and we need you.

@SandeepSingh-yx2si 4 ай бұрын

Very Good Explanation.

@ujjawalagrawal 4 ай бұрын

Wow great sir thanks for preparing the video

@tannaprasanthkumar9119 4 ай бұрын

It was amazing sir

@myself4024 3 ай бұрын

🎯 Key points for quick navigation: 00:00 *📚 Introduction to Transformer Architecture* - The video begins with an introduction to Transformer architecture, highlighting key components already covered such as self-attention, multi-head attention, positional encoding, and layer normalization. - The focus will now shift to a detailed exploration of Transformer architecture, particularly the encoder part. - The teaching approach involves understanding individual components first before delving into the complete architecture. 03:07 *🛠️ Prerequisites and Preparation* - Emphasis on prerequisites for understanding Transformer architecture, including prior knowledge of self-attention, multi-head attention, positional encoding, and layer normalization. - The presenter has created a series of videos covering these foundational topics and recommends reviewing them to grasp the upcoming content on encoder and decoder architectures. - The current video will focus specifically on the encoder architecture, while the decoder will be covered in subsequent videos. 05:06 *📊 Detailed Explanation of Encoder Architecture* - The video starts the detailed exploration of the Transformer encoder architecture, using a complex diagram to represent the entire Transformer model, including both encoder and decoder. - The presenter acknowledges the complexity of the diagram and aims to break down the encoder architecture in an accessible way for better understanding. 05:49 *🗺️ Simplified Transformer Architecture* - The video simplifies the complex Transformer architecture diagram into two main components: the encoder and the decoder. - A basic representation shows that the Transformer consists of an encoder box and a decoder box. - The simplified model helps in understanding that there are multiple encoder and decoder blocks within these components. 07:13 *🏗️ Multi-Block Structure* - The simplified model is expanded to show multiple encoder and decoder blocks, with six blocks of each in the original Transformer model as per the "Attention Is All You Need" paper. - Each block within the encoder and decoder is identical, meaning understanding one block applies to all others. - The focus will be on understanding a single encoder block to grasp the entire architecture. 09:11 *🔍 Detailed Encoder Block Breakdown* - The detailed view of an encoder block reveals it consists of two main components: a self-attention block and a feed-forward neural network. - The self-attention block is described as multi-head attention, and the feed-forward neural network is a key part of the encoder block's functionality. - Additional components such as layer normalization and residual connections are also part of the encoder block's architecture. 10:18 *📈 Actual Encoder Block Architecture* - The actual architecture of an encoder block is shown, including self-attention (multi-head attention) and feed-forward neural network blocks. - The diagram includes additional elements like layer normalization and residual connections, highlighting the complexity beyond the simplified model. - The video emphasizes understanding the detailed connections and components within an encoder block. 11:48 *🔄 Sequential Processing of Encoder Blocks* - Outputs from one encoder block serve as inputs for the next encoder block, continuing through all blocks until the final output is sent to the decoder. - The process involves multiple encoder blocks (six in the original Transformer model) that are sequentially connected. - The main goal is to understand the functioning of these blocks by examining the processing within each one. 12:29 *🧩 Introduction to Detailed Example* - A new page is introduced to explain the encoder architecture with a detailed example sentence. - The goal is to track how an example sentence (e.g., "How are you") moves through the encoder and understand the encoding process. - The explanation will involve breaking down each step and how the input sentence is processed within the encoder. 13:40 *✍️ Initial Operations on Input* - Before the main encoding, the input sentence undergoes three key operations: tokenization, text vectorization, and positional encoding. - Tokenization breaks the sentence into words, and text vectorization converts these words into numerical vectors using embeddings. - Positional encoding adds information about word positions to maintain the sequence order. 14:51 *🔢 Tokenization and Vectorization* - Tokenization splits the sentence into individual words, creating tokens like "How," "are," and "you." - Text vectorization converts these tokens into 512-dimensional vectors using embeddings, which represent each word numerically. - Positional encoding is applied to integrate information about word positions into the vectors. 17:25 *📍 Positional Encoding* - Positional encoding provides positional information by generating a vector for each position in the sentence. - These positional vectors are added to the word vectors to ensure the model can understand the order of words. 18:30 *🧩 Positional Encoding and Input Vector Integration* - Positional encoding adds information about word positions to the input vectors to maintain the sequence order. - This process integrates positional vectors with word vectors to ensure that the model understands the word sequence. 19:04 *🔄 Introduction to Encoder Block Operations* - Detailed examination of the operations within the first encoder block, focusing on multi-head attention and normalization. - Introduction of a new diagram to explain the functioning of these operations. 20:08 *🧠 Multi-Head Attention Mechanism* - Multi-head attention applies multiple self-attention mechanisms to capture diverse contextual information. - This process generates contextually aware vectors for each input word by considering surrounding words. 22:01 *➕ Addition and Normalization* - After multi-head attention, addition and normalization are applied to maintain dimensional consistency and improve stability. - A residual connection is used, where the original input vectors are added to the output of the multi-head attention block. 25:28 *📏 Layer Normalization Explained* - Layer normalization standardizes each vector by calculating the mean and standard deviation for its components, adjusting them to a fixed range. - This helps stabilize the training process by ensuring that values remain within a defined range, preventing large fluctuations. 27:00 *🔄 Purpose of Residual Connections* - Residual connections (or skip connections) are used to add the original input vectors back to the output of the multi-head attention block. - This mechanism helps in maintaining the flow of gradients and preserving the original information during training. 28:35 *🧠 Feed-Forward Network in Encoder* - Introduction to the feed-forward neural network within the encoder block, including its architecture and function. - The network consists of two layers: the first with 2048 neurons using ReLU activation and the second with 512 neurons using linear activation. 32:22 *📊 Feed-Forward Network Processing* - The feed-forward network processes vectors by increasing their dimensionality, applying transformations, and then reducing the dimensionality back. - The first layer increases the vector size from 512 to 2048, and the second layer reduces it back to 512. 35:04 *🔄 Skip Connections and Normalization* - Skip connections bypass the feed-forward network output, adding the original vectors to the processed output. - After addition, layer normalization is applied again to the resulting vectors. 38:01 *🔁 Encoder Block Repetition* - The output vectors from one encoder block become the input for the next encoder block. - Each encoder block contains its own set of parameters for weights and biases, even though the architecture is similar across blocks. 39:18 *🔄 Summary of Encoder Processing* - A quick summary of the transformer encoder process from input to output. - Input sentences undergo tokenization, embedding, and positional encoding. 41:34 *❓ Questions and Residual Connections* - Discussion of the importance of residual connections in the encoder blocks. - Residual connections help stabilize training by allowing gradients to flow more effectively through deep networks. 45:55 *🔍 Alternative Path in Multi-Head Attention* - Discussion on providing an alternate path in case multi-head attention fails to perform effectively. - Residual connections allow the use of original features if transformations are detrimental. 48:00 *🧩 Feed-Forward Neural Networks in Transformers* - Exploration of why feed-forward neural networks are used in transformers alongside multi-head attention. 52:03 *🔢 Number of Encoder Blocks in Transformers* - Multiple encoder blocks are used in transformers to effectively understand and represent language. - A single encoder block does not provide satisfactory results for language comprehension. Made with HARPA AI

@anonymousman3014 4 ай бұрын

Sir, one more time I am requesting you to complete the deep learning playlists ASAP. Please Sir🙏.

@AllInOne-gn4ve 4 ай бұрын

Thanks a lot ❤❤❤❤! please sir, continue this playlist🙏🙏🙏🙏

@KumR 4 ай бұрын

Thanks a lot Nitish. Great Video. Can I make a suggestion? Can you do one live QnA session to clarify any doubts?

@darkwraith8867 3 ай бұрын

Sir make a video for state space models and mamba architecture

@AmitBiswas-hd3js 4 ай бұрын

Please Sir, complete this transformer series ASAP.

@koushik7604 4 ай бұрын

Wonderful! This is too good.

@karanhanda7771 4 ай бұрын

Bro i like your videos and way of education. But jab ap koe purane topic ki baat karte hai.. uska link dal diya karo sir🙏. Thora easy rahe ga hamare liye

@chinmoymodakturjo5293 4 ай бұрын

Kindly drop videos regularly and complete the series please !

@RamandeepSingh_04 4 ай бұрын

Thank you so much sir ❤🎉

@chiragsharma1428 4 ай бұрын

Finally the wait is over . Thanks alot Sir

@ESHAANMISHRA-pr7dh 4 ай бұрын

Thank you sir for the video. I request you to please complete the playlist 🙏🙏🙏🙏

@akashrathore1388 3 ай бұрын

Sir please create a supplimentry video on module and package because its little bit confusing the only one topic that left from python playlist

@vinayakbhat9530 2 ай бұрын

excellent

@Harshh811 3 ай бұрын

Sir, when are you dropping the next lecture(Decoder)?

@kushagrabisht9596 4 ай бұрын

Great content sir. Please launch deep learning course fast

@not_amanullah 4 ай бұрын

This is helpful 🖤🤗

@srinivaspadhy9821 4 ай бұрын

Sir Please bring the decoder architecture soon, I have my interview comming soon. Thank You. May you grow even faster.

@the_princekrrazz 4 ай бұрын

Sir, please update videos on this series regularly A humble request please please .

@SAPTAPARNAKUNDU-g9d 4 ай бұрын

BERT architecture ka video kab ayega sir?

@priyam8665 2 ай бұрын

sir please make decoder video of transformer

@aigaurav5024 4 ай бұрын

Connections is liye use krte hai .. previous information ko age pass kr paye

@gender121 4 ай бұрын

Ok we are expecting decoder also soon….not too long to wait

@RashidAli-jh8zu 3 ай бұрын

As per my Research, Reasons for Residual connection are: Architectural reason - author introduced Residual connections to Mitigating the Vanishing gradient problem as RELU comes in Block, it may generate weights with zero values. Hence Residual connection helps to generate nonzero weights. General Reason - Previous Information weights is passed along with processed information after every operation (Self attention & NN layer) so that the model gives importance and makes decision which information is passed on.

@AnshSingh-dk5uu 2 ай бұрын

1. could have used leaky relu for vanishing gradient problem. 2. you are literally just adding the vectors there is no transformation to decide what amount for which information should be carried to next step.

@RashidAli-jh8zu 2 ай бұрын

@AnshSingh-dk5uu Adding the vector is generally related to adding the information(weights). And backpropagate that information through resnet which uses the residual information for future purpose helps to improve its value and may lead to give attention to particular piece of input sentence.