*******************************************************************

*[Papers Xplained Series] : The intuition behind this series of posts is to explain the gist of famous Deep Learning Research Papers.*

*******************************************************************

The paper in discussion is **“Neural Machine Translation by Jointly Learning to Align and Translate”** by *Dzmitry Bahdanau, KyungHyun Cho & Yoshua Bengio*.

*Topics covered in this article are:*

*Mathematical Representation of Words**What is Word Embedding?**Three methods of generating Word Embeddings namely: i) Dimensionality Reduction, ii) Neural Network-based, iii) Co-occurrence or Count based.**A short introduction to Word2Vec, Skip-Gram, and Continuous Bag of Words (CBoW) models.**What is GloVe Word Embedding?**Mathematics behind the GloVe model from the original paper**How to use GloVe in TensorFlow?*

Natural Language Processing (NLP) is a subfield of Artificial Intelligence, which deals with processing, understanding, and modeling Human Language.

The main challenge in modeling Human Language is that the language construct is in the form…

*Deep Learning, which is based on the Multilayer Neural Networks has achieved state-of-the-art results in most of the domains as of today. In this post, we will look at the Universal Approximation Theorem — one of the fundamental theorems on which the entire concept of Deep Learning is based upon. We will make use of lego blocks analogy and illustrations to understand the same.*

“Neural Networks have an excellent Representation Power of Functions and a Feed Forward Neural Net with one hidden layer with finite number of neurons can represent any continuous function”

In order to make sense of it…

Gradient descentis —→ an iterative optimization algorithm

→ for finding the local minimum of a function

→ by taking smaller steps

→ proportional to the negative of the gradient (opposite direction of the gradient) of the function at the current point.

Gradient Descent is performed by taking small baby steps from randomly initialized points in Loss Function J(w) to eventually reach its minima.

*Note: Here, J in J(w) stands for Jacobian — a first-order derivative vector.*

We assume that the Loss Function is Convex in nature (bowl-shaped). This helps us to consider the minima computation as a Convex…

User Story Mappingproposed byJeff Pattonis an effective method of visualising the multipleMinimum Viable Products (MVP)residing within our Product Backlog.

However, it is not always easy to map User Stories simply based on the **Business Value** delivered collectively. *Inexperienced teams* and *tech-focused teams* may find it harder to visualise and agree upon the MVP way of mapping stories.

In order to overcome this limitation, in 2012, Gojko Adzic proposed a new way of splitting (not mapping) the User Stories called ** “Hamburger Method” - **based on the technical steps involved. …

Artificial Intelligence | Deep Learning | Passionate about Stories