[Papers Xplained Series] : The intuition behind this series of posts is to explain the gist of famous Deep Learning Research Papers.


Photo by Jeffrey Brandjes on Unsplash


Importance of the Paper:

The paper in discussion is “Neural Machine Translation by Jointly Learning to Align and Translate” by Dzmitry Bahdanau, KyungHyun Cho & Yoshua Bengio.

Topics covered in this article are:

  1. Mathematical Representation of Words
  2. What is Word Embedding?
  3. Three methods of generating Word Embeddings namely: i) Dimensionality Reduction, ii) Neural Network-based, iii) Co-occurrence or Count based.
  4. A short introduction to Word2Vec, Skip-Gram, and Continuous Bag of Words (CBoW) models.
  5. What is GloVe Word Embedding?
  6. Mathematics behind the GloVe model from the original paper
  7. How to use GloVe in TensorFlow?
Big Data Jobs

Mathematical Representation of Words:

Natural Language Processing (NLP) is a subfield of Artificial Intelligence, which deals with processing, understanding, and modeling Human Language.

The main challenge in modeling Human Language is that the language construct is in the form…

Source: CS231N Convolutional Neural Networks by Andrej Karpathy — Stanford University

Deep Learning, which is based on the Multilayer Neural Networks has achieved state-of-the-art results in most of the domains as of today. In this post, we will look at the Universal Approximation Theorem — one of the fundamental theorems on which the entire concept of Deep Learning is based upon. We will make use of lego blocks analogy and illustrations to understand the same.

Universal Approximation Theorem (Cybenko 1989) states that:

“Neural Networks have an excellent Representation Power of Functions and a Feed Forward Neural Net with one hidden layer with finite number of neurons can represent any continuous function”

In order to make sense of it…

Photo by Pierre Van Crombrugghe on Unsplash

Gradient Descent Learning Algorithm:

Gradient descent is —

→ an iterative optimization algorithm

→ for finding the local minimum of a function

→ by taking smaller steps

→ proportional to the negative of the gradient (opposite direction of the gradient) of the function at the current point.

Loss Function:

Gradient Descent is performed by taking small baby steps from randomly initialized points in Loss Function J(w) to eventually reach its minima.

Note: Here, J in J(w) stands for Jacobian — a first-order derivative vector.

We assume that the Loss Function is Convex in nature (bowl-shaped). This helps us to consider the minima computation as a Convex…

User Story Mapping proposed by Jeff Patton is an effective method of visualising the multiple Minimum Viable Products (MVP) residing within our Product Backlog.

However, it is not always easy to map User Stories simply based on the Business Value delivered collectively. Inexperienced teams and tech-focused teams may find it harder to visualise and agree upon the MVP way of mapping stories.

In order to overcome this limitation, in 2012, Gojko Adzic proposed a new way of splitting (not mapping) the User Stories called “Hamburger Method” - based on the technical steps involved. …

Kovendhan Venugopal

Artificial Intelligence | Deep Learning | Passionate about Stories

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store