I am reading for GCP Architect exam and I have been going through many blogs/sites where there are good number of information is available. I would like to consolidate all my reading and would like to share with you, so that anyone who is preparing exam in future may find it useful.Let’s start with some basic understanding of cloud.What is cloud computing?Cloud computing is an information technology (IT) paradigm that enables ubiquitous access to shared pools of configurable system resources and higher-level services that can be rapidly provisioned with minimal management effort, often over the Internet.Benefits of cloud computing“Pay-as-you-go” basis, use only what you need

Also known as ‘on demand’ computing

Convert capital expenses into operating expensesFocus on rapid innovationProductivity enhanced due to no software installed“Vertically integrated” stacks enhance functionality, performance, reliability, and securityCloud computing deployment models

Types of cloud computing servicesAlthough cloud computing has changed over time, it has been divided into three broad service categories: infrastructure as a service (IaaS), platform as a service (PaaS) and software as a service (SaaS).

In next part, we will explore different available cloud platforms.References:https://searchcloudcomputing.techtarget.com/definition/cloud-computing


I am reading through “Artificial Intelligence Now” book by O’Reilly media, and the information provided is very useful. I thought of sharing my experiences and learning in this series.The very first thing that makes me excited is the way the AI landscape is growing year by year (Please see below)

Source: http://www.shivonzilis.com/ , Please go through this. This is an excellent article shared on AI

The journey from Vanilla Neural network to BERT architecture (NLU)


Natural language understanding (NLU) is the ability of machines to understand human language.

NLU refers to how unstructured data is rearranged so that machines may “understand” and analyze it.

On the other hand, we have BERTOne of the most path-breaking developments in the field of NLU; a revolutionary NLP model that is superlative when compared with traditional NLP models.

Below, I have tried to briefly describe the journey from Vanilla Neural network to BERT architecture to achieve real-time NLU.

The above figure shows lots of activities. As we will explore more about it we will see, Every succeeding architecture has tried to overcome the problems of previous architecture and has tried to understand the language more accurately.

Let’s explore them(for examples, we have taken a basic QA system as a scenario):

Vanilla Neural Network

An artificial neural network consists of a collection of simulated neurons. Each neuron is a node that is connected to other nodes via links that correspond to biological axon-synapse-dendrite connections. Each link has a weight, which determines the strength of one node’s influence on another.

But, A vanilla neural network takes in a fixed size vector as input which limits its usage in situations that involve a ‘series’ type input with no predetermined size. (e.g.: In case of QA System, Question can only be of 30 words)

Recurrent Neural Network

Recurrent Neural Networks (RNNs) add an interesting twist to basic neural networks. RNNs are designed to take a series of inputs with no predetermined limit on size. (i.e.: Question can be of any length).

But Recurrent Neural Networks also suffers from short-term memory. If a sequence is long enough, they’ll have a hard time carrying information from earlier time steps to later ones. So if you are trying to process a paragraph of text to do predictions, RNN’s may leave out important information from the beginning.


LSTM ’s and GRU’s were created as the solution to short-term memory. They have internal mechanisms called gates that can regulate the flow of information.

Encoder-Decoder Sequence to Sequence LSTM based RNNs

Encoder-Decoder or Sequence to Sequence RNN is used a lot in translation services. The basic idea is that there are two RNNs, one an encoder that keeps updating its hidden state and produces a final single “Context” output. This is then fed to the decoder, which translates this context to a sequence of outputs. Another key difference in this arrangement is that the length of the input sequence and the length of the output sequence need not necessarily be the same.

Click below for visual presentation:


The main issue with this encoder-decoder approach is that a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector/context vector. This turned out to be a bottleneck for these types of models. It made it challenging for the models to deal with long sentences.

Click below for visual presentation:


How to overcome the above limitation:


A solution was proposed in Bahdanau et al., 2014 and Luong et al., 2015. These papers introduced and refined a technique called “Attention”. Instead of passing the last hidden state of the encoding stage, the encoder passes all the hidden states to the decoder

Click below for visual presentation:


“Transformer”: Multi-Head Self Attention

In the paper “Attention Is All You Need”, Google introduced the Transformer, a neural network architecture based on a self-attention mechanism that believed to be particularly well suited for language understanding.

The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality

The attention mechanism in the Transformer is interpreted as a way of computing the relevance of a set of values(information)based on some keys and queries


Unlike the unidirectional language model, BERT uses a bidirectional Transformer. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question-answering

BERT Structure for QA System:


As the complexity of the model is increasing, we need more data to train it. In the case of BERT, Google has already trained it on millions of data points and has released all the checkpoints publicly. We can use those checkpoints at the initial level to understand the language. Then we can fine-tune that model as per our requirement with our data to make it more accurate.


Attention —— Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)

Rasa (contextual AI assistants)

Rasa is an open source machine learning framework for building AI text- and voice-based assistants. Rasa provides a set of tools to build a complete chatbot at your local desktop and completely free.

Why rasa?

Rasa is a contextual AI assistant that can capture the context of what the user is talking about, capable of understanding and responding to different and unexpected inputs and also gracefully handle unexpected dialogue turns.

Rasa consists of two main components:

  *Rasa NLU

  *Rasa Core

Rasa NLU:

Rasa NLU is something like an ear to your assistant which enables the assistant to understand what has been said by the user.

It takes input from the user which is unstructured and extracts structured data in the form of intents and entities (LABELS).


An intent represents the purpose of a user’s input. You define an intent for each type of user request you want your application to support.



·   Intent: searching restaurants

·  What are the non veg restaurants present in Hyderabad?    

           I am looking for veg restaurants in pune?

·  Are there any vegetarian restaurants in Chennai?

 The above questions come under the intent: “searching restaurants” if a user asks any similar kind of questions the assistant will classify the intent as “searching restaurants”. More the data better the bot would get trained.


     This process of extracting the different required pieces of information from a user text is called entity recognition.

 From the above example “Are there any vegetarian restaurants in Chennai?” the entities extracted would be

  Chennai=location, Restaurants=facility type.

    By using this Intent and entities assistant can understand what the user is talking about.

 Rasa NLU file:

Rasa Core:

  Rasa Core is also called Dialogue management. It is something like the brain of the system.

           Rasa, instead of creating rules, uses machine learning to learn conversational patterns from the example conversational data and predicts how an assistant should respond based on the context, history of conversation and other details.

·        The train data in dialogue management is called stories.

·        The story starts with a double hashtag(##) which marks the name of the story

·        Messages sent by the user are shown as lines starting with the asterisk symbol

The responses of the assistant are expressed as action names. There are two types of actions in rasa “utterances” and “custom actions”

Utterances actions are hardcore messages that a bot can respond with. Custom actions, on the other hand, involve custom code being executed.

The custom code can be anything, some kind of back end integration like making an API call or connecting to the database and extracting required information.

       All actions (both utterance actions and custom actions) executed by the   assistant are shown as lines starting with EN Dash (-) followed by the name of the action

Stories file:  


·  A domain is a very important part of building a dialogue management model.

·  Domain file need to contain all the intents, entities, and actions that are mentioned in the NLU and stories files.

·  Domain file also contains “responses


  ·  This is where you can define the actual response an assistant will use to respond when specific utterances are predicted.

·   Each utterance can have more than one template and can include things like images etc.

· The custom action code should be written in an action.py file.

Rasa workflow:

 Choosing a Pipeline:

             To enable our assistant to understand the intents and extract the entities which we defined in our NLU file we have to build a model that is done by processing a pipeline.

There are two pre-configured pipelines in rasa:

·        Pre-trained embeddings spacy

·        supervised embeddings

pre-trained embeddings spacy:

    ·  It can Perform well with less amount of training data

·           Not available for all the languages.

·           If the chatbot is related to domain-specific then pre-trained embeddings     spacy pipeline is not a good choice.

Supervised embeddings:

    ·          The models will pick up domain specific vocabulary.

     ·          It can build assistants in any language.

·           It has the advantage of handling messages with multiple intents.

·           Needs more training data.

Training the Model:

As our NLU, stories, domain files, and pipeline are ready, we are good to go and train our model by running the scripts in the terminal. Once the training is done. The model will be saved in the models folder.

After training is done, chat with the assistant, check whether it’s correctly predicting the entities and intents from the user input and take different dialogue turns to see whether it can handle or not if it is not able to handle you need to re-train the model by making necessary changes in the required files.

There are some more important things such as slots, trackers, rasa interactive, rasa x, fallback actions, etc.  This will be covered in the next part of the article.

BERT Embeddings

I came across some questions/Articles related to BERT-embeddings

There is no clear answer. However, a question that naturally arises is 

What makes BERT so effective in such a wide variety of tasks?

As per the recent set of papers, one of the main ingredients is its unique embeddings.

BERT has a fixed size vocabulary of words/subwords (wordpiece embeddings) — any input word is mapped to these words/subwords. These learned raw vectors are similar to the vector output of a word2vec model — a single vector represents a word regardless of its different meanings or senses. For instance, all the different senses/meanings (cell phone, biological cell, prison cell) of a word like “cell” are combined into a single vector.

For instance, in the task of predicting a word in a sentence (no fine-tuning is required for this prediction task since it is the same as the training objective), all the words in a sentence are transformed by the model into context-dependent representations. A word like “cell” would shed its “mobile” and “prison” sense and only retain its “biological” sense in a sentence like “There are many organelles in a biological cell”. In a sentence like “He went to a prison cell with a cell phone to capture blood cell samples from sick inmates”, the three separate vectors output by the model for the word “cell” would have the three senses separately captured in them as opposed to the original input vector for cell that had all of these senses mixed.

  • Some common words like “the” or even uncommon ones like “quantum”, “Constantinople” are present in BERT vocabulary(base and large model vocab) — so it is a direct mapping for these words. But a word like electrodynamics is absent. So it has broken down into 4 subwords- electro ##dy ##nami ##cs where these subwords are present in the vocab. As an aside, one practical advantage of a fixed size vocab is, loading the BERT model into GPU is not limited by the number of unique words in a corpus — any corpus regardless of its unique vocab size is represented by the ~30k subword vocab.

Examining BERT’s learned raw vectors (about 30,000 vectors — roughly 78% of them complete words of the form “cell”,” protein”, and 22 % partial words or subwords of the form “##os”. For instance, the word “icos” is represented during input to a BERT model as two vectors — “ic” and “##os”) show

  •  They capture different forms of similarity — semantic (crown, throne, monarch, queen ), syntactic (he, she, they), word inflections (carry, carries, carrying), spelling( Mexico, México, Mexican ) phonetic similarity across languages (##kan, ##カ — Katakana letter ka , ##क — Devanagiri letter ka; this similarity perhaps in part explains the performance of transformers in machine translation). In essence, the embedding space of raw vectors is a mixed grab bag of different types of similarities illustrated by the examples above.

These raw vectors along with a masked language model could be used for a variety of tasks

  •  Given a term, identify its different senses and its predominant sense (again captured using terms in BERT’s vocabulary). For instance, the term “cell” has multiple senses as described earlier. After fine-tuning a model on a domain-specific corpus, one of the senses may predominate the other (e.g. fine-tuning a model on a biomedical corpus may make the “biological cell” sense of the word “cell” predominate the other two senses).
  •  Given two or more terms, identify any sense in common between them (sense captured using terms in BERT vocabulary). For instance, two drugs would share common descriptors like drug, drugs, treatment.
  •  cluster entities of a particular type (this typically works well only for entity types that are distinct from each other)
  •  Unsupervised “weak entity recognition” — we can tag a proper noun (e.g. a drug like atorvastatin ) with a set of common nouns present in the BERT vocabulary like drug, drugs, treatment, therapy. These common nouns could serve as entity tagging proxies for the proper noun. There could be instances where proper nouns(see vocabulary stats of upper case terms below) in the vocabulary serve as proxies for entities (e.g. names of persons — johnson, smith, cohen may serve as descriptors for a person’s entity type)

Such a comprehensive embedding scheme contains a lot of useful information for the model.

These combinations of preprocessing steps make BERT so versatile. This implies that without making any major change in the model’s architecture, we can easily train it on multiple kinds of NLP tasks.

Understanding Elasticsearch

What is Elasticsearch?

Elasticsearch is an open source analytics and full text search engine built on apache lucene. It helps perform powerful text search queries using distributed inverted index and can also query structured data for analytics platforms. It is written and developed in Java. Elasticsearch uses a REST API to query documents.

Why Elasticsearch?

  • Using elasticsearch we can query all data types, even unstructured, geographical locations and metrics data.
  • Elasticsearch can search billions of records in just a few seconds, thanks to its distributed architecture, making it highly scalable for huge indexes and being able to perform multiple searches simultaneously.
  • Multilingual support using ICU plugin based on the lucene implementation of text segmentation standard.
  • Its powerful full text search helps in building search functionalities such as auto completion, correcting typos, highlighting matches, finding relevant records, finding synonyms and many more.
  • Elasticsearch search can not really be a business intelligence solution but you can indeed get a lot of valuable information out of the data that you store within the elasticsearch that is useful for analytics platforms when analysing lots of data.

How elasticsearch works?

To understand the architecture of elasticsearch we need to know about Nodes and Clusters which are the data units of elasticsearch. A Cluster is a collection of nodes where each node is a single server that stores part of the data that is added to the cluster as a whole. Each of these nodes work in symphony with other nodes in the cluster and forward the request from one node to another using Transport Layer. The nodes and the clusters are uniquely identified using names for each node and the cluster.

Cluster and its Nodes

In elasticsearch data is stored as documents which is just a unit of information. A document in elasticsearch corresponds to a row in a relational database. A document then contains fields, which corresponds to columns in a relational database, a document is essentially a json object. An example document:


 “First_name” : “Able”,

 “Last_name” : “Tesfaye”,

 “Albums” : [“After Hours”, “Starboy”,”Beauty Behind the Madness” ,“House of Balloons”]


An Index with its Documents

Again the collection of logically related documents is called an index. Just like nodes in a Cluster these documents are organized in indices. An index in elasticsearch corresponds to a table in a relational database. Just like nodes and clusters the documents and indices are uniquely identified using unique names for each document and index.

As noted earlier elasticsearch is very scalable, thanks to its distributed architecture. Lets see how sharding helps in handling large indices making elasticsearch scalable. Let’s say we have an index that is of the size 1TB and two nodes each of size 512GB. The index cannot be stored in a single node and needs to be distributed among the two. So when an index is large enough to exceed the hardware limits of a node we break the data in the index into pieces called shards, hence the process sharding.

An Index is sharded into 4 shards

The shards are distributed between the nodes and even if we want to add more data to the cluster we can simply do so by adjusting the distribution of shards between the nodes, hence sharding makes elasticsearch highly scalable. Another advantage of distributed architecture is that we can parallelly search both the nodes at the same time increasing the performance of the search.

End Note:

 Elasticsearch is being used by large companies, some of which are quora, adobe, facebook, firefox. Netflix, soundcloud, stackexchange, etc. There are many other users of elasticsearch and there is a vibrant community. However as a developer one can just use elasticsearch by knowing how to query a search and does not need to know all this, this can help shed some light on the working of elastic search.