I came across some questions/Articles related to BERT-embeddings:
- Why do BERT embeddings work so well for machine learning?
- Do pre-trained embeddings as ELMo and BERT work well on noisy data like tweets?
- Demystifying BERT: The Groundbreaking NLP Framework
- Understanding BERT: Is it a Game Changer in NLP, and many more…
There is no clear answer. However, a question that naturally arises is
What makes BERT so effective in such a wide variety of tasks?
As per the recent set of papers, one of the main ingredients is its unique embeddings.
BERT has a fixed size vocabulary of words/subwords (wordpiece embeddings) — any input word is mapped to these words/subwords. These learned raw vectors are similar to the vector output of a word2vec model — a single vector represents a word regardless of its different meanings or senses. For instance, all the different senses/meanings (cell phone, biological cell, prison cell) of a word like “cell” are combined into a single vector.
For instance, in the task of predicting a word in a sentence (no fine-tuning is required for this prediction task since it is the same as the training objective), all the words in a sentence are transformed by the model into context-dependent representations. A word like “cell” would shed its “mobile” and “prison” sense and only retain its “biological” sense in a sentence like “There are many organelles in a biological cell”. In a sentence like “He went to a prison cell with a cell phone to capture blood cell samples from sick inmates”, the three separate vectors output by the model for the word “cell” would have the three senses separately captured in them as opposed to the original input vector for cell that had all of these senses mixed.
- Some common words like “the” or even uncommon ones like “quantum”, “Constantinople” are present in BERT vocabulary(base and large model vocab) — so it is a direct mapping for these words. But a word like electrodynamics is absent. So it has broken down into 4 subwords- electro ##dy ##nami ##cs where these subwords are present in the vocab. As an aside, one practical advantage of a fixed size vocab is, loading the BERT model into GPU is not limited by the number of unique words in a corpus — any corpus regardless of its unique vocab size is represented by the ~30k subword vocab.
Examining BERT’s learned raw vectors (about 30,000 vectors — roughly 78% of them complete words of the form “cell”,” protein”, and 22 % partial words or subwords of the form “##os”. For instance, the word “icos” is represented during input to a BERT model as two vectors — “ic” and “##os”) show
- They capture different forms of similarity — semantic (crown, throne, monarch, queen ), syntactic (he, she, they), word inflections (carry, carries, carrying), spelling( Mexico, México, Mexican ) phonetic similarity across languages (##kan, ##カ — Katakana letter ka , ##क — Devanagiri letter ka; this similarity perhaps in part explains the performance of transformers in machine translation). In essence, the embedding space of raw vectors is a mixed grab bag of different types of similarities illustrated by the examples above.
These raw vectors along with a masked language model could be used for a variety of tasks
- Given a term, identify its different senses and its predominant sense (again captured using terms in BERT’s vocabulary). For instance, the term “cell” has multiple senses as described earlier. After fine-tuning a model on a domain-specific corpus, one of the senses may predominate the other (e.g. fine-tuning a model on a biomedical corpus may make the “biological cell” sense of the word “cell” predominate the other two senses).
- Given two or more terms, identify any sense in common between them (sense captured using terms in BERT vocabulary). For instance, two drugs would share common descriptors like drug, drugs, treatment.
- cluster entities of a particular type (this typically works well only for entity types that are distinct from each other)
- Unsupervised “weak entity recognition” — we can tag a proper noun (e.g. a drug like atorvastatin ) with a set of common nouns present in the BERT vocabulary like drug, drugs, treatment, therapy. These common nouns could serve as entity tagging proxies for the proper noun. There could be instances where proper nouns(see vocabulary stats of upper case terms below) in the vocabulary serve as proxies for entities (e.g. names of persons — johnson, smith, cohen may serve as descriptors for a person’s entity type)
Such a comprehensive embedding scheme contains a lot of useful information for the model.
These combinations of preprocessing steps make BERT so versatile. This implies that without making any major change in the model’s architecture, we can easily train it on multiple kinds of NLP tasks.