What is generative modeling?
Generative models are a class of machine learning algorithms aiming at estimating the joint probability distribution over some variables of interest. There are two general goals for developing them, improving the discriminative models and content generation. Given a dataset containing supervision information, a discriminative model is usually employed to decode the target variable through a deterministic map of input variables, capturing discriminative features. For instance, if we are going to develop an image classifier to distinguish between cat and dog, the classifier probably learns appearance differences between two classes, nothing more. It may diminish the generalization of models in cases where there are not sufficient training examples of each patterns, since it is likely the model to be fed with two patterns of different classes that has no much differences. On the other hand, generative models learn the features that are rich enough to capture everything required to be known about an object to generate. Apparently, using generative models for discriminative task will improve the performance, yet they are difficult to train due to their over-parameterized nature. The second goal of generative models illuminated by releasing Open-AI ChatGPT, is sampling itself, or rather content generation known as AIGC (artificial intelligent generated content). Content generation is a process that machine should generate digital contents that are close to reality. The motivation behind the sampling can be rooted at human beings’ capability of imagination, assisting them with planning and simulating the dynamic world. AIGC can be utilized in a wide range of domains associated with content generation or analysis.
Type of generative models
The most dominant approach for categorizing generative models is based on the type of input prompt and generated output. If the input prompt and generated content are of the same type the generative model is so-called unimodal. Text-to-text and image-to-image are some examples of unimodal generative models. On the other hand, if they are different modalities, multi-modal generative models will be born like text-to-image and image-to-text. Another possible view for classifying generative models is based on their target. According to this, generative models are either PDF-based or cost based. PDF-based generative models are those models that directly learn the probability density function itself, providing the PDF evaluation, marginalization, and sampling. However, cost-based models only mimic the sampling process through learning a mechanism enabling one to extract samples from the same population on which model is trained. Energy-based models and Generative Adversarial Networks (GAN) are some examples of former and latter respectively.
History of generative models
Efforts made for creating generative models are not limited to recent years and dating back to 1950 when GMM and HMM are developed for modeling sequential data like time series. However, their limited performance on modeling high dimensional space and generating sequences with short-term dependency was frustrating. However, it was not until the advent of deep learning, stemming from proposing a fast greedy-strategy for training deep belief networks at 2006, which generative models witness a great progress in performance. Traditional generative models for natural language processing learn the distribution over a dictionary of words using N-gram language modeling, and based on which search for the high probable sentences. The adopted approach couldn’t efficiently adapt to generate long sequences. This problem was solved by introducing recurrent neural networks for language modeling task, allowing for modeling long-term dependencies. It was followed by the development of LSTM and GRU utilizing gating mechanism to control the information flow. Meanwhile, in computer vision, before advent of deep learning, traditional generative algorithms used texture synthesis that is based on hand-designed features to generate images. The introduction of GAN was a successful milestone in this era, showing a great performance in different applications. This advancement followed by the development of Variational Auto-encoders and other methods like stable diffusion models to generate high quality images.
Although at the early stage of deep generative models, different areas are developed in a separated direction without any overlap, they intersect with each other after introducing transformers, a specific deep learning architecture working based on self-attention mechanism that automatically learn which part of a sequence is the most important to pay attention. After showing a great performance in NLP tasks, transformers find a way to computer vision tasks. Vision transformers (ViT) translate the input image into a sequence of patches and process it as transformers process a sequence of tokens. Except for transformers, the eye-sticking performance of generative models, they also owe to pre-training strategies. Pre-training is usually the first step of training the deep architecture, during which models only learn how to understand data. Pre-training methodologies depend on the area, yet it is also possible to customized one for another. There are two general paradigms of pre-training in natural language processing, autoregressive language models and masked language models. Autoregressive language models are trained in a way that the current token should be estimated based on the all previous tokens, while masked language models are trained to predict the tokens which are randomly masked conditioned on unmasked tokens. GPT and BERT models are well-known examples of former and latter respectively. Depending on the task we are going to solve with machine learning solution, pre-training strategies can be arbitrary designed provided that the employed approach ensure the generalization and a great performance. On the other hand, the pre-training of visual models is based on contrastive learning paradigm. There are also works in different directions like solving jigsaw puzzle and masked auto-encoders inspired from masked language models.
.png)
