1Multimedia Lab, The Chinese University of Hong Kong
2OpenGVLab, Shanghai AI Laboratory
*Indicates Equal Contribution
Unified Multimodal Learning. Meta-Transformer makes use of the identical spine to encode pure languages, pictures, level clouds, audio, video, infrared, hyperspectral, X-ray, time-series, tabular, and graph knowledge. It reveals the potential of transformer architectures for common notion.
Modalities
Abstract
Multimodal studying includes using knowledge from varied modalities to enhance mannequin capability. Despite the years of improvement on this area, it stays difficult to plot a unified framework for processing pure language, 2D pictures, 3D level clouds, and audio spectrograms as a result of essential gaps amongst these completely different modalities. This research proposes a novel method that demonstrates a community with frozen parameters can encode the information from the aforementioned 4 modalities and obtain favorable efficiency, leading to a unified framework known as Meta Transformer. Using this framework, the uncooked enter knowledge from varied modalities are transformed to a shared token area, permitting a subsequent encoder with frozen parameters to extract high-level semantic options of the enter knowledge. Composed of three predominant parts: a unified knowledge tokenizer, a modality-shared encoder, and task-specific heads for downstream duties, Meta Transformer is the primary framework for unified studying among the many 4 modalities with unpaired knowledge, to one of the best of our information. We evaluated Meta Transformer on completely different benchmarks throughout modalities, resembling ImageNet for classification, GLUE for textual content understanding, ModelNet-40, S3DIS, ShapeNetPart for level cloud, and Speech Commands V2 for speech spectrograms. These outcomes point out a promising future for creating unified multimodal intelligence with transformers.
Meta-Transformer
Illustration of Unified Multimodal Learning framework for pure language, pictures, level clouds, and audio spectrograms. An all-to-one tokenizer is used to transform the uncooked enter knowledge from completely different modalities to a shared token area. Then, a modality-shared encoder with frozen parameters is used to extract high-level semantic options of the enter knowledge. Finally, task-specific heads are used for downstream duties. This framework permits perceiving completely different modalities with one shared encoder and with out paired knowledge.
Tokenizer
We suggest the meta scheme in (a) containing grouping, convolution, and transformation progress. Then (b)-(e) represents the constructing blocks utilized with our meta scheme on texts, pictures, level clouds, and audio spectrograms.
Experiment
We consider Meta-Transformer on a variety of modalities, together with 2D pictures, pure language, 3D level clouds, audio spectrograms, time-series knowledge, and many others.
Compared with present state-of-the-art strategies, Meta-Transformer additionally delivers an excellent efficiency.
Table 1: Experimental outcomes for textual content understanding on the GLUE benchmark. We evaluate present superior strategies from paraphrasing, sentiment, duplication, inference, and answering duties, and we report the pre-training settings and performances.
Table 2: Experimental outcomes for picture understanding. We conduct experiments in classification, object detection, and occasion segmentation duties on the ImageNet [23], MSCOCO [71], and ADE20K [74] datasets. ∗ denotes zero-shot picture classification, † denotes linear probing for picture classification, and ‡ signifies the mannequin is pre-trained on ImageNet-22K [23], the place Bold and underline point out greatest and second greatest outcomes.
Table 3: Experimental outcomes for infrared and hyperspectral picture understanding.We conduct experiments in classification duties on the SYSU-MM01 and Indian Pine datasets. We report Rank-1 (R@1), Top-1 Accuracy scores, and the variety of trainable parameters (Params).
Table 4: Experimental outcomes for level cloud understanding. We conduct experiments on the ModelNet-40 [25], S3DIS [26], and ShapeNetPart [27] datasets. We evaluate present superior strategies from classification, semantic, and object half segmentation duties, and we report the pretraining modality (Pre-train) and trainable parameters quantity (Param.) of every methodology.
Table 6: Time-series Forecasting with Meta-Transformers. Following TimesNet, we report the variety of trainable parameters and common performances from 4 completely different prediction lengths, which is {96, 192, 336, 720}.
BibTeX
If you discover our work helpful, please cite our paper. BibTex code is offered under:
@article{zhang2023metatransformer,
title={Meta-Transformer: A Unified Framework for Multimodal Learning},
creator={Zhang, Yiyuan and Gong, Kaixiong and Zhang, Kaipeng and Li, Hongsheng and Qiao, Yu and Ouyang, Wanli and Yue, Xiangyu},
12 months={2023},
journal={arXiv preprint arXiv:2307.10802},
}
…. to be continued
Read the Original Article
Copyright for syndicated content material belongs to the linked Source : Hacker News – https://kxgong.github.io/meta_transformer/