Transformers
Ctrlk
  • 🌍GET STARTED
  • 🌍TUTORIALS
  • 🌍TASK GUIDES
  • 🌍DEVELOPER GUIDES
  • 🌍PERFORMANCE AND SCALABILITY
  • 🌍CONTRIBUTE
  • 🌍CONCEPTUAL GUIDES
  • 🌍API
    • 🌍MAIN CLASSES
    • 🌍MODELS
      • 🌍TEXT MODELS
      • 🌍VISION MODELS
      • 🌍AUDIO MODELS
      • 🌍MULTIMODAL MODELS
        • ALIGN
        • AltCLIP
        • BLIP
        • BLIP-2
        • BridgeTower
        • BROS
        • Chinese-CLIP
        • CLIP
        • CLIPSeg
        • Data2Vec
        • DePlot
        • Donut
        • FLAVA
        • GIT
        • GroupViT
        • IDEFICS
        • InstructBLIP
        • LayoutLM
        • LayoutLMV2
        • LayoutLMV3
        • LayoutXLM
        • LiLT
        • LXMERT
        • MatCha
        • MGP-STR
        • Nougat
        • OneFormer
        • OWL-ViT
        • Perceiver
        • Pix2Struct
        • Segment Anything
        • Speech Encoder Decoder Models
        • TAPAS
        • TrOCR
        • TVLT
        • ViLT
        • Vision Encoder Decoder Models
        • Vision Text Dual Encoder
        • VisualBERT
        • X-CLIP
      • 🌍REINFORCEMENT LEARNING MODELS
      • 🌍TIME SERIES MODELS
      • 🌍GRAPH MODELS
  • 🌍INTERNAL HELPERS
Powered by GitBook
On this page
  1. 🌍API
  2. 🌍MODELS

🌍MULTIMODAL MODELS

ALIGNAltCLIPBLIPBLIP-2BridgeTowerBROSChinese-CLIPCLIPCLIPSegData2VecDePlotDonutFLAVAGITGroupViTIDEFICSInstructBLIPLayoutLMLayoutLMV2LayoutLMV3LayoutXLMLiLTLXMERTMatChaMGP-STRNougatOneFormerOWL-ViTPerceiverPix2StructSegment AnythingSpeech Encoder Decoder ModelsTAPASTrOCRTVLTViLTVision Encoder Decoder ModelsVision Text Dual EncoderVisualBERTX-CLIP
PreviousXLSR-Wav2Vec2NextALIGN