Transformers
  • 🌍GET STARTED
    • Transformers
    • Quick tour
    • Installation
  • 🌍TUTORIALS
    • Run inference with pipelines
    • Write portable code with AutoClass
    • Preprocess data
    • Fine-tune a pretrained model
    • Train with a script
    • Set up distributed training with BOINC AI Accelerate
    • Load and train adapters with BOINC AI PEFT
    • Share your model
    • Agents
    • Generation with LLMs
  • 🌍TASK GUIDES
    • 🌍NATURAL LANGUAGE PROCESSING
      • Text classification
      • Token classification
      • Question answering
      • Causal language modeling
      • Masked language modeling
      • Translation
      • Summarization
      • Multiple choice
    • 🌍AUDIO
      • Audio classification
      • Automatic speech recognition
    • 🌍COMPUTER VISION
      • Image classification
      • Semantic segmentation
      • Video classification
      • Object detection
      • Zero-shot object detection
      • Zero-shot image classification
      • Depth estimation
    • 🌍MULTIMODAL
      • Image captioning
      • Document Question Answering
      • Visual Question Answering
      • Text to speech
    • 🌍GENERATION
      • Customize the generation strategy
    • 🌍PROMPTING
      • Image tasks with IDEFICS
  • 🌍DEVELOPER GUIDES
    • Use fast tokenizers from BOINC AI Tokenizers
    • Run inference with multilingual models
    • Use model-specific APIs
    • Share a custom model
    • Templates for chat models
    • Run training on Amazon SageMaker
    • Export to ONNX
    • Export to TFLite
    • Export to TorchScript
    • Benchmarks
    • Notebooks with examples
    • Community resources
    • Custom Tools and Prompts
    • Troubleshoot
  • 🌍PERFORMANCE AND SCALABILITY
    • Overview
    • 🌍EFFICIENT TRAINING TECHNIQUES
      • Methods and tools for efficient training on a single GPU
      • Multiple GPUs and parallelism
      • Efficient training on CPU
      • Distributed CPU training
      • Training on TPUs
      • Training on TPU with TensorFlow
      • Training on Specialized Hardware
      • Custom hardware for training
      • Hyperparameter Search using Trainer API
    • 🌍OPTIMIZING INFERENCE
      • Inference on CPU
      • Inference on one GPU
      • Inference on many GPUs
      • Inference on Specialized Hardware
    • Instantiating a big model
    • Troubleshooting
    • XLA Integration for TensorFlow Models
    • Optimize inference using `torch.compile()`
  • 🌍CONTRIBUTE
    • How to contribute to transformers?
    • How to add a model to BOINC AI Transformers?
    • How to convert a BOINC AI Transformers model to TensorFlow?
    • How to add a pipeline to BOINC AI Transformers?
    • Testing
    • Checks on a Pull Request
  • 🌍CONCEPTUAL GUIDES
    • Philosophy
    • Glossary
    • What BOINC AI Transformers can do
    • How BOINC AI Transformers solve tasks
    • The Transformer model family
    • Summary of the tokenizers
    • Attention mechanisms
    • Padding and truncation
    • BERTology
    • Perplexity of fixed-length models
    • Pipelines for webserver inference
    • Model training anatomy
  • 🌍API
    • 🌍MAIN CLASSES
      • Agents and Tools
      • 🌍Auto Classes
        • Extending the Auto Classes
        • AutoConfig
        • AutoTokenizer
        • AutoFeatureExtractor
        • AutoImageProcessor
        • AutoProcessor
        • Generic model classes
          • AutoModel
          • TFAutoModel
          • FlaxAutoModel
        • Generic pretraining classes
          • AutoModelForPreTraining
          • TFAutoModelForPreTraining
          • FlaxAutoModelForPreTraining
        • Natural Language Processing
          • AutoModelForCausalLM
          • TFAutoModelForCausalLM
          • FlaxAutoModelForCausalLM
          • AutoModelForMaskedLM
          • TFAutoModelForMaskedLM
          • FlaxAutoModelForMaskedLM
          • AutoModelForMaskGenerationge
          • TFAutoModelForMaskGeneration
          • AutoModelForSeq2SeqLM
          • TFAutoModelForSeq2SeqLM
          • FlaxAutoModelForSeq2SeqLM
          • AutoModelForSequenceClassification
          • TFAutoModelForSequenceClassification
          • FlaxAutoModelForSequenceClassification
          • AutoModelForMultipleChoice
          • TFAutoModelForMultipleChoice
          • FlaxAutoModelForMultipleChoice
          • AutoModelForNextSentencePrediction
          • TFAutoModelForNextSentencePrediction
          • FlaxAutoModelForNextSentencePrediction
          • AutoModelForTokenClassification
          • TFAutoModelForTokenClassification
          • FlaxAutoModelForTokenClassification
          • AutoModelForQuestionAnswering
          • TFAutoModelForQuestionAnswering
          • FlaxAutoModelForQuestionAnswering
          • AutoModelForTextEncoding
          • TFAutoModelForTextEncoding
        • Computer vision
          • AutoModelForDepthEstimation
          • AutoModelForImageClassification
          • TFAutoModelForImageClassification
          • FlaxAutoModelForImageClassification
          • AutoModelForVideoClassification
          • AutoModelForMaskedImageModeling
          • TFAutoModelForMaskedImageModeling
          • AutoModelForObjectDetection
          • AutoModelForImageSegmentation
          • AutoModelForImageToImage
          • AutoModelForSemanticSegmentation
          • TFAutoModelForSemanticSegmentation
          • AutoModelForInstanceSegmentation
          • AutoModelForUniversalSegmentation
          • AutoModelForZeroShotImageClassification
          • TFAutoModelForZeroShotImageClassification
          • AutoModelForZeroShotObjectDetection
        • Audio
          • AutoModelForAudioClassification
          • AutoModelForAudioFrameClassification
          • TFAutoModelForAudioFrameClassification
          • AutoModelForCTC
          • AutoModelForSpeechSeq2Seq
          • TFAutoModelForSpeechSeq2Seq
          • FlaxAutoModelForSpeechSeq2Seq
          • AutoModelForAudioXVector
          • AutoModelForTextToSpectrogram
          • AutoModelForTextToWaveform
        • Multimodal
          • AutoModelForTableQuestionAnswering
          • TFAutoModelForTableQuestionAnswering
          • AutoModelForDocumentQuestionAnswering
          • TFAutoModelForDocumentQuestionAnswering
          • AutoModelForVisualQuestionAnswering
          • AutoModelForVision2Seq
          • TFAutoModelForVision2Seq
          • FlaxAutoModelForVision2Seq
      • Callbacks
      • Configuration
      • Data Collator
      • Keras callbacks
      • Logging
      • Models
      • Text Generation
      • ONNX
      • Optimization
      • Model outputs
      • Pipelines
      • Processors
      • Quantization
      • Tokenizer
      • Trainer
      • DeepSpeed Integration
      • Feature Extractor
      • Image Processor
    • 🌍MODELS
      • 🌍TEXT MODELS
        • ALBERT
        • BART
        • BARThez
        • BARTpho
        • BERT
        • BertGeneration
        • BertJapanese
        • Bertweet
        • BigBird
        • BigBirdPegasus
        • BioGpt
        • Blenderbot
        • Blenderbot Small
        • BLOOM
        • BORT
        • ByT5
        • CamemBERT
        • CANINE
        • CodeGen
        • CodeLlama
        • ConvBERT
        • CPM
        • CPMANT
        • CTRL
        • DeBERTa
        • DeBERTa-v2
        • DialoGPT
        • DistilBERT
        • DPR
        • ELECTRA
        • Encoder Decoder Models
        • ERNIE
        • ErnieM
        • ESM
        • Falcon
        • FLAN-T5
        • FLAN-UL2
        • FlauBERT
        • FNet
        • FSMT
        • Funnel Transformer
        • GPT
        • GPT Neo
        • GPT NeoX
        • GPT NeoX Japanese
        • GPT-J
        • GPT2
        • GPTBigCode
        • GPTSAN Japanese
        • GPTSw3
        • HerBERT
        • I-BERT
        • Jukebox
        • LED
        • LLaMA
        • LLama2
        • Longformer
        • LongT5
        • LUKE
        • M2M100
        • MarianMT
        • MarkupLM
        • MBart and MBart-50
        • MEGA
        • MegatronBERT
        • MegatronGPT2
        • Mistral
        • mLUKE
        • MobileBERT
        • MPNet
        • MPT
        • MRA
        • MT5
        • MVP
        • NEZHA
        • NLLB
        • NLLB-MoE
        • NystrΓΆmformer
        • Open-Llama
        • OPT
        • Pegasus
        • PEGASUS-X
        • Persimmon
        • PhoBERT
        • PLBart
        • ProphetNet
        • QDQBert
        • RAG
        • REALM
        • Reformer
        • RemBERT
        • RetriBERT
        • RoBERTa
        • RoBERTa-PreLayerNorm
        • RoCBert
        • RoFormer
        • RWKV
        • Splinter
        • SqueezeBERT
        • SwitchTransformers
        • T5
        • T5v1.1
        • TAPEX
        • Transformer XL
        • UL2
        • UMT5
        • X-MOD
        • XGLM
        • XLM
        • XLM-ProphetNet
        • XLM-RoBERTa
        • XLM-RoBERTa-XL
        • XLM-V
        • XLNet
        • YOSO
      • 🌍VISION MODELS
        • BEiT
        • BiT
        • Conditional DETR
        • ConvNeXT
        • ConvNeXTV2
        • CvT
        • Deformable DETR
        • DeiT
        • DETA
        • DETR
        • DiNAT
        • DINO V2
        • DiT
        • DPT
        • EfficientFormer
        • EfficientNet
        • FocalNet
        • GLPN
        • ImageGPT
        • LeViT
        • Mask2Former
        • MaskFormer
        • MobileNetV1
        • MobileNetV2
        • MobileViT
        • MobileViTV2
        • NAT
        • PoolFormer
        • Pyramid Vision Transformer (PVT)
        • RegNet
        • ResNet
        • SegFormer
        • SwiftFormer
        • Swin Transformer
        • Swin Transformer V2
        • Swin2SR
        • Table Transformer
        • TimeSformer
        • UperNet
        • VAN
        • VideoMAE
        • Vision Transformer (ViT)
        • ViT Hybrid
        • ViTDet
        • ViTMAE
        • ViTMatte
        • ViTMSN
        • ViViT
        • YOLOS
      • 🌍AUDIO MODELS
        • Audio Spectrogram Transformer
        • Bark
        • CLAP
        • EnCodec
        • Hubert
        • MCTCT
        • MMS
        • MusicGen
        • Pop2Piano
        • SEW
        • SEW-D
        • Speech2Text
        • Speech2Text2
        • SpeechT5
        • UniSpeech
        • UniSpeech-SAT
        • VITS
        • Wav2Vec2
        • Wav2Vec2-Conformer
        • Wav2Vec2Phoneme
        • WavLM
        • Whisper
        • XLS-R
        • XLSR-Wav2Vec2
      • 🌍MULTIMODAL MODELS
        • ALIGN
        • AltCLIP
        • BLIP
        • BLIP-2
        • BridgeTower
        • BROS
        • Chinese-CLIP
        • CLIP
        • CLIPSeg
        • Data2Vec
        • DePlot
        • Donut
        • FLAVA
        • GIT
        • GroupViT
        • IDEFICS
        • InstructBLIP
        • LayoutLM
        • LayoutLMV2
        • LayoutLMV3
        • LayoutXLM
        • LiLT
        • LXMERT
        • MatCha
        • MGP-STR
        • Nougat
        • OneFormer
        • OWL-ViT
        • Perceiver
        • Pix2Struct
        • Segment Anything
        • Speech Encoder Decoder Models
        • TAPAS
        • TrOCR
        • TVLT
        • ViLT
        • Vision Encoder Decoder Models
        • Vision Text Dual Encoder
        • VisualBERT
        • X-CLIP
      • 🌍REINFORCEMENT LEARNING MODELS
        • Decision Transformer
        • Trajectory Transformer
      • 🌍TIME SERIES MODELS
        • Autoformer
        • Informer
        • Time Series Transformer
      • 🌍GRAPH MODELS
        • Graphormer
  • 🌍INTERNAL HELPERS
    • Custom Layers and Utilities
    • Utilities for pipelines
    • Utilities for Tokenizers
    • Utilities for Trainer
    • Utilities for Generation
    • Utilities for Image Processors
    • Utilities for Audio processing
    • General Utilities
    • Utilities for Time Series
Powered by GitBook
On this page
  • Autoformer
  • Overview
  • Resources
  • AutoformerConfig
  • AutoformerModel
  • AutoformerForPrediction
  1. API
  2. MODELS
  3. TIME SERIES MODELS

Autoformer

PreviousTIME SERIES MODELSNextInformer

Last updated 1 year ago

Autoformer

Overview

The Autoformer model was proposed in by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.

This model augments the Transformer as a deep decomposition architecture, which can progressively decompose the trend and seasonal components during the forecasting process.

The abstract from the paper is the following:

Extending the forecasting time is a critical demand for real applications, such as extreme weather early warning and long-term energy consumption planning. This paper studies the long-term forecasting problem of time series. Prior Transformer-based models adopt various self-attention mechanisms to discover the long-range dependencies. However, intricate temporal patterns of the long-term future prohibit the model from finding reliable dependencies. Also, Transformers have to adopt the sparse versions of point-wise self-attentions for long series efficiency, resulting in the information utilization bottleneck. Going beyond Transformers, we design Autoformer as a novel decomposition architecture with an Auto-Correlation mechanism. We break with the pre-processing convention of series decomposition and renovate it as a basic inner block of deep models. This design empowers Autoformer with progressive decomposition capacities for complex time series. Further, inspired by the stochastic process theory, we design the Auto-Correlation mechanism based on the series periodicity, which conducts the dependencies discovery and representation aggregation at the sub-series level. Auto-Correlation outperforms self-attention in both efficiency and accuracy. In long-term forecasting, Autoformer yields state-of-the-art accuracy, with a 38% relative improvement on six benchmarks, covering five practical applications: energy, traffic, economics, weather and disease.

This model was contributed by and . The original code can be found .

Resources

A list of official BOINC AI and community (indicated by 🌎) resources to help you get started. If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we’ll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

  • Check out the Autoformer blog-post in HuggingFace blog:

AutoformerConfig

class transformers.AutoformerConfig

( prediction_length: typing.Optional[int] = Nonecontext_length: typing.Optional[int] = Nonedistribution_output: str = 'student_t'loss: str = 'nll'input_size: int = 1lags_sequence: typing.List[int] = [1, 2, 3, 4, 5, 6, 7]scaling: bool = Truenum_time_features: int = 0num_dynamic_real_features: int = 0num_static_categorical_features: int = 0num_static_real_features: int = 0cardinality: typing.Optional[typing.List[int]] = Noneembedding_dimension: typing.Optional[typing.List[int]] = Noned_model: int = 64encoder_attention_heads: int = 2decoder_attention_heads: int = 2encoder_layers: int = 2decoder_layers: int = 2encoder_ffn_dim: int = 32decoder_ffn_dim: int = 32activation_function: str = 'gelu'dropout: float = 0.1encoder_layerdrop: float = 0.1decoder_layerdrop: float = 0.1attention_dropout: float = 0.1activation_dropout: float = 0.1num_parallel_samples: int = 100init_std: float = 0.02use_cache: bool = Trueis_encoder_decoder = Truelabel_length: int = 10moving_average: int = 25autocorrelation_factor: int = 3**kwargs )

Parameters

  • prediction_length (int) β€” The prediction length for the decoder. In other words, the prediction horizon of the model.

  • context_length (int, optional, defaults to prediction_length) β€” The context length for the encoder. If unset, the context length will be the same as the prediction_length.

  • distribution_output (string, optional, defaults to "student_t") β€” The distribution emission head for the model. Could be either β€œstudent_t”, β€œnormal” or β€œnegative_binomial”.

  • loss (string, optional, defaults to "nll") β€” The loss function for the model corresponding to the distribution_output head. For parametric distributions it is the negative log likelihood (nll) - which currently is the only supported one.

  • input_size (int, optional, defaults to 1) β€” The size of the target variable which by default is 1 for univariate targets. Would be > 1 in case of multivariate targets.

  • lags_sequence (list[int], optional, defaults to [1, 2, 3, 4, 5, 6, 7]) β€” The lags of the input time series as covariates often dictated by the frequency. Default is [1, 2, 3, 4, 5, 6, 7].

  • scaling (bool, optional defaults to True) β€” Whether to scale the input targets.

  • num_time_features (int, optional, defaults to 0) β€” The number of time features in the input time series.

  • num_dynamic_real_features (int, optional, defaults to 0) β€” The number of dynamic real valued features.

  • num_static_categorical_features (int, optional, defaults to 0) β€” The number of static categorical features.

  • num_static_real_features (int, optional, defaults to 0) β€” The number of static real valued features.

  • cardinality (list[int], optional) β€” The cardinality (number of different values) for each of the static categorical features. Should be a list of integers, having the same length as num_static_categorical_features. Cannot be None if num_static_categorical_features is > 0.

  • embedding_dimension (list[int], optional) β€” The dimension of the embedding for each of the static categorical features. Should be a list of integers, having the same length as num_static_categorical_features. Cannot be None if num_static_categorical_features is > 0.

  • d_model (int, optional, defaults to 64) β€” Dimensionality of the transformer layers.

  • encoder_layers (int, optional, defaults to 2) β€” Number of encoder layers.

  • decoder_layers (int, optional, defaults to 2) β€” Number of decoder layers.

  • encoder_attention_heads (int, optional, defaults to 2) β€” Number of attention heads for each attention layer in the Transformer encoder.

  • decoder_attention_heads (int, optional, defaults to 2) β€” Number of attention heads for each attention layer in the Transformer decoder.

  • encoder_ffn_dim (int, optional, defaults to 32) β€” Dimension of the β€œintermediate” (often named feed-forward) layer in encoder.

  • decoder_ffn_dim (int, optional, defaults to 32) β€” Dimension of the β€œintermediate” (often named feed-forward) layer in decoder.

  • activation_function (str or function, optional, defaults to "gelu") β€” The non-linear activation function (function or string) in the encoder and decoder. If string, "gelu" and "relu" are supported.

  • dropout (float, optional, defaults to 0.1) β€” The dropout probability for all fully connected layers in the encoder, and decoder.

  • encoder_layerdrop (float, optional, defaults to 0.1) β€” The dropout probability for the attention and fully connected layers for each encoder layer.

  • decoder_layerdrop (float, optional, defaults to 0.1) β€” The dropout probability for the attention and fully connected layers for each decoder layer.

  • attention_dropout (float, optional, defaults to 0.1) β€” The dropout probability for the attention probabilities.

  • activation_dropout (float, optional, defaults to 0.1) β€” The dropout probability used between the two layers of the feed-forward networks.

  • num_parallel_samples (int, optional, defaults to 100) β€” The number of samples to generate in parallel for each time step of inference.

  • init_std (float, optional, defaults to 0.02) β€” The standard deviation of the truncated normal weight initialization distribution.

  • use_cache (bool, optional, defaults to True) β€” Whether to use the past key/values attentions (if applicable to the model) to speed up decoding.

  • label_length (int, optional, defaults to 10) β€” Start token length of the Autoformer decoder, which is used for direct multi-step prediction (i.e. non-autoregressive generation).

  • moving_average (int, defaults to 25) β€” The window size of the moving average. In practice, it’s the kernel size in AvgPool1d of the Decomposition Layer.

  • autocorrelation_factor (int, defaults to 3) β€” β€œAttention” (i.e. AutoCorrelation mechanism) factor which is used to find top k autocorrelations delays. It’s recommended in the paper to set it to a number between 1 and 5.

Copied

>>> from transformers import AutoformerConfig, AutoformerModel

>>> # Initializing a default Autoformer configuration
>>> configuration = AutoformerConfig()

>>> # Randomly initializing a model (with random weights) from the configuration
>>> model = AutoformerModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

AutoformerModel

class transformers.AutoformerModel

( config: AutoformerConfig )

Parameters

forward

( past_values: Tensorpast_time_features: Tensorpast_observed_mask: Tensorstatic_categorical_features: typing.Optional[torch.Tensor] = Nonestatic_real_features: typing.Optional[torch.Tensor] = Nonefuture_values: typing.Optional[torch.Tensor] = Nonefuture_time_features: typing.Optional[torch.Tensor] = Nonedecoder_attention_mask: typing.Optional[torch.LongTensor] = Nonehead_mask: typing.Optional[torch.Tensor] = Nonedecoder_head_mask: typing.Optional[torch.Tensor] = Nonecross_attn_head_mask: typing.Optional[torch.Tensor] = Noneencoder_outputs: typing.Optional[typing.List[torch.FloatTensor]] = Nonepast_key_values: typing.Optional[typing.List[torch.FloatTensor]] = Noneoutput_hidden_states: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneuse_cache: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None ) β†’ transformers.models.autoformer.modeling_autoformer.AutoformerModelOutput or tuple(torch.FloatTensor)

Parameters

  • past_values (torch.FloatTensor of shape (batch_size, sequence_length)) β€” Past values of the time series, that serve as context in order to predict the future. These values may contain lags, i.e. additional values from the past which are added in order to serve as β€œextra context”. The past_values is what the Transformer encoder gets as input (with optional additional features, such as static_categorical_features, static_real_features, past_time_features).

    The sequence length here is equal to context_length + max(config.lags_sequence).

    Missing values need to be replaced with zeros.

  • past_time_features (torch.FloatTensor of shape (batch_size, sequence_length, num_features), optional) β€” Optional time features, which the model internally will add to past_values. These could be things like β€œmonth of year”, β€œday of the month”, etc. encoded as vectors (for instance as Fourier features). These could also be so-called β€œage” features, which basically help the model know β€œat which point in life” a time-series is. Age features have small values for distant past time steps and increase monotonically the more we approach the current time step.

    These features serve as the β€œpositional encodings” of the inputs. So contrary to a model like BERT, where the position encodings are learned from scratch internally as parameters of the model, the Time Series Transformer requires to provide additional time features.

    The Autoformer only learns additional embeddings for static_categorical_features.

  • past_observed_mask (torch.BoolTensor of shape (batch_size, sequence_length), optional) β€” Boolean mask to indicate which past_values were observed and which were missing. Mask values selected in [0, 1]:

    • 1 for values that are observed,

    • 0 for values that are missing (i.e. NaNs that were replaced by zeros).

  • static_categorical_features (torch.LongTensor of shape (batch_size, number of static categorical features), optional) β€” Optional static categorical features for which the model will learn an embedding, which it will add to the values of the time series.

    Static categorical features are features which have the same value for all time steps (static over time).

    A typical example of a static categorical feature is a time series ID.

  • static_real_features (torch.FloatTensor of shape (batch_size, number of static real features), optional) β€” Optional static real features which the model will add to the values of the time series.

    Static real features are features which have the same value for all time steps (static over time).

    A typical example of a static real feature is promotion information.

  • future_values (torch.FloatTensor of shape (batch_size, prediction_length)) β€” Future values of the time series, that serve as labels for the model. The future_values is what the Transformer needs to learn to output, given the past_values.

    See the demo notebook and code snippets for details.

    Missing values need to be replaced with zeros.

  • future_time_features (torch.FloatTensor of shape (batch_size, prediction_length, num_features), optional) β€” Optional time features, which the model internally will add to future_values. These could be things like β€œmonth of year”, β€œday of the month”, etc. encoded as vectors (for instance as Fourier features). These could also be so-called β€œage” features, which basically help the model know β€œat which point in life” a time-series is. Age features have small values for distant past time steps and increase monotonically the more we approach the current time step.

    These features serve as the β€œpositional encodings” of the inputs. So contrary to a model like BERT, where the position encodings are learned from scratch internally as parameters of the model, the Time Series Transformer requires to provide additional features.

    The Autoformer only learns additional embeddings for static_categorical_features.

  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) β€” Mask to avoid performing attention on certain token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

  • decoder_attention_mask (torch.LongTensor of shape (batch_size, target_sequence_length), optional) β€” Mask to avoid performing attention on certain token indices. By default, a causal mask will be used, to make sure the model can only look at previous inputs in order to predict the future.

  • head_mask (torch.Tensor of shape (encoder_layers, encoder_attention_heads), optional) β€” Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in [0, 1]:

    • 1 indicates the head is not masked,

    • 0 indicates the head is masked.

  • decoder_head_mask (torch.Tensor of shape (decoder_layers, decoder_attention_heads), optional) β€” Mask to nullify selected heads of the attention modules in the decoder. Mask values selected in [0, 1]:

    • 1 indicates the head is not masked,

    • 0 indicates the head is masked.

  • cross_attn_head_mask (torch.Tensor of shape (decoder_layers, decoder_attention_heads), optional) β€” Mask to nullify selected heads of the cross-attention modules. Mask values selected in [0, 1]:

    • 1 indicates the head is not masked,

    • 0 indicates the head is masked.

  • encoder_outputs (tuple(tuple(torch.FloatTensor), optional) β€” Tuple consists of last_hidden_state, hidden_states (optional) and attentions (optional) last_hidden_state of shape (batch_size, sequence_length, hidden_size) (optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.

  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) β€” Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

    If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

  • inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) β€” Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.

  • use_cache (bool, optional) β€” If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

  • output_attentions (bool, optional) β€” Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

  • output_hidden_states (bool, optional) β€” Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.

Returns

transformers.models.autoformer.modeling_autoformer.AutoformerModelOutput or tuple(torch.FloatTensor)

  • last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) β€” Sequence of hidden-states at the output of the last layer of the decoder of the model.

    If past_key_values is used only the last hidden-state of the sequences of shape (batch_size, 1, hidden_size) is output.

  • trend (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) β€” Trend tensor for each time series.

  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) β€” Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) β€” Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the decoder at the output of each layer plus the optional initial embedding outputs.

  • decoder_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) β€” Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) β€” Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) β€” Sequence of hidden-states at the output of the last layer of the encoder of the model.

  • encoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) β€” Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the encoder at the output of each layer plus the optional initial embedding outputs.

  • encoder_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) β€” Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • loc (torch.FloatTensor of shape (batch_size,) or (batch_size, input_size), optional) β€” Shift values of each time series’ context window which is used to give the model inputs of the same magnitude and then used to shift back to the original magnitude.

  • scale (torch.FloatTensor of shape (batch_size,) or (batch_size, input_size), optional) β€” Scaling values of each time series’ context window which is used to give the model inputs of the same magnitude and then used to rescale back to the original magnitude.

  • static_features: (torch.FloatTensor of shape (batch_size, feature size), optional) β€” Static features of each time series’ in a batch which are copied to the covariates at inference time.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

Copied

>>> from huggingface_hub import hf_hub_download
>>> import torch
>>> from transformers import AutoformerModel

>>> file = hf_hub_download(
...     repo_id="hf-internal-testing/tourism-monthly-batch", filename="train-batch.pt", repo_type="dataset"
... )
>>> batch = torch.load(file)

>>> model = AutoformerModel.from_pretrained("huggingface/autoformer-tourism-monthly")

>>> # during training, one provides both past and future values
>>> # as well as possible additional features
>>> outputs = model(
...     past_values=batch["past_values"],
...     past_time_features=batch["past_time_features"],
...     past_observed_mask=batch["past_observed_mask"],
...     static_categorical_features=batch["static_categorical_features"],
...     future_values=batch["future_values"],
...     future_time_features=batch["future_time_features"],
... )

>>> last_hidden_state = outputs.last_hidden_state

AutoformerForPrediction

class transformers.AutoformerForPrediction

( config: AutoformerConfig )

Parameters

forward

Parameters

  • past_values (torch.FloatTensor of shape (batch_size, sequence_length)) β€” Past values of the time series, that serve as context in order to predict the future. These values may contain lags, i.e. additional values from the past which are added in order to serve as β€œextra context”. The past_values is what the Transformer encoder gets as input (with optional additional features, such as static_categorical_features, static_real_features, past_time_features).

    The sequence length here is equal to context_length + max(config.lags_sequence).

    Missing values need to be replaced with zeros.

  • past_time_features (torch.FloatTensor of shape (batch_size, sequence_length, num_features), optional) β€” Optional time features, which the model internally will add to past_values. These could be things like β€œmonth of year”, β€œday of the month”, etc. encoded as vectors (for instance as Fourier features). These could also be so-called β€œage” features, which basically help the model know β€œat which point in life” a time-series is. Age features have small values for distant past time steps and increase monotonically the more we approach the current time step.

    These features serve as the β€œpositional encodings” of the inputs. So contrary to a model like BERT, where the position encodings are learned from scratch internally as parameters of the model, the Time Series Transformer requires to provide additional time features.

    The Autoformer only learns additional embeddings for static_categorical_features.

  • past_observed_mask (torch.BoolTensor of shape (batch_size, sequence_length), optional) β€” Boolean mask to indicate which past_values were observed and which were missing. Mask values selected in [0, 1]:

    • 1 for values that are observed,

    • 0 for values that are missing (i.e. NaNs that were replaced by zeros).

  • static_categorical_features (torch.LongTensor of shape (batch_size, number of static categorical features), optional) β€” Optional static categorical features for which the model will learn an embedding, which it will add to the values of the time series.

    Static categorical features are features which have the same value for all time steps (static over time).

    A typical example of a static categorical feature is a time series ID.

  • static_real_features (torch.FloatTensor of shape (batch_size, number of static real features), optional) β€” Optional static real features which the model will add to the values of the time series.

    Static real features are features which have the same value for all time steps (static over time).

    A typical example of a static real feature is promotion information.

  • future_values (torch.FloatTensor of shape (batch_size, prediction_length)) β€” Future values of the time series, that serve as labels for the model. The future_values is what the Transformer needs to learn to output, given the past_values.

    See the demo notebook and code snippets for details.

    Missing values need to be replaced with zeros.

  • future_time_features (torch.FloatTensor of shape (batch_size, prediction_length, num_features), optional) β€” Optional time features, which the model internally will add to future_values. These could be things like β€œmonth of year”, β€œday of the month”, etc. encoded as vectors (for instance as Fourier features). These could also be so-called β€œage” features, which basically help the model know β€œat which point in life” a time-series is. Age features have small values for distant past time steps and increase monotonically the more we approach the current time step.

    These features serve as the β€œpositional encodings” of the inputs. So contrary to a model like BERT, where the position encodings are learned from scratch internally as parameters of the model, the Time Series Transformer requires to provide additional features.

    The Autoformer only learns additional embeddings for static_categorical_features.

  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) β€” Mask to avoid performing attention on certain token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

  • decoder_attention_mask (torch.LongTensor of shape (batch_size, target_sequence_length), optional) β€” Mask to avoid performing attention on certain token indices. By default, a causal mask will be used, to make sure the model can only look at previous inputs in order to predict the future.

  • head_mask (torch.Tensor of shape (encoder_layers, encoder_attention_heads), optional) β€” Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in [0, 1]:

    • 1 indicates the head is not masked,

    • 0 indicates the head is masked.

  • decoder_head_mask (torch.Tensor of shape (decoder_layers, decoder_attention_heads), optional) β€” Mask to nullify selected heads of the attention modules in the decoder. Mask values selected in [0, 1]:

    • 1 indicates the head is not masked,

    • 0 indicates the head is masked.

  • cross_attn_head_mask (torch.Tensor of shape (decoder_layers, decoder_attention_heads), optional) β€” Mask to nullify selected heads of the cross-attention modules. Mask values selected in [0, 1]:

    • 1 indicates the head is not masked,

    • 0 indicates the head is masked.

  • encoder_outputs (tuple(tuple(torch.FloatTensor), optional) β€” Tuple consists of last_hidden_state, hidden_states (optional) and attentions (optional) last_hidden_state of shape (batch_size, sequence_length, hidden_size) (optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.

  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) β€” Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

    If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

  • inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) β€” Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.

  • use_cache (bool, optional) β€” If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

  • output_attentions (bool, optional) β€” Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

  • output_hidden_states (bool, optional) β€” Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.

Returns

  • loss (torch.FloatTensor of shape (1,), optional, returned when a future_values is provided) β€” Distributional loss.

  • params (torch.FloatTensor of shape (batch_size, num_samples, num_params)) β€” Parameters of the chosen distribution.

  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) β€” Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) β€” Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.

  • decoder_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) β€” Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) β€” Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) β€” Sequence of hidden-states at the output of the last layer of the encoder of the model.

  • encoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) β€” Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.

  • encoder_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) β€” Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • loc (torch.FloatTensor of shape (batch_size,) or (batch_size, input_size), optional) β€” Shift values of each time series’ context window which is used to give the model inputs of the same magnitude and then used to shift back to the original magnitude.

  • scale (torch.FloatTensor of shape (batch_size,) or (batch_size, input_size), optional) β€” Scaling values of each time series’ context window which is used to give the model inputs of the same magnitude and then used to rescale back to the original magnitude.

  • static_features (torch.FloatTensor of shape (batch_size, feature size), optional) β€” Static features of each time series’ in a batch which are copied to the covariates at inference time.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

Copied

>>> from huggingface_hub import hf_hub_download
>>> import torch
>>> from transformers import AutoformerForPrediction

>>> file = hf_hub_download(
...     repo_id="hf-internal-testing/tourism-monthly-batch", filename="train-batch.pt", repo_type="dataset"
... )
>>> batch = torch.load(file)

>>> model = AutoformerForPrediction.from_pretrained("huggingface/autoformer-tourism-monthly")

>>> # during training, one provides both past and future values
>>> # as well as possible additional features
>>> outputs = model(
...     past_values=batch["past_values"],
...     past_time_features=batch["past_time_features"],
...     past_observed_mask=batch["past_observed_mask"],
...     static_categorical_features=batch["static_categorical_features"],
...     static_real_features=batch["static_real_features"],
...     future_values=batch["future_values"],
...     future_time_features=batch["future_time_features"],
... )

>>> loss = outputs.loss
>>> loss.backward()

>>> # during inference, one only provides past values
>>> # as well as possible additional features
>>> # the model autoregressively generates future values
>>> outputs = model.generate(
...     past_values=batch["past_values"],
...     past_time_features=batch["past_time_features"],
...     past_observed_mask=batch["past_observed_mask"],
...     static_categorical_features=batch["static_categorical_features"],
...     static_real_features=batch["static_real_features"],
...     future_time_features=batch["future_time_features"],
... )

>>> mean_prediction = outputs.sequences.mean(dim=1)

This is the configuration class to store the configuration of an . It is used to instantiate an Autoformer model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Autoformer architecture.

Configuration objects inherit from can be used to control the model outputs. Read the documentation from for more information.

config () β€” Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

The bare Autoformer Model outputting raw hidden-states without any specific head on top. This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

return_dict (bool, optional) β€” Whether or not to return a instead of a plain tuple.

A transformers.models.autoformer.modeling_autoformer.AutoformerModelOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration () and inputs.

The forward method, overrides the __call__ special method.

config () β€” Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

The Autoformer Model with a distribution head on top for time-series forecasting. This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

( past_values: Tensorpast_time_features: Tensorpast_observed_mask: Tensorstatic_categorical_features: typing.Optional[torch.Tensor] = Nonestatic_real_features: typing.Optional[torch.Tensor] = Nonefuture_values: typing.Optional[torch.Tensor] = Nonefuture_time_features: typing.Optional[torch.Tensor] = Nonefuture_observed_mask: typing.Optional[torch.Tensor] = Nonedecoder_attention_mask: typing.Optional[torch.LongTensor] = Nonehead_mask: typing.Optional[torch.Tensor] = Nonedecoder_head_mask: typing.Optional[torch.Tensor] = Nonecross_attn_head_mask: typing.Optional[torch.Tensor] = Noneencoder_outputs: typing.Optional[typing.List[torch.FloatTensor]] = Nonepast_key_values: typing.Optional[typing.List[torch.FloatTensor]] = Noneoutput_hidden_states: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneuse_cache: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None ) β†’ or tuple(torch.FloatTensor)

return_dict (bool, optional) β€” Whether or not to return a instead of a plain tuple.

or tuple(torch.FloatTensor)

A or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration () and inputs.

The forward method, overrides the __call__ special method.

🌍
🌍
🌍
Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting
elisim
kashif
here
Yes, Transformers are Effective for Time Series Forecasting (+ Autoformer)
<source>
AutoformerModel
huggingface/autoformer-tourism-monthly
PretrainedConfig
PretrainedConfig
<source>
AutoformerConfig
from_pretrained()
PreTrainedModel
torch.nn.Module
<source>
What are attention masks?
ModelOutput
AutoformerConfig
AutoformerModel
<source>
AutoformerConfig
from_pretrained()
PreTrainedModel
torch.nn.Module
<source>
transformers.modeling_outputs.Seq2SeqTSPredictionOutput
What are attention masks?
ModelOutput
transformers.modeling_outputs.Seq2SeqTSPredictionOutput
transformers.modeling_outputs.Seq2SeqTSPredictionOutput
AutoformerConfig
AutoformerForPrediction