A hidden Markov model is a probabilistic framework for reasoning about sequences in which the underlying causes of what we observe are not directly visible. The model assumes there is a hidden process that moves between a finite set of states according to fixed probabilities, and at each step that hidden state produces an observation through another probabilistic rule. The defining assumption is the Markov property: the next hidden state depends only on the current one, not on the full history.
This simple structure turns out to be powerful enough to model speech, handwriting, gestures, biological sequences, and many other domains where what we measure is a noisy or indirect window onto an unseen dynamic system.
The core ingredients
A hidden Markov model is fully specified by three things. The first is a set of transition probabilities that describe how the hidden state evolves from one time step to the next. The second is a set of emission probabilities that describe how each hidden state produces observations, which may be discrete symbols or continuous vectors modeled by densities such as Gaussians or mixtures. The third is an initial distribution over the first hidden state, which seeds the entire process before any observation is seen.
Why hidden states matter
The hidden states are not arbitrary mathematical machinery; they represent latent regimes or categories that we believe shape what we observe. In speech recognition, hidden states might correspond to phonetic subunits whose acoustic signatures vary; in part-of-speech tagging, they correspond to grammatical categories that emit words. The model does not require us to label these states in the training data, which is one of its most valuable properties. By treating the regime as unobserved, the model can learn structure from sequences of raw observations alone.
The Markov assumption and its limits
The first-order Markov assumption is the simplifying move that makes inference tractable: knowing the present hidden state is enough to predict the future. This assumption is rarely literally true in real data, but it often holds well enough to be useful, and it can be relaxed by enlarging the state space to encode richer context. Higher-order dependencies, durations, and hierarchical structures can be folded into the state definition at the cost of a larger and slower model. The trade-off between expressiveness and computational efficiency is one of the central design choices when applying these models.
The three canonical problems
Working with a hidden Markov model usually means addressing one of three computational tasks. The first is evaluation: given a model and a sequence of observations, compute how likely the sequence is under the model. The second is decoding: given a model and a sequence, find the most probable sequence of hidden states that could have produced it. The third is learning: given only sequences of observations, estimate the transition, emission, and initial probabilities that best explain the data.
Forward and backward computations
The evaluation problem is solved by the forward algorithm, a dynamic programming procedure that computes, at each time step, the probability of having observed the data so far and being in each possible hidden state. A complementary backward algorithm computes the probability of the remaining observations given each possible current state. Together these quantities allow efficient computation of the overall sequence likelihood and the posterior probability of being in each state at each time. Without this dynamic programming trick, summing over all exponentially many hidden state sequences would be infeasible.
The Viterbi algorithm
Decoding the single most likely hidden state sequence is handled by the Viterbi algorithm, which is structurally similar to the forward pass but replaces summation with maximization. At every time step it tracks, for each state, the probability of the best path ending in that state along with a back pointer to its predecessor. After processing the full sequence it traces these pointers backward to recover the optimal state path. This procedure is the workhorse behind tagging, alignment, and segmentation tasks in many sequence applications.
Learning the parameters
When the parameters are unknown, they are typically estimated using the Baum-Welch algorithm, which is an instance of expectation-maximization tailored to hidden Markov models. The expectation step uses the forward and backward probabilities to compute expected counts of state transitions and emissions given the current parameters. The maximization step then re-estimates the parameters from these expected counts as if they were observed. Iterating these steps increases the data likelihood monotonically until it converges to a local optimum, which means initialization and regularization matter in practice.
Discrete versus continuous observations
The form of the emission distribution depends on what is being modeled. For symbolic data such as text tokens or DNA bases, emissions are typically categorical distributions over a finite alphabet. For continuous signals such as audio features or sensor readings, emissions are often modeled as Gaussian distributions or mixtures of Gaussians associated with each hidden state. The choice of emission family directly affects how flexibly the model can capture variability within each state and how prone it is to overfitting.
Variants and extensions
Several extensions adapt the basic model to richer scenarios. Left-to-right models constrain transitions so that states can only advance, which is natural for modeling processes with a clear progression like spoken words. Hidden semi-Markov models replace the implicit geometric state duration with explicit duration distributions, giving more realistic control over how long the system stays in each regime. Factorial and coupled variants use multiple interacting chains to represent several latent processes simultaneously, while input-output versions condition transitions and emissions on external covariates.
Relation to other sequence models
A hidden Markov model can be viewed as a particular kind of dynamic Bayesian network with one discrete latent variable per time step. It is closely related to linear dynamical systems, which use continuous latent states and Gaussian dynamics rather than discrete states. Conditional random fields share the goal of labeling sequences but model the conditional distribution of labels given observations directly, avoiding the generative assumptions about how observations are produced. Recurrent and attention-based neural networks have replaced hidden Markov models in many large-scale applications, yet the underlying ideas of latent state and probabilistic inference remain influential.
Strengths in practice
The enduring appeal of hidden Markov models lies in their interpretability, their modest data requirements, and the existence of exact and efficient algorithms for the core inference problems. Because the hidden states correspond to meaningful categories, the learned model can often be inspected and reasoned about rather than treated as a black box. The forward, backward, and Viterbi algorithms scale linearly in the sequence length, which makes inference practical even on long signals. Probabilistic outputs also make these models easy to integrate into larger systems that combine multiple sources of evidence.
Limitations and constraints
The same assumptions that make hidden Markov models tractable also constrain them. The Markov property limits how much long-range context a single chain can capture, and discrete states can be a coarse approximation when the underlying dynamics are smooth or high dimensional. Emission models that assume independence across observation dimensions can miss important correlations, and the likelihood surface explored during training is generally multimodal. Practitioners address these issues with careful state design, richer emission distributions, sensible initialization, and sometimes by hybridizing the model with neural components that supply more expressive features.
A compact lens on sequential data
Despite their simplicity, hidden Markov models remain a clarifying lens for thinking about any problem in which observed sequences are driven by an unseen process that switches between modes. They provide a clean separation between dynamics and observation, a tractable toolkit for inference and learning, and a vocabulary of states and transitions that maps naturally onto how we describe many real phenomena. Understanding them deeply illuminates more elaborate sequence models that share their fundamental commitments to latent structure and probabilistic reasoning.
