Parallel Token Prediction for Language Models
We propose Parallel Token Prediction (PTP), a universal framework for parallel sequence generation in language models.
What’s new (20 sec)
We propose Parallel Token Prediction (PTP), a universal framework for parallel sequence generation in language models.
Why it matters (2 min)
- We propose Parallel Token Prediction (PTP), a universal framework for parallel sequence generation in language models.
- PTP jointly predicts multiple dependent tokens in a single transformer call by incorporating the sampling procedure into the model.
- Open receipts to verify and go deeper.
Go deeper (8 min)
Context
We propose Parallel Token Prediction (PTP), a universal framework for parallel sequence generation in language models. PTP jointly predicts multiple dependent tokens in a single transformer call by incorporating the sampling procedure into the model. This reduces the latency bottleneck of autoregressive decoding, and avoids the restrictive independence assumptions common in existing multi-token prediction methods. We prove that PTP can represent arbitrary autoregressive sequence distributions. PTP is trained either by distilling an existing model or through inverse autoregressive training without a teacher. Experimentally, we achieve state-of-the-art speculative decoding performance on Vicuna-7B by accepting over four tokens per step on Spec-Bench. The universality of our framework indicates that parallel generation of long sequences is feasible without loss of modeling power.
For builders
Builder: scan the abstract + experiments; look for code, datasets, and evals.
Verify
Prefer primary announcements, papers, repos, and changelogs over reposts.
Receipts
- Parallel Token Prediction for Language Models (arXiv cs.LG)