How Lyric unlocked fast, reliable time-series inference at scale, handling large payloads and rapid scale-to-zero autoscaling

How Lyric unlocked fast, reliable time-series inference at scale, handling large payloads and rapid scale-to-zero autoscaling

Latency

500 ms to 40 ms

Cold Start Time

300 s to 75 s

Company

USE CASE

Enterprise supply-chain platform running large-scale time-series forecasting and DB/SQL models that must handle heavy payloads, deliver stable low latency, and scale rapidly from zero under spiky workloads.

Highlights

Efficient large-payload handling and packing to reduce network overhead and prevent application failures.
High-throughput, low-latency serving for both time-series (Chronos, TimesFM) and embedding (Stella) models using efficient batching and optimized inference paths.
Fast scale-to-zero with rapid autoscaling tuned for spiky enterprise workloads and strict SLAs.

Company background

‍

Lyric provides a four-layered platform for modeling, planning, and operating supply chains, built to handle massive data and deliver decision science at scale. The platform is positioned to deliver rapid, scalable forecasting and planning workflows for enterprise customers across logistics, manufacturing, and distribution.

‍

The problem

‍

Lyric’s production stack presented three tightly coupled operational challenges:

‍

Scale-to-zero / slow scale-up:
The environment needed idle cost savings via scale-to-zero but also required near-instant scale-up when forecasts or batch jobs started. Slow cold starts or inefficient scale logic caused missed windows for time-sensitive forecasts.
‍
Inefficient handling of large tensors:
Large, high-dimensional time-series payloads strained network transfer and model servers, causing serialization failures, high latency, and poor GPU/CPU utilization due to inefficient tensor handling and per-request processing.
‍

These problems manifested as request failures for large payloads, high median latency, variable tail latency, and inflated infrastructure costs when teams had to leave capacity warm to meet SLAs.

‍

Solution

‍

Simplismart’s approach combined infrastructure architecture, model-server engineering, and request-level optimizations. The work was implemented across three layers: transport/ingest, model serving, and autoscaling/orchestration.

‍

1. Payload handling: staged uploads + efficient packing

‍

S3 staging for large payloads: For very large time-series or batched SQL write payloads, clients upload payloads to a preconfigured S3 location and pass a lightweight pointer in the request. This avoids network timeouts and app-level memory churn.
Binary packing and streaming: Used compact binary formats and chunked streaming so large tensor payloads are sent and processed incrementally, without loading the full payload into memory.
Server-side reassembly & validation: The model server accepts streamed chunks, validates, and reassembles just-in-time into batched tensors, reducing memory spikes and eliminating common breakage patterns.

‍

2. Model server engineering for better batching & tensor support

‍

Custom serving stack (Chronos, TimesFM, Stella):

Purpose-built model servers optimized for high-dimensional time-series and embedding workloads.
Adaptive batching and mixed-size tensor coalescing to maximize throughput without padding overhead.
Memory-efficient tensor layouts to speed CPU/GPU transfers and keep latency stable under load.

‍

3. Autoscaling tuned for scale-to-zero with fast scale-up

‍

Event-driven scaling triggers: Autoscaling driven by real traffic events (incoming S3 pointers, job start requests) rather than simple CPU utilization, enabling preemptive provisioning when jobs are scheduled.
Scale-to-zero with warm-fast-path: Lightweight warm proxies and a tiny always-on control plane keep critical model server binaries cached; this reduces container/Python cold start costs even when replica count is zero.

‍

Results

‍

Production-ready in 2 days: Lyric moved from integration to live production in just 2 days, enabling rapid validation and rollout of optimized time-series and embedding workloads.
4× faster cold starts with scale-to-zero: Cold-start time was reduced from ~300 seconds to ~75 seconds, allowing Lyric to confidently use scale-to-zero without compromising responsiveness during traffic spikes.
12.5× faster embedding inference: End-to-end embedding latency dropped from ~500 ms to ~40 ms, delivering consistently fast and predictable response times under load.
Stable performance under large payloads: Large, high-dimensional payloads are now processed reliably via staged uploads, streaming, and efficient batching eliminating application-level failures.

‍

With Simplismart, Lyric transformed its time-series and analytics inference stack from a brittle, always-on system into a scalable, production-ready platform built for large payloads and spiky enterprise workloads. Heavy time-series requests are now handled reliably through staged uploads and efficient batching, while inference remains fast and stable across both forecasting and embedding models.

‍

By enabling true scale-to-zero with rapid scale-up, Lyric reduced idle infrastructure costs while maintaining responsiveness. Cold starts are faster, embedding latency is significantly lower, and the system now scales smoothly delivering a resilient, cost-efficient inference layer for enterprise decision intelligence.

Find out what is tailor-made inference for you.

Deploy now

Talk to an Engineer