Matthaus Krzykowski
Matthaus KrzykowskiMar 10th, 2026
Data engineering for builders + their agents in Python code
dltHub
PyAI
San Francisco
This talk explores how AI agents are beginning to author entire data systems - and what kind of infrastructure is required to support them. When agents generate pipelines, the real challenge becomes understanding, validating, and improving what they produce. Many modern platforms hide execution details behind proprietary abstractions, making it difficult for both humans and agents to inspect pipelines, reason about failures, or iterate safely. The talk introduces dlt, an AI-native Python data pipeline library designed around transparent primitives - schema evolution, incremental loading, data contracts, and transformations - so both humans and agents can build pipelines that remain inspectable and composable.

The centerpiece of the session is a live end-to-end demo that proves these design principles. A pipeline ingests data from an API, runs through a standard dlt pipeline, and loads into DuckDB, while developers and agents interact through Python and a notebook environment (marimo). The pipeline is instrumented with Pydantic Logfire, enabling real-time tracing of pipeline execution and agent behavior- turning pipelines into observable systems where telemetry, code snapshots, and data lineage can be inspected live. This kind of traceable, introspectable pipeline represents a new category of data infrastructure designed for agent-generated systems.

The demo then pushes processed traces to Hugging Face using a new dlt destination that publishes datasets directly to the Hub as versioned Parquet datasets ready for training workflows. In a real case study with Distil Labs, this pipeline enabled training a 0.6B specialist model that outperformed a 120B LLM by 29 points, demonstrating how reproducible data pipelines can power the next generation of AI systems.

Ultimately, the talk argues that as LLMs commoditize writing pipelines, the real bottleneck becomes understanding and governing the data and code they produce.
Posted Mar 15th, 2026