This is a page where I dump information I learn about how data platform engineering is done inside different reputable AI companies. See also analytics in AI.

OpenAI

TLDR

OpenAI’s Data Platform organization is infrastructure-focused and is responsible for traditional data lake/data warehouse infrastructure (ingest, analytics, governance). Named products include Apache Spark, Kafka, Flink, Airflow, Trino, and Iceberg.

OpenAI has a Data Platform organization, and a Software Engineer role they have posted discloses a lot about their stack. From Software Engineer, Data Infrastructure @ OpenAI:

Data Platform at OpenAI owns the foundational data stack powering critical product, research, and analytics workflows. We operate some of the largest Spark compute fleets in production; design, and build data lakes and metadata systems on Iceberg and Delta with a vision toward exabyte-scale architecture; run high throughput streaming platforms on Kafka and Flink; provide orchestration with Airflow; and support ML feature engineering tooling such as Chronon. Our mission is to deliver reliable, secure, and efficient data access at scale and accelerate intelligent, AI assisted data workflows.

Responsibilities include:

You will scale and harden big data compute and storage platforms, build and support high-throughput streaming systems, build and operate low latency data ingestions, enable secure and governed data access for ML and analytics, and design for reliability and performance at extreme scale.

Requirements include:

You’ve supported Spark, Kafka, Flink, Airflow, Trino, or Iceberg as platforms. You’re well-versed in infrastructure tooling like Terraform, experienced in debugging large-scale distributed systems, and excited about solving data infrastructure problems in the AI space.

Anthropic

TLDR

Anthropic has both a standalone Research Data Platform team and embedded data platform engineers within domain-specific business units (e.g., Safety).

Focus of Research Data Platform is in pipelines that connect training with data warehousing. They also maintain indices (“data cataloging” and “dataset management”), handle high-volume timeseries data (streaming, storage, and query), and have some vis/frontend responsibilities around providing data services. These engineers are attached to AI research teams directly. Named technologies include Spark, BigQuery, DuckDB, and Parquet.

The data platform engineers in Safety are closer to enterprise data warehouse engineers and focus more on streaming data, real-time analytics, anomaly detection, and related pipelines. Named technologies include dbt/Airflow/Spark; BigQuery/Redshift/Snowflake; Looker/Tableau/Metabase.

Anthropic has the Research Data Platform team:

From Software Engineer, Research Data Platform | Anthropic | LinkedIn:

The Research Data Platform team builds the tools that Anthropic’s researchers use every day to manage, query, and analyze the data that goes into training and evaluating frontier models. We power the internal applications researchers rely on to monitor RL runs, explore finetuning datasets, and understand what’s happening inside their experiments.

We’re looking for engineers who love working directly with users and who excel at building data products — the pipelines that move data out of training runs into queryable storage, and the APIs, libraries, and services researchers use to manage and explore it. This role sits closer to the research workflow than a typical data infrastructure position: you’ll often embed with research teams, build ML-specific tooling alongside them, and leverage what our Data Infrastructure team has already built rather than reinventing it.

and responsibilities include:

  • Build and operate data pipelines that extract data from research training runs and land it in storage systems that are easy and fast to query
  • Work closely with researchers to design and build APIs, libraries, and web interfaces that support data management, exploration, and analysis
  • Develop dataset management, data cataloging, and provenance tooling that researchers use in their day-to-day work
  • Embed with research teams to understand their workflows, identify high-leverage tooling opportunities, and ship solutions quickly

Relevant experience:

  • Large-scale ETL, columnar storage formats, and query engines (e.g., Spark, BigQuery, DuckDB, Parquet)
  • High-volume time series data — ingestion, storage, and efficient querying
  • Data cataloging, lineage, or metadata management systems
  • ML experiment tracking or metrics platforms
  • Working in environments where engineers partner closely with quantitative users — research labs, trading firms, observability or analytics startups
  • Complex data visualization and full-stack web application development

Anthropic also has data engineers embedded within domain-specific teams. This is from Job Application for Data Engineer, Safeguards at Anthropic:

Anthropic is looking for a Data Engineer to join the Safeguards team and build the data foundations that keep our AI systems safe. The Safeguards team works to monitor models, prevent misuse, and ensure user well-being — and doing that well requires robust, reliable data infrastructure. In this role, you’ll design and build the data pipelines, warehousing solutions, and analytical tooling that power our safety and trust efforts at scale.

Responsibilities:

  • Design, build, and maintain scalable data pipelines that support safety monitoring, abuse detection, and enforcement workflows
  • Develop and optimize data models and warehousing solutions to enable efficient analysis of large-scale usage and safety data
  • Build and maintain dashboards and reporting infrastructure that give Safeguards teams visibility into model behavior, misuse patterns, and enforcement outcomes
  • Collaborate with engineers to integrate data from multiple sources — including model outputs, user reports, and automated classifiers — into a unified analytical layer
  • Implement data quality frameworks, monitoring, and alerting to ensure the reliability of safety-critical data
  • Partner with research teams to surface data insights that inform model improvements and safety interventions
  • Develop self-service data tooling that enables stakeholders to explore safety data and generate reports independently
  • Contribute to data governance practices, including access controls, retention policies, and privacy-compliant data handling

Relevant experience includes:

  • Have hands-on experience with modern data stack tools such as dbt, Airflow, Spark, or similar orchestration and transformation frameworks
  • Have worked with cloud data platforms (BigQuery, Redshift, Snowflake, or similar)
  • Are comfortable building dashboards and data visualizations using tools like Looker, Tableau, or Metabase

xAI

TLDR

xAI’s Data Platform team provides infrastructure for streaming, real-time, and analytics workloads. They do not appear to also own the storage infrastructure though. Named technologies include Kafka, HDFS, Spark, Flink, and Trino.

xAI has a Data Platform team which has Members of Technical Staff. From Job Application for Member of Technical Staff - Data Platform at xAI:

The Data Platform team builds and operates the infrastructure responsible for all large-scale data transport and processing across the company. We own and manage core systems including Apache Kafka, HDFS, Spark, Flink, and Trino, enabling real-time ML pipelines, feed ranking, experimentation, analytics, and observability at petabyte scale. Our team deals with latency-critical workloads, high-throughput streaming, and distributed compute systems that require fault tolerance, performance, and absolute reliability.

As a software engineer on the Data Platform team, you will design, build, and operate the distributed systems powering X’s data movement and compute. You will take ownership of infrastructure components that process trillions of events daily, driving the scalability, performance, and reliability of the systems that power product and ML workloads across the company.

Responsibilities:

  • Design and implement high-throughput, low-latency data ingestion and transport systems.
  • Scale and optimize multi-tenant Kafka infrastructure supporting real-time workloads.
  • Extend and tune Spark, Flink, and Trino for demanding production pipelines.
  • Build interfaces, APIs, and pipelines enabling teams to query, process, and move data at petabyte scale.

Relevant experience:

  • Proven expertise in distributed systems, stream processing, or large-scale data platforms.
  • Hands-on experience with  Kafka, Flink, Spark, Trino, or Hadoop in production.

Together.AI

TLDR

Together’s Data Platform team provides infrastructure services for event-based workflows in research, enrichment pipelines, reliability and observability, and OLTP/OLAP. Named technologies include Kafka, Airflow/Spark/Flink/Trino, and Postgres.

Together AI has a Data Platform team which hires backend software engineers. From Job Application for Backend Software Engineer — Data Platform & AI Data Products at Together AI:

You’ll join the Data Platform team, responsible for building the backend services and “data products” that power how data moves through the company. We create the core platform primitives — high-quality event streams, reliable access layers, and developer-friendly APIs/tools — so teams across the org can self-serve what they need and ship faster. You’ll contribute to backend services that create value from our company data, and help make our data platform more self-serve so product and engineering teams can easily create and operate event-driven architectures, publish/consume streams, define access models, and ship data products end-to-end. You’ll also work on LLM-adjacent services such as prompt categorization/taxonomy, enrichment, and metadata systems that turn raw telemetry into trusted, usable products — with mentorship and support from experienced engineers.

Responsibilities:

  • Help enable DIY workflows for teams across the company:
    • Define/publish events and schemas
    • Create/consume streams and subscriptions
    • Establish access models (authz, row/field-level controls where applicable)
    • Manage dataset/catalog metadata, lineage, versioning, and retention
  • Contribute to end-to-end data products: ingestion → validation/quality → enrichment → serving (APIs/streams) → observability → adoption.
  • Work on prompt categorization and enrichment services: taxonomy design, labeling workflows, classifier/rules integration, evaluation, drift/quality monitoring, and safe rollouts.
  • Learn to own reliability: SLOs, alerting, performance/cost tuning, incident response, and postmortems.
  • Partner cross-functionally with ML/LLM, infra, security, and product teams to define crisp contracts and deliver durable platform primitives.

Requirements include:

  • Basic data modeling and SQL skills, and some familiarity with at least one of:
    • Streaming/eventing (Kafka/PubSub/Kinesis, etc.)
    • Workflow/compute (Airflow/Spark/Flink/Trino, etc.)
    • OLTP/OLAP stores and data lakes (Postgres + warehouse/lake tech)

Fireworks.AI

TLDR

MTS roles where “Training Infrastructure” (training) and “Cloud Infrastructure” (inferencing) are separate roles (maybe teams).

Fireworks combines their storage and compute responsibilities into individual engineering roles that are spread with names like “Training Infrastructure Engineer” and “Software Engineer” or “Member of Technical Staff” on the Cloud Infrastructure team.

The Job Application for Member of Technical Staff, AI Training Infrastructure at Fireworks AI role is very vague in its responsibilities and qualifications. As it pertains to storage or data infrastructure:

As a Training Infrastructure Engineer, you’ll design, build, and optimize the infrastructure that powers our large-scale model training operations. Your work will be essential to developing high-performance AI training infrastructure.

Responsibilities:

  • Design and implement scalable infrastructure for large-scale model training workloads
  • Architect and maintain data storage solutions for large-scale training datasets

The Cloud Infrastructure team role seems more concrete. From Job Application for Member of Technical Staff, Cloud Infrastructure at Fireworks AI:

You’ll spearhead the creation of one of the world’s first virtual clouds, seamlessly serving AI workloads across the globe and every cloud provider.

Responsibilities include both training and inferencing, strangely:

  • Architect and build scalable, resilient, and high-performance backend infrastructure to support distributed training, inference, and data processing pipelines.
  • Design and implement core backend services (e.g., job schedulers, resource managers, autoscalers, model serving layers) with a focus on efficiency and low latency.
  • Drive infrastructure optimization initiatives, including compute cost reduction, storage lifecycle management, and network performance tuning.