Building Robust AI and Data Systems, Data Architecture, Data Quality, Data Storage | ep 10

Update: 2024-05-31

Description

In this episode of "How AI is Built", data architect Anjan Banerjee provides an in-depth look at the world of data architecture and building complex AI and data systems. Anjan breaks down the basics using simple analogies, explaining how data architecture involves sorting, cleaning, and painting a picture with data, much like organizing Lego bricks to build a structure.

Summary by Section

Introduction

Anjan Banerjee, a data architect, discusses building complex AI and data systems

Explains the basics of data architecture using Lego and chat app examples

Sources and Tools

Identifying data sources is the first step in designing a data architecture

Pick the right tools to extract data based on use cases (block storage for images, time series DB, etc.)

Use one tool for most activities if possible, but specialized tools offer benefits

Multi-modal storage engines are gaining popularity (Snowflake, Databricks, BigQuery)

Airflow and Orchestration

Airflow is versatile but has a learning curve; good for orgs with Python/data engineering skills

For less technical orgs, GUI-based tools like Talend, Alteryx may be better

AWS Step Functions and managed Airflow are improving native orchestration capabilities

For multi-cloud, prefer platform-agnostic tools like Astronomer, Prefect, Airbyte

AI and Data Processing

ML is key for data-intensive use cases to avoid storing/processing petabytes in cloud

TinyML and edge computing enable ML inference on device (drones, manufacturing)

Cloud batch processing still dominates for user targeting, recommendations

Data Lakes and Storage

Storage choice depends on data types, use cases, cloud ecosystem

Delta Lake excels at data versioning and consistency; Iceberg at partitioning and metadata

Pulling data into separate system often needed for advanced analytics beyond source system

Data Quality and Standardization

"Poka-yoke" error-proofing of input screens is vital for downstream data quality

Impose data quality rules and unified schemas (e.g. UTC timestamps) during ingestion

Complexity arises with multi-region compliance (GDPR, CCPA) requiring encryption, sanitization

Hot Takes and Wishes

Snowflake is overhyped; great UX but costly at scale. Databricks is preferred.

Automated data set joining and entity resolution across systems would be a game-changer

Anjan Banerjee:

Nicolay Gerold:

⁠LinkedIn⁠

⁠X (Twitter)

00:00 Understanding Data Architecture

12:36 Choosing the Right Tools

20:36 The Benefits of Serverless Functions

21:34 Integrating AI in Data Acquisition

24:31 The Trend Towards Single Node Engines

26:51 Choosing the Right Database Management System and Storage

29:45 Adding Additional Storage Components

32:35 Reducing Human Errors for Better Data Quality

39:07 Overhyped and Underutilized Tools

Data architecture, AI, data systems, data sources, data extraction, data storage, multi-modal storage engines, data orchestration, Airflow, edge computing, batch processing, data lakes, Delta Lake, Iceberg, data quality, standardization, poka-yoke, compliance, entity resolution

Comments

Top Podcasts

The Best New Comedy Podcast Right Now – June 2024 The Best News Podcast Right Now – June 2024 The Best New Business Podcast Right Now – June 2024 The Best New Sports Podcast Right Now – June 2024 The Best New True Crime Podcast Right Now – June 2024 The Best New Joe Rogan Experience Podcast Right Now – June 20 The Best New Dan Bongino Show Podcast Right Now – June 20 The Best New Mark Levin Podcast – June 2024

In Channel

Limits of Embeddings: Out-of-Domain Data, Long Context, Finetuning (and How We're Fixing It) | S2 E5

2024-09-1946:06

RAG at Scale: The problems you will encounter and how to prevent (or fix) them | S2 E4

2024-09-1250:09

From Keywords to AI (to GAR): The Evolution of Search, Finding Search Signals | S2 E3

2024-09-0552:16

Data-driven Search Optimization, Analysing Relevance | S2 E2

2024-08-3051:14

Query Understanding: Doing The Work Before The Query Hits The Database | S2 E1

2024-08-1553:02

Season 2 Trailer: Mastering Search

2024-08-0804:16

Unlocking Value from Unstructured Data, Real-World Applications of Generative AI | ep 17

2024-07-1636:28

Data Processing for AI, Integrating AI into Data Pipelines, Spark | ep 16

2024-07-1246:26

Building AI Agents for the Enterprise: Realistic Use Cases, Cost Controls, Seamless UX | ep 15

2024-07-0435:12

Building Predictable Agents: Prompting, Compression, and Memory Strategies | ep 14

2024-06-2732:14

Data Integration and Ingestion for AI & LLMs, Architecting Data Flows | changelog 3

2024-06-2514:53

ETL for LLMs, Integrating and Normalizing Unstructured Data | ep 13

2024-06-1936:48

Serverless Data Orchestration, AI in the Data Stack, AI Pipelines | ep 12

2024-06-1428:06

Mastering Vector Databases: Product & Binary Quantization, Multi-Vector Search

2024-06-0740:06

Building Robust AI and Data Systems, Data Architecture, Data Quality, Data Storage | ep 10

2024-05-3145:33

Modern Data Infrastructure for Analytics and AI, Lakehouses, Open Source Data Stack | ep 9

2024-05-2427:53

Knowledge Graphs for Better RAG, Virtual Entities, Hybrid Data Models | ep 8

2024-05-2036:40

Navigating the Modern Data Stack, Choosing the Right OSS Tools, From Problem to Requirements to Architecture | ep 7

2024-05-1738:12

Data Orchestration Tools: Choosing the right one for your needs | ep 6

2024-05-1032:37

Building Reliable LLM Applications, Production-Ready RAG, Data-Driven Evals | ep 5

2024-05-0329:40

00:00

Building Robust AI and Data Systems, Data Architecture, Data Quality, Data Storage | ep 10

#box-pro-ellipsis-172690451285129{-webkit-line-clamp:2;}Building Robust AI and Data Systems, Data Architecture, Data Quality, Data Storage | ep 10

Building Robust AI and Data Systems, Data Architecture, Data Quality, Data Storage | ep 10

Nicolay Gerold

Building Robust AI and Data Systems, Data Architecture, Data Quality, Data Storage | ep 10