DiscoverHow AI Is Built
How AI Is Built
Claim Ownership

How AI Is Built

Author: Nicolay Gerold

Subscribed: 6Played: 82
Share

Description

How AI is Built dives into the different building blocks necessary to develop AI applications: how they work, how you can get started, and how you can master them. Build on the breakthroughs of others. Follow along, as Nicolay learns from the best data engineers, ML engineers, solution architects, and tech founders.
26 Episodes
Reverse
Text embeddings have limitations when it comes to handling long documents and out-of-domain data.Today, we are talking to Nils Reimers. He is one of the researchers who kickstarted the field of dense embeddings, developed sentence transformers, started HuggingFace’s Neural Search team and now leads the development of search foundational models at Cohere. Tbh, he has too many accolades to count off here.We talk about the main limitations of embeddings:Failing out of domainStruggling with long documentsVery hard to debugHard to find formalize what actually is similarAre you still not sure whether to listen? Here are some teasers:Interpreting embeddings can be challenging, and current models are not easily explainable.Fine-tuning is necessary to adapt embeddings to specific domains, but it requires careful consideration of the data and objectives.Re-ranking is an effective approach to handle long documents and incorporate additional factors like recency and trustworthiness.The future of embeddings lies in addressing scalability issues and exploring new research directions.Nils Reimers:LinkedInX (Twitter)WebsiteCohereNicolay Gerold:⁠LinkedIn⁠⁠X (Twitter)text embeddings, limitations, long documents, interpretation, fine-tuning, re-ranking, future research00:00 Introduction and Guest Introduction 00:43 Early Work with BERT and Argument Mining 02:24 Evolution and Innovations in Embeddings 03:39 Constructive Learning and Hard Negatives 05:17 Training and Fine-Tuning Embedding Models 12:48 Challenges and Limitations of Embeddings 18:16 Adapting Embeddings to New Domains 22:41 Handling Long Documents and Re-Ranking 31:08 Combining Embeddings with Traditional ML 45:16 Conclusion and Upcoming Episodes
Hey! Welcome back.Today we look at how we can get our RAG system ready for scale.We discuss common problems and their solutions, when you introduce more users and more requests to your system.For this we are joined by Nirant Kasliwal, the author of fastembed.Nirant shares practical insights on metadata extraction, evaluation strategies, and emerging technologies like Colipali. This episode is a must-listen for anyone looking to level up their RAG implementations."Naive RAG has a lot of problems on the retrieval end and then there's a lot of problems on how LLMs look at these data points as well.""The first 30 to 50% of gains are relatively quick. The rest 50% takes forever.""You do not want to give the same answer about company's history to the co-founding CEO and the intern who has just joined.""Embedding similarity is the signal on which you want to build your entire search is just not quite complete."Key insights:Naive RAG often fails due to limitations of embeddings and LLMs' sensitivity to input ordering.Query profiling and expansion: Use clustering and tools like latent Scope to identify problematic query typesExpand queries offline and use parallel searches for better resultsMetadata extraction: Extract temporal, entity, and other relevant information from queriesUse LLMs for extraction, with checks against libraries like Stanford NLPUser personalization: Include user role, access privileges, and conversation historyAdapt responses based on user expertise and readability scoresEvaluation and improvement: Create synthetic datasets and use real user feedbackEmploy tools like DSPY for prompt engineeringAdvanced techniques: Query routing based on type and urgencyUse smaller models (1-3B parameters) for easier iteration and error spottingImplement error handling and cross-validation for extracted metadataNirant Kasliwal:X (Twitter)LinkedInSearch in the LLM Era for AI Engineers (course)Nicolay Gerold:⁠LinkedIn⁠⁠X (Twitter)query understanding, AI-powered search, Lambda Mart, e-commerce ranking, networking, experts, recommendation, search
In this episode of How AI is Built, Nicolay Gerold interviews Doug Turnbull, a search engineer at Reddit and author on “Relevant Search”. They discuss how methods and technologies, including large language models (LLMs) and semantic search, contribute to relevant search results.Key Highlights:Defining relevance is challenging and depends heavily on user intent and contextCombining multiple search techniques (keyword, semantic, etc.) in tiers can improve resultsLLMs are emerging as a powerful tool for augmenting traditional search approachesOperational concerns often drive architectural decisions in large-scale search systemsUnderappreciated techniques like LambdaMART may see a resurgenceKey Quotes:"There's not like a perfect measure or definition of what a relevant search result is for a given application. There are a lot of really good proxies, and a lot of really good like things, but you can't just like blindly follow the one objective, if you want to build a good search product." - Doug Turnbull"I think 10 years ago, what people would do is they would just put everything in Solr, Elasticsearch or whatever, and they would make the query to Elasticsearch pretty complicated to rank what they wanted... What I see people doing more and more these days is that they'll use each retrieval source as like an independent piece of infrastructure." - Doug Turnbull on the evolution of search architecture"Honestly, I feel like that's a very practical and underappreciated thing. People talk about RAG and I talk, I call this GAR - generative AI augmented retrieval, so you're making search smarter with generative AI." - Doug Turnbull on using LLMs to enhance search"LambdaMART and gradient boosted decision trees are really powerful, especially for when you're expressing your re-ranking as some kind of structured learning problem... I feel like we'll see that and like you're seeing papers now where people are like finding new ways of making BM25 better." - Doug Turnbull on underappreciated techniquesDoug TurnbullLinkedInX (Twitter)WebNicolay Gerold:⁠LinkedIn⁠⁠X (Twitter)Chapters00:00 Introduction and Guest Introduction 00:52 Understanding Relevant Search Results 01:18 Search Behavior on Social Media 02:14 Challenges in Defining Relevance 05:12 Query Understanding and Ranking Signals 10:57 Evolution of Search Technologies 15:15 Combining Search Techniques 21:49 Leveraging LLMs and Embeddings 25:49 Operational Considerations in Search Systems 39:09 Concluding Thoughts and Future Directions
In this episode, we talk data-driven search optimizations with Charlie Hull.Charlie is a search expert from Open Source Connections. He has built Flax, one of the leading open source search companies in the UK, has written “Searching the Enterprise”, and is one of the main voices on data-driven search.We discuss strategies to improve search systems quantitatively and much more.Key Points:Relevance in search is subjective and context-dependent, making it challenging to measure consistently.Common mistakes in assessing search systems include overemphasizing processing speed and relying solely on user complaints.Three main methods to measure search system performance: Human evaluationUser interaction data analysisAI-assisted judgment (with caution)Importance of balancing business objectives with user needs when optimizing search results.Technical components for assessing search systems: Query logs analysisSource data quality examinationTest queries and cases setupResources mentioned:Quepid: Open-source tool for search quality testingHaystack conference: Upcoming event in Berlin (September 30 - October 1)Relevance Slack communityOpenSource ConnectionsCharlie Hull:LinkedInX (Twitter)Nicolay Gerold:⁠LinkedIn⁠⁠X (Twitter)search results, search systems, assessing, evaluation, improvement, data quality, user behavior, proactive, test dataset, search engine optimization, SEO, search quality, metadata, query classification, user intent, search results, metrics, business objectives, user objectives, experimentation, continuous improvement, data modeling, embeddings, machine learning, information retrieval00:00 Introduction01:35 Challenges in Measuring Search Relevance02:19 Common Mistakes in Search System Assessment03:22 Methods to Measure Search System Performance04:28 Human Evaluation in Search Systems05:18 Leveraging User Interaction Data06:04 Implementing AI for Search Evaluation09:14 Technical Components for Assessing Search Systems12:07 Improving Search Quality Through Data Analysis17:16 Proactive Search System Monitoring24:26 Balancing Business and User Objectives in Search25:08 Search Metrics and KPIs: A Contract Between Teams26:56 The Role of Recency and Popularity in Search Algorithms28:56 Experimentation: The Key to Optimizing Search30:57 Offline Search Labs and A/B Testing34:05 Simple Levers to Improve Search37:38 Data Modeling and Its Importance in Search43:29 Combining Keyword and Vector Search44:24 Bridging the Gap Between Machine Learning and Information Retrieval47:13 Closing Remarks and Contact Information
Welcome back to How AI Is Built. We have got a very special episode to kick off season two. Daniel Tunkelang is a search consultant currently working with Algolia. He is a leader in the field of information retrieval, recommender systems, and AI-powered search. He worked for Canva, Algolia, Cisco, Gartner, Handshake, to pick a few. His core focus is query understanding.  **Query understanding is about focusing less on the results and more on the query.** The query of the user is the first-class citizen. It is about figuring out what the user wants and than finding, scoring, and ranking results based on it. So most of the work happens before you hit the database. **Key Takeaways:**- The "bag of documents" model for queries and "bag of queries" model for documents are useful approaches for representing queries and documents in search systems.- Query specificity is an important factor in query understanding. It can be measured using cosine similarity between query vectors and document vectors.- Query classification into broad categories (e.g., product taxonomy) is a high-leverage technique for improving search relevance and can act as a guardrail for query expansion and relaxation.- Large Language Models (LLMs) can be useful for search, but simpler techniques like query similarity using embeddings can often solve many problems without the complexity and cost of full LLM implementations.- Offline processing to enhance document representations (e.g., filling in missing metadata, inferring categories) can significantly improve search quality.**Daniel Tunkelang**- [LinkedIn](https://www.linkedin.com/in/dtunkelang/)- [Medium](https://queryunderstanding.com/)**Nicolay Gerold:**- [⁠LinkedIn⁠](https://www.linkedin.com/in/nicolay-gerold/)- [⁠X (Twitter)](https://twitter.com/nicolaygerold)- [Substack](https://nicolaygerold.substack.com/)Query understanding, search relevance, bag of documents, bag of queries, query specificity, query classification, named entity recognition, pre-retrieval processing, caching, large language models (LLMs), embeddings, offline processing, metadata enhancement, FastText, MiniLM, sentence transformers, visualization, precision, recall[00:00:00] 1. Introduction to Query UnderstandingDefinition and importance in search systemsEvolution of query understanding techniques[00:05:30] 2. Query Representation ModelsThe "bag of documents" model for queriesThe "bag of queries" model for documentsAdvantages of holistic query representation[00:12:00] 3. Query Specificity and ClassificationMeasuring query specificity using cosine similarityImportance of query classification in search relevanceImplementing and leveraging query classifiers[00:19:30] 4. Named Entity Recognition in Query UnderstandingRole of NER in query processingChallenges with unique or tail entities[00:24:00] 5. Pre-Retrieval Query ProcessingImportance of early-stage query analysisBalancing computational resources and impact[00:28:30] 6. Performance Optimization TechniquesCaching strategies for query understandingOffline processing for document enhancement[00:33:00] 7. Advanced Techniques: Embeddings and Language ModelsUsing embeddings for query similarityRole of Large Language Models (LLMs) in searchWhen to use simpler techniques vs. complex models[00:39:00] 8. Practical Implementation StrategiesStarting points for engineers new to query understandingTools and libraries for query understanding (FastText, MiniLM, etc.)Balancing precision and recall in search systems[00:44:00] 9. Visualization and Analysis of Query SpacesDiscussion on t-SNE, UMAP, and other visualization techniquesLimitations and alternatives to embedding visualizations[00:47:00] 10. Future Directions and Closing Thoughts - Emerging trends in query understanding - Key takeaways for search system engineers[00:53:00] End of Episode
Today we are launching the season 2 of How AI Is Built.The last few weeks, we spoke to a lot of regular listeners and past guests and collected feedback. Analyzed our episode data. And we will be applying the learnings to season 2.This season will be all about search.We are trying to make it better, more actionable, and more in-depth. The goal is that at the end of this season, you have a full-fleshed course on search in podcast form, which mini-courses on specific elements like RAG.We will be talking to experts from information retrieval, information architecture, recommendation systems, and RAG; from academia and industry. Fields that do not really talk to each other.We will try to unify and transfer the knowledge and give you a full tour of search, so you can build your next search application or feature with confidence.We will be talking to Charlie Hull on how to systematically improve search systems, with Nils Reimers on the fundamental flaws of embeddings and how to fix them, with Daniel Tunkelang on how to actually understand the queries of the user, and many more.We will try to bridge the gaps. How to use decades of research and practice in iteratively improving traditional search and apply it to RAG. How to take new methods from recommendation systems and vector databases and bring it into traditional search systems. How to use all of the different methods as search signals and combine them to deliver the results your user actually wants.We will be using two types of episodes:Traditional deep dives, like we have done them so far. Each one will dive into one specific topic within search interviewing an expert on that topic.Supplementary episodes, which answer one additional question; often either complementary or precursory knowledge for the episode, which we did not get to in the deep dive.We will be starting with episodes next week, looking at the first, last, and overarching action in search: understanding user intent and understanding the queries with Daniel Tunkelang.I am really excited to kick this off.I would love to hear from you:What would you love to learn in this season?What guest should I have on?What topics should I make a deep dive on (try to be specific)?Yeah, let me know in the comments or just slide into my DMs on Twitter or LinkedIn.I am looking forward to hearing from you guys.I want to try to be more interactive. So anytime you encounter anything unclear or any question pops up in one of the episode, give me a shout and I will try to answer it to you and to everyone.Enough of me rambling. Let’s kick this off. I will see you next Thursday, when we start with query understanding.Shoot me a message and stay up to date:⁠LinkedIn⁠⁠X (Twitter)
In this episode of "How AI is Built," host Nicolay Gerold interviews Jonathan Yarkoni, founder of Reach Latent. Jonathan shares his expertise in extracting value from unstructured data using AI, discussing challenging projects, the impact of ChatGPT, and the future of generative AI. From weather prediction to legal tech, Jonathan provides valuable insights into the practical applications of AI across various industries.Key TakeawaysGenerative AI projects often require less data cleaning due to the models' tolerance for "dirty" data, allowing for faster implementation in some cases.The success of AI projects post-delivery is ensured through monitoring, but automatic retraining of generative AI applications is not yet common due to evaluation challenges.Industries ripe for AI disruption include text-heavy fields like legal, education, software engineering, and marketing, as well as biotech and entertainment.The adoption of AI is expected to occur in waves, with 2024 likely focusing on internal use cases and 2025 potentially seeing more customer-facing applications as models improve.Synthetic data generation, using models like GPT-4, can be a valuable approach for training AI systems when real data is scarce or sensitive.Evaluation frameworks like RAGAS and custom metrics are essential for assessing the quality of synthetic data and AI model outputs.Jonathan’s ideal tech stack for generative AI projects includes tools like Instructor, Guardrails, Semantic Routing, DSPY, LangChain, and LlamaIndex, with a growing emphasis on evaluation stacks.Key Quotes"I think we're going to see another wave in 2024 and another one in 2025. And people are familiarized. That's kind of the wave of 2023. 2024 is probably still going to be a lot of internal use cases because it's a low risk environment and there was a lot of opportunity to be had.""To really get to production reliably, we have to have these tools evolve further and get more standardized so people can still use the old ways of doing production with the new technology."Jonathan YarkoniLinkedInYouTubeX (Twitter)Reach LatentNicolay Gerold:⁠LinkedIn⁠⁠X (Twitter)Chapters00:00 Introduction: Extracting Value from Unstructured Data 03:16 Flexible Tailoring Solutions to Client Needs 05:39 Monitoring and Retraining Models in the Evolving AI Landscape 09:15 Generative AI: Disrupting Industries and Unlocking New Possibilities 17:47 Balancing Immediate Results and Cutting-Edge Solutions in AI Development 28:29 Dream Tech Stack for Generative AIunstructured data, textual data, automation, weather prediction, data cleaning, chat GPT, AI disruption, legal, education, software engineering, marketing, biotech, immediate results, cutting-edge solutions, tech stack
This episode of "How AI Is Built" is all about data processing for AI. Abhishek Choudhary and Nicolay discuss Spark and alternatives to process data so it is AI-ready.Spark is a distributed system that allows for fast data processing by utilizing memory. It uses a dataframe representation "RDD" to simplify data processing.When should you use Spark to process your data for your AI Systems?→ Use Spark when:Your data exceeds terabytes in volumeYou expect unpredictable data growthYour pipeline involves multiple complex operationsYou already have a Spark cluster (e.g., Databricks)Your team has strong Spark expertiseYou need distributed computing for performanceBudget allows for Spark infrastructure costs→ Consider alternatives when:Dealing with datasets under 1TBIn early stages of AI developmentBudget constraints limit infrastructure spendingSimpler tools like Pandas or DuckDB sufficeSpark isn't always necessary. Evaluate your specific needs and resources before committing to a Spark-based solution for AI data processing.In today’s episode of How AI Is Built, Abhishek and I discuss data processing:When to use Spark vs. alternatives for data processingKey components of Spark: RDDs, DataFrames, and SQLIntegrating AI into data pipelinesChallenges with LLM latency and consistencyData storage strategies for AI workloadsOrchestration tools for data pipelinesTips for making LLMs more reliable in productionAbhishek Choudhary:LinkedInGitHubX (Twitter)Nicolay Gerold:⁠LinkedIn⁠⁠X (Twitter)
In this episode, Nicolay talks with Rahul Parundekar, founder of AI Hero, about the current state and future of AI agents. Drawing from over a decade of experience working on agent technology at companies like Toyota, Rahul emphasizes the importance of focusing on realistic, bounded use cases rather than chasing full autonomy. They dive into the key challenges, like effectively capturing expert workflows and decision processes, delivering seamless user experiences that integrate into existing routines, and managing costs through techniques like guardrails and optimized model choices. The conversation also explores potential new paradigms for agent interactions beyond just chat. Key Takeaways: Agents need to focus on realistic use cases rather than trying to be fully autonomous. Enterprises are unlikely to allow agents full autonomy anytime soon. Capturing the logic and workflows in the user's head is the key challenge. Shadowing experts and having them demonstrate workflows is more effective than asking them to document processes. User experience is crucial - agents must integrate seamlessly into existing user workflows without major disruptions. Interfaces beyond just chat may be needed. Cost control is important - techniques like guardrails, context windowing, model choice optimization, and dev vs production modes can help manage costs. New paradigms beyond just chat could be powerful - e.g. workflow specification, state/declarative definition of desired end-state. Prompt engineering and dynamic prompt improvement based on feedback remain an open challenge. Key Quotes: "Empowering users to create their own workflows is essential for effective agent usage." "Capturing workflows accurately is a significant challenge in agent development." "Preferences, right? So a lot of the work becomes like, hey, can you do preference learning for this user so that the next time the user doesn't have to enter the same information again, things like that." Rahul Parundekar: AI Hero AI Hero Docs Nicolay Gerold: ⁠LinkedIn⁠ ⁠X (Twitter) 00:00 Exploring the Potential of Autonomous Agents 02:23 Challenges of Accuracy and Repeatability in Agents 08:31 Capturing User Workflows and Improving Prompts 13:37 Tech Stack for Implementing Agents in the Enterprise agent development, determinism, user experience, agent paradigms, private use, human-agent interaction, user workflows, agent deployment, human-in-the-loop, LLMs, declarative ways, scalability, AI Hero
In this conversation, Nicolay and Richmond Alake discuss various topics related to building AI agents and using MongoDB in the AI space. They cover the use of agents and multi-agents, the challenges of controlling agent behavior, and the importance of prompt compression. When you are building agents. Build them iteratively. Start with simple LLM calls before moving to multi-agent systems. Main Takeaways: Prompt Compression: Using techniques like prompt compression can significantly reduce the cost of running LLM-based applications by reducing the number of tokens sent to the model. This becomes crucial when scaling to production. Memory Management: Effective memory management is key for building reliable agents. Consider different memory components like long-term memory (knowledge base), short-term memory (conversation history), semantic cache, and operational data (system logs). Store each in separate collections for easy access and reference. Performance Optimization: Optimize performance across multiple dimensions - output quality (by tuning context and knowledge base), latency (using semantic caching), and scalability (using auto-scaling databases like MongoDB). Prompting Techniques: Leverage prompting techniques like ReAct (observe, plan, act) and structured prompts (JSON, pseudo-code) to improve agent predictability and output quality. Experimentation: Continuous experimentation is crucial in this rapidly evolving field. Try different frameworks (LangChain, Crew AI, Haystack), models (Claude, Anthropic, open-source), and techniques to find the best fit for your use case. Richmond Alake: LinkedIn Medium Find Richmond on MongoDB X (Twitter) YouTube GenAI Showcase MongoDB MongoDB AI Stack Nicolay Gerold: ⁠LinkedIn⁠ ⁠X (Twitter) 00:00 Reducing the Scope of AI Agents 01:55 Seamless Data Ingestion 03:20 Challenges and Considerations in Implementing Multi-Agents 06:05 Memory Modeling for Robust Agents with MongoDB 15:05 Performance Optimization in AI Agents 18:19 RAG Setup AI agents, multi-agents, prompt compression, MongoDB, data storage, data ingestion, performance optimization, tooling, generative AI
In this episode, Kirk Marple, CEO and founder of Graphlit, shares his expertise on building efficient data integrations. Kirk breaks down his approach using relatable concepts: The "Two-Sided Funnel": This model streamlines data flow by converting various data sources into a standard format before distributing it. Universal Data Streams: Kirk explains how he transforms diverse data into a single, manageable stream of information. Parallel Processing: Learn about the "competing consumer model" that allows for faster data handling. Building Blocks for Success: Discover the importance of well-defined interfaces and actor models in creating robust data systems. Tech Talk: Kirk discusses data normalization techniques and the potential shift towards a more streamlined "Kappa architecture." Reusable Patterns: Find out how Kirk's methods can speed up the integration of new data sources. Kirk Marple: LinkedIn X (Twitter) Graphlit Graphlit Docs Nicolay Gerold: ⁠LinkedIn⁠ ⁠X (Twitter) Chapters 00:00 Building Integrations into Different Tools 00:44 The Two-Sided Funnel Model for Data Flow 04:07 Using Well-Defined Interfaces for Faster Integration 04:36 Managing Feeds and State with Actor Models 06:05 The Importance of Data Normalization 10:54 Tech Stack for Data Flow 11:52 Progression towards a Kappa Architecture 13:45 Reusability of Patterns for Faster Integration data integration, data sources, data flow, two-sided funnel model, canonical format, stream of ingestible objects, competing consumer model, well-defined interfaces, actor model, data normalization, tech stack, Kappa architecture, reusability of patterns
In our latest episode, we sit down with Derek Tu, Founder and CEO of Carbon, a cutting-edge ETL tool designed specifically for large language models (LLMs). Carbon is streamlining AI development by providing a platform for integrating unstructured data from various sources, enabling businesses to build innovative AI applications more efficiently while addressing data privacy and ethical concerns. "I think people are trying to optimize around the chunking strategy... But for me, that seems a bit maybe not focusing on the right area of optimization. These embedding models themselves have gone just like, so much more advanced over the past five to 10 years that regardless of what representation you're passing in, they do a pretty good job of being able to understand that information semantically and returning the relevant chunks." - Derek Tu on the importance of embedding models over chunking strategies "If you are cost conscious and if you're worried about performance, I would definitely look at quantizing your embeddings. I think we've probably been able to, I don't have like the exact numbers here, but I think we might be saving at least half, right, in storage costs by quantizing everything." - Derek Tu on optimizing costs and performance with vector databases Derek Tu: LinkedIn Carbon Nicolay Gerold: ⁠LinkedIn⁠ ⁠X (Twitter) Key Takeaways: Understand your data sources: Before building your ETL pipeline, thoroughly assess the various data sources you'll be working with, such as Slack, Email, Google Docs, and more. Consider the unique characteristics of each source, including data format, structure, and metadata. Normalize and preprocess data: Develop strategies to normalize and preprocess the unstructured data from different sources. This may involve parsing, cleaning, and transforming the data into a standardized format that can be easily consumed by your AI models. Experiment with chunking strategies: While there's no one-size-fits-all approach to chunking, it's essential to experiment with different strategies to find what works best for your specific use case. Consider factors like data format, structure, and the desired granularity of the chunks. Leverage metadata and tagging: Metadata and tagging can play a crucial role in organizing and retrieving relevant data for your AI models. Implement mechanisms to capture and store important metadata, such as document types, topics, and timestamps, and consider using AI-powered tagging to automatically categorize your data. Choose the right embedding model: Embedding models have advanced significantly in recent years, so focus on selecting the right model for your needs rather than over-optimizing chunking strategies. Consider factors like model performance, dimensionality, and compatibility with your data types. Optimize vector database usage: When working with vector databases, consider techniques like quantization to reduce storage costs and improve performance. Experiment with different configurations and settings to find the optimal balance for your specific use case. 00:00 Introduction and Optimizing Embedding Models 03:00 The Evolution of Carbon and Focus on Unstructured Data 06:19 Customer Progression and Target Group 09:43 Interesting Use Cases and Handling Different Data Representations 13:30 Chunking Strategies and Normalization 20:14 Approach to Chunking and Choosing a Vector Database 23:06 Tech Stack and Recommended Tools 28:19 Future of Carbon: Multimodal Models and Building a Platform Carbon, LLMs, RAG, chunking, data processing, global customer base, GDPR compliance, AI founders, AI agents, enterprises
In this episode, Nicolay sits down with Hugo Lu, founder and CEO of Orchestra, a modern data orchestration platform. As data pipelines and analytics workflows become increasingly complex, spanning multiple teams, tools and cloud services, the need for unified orchestration and visibility has never been greater. Orchestra is a serverless data orchestration tool that aims to provide a unified control plane for managing data pipelines, infrastructure, and analytics across an organization's modern data stack. The core architecture involves users building pipelines as code which then run on Orchestra's serverless infrastructure. It can orchestrate tasks like data ingestion, transformation, AI calls, as well as monitoring and getting analytics on data products. All with end-to-end visibility, data lineage and governance even when organizations have a scattered, modular data architecture across teams and tools. Key Quotes: Find the right level of abstraction when building data orchestration tasks/workflows. "I think the right level of abstraction is always good. I think like Prefect do this really well, right? Their big sell was, just put a decorator on a function and it becomes a task. That is a great idea. You know, just make tasks modular and have them do all the boilerplate stuff like error logging, monitoring of data, all of that stuff.” Modularize data pipeline components: "It's just around understanding what that dev workflow should look like. I think it should be a bit more modular." Having a modular architecture where different components like data ingestion, transformation, model training are decoupled allows better flexibility and scalability. Adopt a streaming/event-driven architecture for low-latency AI use cases: "If you've got an event-driven architecture, then, you know, that's not what you use an orchestration tool for...if you're having a conversation with a chatbot, like, you know, you're sending messages, you're sending events, you're getting a response back. That I would argue should be dealt with by microservices." Hugo Lu: LinkedIn Newsletter Orchestra Orchestra Docs Nicolay Gerold: ⁠LinkedIn⁠ ⁠X (Twitter) 00:00 Introduction to Orchestra and its Focus on Data Products 08:03 Unified Control Plane for Data Stack and End-to-End Control 14:42 Use Cases and Unique Applications of Orchestra 19:31 Retaining Existing Dev Workflows and Best Practices in Orchestra 22:23 Event-Driven Architectures and Monitoring in Orchestra 23:49 Putting Data Products First and Monitoring Health and Usage 25:40 The Future of Data Orchestration: Stream-Based and Cost-Effective data orchestration, Orchestra, serverless architecture, versatility, use cases, maturity levels, challenges, AI workloads
Ever wondered how AI systems handle images and videos, or how they make lightning-fast recommendations? Tune in as Nicolay chats with Zain Hassan, an expert in vector databases from Weaviate. They break down complex topics like quantization, multi-vector search, and the potential of multimodal search, making them accessible for all listeners. Zain even shares a sneak peek into the future, where vector databases might connect our brains with computers! Zain Hasan: LinkedIn X (Twitter) Weaviate Nicolay Gerold: ⁠LinkedIn⁠ ⁠X (Twitter) Key Insights: Vector databases can handle not just text, but also image, audio, and video data Quantization is a powerful technique to significantly reduce costs and enable in-memory search Binary quantization allows efficient brute force search for smaller datasets Multi-vector search enables retrieval of heterogeneous data types within the same index The future lies in multimodal search and recommendations across different senses Brain-computer interfaces and EEG foundation models are exciting areas to watch Key Quotes: "Vector databases are pretty much the commercialization and the productization of representation learning." "I think quantization, it builds on the assumption that there is still noise in the embeddings. And if I'm looking, it's pretty similar as well to the thought of Matryoshka embeddings that I can reduce the dimensionality." "Going from text to multimedia in vector databases is really simple." "Vector databases allow you to take all the advances that are happening in machine learning and now just simply turn a switch and use them for your application." Chapters 00:00 - 01:24 Introduction 01:24 - 03:48 Underappreciated aspects of vector databases 03:48 - 06:06 Quantization trade-offs and techniques Various quantization techniques: binary quantization, product quantization, scalar quantization 06:06 - 08:24 Binary quantization Reducing vectors from 32-bits per dimension down to 1-bit Enables efficient in-memory brute force search for smaller datasets Requires normally distributed data between negative and positive values 08:24 - 10:44 Product quantization and other techniques Alternative to binary quantization, segments vectors and clusters each segment Scalar quantization reduces vectors to 8-bits per dimension 10:44 - 13:08 Quantization as a "superpower" to reduce costs 13:08 - 15:34 Comparing quantization approaches 15:34 - 17:51 Placing vector databases in the database landscape 17:51 - 20:12 Pruning unused vectors and nodes 20:12 - 22:37 Improving precision beyond similarity thresholds 22:37 - 25:03 Multi-vector search 25:03 - 27:11 Impact of vector databases on data interaction 27:11 - 29:35 Interesting and weird use cases 29:35 - 32:00 Future of multimodal search and recommendations 32:00 - 34:22 Extending recommendations to user data 34:22 - 36:39 What's next for Weaviate 36:39 - 38:57 Exciting technologies beyond vector databases and LLMs vector databases, quantization, hybrid search, multi-vector support, representation learning, cost reduction, memory optimization, multimodal recommender systems, brain-computer interfaces, weather prediction models, AI applications
In this episode of "How AI is Built", data architect Anjan Banerjee provides an in-depth look at the world of data architecture and building complex AI and data systems. Anjan breaks down the basics using simple analogies, explaining how data architecture involves sorting, cleaning, and painting a picture with data, much like organizing Lego bricks to build a structure. Summary by Section Introduction Anjan Banerjee, a data architect, discusses building complex AI and data systems Explains the basics of data architecture using Lego and chat app examples Sources and Tools Identifying data sources is the first step in designing a data architecture Pick the right tools to extract data based on use cases (block storage for images, time series DB, etc.) Use one tool for most activities if possible, but specialized tools offer benefits Multi-modal storage engines are gaining popularity (Snowflake, Databricks, BigQuery) Airflow and Orchestration Airflow is versatile but has a learning curve; good for orgs with Python/data engineering skills For less technical orgs, GUI-based tools like Talend, Alteryx may be better AWS Step Functions and managed Airflow are improving native orchestration capabilities For multi-cloud, prefer platform-agnostic tools like Astronomer, Prefect, Airbyte AI and Data Processing ML is key for data-intensive use cases to avoid storing/processing petabytes in cloud TinyML and edge computing enable ML inference on device (drones, manufacturing) Cloud batch processing still dominates for user targeting, recommendations Data Lakes and Storage Storage choice depends on data types, use cases, cloud ecosystem Delta Lake excels at data versioning and consistency; Iceberg at partitioning and metadata Pulling data into separate system often needed for advanced analytics beyond source system Data Quality and Standardization "Poka-yoke" error-proofing of input screens is vital for downstream data quality Impose data quality rules and unified schemas (e.g. UTC timestamps) during ingestion Complexity arises with multi-region compliance (GDPR, CCPA) requiring encryption, sanitization Hot Takes and Wishes Snowflake is overhyped; great UX but costly at scale. Databricks is preferred. Automated data set joining and entity resolution across systems would be a game-changer Anjan Banerjee: LinkedIn Nicolay Gerold: ⁠LinkedIn⁠ ⁠X (Twitter) 00:00 Understanding Data Architecture 12:36 Choosing the Right Tools 20:36 The Benefits of Serverless Functions 21:34 Integrating AI in Data Acquisition 24:31 The Trend Towards Single Node Engines 26:51 Choosing the Right Database Management System and Storage 29:45 Adding Additional Storage Components 32:35 Reducing Human Errors for Better Data Quality 39:07 Overhyped and Underutilized Tools Data architecture, AI, data systems, data sources, data extraction, data storage, multi-modal storage engines, data orchestration, Airflow, edge computing, batch processing, data lakes, Delta Lake, Iceberg, data quality, standardization, poka-yoke, compliance, entity resolution
Jorrit Sandbrink, a data engineer specializing on open table formats, discusses the advantages of decoupling storage and compute, the importance of choosing the right table format, and strategies for optimizing your data pipelines. This episode is full of practical advice for anyone looking to build a high-performance data analytics platform. Lake house architecture: A blend of data warehouse and data lake, addressing their shortcomings and providing a unified platform for diverse workloads. Key components and decisions: Storage options (cloud or on-prem), table formats (Delta Lake, Iceberg, Apache Hoodie), and query engines (Apache Spark, Polars). Optimizations: Partitioning strategies, file size considerations, and auto-optimization tools for efficient data layout and query performance. Orchestration tools: Airflow, Dagster, Prefect, and their roles in triggering and managing data pipelines. Data ingress with DLT: An open-source Python library for building data pipelines, focusing on efficient data extraction and loading. Key Takeaways: Lake houses offer a powerful and flexible architecture for modern data analytics. Open-source solutions provide cost-effective and customizable alternatives. Carefully consider your specific use cases and preferences when choosing tools and components. Tools like DLT simplify data ingress and can be easily integrated with serverless functions. The data landscape is constantly evolving, so staying informed about new tools and trends is crucial. Sound Bites "The Lake house is sort of a modular setup where you decouple the storage and the compute." "A lake house is an architecture, an architecture for data analytics platforms." "The most popular table formats for a lake house are Delta, Iceberg, and Apache Hoodie." Jorrit Sandbrink: LinkedIn dlt Nicolay Gerold: ⁠LinkedIn⁠ ⁠X (Twitter) Chapters 00:00 Introduction to the Lake House Architecture 03:59 Choosing Storage and Table Formats 06:19 Comparing Compute Engines 21:37 Simplifying Data Ingress 25:01 Building a Preferred Data Stack lake house, data analytics, architecture, storage, table format, query execution engine, document store, DuckDB, Polars, orchestration, Airflow, Dexter, DLT, data ingress, data processing, data storage
Kirk Marple, CEO and founder of Graphlit, discusses the evolution of his company from a data cataloging tool to an platform designed for ETL (Extract, Transform, Load) and knowledge retrieval for Large Language Models (LLMs). Graphlit empowers users to build custom applications on top of its API that go beyond naive RAG. Key Points: Knowledge Graphs: Graphlet utilizes knowledge graphs as a filtering layer on top of keyword metadata and vector search, aiding in information retrieval. Storage for KGs: A single piece of content in their data model resides across multiple systems: a document store with JSON, a graph node, and a search index. This hybrid approach creates a virtual entity with representations in different databases. Entity Extraction: Azure Cognitive Services and other models are employed to extract entities from text for improved understanding. Metadata-first approach: The metadata-first strategy involves extracting comprehensive metadata from various sources, ensuring it is canonicalized and filterable. This approach aids in better indexing and retrieval of data, crucial for effective RAG. Challenges: Entity resolution and deduplication remain significant challenges in knowledge graph development. Notable Quotes: "Knowledge graphs is a filtering [mechanism]...but then I think also the kind of spidering and pulling extra content in is the other place this comes into play." "Knowledge graphs to me are kind of like index per se...you're providing a new type of index on top of that." "[For RAG]...you have to find constraints to make it workable." "Entity resolution, deduping, I think is probably the number one thing." "I've essentially built a connector infrastructure that would be like a FiveTran or something that Airflow would have..." "One of the reasons is because we're a platform as a service, the burstability of it is really important. We can spin up to a hundred instances without any problem, and we don't have to think about it." "Once cost and performance become a no-brainer, we're going to start seeing LLMs be more of a compute tool. I think that would be a game-changer for how applications are built in the future." Kirk Marple: LinkedIn X (Twitter) Graphlit Graphlit Docs Nicolay Gerold: ⁠LinkedIn⁠ ⁠X (Twitter) Chapters 00:00 Graphlit’s Hybrid Approach 02:23 Use Cases and Transition to Graphlit 04:19 Knowledge Graphs as a Filtering Mechanism 13:23 Using Gremlin for Querying the Graph 32:36 XML in Prompts for Better Segmentation 35:04 The Future of LLMs and Graphlit 36:25 Getting Started with Graphlit Graphlit, knowledge graphs, AI, document store, graph database, search index co-pilot, entity extraction, Azure Cognitive Services, XML, event-driven architecture, serverless architecture graph rag, developer portal
From Problem to Requirements to Architecture. In this episode, Nicolay Gerold and Jon Erich Kemi Warghed discuss the landscape of data engineering, sharing insights on selecting the right tools, implementing effective data governance, and leveraging powerful concepts like software-defined assets. They discuss the challenges of keeping up with the ever-evolving tech landscape and offer practical advice for building sustainable data platforms. Tune in to discover how to simplify complex data pipelines, unlock the power of orchestration tools, and ultimately create more value from your data. "Don't overcomplicate what you're actually doing." "Getting your basic programming software development skills down is super important to becoming a good data engineer." "Who has time to learn 500 new tools? It's like, this is not humanly possible anymore." Key Takeaways: Data Governance: Data governance is about transparency and understanding the data you have. It's crucial for organizations as they scale and data becomes more complex. Tools like dbt and Dagster can help achieve this. Open Source Tooling: When choosing open source tools, assess their backing, commit frequency, community support, and ease of use. Agile Data Platforms: Focus on the capabilities you want to enable and prioritize solving the core problems of your data engineers and analysts. Software Defined Assets: This concept, exemplified by Dagster, shifts the focus from how data is processed to what data should exist. This change in mindset can greatly simplify data orchestration and management. The Importance of Fundamentals: Strong programming and software development skills are crucial for data engineers, and understanding the basics of data management and orchestration is essential for success. The Importance of Versioning Data: Data has to be versioned so you can easily track changes, revert to previous states if needed, and ensure reproducibility in your data pipelines. lakeFS applies the concepts of Git to your data lake. This gives you the ability to create branches for different development environments, commit changes to specific versions, and merge branches together once changes have been tested and validated. Jon Erik Kemi Warghed: LinkedIn Nicolay Gerold: ⁠LinkedIn⁠ ⁠X (Twitter) Chapters 00:00 The Problem with the Modern Data Stack: Too many tools and buzzwords 00:57 How to Choose the Right Tools: Considerations for startups and large companies 03:13 Evaluating Open Source Tools: Background checks and due diligence 07:52 Defining Data Governance: Transparency and understanding of data 10:15 The Importance of Data Governance: Challenges and solutions 12:21 Data Governance Tools: dbt and Dagster 17:05 The Impact of Dagster: Software-defined assets and declarative thinking 19:31 The Power of Software Defined Assets: How Dagster differs from Airflow and Mage 21:52 State Management and Orchestration in Dagster: Real-time updates and dependency management 26:24 Why Use Orchestration Tools?: The role of orchestration in complex data pipelines 28:47 The Importance of Tool Selection: Thinking about long-term sustainability 31:10 When to Adopt Orchestration: Identifying the need for orchestration tools
In this episode, Nicolay Gerold interviews John Wessel, the founder of Agreeable Data, about data orchestration. They discuss the evolution of data orchestration tools, the popularity of Apache Airflow, the crowded market of orchestration tools, and the key problem that orchestrators solve. They also explore the components of a data orchestrator, the role of AI in data orchestration, and how to choose the right orchestrator for a project. They touch on the challenges of managing orchestrators, the importance of monitoring and optimization, and the need for product people to be more involved in the orchestration space. They also discuss data residency considerations and the future of orchestration tools. Sound Bites "The modern era, definitely airflow. Took the market share, a lot of people running it themselves." "It's like people are launching new orchestrators every day. This is a funny one. This was like two weeks ago, somebody launched an orchestrator that was like a meta-orchestrator." "The DAG introduced two other components. It's directed acyclic graph is what DAG means, but direct is like there's a start and there's a finish and the acyclic is there's no loops." Key Topics The evolution of data orchestration: From basic task scheduling to complex DAG-based solutions What is a data orchestrator and when do you need one? Understanding the role of orchestrators in handling complex dependencies and scaling data pipelines. The crowded market: A look at popular options like Airflow, Daxter, Prefect, and more. Best practices: Choosing the right tool, prioritizing serverless solutions when possible, and focusing on solving the use case before implementing complex tools. Data residency and GDPR: How regulations influence tool selection, especially in Europe. Future of the field: The need for consolidation and finding the right balance between features and usability. John Wessel: LinkedIn Data Stack Show Agreeable Data Nicolay Gerold: ⁠LinkedIn⁠ ⁠X (Twitter) Data orchestration, data movement, Apache Airflow, orchestrator selection, DAG, AI in orchestration, serverless, Kubernetes, infrastructure as code, monitoring, optimization, data residency, product involvement, generative AI. Chapters 00:00 Introduction and Overview 00:34 The Evolution of Data Orchestration Tools 04:54 Components and Flow of Data in Orchestrators 08:24 Deployment Options: Serverless vs. Kubernetes 11:14 Considerations for Data Residency and Security 13:02 The Need for a Clear Winner in the Orchestration Space 20:47 Optimization Techniques for Memory and Time-Limited Issues 23:09 Integrating Orchestrators with Infrastructure-as-Code 24:33 Bridging the Gap Between Data and Engineering Practices 27:2 2Exciting Technologies Outside of Data Orchestration 30:09 The Feature of Dagster
In this episode of "How AI is Built", we learn how to build and evaluate real-world language model applications with Shahul and Jithin, creators of Ragas. Ragas is a powerful open-source library that helps developers test, evaluate, and fine-tune Retrieval Augmented Generation (RAG) applications, streamlining their path to production readiness. Main Insights Challenges of Open-Source Models: Open-source large language models (LLMs) can be powerful tools, but require significant post-training optimization for specific use cases. Evaluation Before Deployment: Thorough testing and evaluation are key to preventing unexpected behaviors and hallucinations in deployed RAGs. Ragas offers metrics and synthetic data generation to support this process. Data is Key: The quality and distribution of data used to train and evaluate LLMs dramatically impact their performance. Ragas is enabling novel synthetic data generation techniques to make this process more effective and cost-efficient. RAG Evolution: Techniques for improving RAGs are continuously evolving. Developers must be prepared to experiment and keep up with the latest advancements in chunk embedding, query transformation, and model alignment. Practical Takeaways Start with a solid testing strategy: Before launching, define the quality metrics aligned with your RAG's purpose. Ragas helps in this process. Embrace synthetic data: Manually creating test data sets is time-consuming. Tools within Ragas help automate the creation of synthetic data to mirror real-world use cases. RAGs are iterative: Be prepared for continuous improvement as better techniques and models emerge. Interesting Quotes "...models are very stochastic and grading it directly would rather trigger them to give some random number..." - Shahul, on the dangers of naive model evaluation. "Reducing the developer time in acquiring these test data sets by 90%." - Shahul, on the efficiency gains of Ragas' synthetic data generation. "We want to ensure maximum diversity..." - Shahul, on creating realistic and challenging test data for RAG evaluation. Ragas: Web Docs Jithin James: LinkedIn Shahul ES: LinkedIn X (Twitter) Nicolay Gerold: ⁠LinkedIn⁠ ⁠X (Twitter) 00:00 Introduction 02:03 Introduction to Open Assistant project 04:05 Creating Customizable and Fine-Tunable Models 06:07 Ragas and the LLM Use Case 08:09 Introduction to Language Model Metrics (LLMs) 11:12 Reducing the Cost of Data Generation 13:19 Evaluation of Components at Melvess 15:40 Combining Ragas Metrics with AutoML Providers 20:08 Improving Performance with Fine-tuning and Reranking 22:56 End-to-End Metrics and Component-Specific Metrics 25:14 The Importance of Deep Knowledge and Understanding 25:53 Robustness vs Optimization 26:32 Challenges of Evaluating Models 27:18 Creating a Dream Tech Stack 27:47 The Future Roadmap for Ragas 28:02 Doubling Down on Grid Data Generation 28:12 Open-Source Models and Expanded Support 28:20 More Metrics for Different Applications RAG, Ragas, LLM, Evaluation, Synthetic Data, Open-Source, Language Model Applications, Testing.
loading