Claim Ownership

Author:

Subscribed: 0Played: 0
Share

Description

 Episodes
Reverse
Can we use machine learning to detect security threats in real-time? As organizations increasingly rely on distributed systems, it is becoming more important to analyze the traffic that passes through those systems quickly. Confluent Hackathon ’22 finalist, Géraud Dugé de Bernonville (Data Consultant, Zenika Bordeaux), shares how his team used TensorFlow (machine learning) and Neo4j (graph database) to analyze and detect network traffic data in real-time. What started as a research and development exercise turned into ZIEM, a full-blown internal project using ksqlDB to manipulate, export, and visualize data from Apache Kafka®.Géraud and his team noticed that large amounts of data passed through their network, and they were curious to see if they could detect threats as they happened. As a hackathon project, they built ZIEM, a network mapping and intrusion detection platform that quickly generates network diagrams. Using Kafka, the system captures network packets, processes the data in ksqlDB, and uses a Neo4j Sink Connector to send it to a Neo4j instance. Using the Neo4j browser, users can see instant network diagrams showing who's on the network, allowing them to detect anomalies quickly in real time.The Ziem project was initially conceived as an experiment to explore the potential of using Kafka for data processing and manipulation. However, it soon became apparent that there was great potential for broader applications (banking, security, etc.). As a result, the focus shifted to developing a tool for exporting data from Kafka, which is helpful in transforming data for deeper analysis, moving it from one database to another, or creating powerful visualizations.Géraud goes on to talk about how the success of this project has helped them better understand the potential of using Kafka for data processing. Zenika plans to continue working to build a pipeline that can handle more robust visualizations, expose more learning opportunities, and detect patterns.EPISODE LINKSZiem Project on GitHub ksqlDB 101 courseksqlDB Fundamentals: How Apache Kafka, SQL, and ksqlDB Work together ft. Simon AuburyReal-Time Stream Processing, Monitoring, and Analytics with Apache KafkaApplication Data Streaming with Apache Kafka and SwimWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)  
What happens when you need to store more than a few petabytes of data? Rittika Adhikari (Software Engineer, Confluent) discusses how her team implemented tiered storage, a method for improving the scalability and elasticity of data storage in Apache Kafka®. She also explores the motivating factors for building it in the first place: cost, performance, and manageability. Before Tiered Storage, there was no real way to retain Kafka data indefinitely. Because of the tight coupling between compute and storage, users were forced to use different tools to access cold and hot data. Additionally, the cost of re-replication was prohibitive because Kafka had to process large amounts of data rather than small hot sets.As a member of the Kafka Storage Foundations team, Rittika explains to Kris Jenkins how her team initially considered a Kafka data lake but settled on a more cost-effective method – tiered storage. With tiered storage, one tier handles elasticity and throughput for long-term storage, while the other tier is dedicated to high-cost, low-latency, short-term storage. Before, re-replication impacted all brokers, slowing down performance because it required more replication cycles. By decoupling compute and storage, they now only replicate the hot set rather than weeks of data. Ultimately, this tiered storage method broke down the barrier between compute and storage by separating data into multiple tiers across the cloud. This allowed for better scalability and elasticity that reduced operational toil. In preparation for a broader rollout to customers who heavily rely on compacted topics, Rittika’s team will be implementing tier compaction to support tiering of compacted topics. The goal is to have the partition leader perform compaction. This will substantially reduce compaction costs (CPU/disk) because the number of replicas compacting is significantly smaller. It also protects the broker resource consumption through a new compaction algorithm and throttling. EPISODE LINKSJun Rao explains: What is Tiered Storage?Enabling Tiered StorageInfinite Storage in Confluent PlatformKafka Storage and Processing FundamentalsKIP-405: Kafka Tiered StorageOptimizing Apache Kafka’s Internals with Its Co-Creator Jun RaoWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)  
In principle, data mesh architecture should liberate teams to build their systems and gather data in a distributed way, without having to explicitly coordinate. Data is the thing that can and should decouple teams, but proper implementation has its challenges.In this episode, Kris talks to Florian Albrecht (Solution Architect, Hermes Germany) about Galapagos, an open-source DevOps software tool for Apache Kafka® that Albrecht created with his team at Hermes, a German parcel delivery company. After Hermes chose Kafka to implement company-wide event-driven architecture, Albrecht’s team created rules and guidelines on how to use and really make the most out of Kafka. But the hands-off approach wasn’t leading to greater independence, so Albrecht’s team tried something different to documentation— they encoded the rules as software.This method pushed the teams to stop thinking in terms of data and to start thinking in terms of events. Previously, applications copied data from one point to another, with slight changes each time. In the end, teams with conflicting data were left asking when the data changed and why, with a real impact on customers who might be left wondering when their parcel was redirected and how. Every application would then have to be checked to find out when exactly the data was changed. Event architecture terminates this cycle. Events are immutable and changes are registered as new domain-specific events. Packaged together as event envelopes, they can be safely copied to other applications, and can provide significant insights. No need to check each application to find out when manually entered or imported data was changed—the complete history exists in the event envelope. More importantly, no more time-consuming collaborations where teams help each other to interpret the data. Using Galapagos helped the teams at Hermes to switch their thought process from raw data to event-driven. Galapagos also empowers business teams to take charge of their own data needs by providing a protective buffer. When specific teams,  providers of data or events, want to change something, Galapagos enforces a method which will not kill the production applications already reading the data. Teams can add new fields which existing applications can ignore, but a previously required field that an application could be relying on won’t be changeable. Business partners using Galapagos found they were better prepared to give answers to their developer colleagues, allowing different parts of the business to communicate in ways they hadn’t before. Through Galapagos, Hermes saw better success decoupling teams.EPISODE LINKSA Guide to Data MeshPractical Data Mesh ebookGalapagos GitHubFlorian Albrecht GitHubWatch the videoJoin the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get $100 of free Confluent Cloud usage (details)   
Is real-time data streaming the future, or will batch processing always be with us? Interest in streaming data architecture is booming, but just as many teams are still happily batching away. Batch processing is still simpler to implement than stream processing, and successfully moving from batch to streaming requires a significant change to a team’s habits and processes, as well as a meaningful upfront investment. Some are even running dbt in micro batches to simulate an effect similar to streaming, without having to make the full transition. Will streaming ever fully take over?In this episode, Kris talks to a panel of industry experts with decades of experience building and implementing data systems. They discuss the state of streaming adoption today, if streaming will ever fully replace batch, and whether it even could (or should). Is micro batching the natural stepping stone between batch and streaming? Will there ever be a unified understanding on how data should be processed over time? Is the lack of agreement on best practices for data streaming an insurmountable obstacle to widespread adoption? What exactly is holding teams back from fully adopting a streaming model?Recorded live at Current 2022: The Next Generation of Kafka Summit, the panel includes Adi Polak (Vice President of Developer Experience, Treeverse), Amy Chen (Partner Engineering Manager, dbt Labs), Eric Sammer (CEO, Decodable), and Tyler Akidau (Principal Software Engineer, Snowflake).EPISODE LINKSdbt LabsDecodablelakeFSSnowflakeView sessions and slides from Current 2022Stream Processing vs. Batch Processing: What to KnowFrom Batch to Real-Time: Tips for Streaming Data Pipelines with Apache Kafka ft. Danica FineWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   
Streaming real-time data at scale and processing it efficiently is critical to cybersecurity organizations like SecurityScorecard. Jared Smith, Senior Director of Threat Intelligence, and Brandon Brown, Senior Staff Software Engineer, Data Platform at SecurityScorecard, discuss their journey from using RabbitMQ to open-source Apache Kafka® for stream processing. As well as why turning to fully-managed Kafka on Confluent Cloud is the right choice for building real-time data pipelines at scale. SecurityScorecard mines data from dozens of digital sources to discover security risks and flaws with the potential to expose their client’ data. This includes scanning and ingesting data from a large number of ports to identify suspicious IP addresses, exposed servers, out-of-date endpoints, malware-infected devices, and other potential cyber threats for more than 12 million companies worldwide.To allow real-time stream processing for the organization, the team moved away from using RabbitMQ to open-source Kafka for processing a massive amount of data in a matter of milliseconds, instead of weeks or months. This makes the detection of a website’s security posture risk happen quickly for constantly evolving security threats. The team relied on batch pipelines to push data to and from Amazon S3 as well as expensive REST API based communication carrying data between systems. They also spent significant time and resources on open-source Kafka upgrades on Amazon MSK.Self-maintaining the Kafka infrastructure increased operational overhead with escalating costs. In order to scale faster, govern data better, and ultimately lower the total cost of ownership (TOC), Brandon, lead of the organization’s Pipeline team, pivoted towards a fully-managed, cloud-native approach for more scalable streaming data pipelines, and for the development of a new Automatic Vendor Detection (AVD) product. Jared and Brandon continue to leverage the Cloud for use cases including using PostgreSQL and pushing data to downstream systems using CSC connectors, increasing data governance and security for streaming scalability, and more.EPISODE LINKSSecurityScorecard Case StudyBuilding Data Pipelines with Apache Kafka and ConfluentWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   
What are some recommendations to consider when running Apache Kafka® in production? Jun Rao, one of the original Kafka creators, as well as an ongoing committer and PMC member, shares the essential wisdom he's gained from developing Kafka and dealing with a large number of Kafka use cases.Here are 6 recommendations for maximizing Kafka in production:1. Nail Down the Operational PartWhen setting up your cluster, in addition to dealing with the usual architectural issues, make sure to also invest time into alerting, monitoring, logging, and other operational concerns. Managing a distributed system can be tricky and you have to make sure that all of its parts are healthy together.  This will give you a chance at catching cluster problems early, rather than after they have become full-blown crises. 2. Reason Properly About Serialization and Schemas Up FrontAt the Kafka API level, events are just bytes, which gives your application the flexibility to use various serialization mechanisms. Avro has the benefit of decoupling schemas from data serialization, whereas Protobuf is often preferable to those practiced with remote procedure calls; JSON Schema is user friendly but verbose. When you are choosing your serialization, it's a good time to reason about schemas, which should be well-thought-out contracts between your publishers and subscribers. You should know who owns a schema as well as the path for evolving that schema over time.3. Use Kafka As a Central Nervous System Rather Than As a Single ClusterTeams typically start out with a single, independent Kafka cluster, but they could benefit, even from the outset, by thinking of Kafka more as a central nervous system that they can use to connect disparate data sources. This enables data to be shared among more applications. 4. Utilize Dead Letter Queues (DLQs)DLQs can keep service delays from blocking the processing of your messages. For example, instead of using a unique topic for each customer to which you need to send data (potentially millions of topics),  you may prefer to use a shared topic, or a series of shared topics that contain all of your customers. But if you are sending to multiple customers from a shared topic and one customer's REST API is down—instead of delaying the process entirely—you can have that customer's events divert into a dead letter queue. You can then process them later from that queue.5. Understand Compacted TopicsBy default in Kafka topics, data is kept by time. But there is also another type of topic, a compacted topic, which stores data by key and replaces old data with new data as it comes in. This is particularly useful for working with data that is updateable, for example, data that may be coming in through a change-data-capture log. A practical example of this would be a retailer that needs to update prices and product descriptions to send out to all of its locations. 6. Imagine New Use Cases Enabled by Kafka's Recent Evolution The biggest recent change in Kafka's history is its migration to the cloud. By using Kafka there, you can reserve your engineering talent for business logic. The unlimited storage enabled by the cloud also means that you can truly keep data forever at reasonable cost, and thus you don't have to build a separate system for your historical data needs.EPISODE LINKSKafka Internals 101 Watch in videoKris Jenkins' TwitterUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   
Is it possible to build a real-time data platform without using stateful stream processing? Forecasty.ai is an artificial intelligence platform for forecasting commodity prices, imparting insights into the future valuations of raw materials for users. Nearly all AI models are batch-trained once, but precious commodities are linked to ever-fluctuating global financial markets, which require real-time insights. In this episode, Ralph Debusmann (CTO, Forecasty.ai) shares their journey of migrating from a batch machine learning platform to a real-time event streaming system with Apache Kafka® and delves into their approach to making the transition frictionless. Ralph explains that Forecasty.ai was initially built on top of batch processing, however, updating the models with batch-data syncs was costly and environmentally taxing. There was also the question of scalability—progressing from 60 commodities on offer to their eventual plan of over 200 commodities. Ralph observed that most real-time systems are non-batch, streaming-based real-time data platforms with stateful stream processing, using Kafka Streams, Apache Flink®, or even Apache Samza. However, stateful stream processing involves resources, such as teams of stream processing specialists to solve the task. With the existing team, Ralph decided to build a real-time data platform without using any sort of stateful stream processing. They strictly keep to the out-of-the-box components, such as Kafka topics, Kafka Producer API, Kafka Consumer API, and other Kafka connectors, along with a real-time database to process data streams and implement the necessary joins inside the database. Additionally, Ralph shares the tool he built to handle historical data, kash.py—a Kafka shell based on Python; discusses issues the platform needed to overcome for success, and how they can make the migration from batch processing to stream processing painless for the data science team. EPISODE LINKSKafka Streams 101 courseThe Difference Engine for Unlocking the Kafka Black BoxGitHub repo: kash.pyWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   
Java Virtual Machines (JVMs) impact Apache Kafka® performance in production. How can you optimize your event-streaming architectures so they process more Kafka messages using the same number of JVMs? Gil Tene (CTO and Co-Founder, Azul) delves into JVM internals and how developers and architects can use Java and optimized JVMs to make real-time data pipelines more performant and more cost effective, with use cases.Gil has deep roots in Java optimization, having started out building large data centers for parallel processing, where the goal was to get a finite set of hardware to run the largest possible number of JVMs. As the industry evolved, Gil switched his primary focus to software, and throughout the years, has gained particular expertise in garbage collection (the C4 collector) and JIT compilation. The OpenJDK distribution Gil's company Azul releases, Zulu, is widely used throughout the Java world, although Azul's Prime build version can run Kafka up to forty-percent faster than the open version—on identical hardware. Gil relates that improvements in JVMs aren't yielded with a single stroke or in one day, but are rather the result of many smaller incremental optimizations over time, i.e. "half-percent" improvements that accumulate. Improving a JVM starts with a good engineering team, one that has thought significantly about how to make JVMs better. The team must continuously monitor metrics, and Gil mentions that his team tests optimizations against 400-500 different workloads (one of his favorite things to get into the lab is a new customer's workload). The quality of a JVM can be measured on response times, the consistency of these response times including outliers, as well as the level and number of machines that are needed to run it. A balance between performance and cost efficiency is usually a sweet spot for customers.Throughout the podcast, Gil goes into depth on optimization in theory and practice, as well as Azul's use of JIT compilers, as they play a key role in improving JVMs. There are always tradeoffs when using them: You want a JIT compiler to strike a balance between the work expended optimizing and the benefits that come from that work. Gil also mentions a new innovation Azul has been working on that moves JIT compilation to the cloud, where it can be applied to numerous JVMs simultaneously.EPISODE LINKSA Guide on Increasing Kafka Event Streaming PerformanceBetter Kafka Performance Without Changing Any CodeWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   
Apache Kafka® 3.3 is released! With over two years of development, KIP-833 marks KRaft as production ready for new AK 3.3 clusters only. On behalf of the Kafka community, Danica Fine (Senior Developer Advocate, Confluent) shares highlights of this release, with KIPs from Kafka Core, Kafka Streams, and Kafka Connect. To reduce request overhead and simplify client-side code, KIP-709 extends the OffsetFetch API requests to accept multiple consumer group IDs. This update has three changes, including extending the wire protocol, response handling changes, and enhancing the AdminClient to use the new protocol. Log recovery is an important process that is triggered whenever a broker starts up after an unclean shutdown. And since there is no way to know the log recovery progress other than checking if the broker log is busy, KIP-831 adds metrics for the log recovery progress with `RemainingLogsToRecover` and `RemainingSegmentsToRecover`for each recovery thread. These metrics allow the admin to monitor the progress of the log recovery.Additionally, updates on Kafka Core also include KIP-841: Fenced replicas should not be allowed to join the ISR in KRaft. KIP-835: Monitor KRaft Controller Quorum Health. KIP-859: Add metadata log processing error-related metrics. KIP-834 for Kafka Streams added the ability to pause and resume topologies. This feature lets you reduce rescue usage when processing is not required or modifying the logic of Kafka Streams applications, or when responding to operational issues. While KIP-820 extends the KStream process with a new processor API. Previously, KIP-98 added support for exactly-once delivery guarantees with Kafka and its Java clients. In the AK 3.3 release, KIP-618 offers the Exactly-Once Semantics support to Confluent’s source connectors. To accomplish this, a number of new connectors and worker-based configurations have been introduced, including `exactly.once.source.support`, `transaction.boundary`, and more. Image attribution: Apache ZooKeeper™: https://zookeeper.apache.org/ and Raft logo:  https://raft.github.io/  EPISODE LINKSSee release notes for Apache Kafka 3.3.0 and Apache Kafka 3.3.1 for the full list of changesRead the blog to learn moreDownload Apache Kafka 3.3 and get startedWatch the video version of this podcast
How do you set data applications in motion by running stateful business logic on streaming data? Capturing key stream processing events and cumulative statistics that necessitate real-time data assessment, migration, and visualization remains as a gap—for event-driven systems and stream processing frameworks according to Fred Patton (Developer Evangelist, Swim Inc.) In this episode, Fred explains streaming applications and how it contrasts with stream processing applications. Fred and Kris also discuss how you can use Apache Kafka® and Swim for a real-time UI for streaming data.Swim's technology facilitates relationships between streaming data from distributed sources and complex UIs, managing backpressure cumulatively, so that front ends don't get overwhelmed. They are focused on real-time, actionable insights, as opposed to those derived from historical data. Fred compares Swim's functionality to the speed layer in the Lambda architecture model, which is specifically concerned with serving real-time views. For this reason, when sending your data to Swim, it is common to also send a copy to a data warehouse that you control. Web agent—a data entity in the Swim ecosystem, can be as small as a single cellphone or as large as a whole cellular network. Web agents communicate with one another as well as with their subscribers, and each one is a URI that can be called by a browser or the command line. Swim has been designed to instantaneously accommodate requests at widely varying levels of granularity, each of which demands a completely different volume of data. Thus, as you drill down, for example, from a city view on a map into a neighborhood view, the Swim system figures out which web agent is responsible for the view you are requesting, as well as the other web agents needed to show it.Fred also shares an example where they work with a telephony company that requires real-time statuses for a network infrastructure with thousands of cell towers servicing millions of devices. Along with a use case for a transportation company needing to transform raw edge data into actionable insights for its connected vehicle customers. Future plans for Swim include porting more functionality to the cloud, which will enable additional automation, so that, for example, a customer just has to provide database and Kafka cluster connections, and Swim can automatically build out infrastructure. EPISODE LINKSSwim Cellular Network SimulatorContinuous Intelligence - Streaming Apps That Are Always in SyncUsing Swim with Apache KafkaSwim DeveloperWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   
What’s your favorite podcast? Would you like to find some new ones? In celebration of International Podcast Day, Kris Jenkins invites 12 experts from the Apache Kafka® community to talk about their favorite podcasts. Unlike other episodes where guests educate developers and tell stories about Kafka, its surrounding technological ecosystem, or the Cloud, this special episode provides a glimpse into what these guests have learned through listening to podcasts that you might also find interesting. Through a virtual international tour, Kris chatted with Bill Bejeck (Integration Architect, Confluent), Nikoleta Verbeck (Senior Solutions Engineer, CSID, Confluent), Ben Stopford (Lead Technologist, OCTO, Confluent), Noelle Gallagher (Video Producer, Editor), Danica Fine (Senior Developer Advocate, Confluent), Tim Berglund (VP, Developer Relations, StarTree), Ben Ford (Founder and CEO, Commando Development), Jeff Bean (Group Manager, Technical Marketing, Confluent), Domenico Fioravanti (Director of Engineering, Therapie Clinic), Francesco Tisiot (Senior Developer Advocate, Aiven), Robin Moffatt (Principal, Developer Advocate, Confluent), and Simon Aubury (Principal Data Engineer, ThoughtWorks). They share recommendations covering a wide range of topics such as building distributed systems, travel, data engineering, greek mythology, data mesh, economics, and music and the arts. EPISODE LINKSCommon Apache Kafka Mistakes to AvoidFlink vs Kafka Streams/ksqlDBWhy Data Mesh ft. Ben StopfordPractical Data Pipeline ft. Danica FineWhat Could Go Wrong with a Kafka JDBC Connector?Intro to Kafka Connect: Core Components and Architecture ft. Robin MoffattServerless Stream Processing with Apache Kafka ft. Bill BejeckScaling an Apache Kafka-Based Architecture at Therapie ClinicEvent-Driven Systems and Agile OperationsReal-Time Stream Processing, Monitoring, and Analytics with Apache KafkaWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   
How do you build an event-driven application that can react to real-time data streams as they happen? Kris Jenkins (Senior Developer Advocate, Confluent) will be hosting another fun, hands-on programming workshop—Coding in Motion: Watching the River Flow, to demonstrate how you can build a reactive event streaming application with Apache Kafka®, ksqlDB using Python.As a developer advocate, Kris often speaks at conferences, and the presentation will be available on-demand through the organizer’s YouTube channel. The desire to read comments and be able to interact with the community motivated Kris to set up a real-time event streaming application that would notify him on his mobile phone. During the workshop, Kris will demonstrate the end-to-end process of using Python to process and stream data from YouTube’s REST API into a Kafka topic, analyze the data with ksqlDB, and then stream data out via Telegram. After the workshop, you’ll be able to use the recipe to build your own event-driven data application.  EPISODE LINKSCoding in Motion: Building a Reactive Data Streaming AppWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   
Processing real-time event streams enables countless use cases big and small. With a day job designing and building highly available distributed data systems, Simon Aubury (Principal Data Engineer, Thoughtworks) believes stream-processing thinking can be applied to any stream of events. In this episode, Simon shares his Confluent Hackathon ’22 winning project—a wildlife monitoring system to observe population trends over time using a Raspberry Pi, along with Apache Kafka®, Kafka Connect, ksqlDB, TensorFlow Lite, and Kibana. He used the system to count animals in his Australian backyard and perform trend analysis on the results. Simon also shares ideas on how you can use these same technologies to help with other real-world challenges.Open-source, object detection models for TensorFlow, which appropriately are collected into "model zoos," meant that Simon didn't have to provide his own object identification as part of the project, which would have made it untenable. Instead, he was able to utilize the open-source models, which are essentially neural nets pretrained on relevant data sets—in his case, backyard animals.Simon's system, which consists of around 200 lines of code, employs a Kafka producer running a while loop, which connects to a camera feed using a Python library. For each frame brought down, object masking is applied in order to crop and reduce pixel density, and then the frame is compared to the models mentioned above. A Python dictionary containing probable found objects is sent to a Kafka broker for processing; the images themselves aren't sent. (Note that Simon's system is also capable of alerting if a specific, rare animal is detected.) On the broker, Simon uses ksqlDB and windowing to smooth the data in case the frames were inconsistent for some reason (it may look back over thirty seconds, for example, and find the highest number of animals per type). Finally, the data is sent to a Kibana dashboard for analysis, through a Kafka Connect sink connector. Simon’s system is an extremely low-cost system that can simulate the behaviors of more expensive, proprietary systems. And the concepts can easily be applied to many other use cases. For example, you could use it to estimate traffic at a shopping mall to gauge optimal opening hours, or you could use it to monitor the queue at a coffee shop, counting both queued patrons as well as impatient patrons who decide to leave because the queue is too long.EPISODE LINKSReal-Time Wildlife Monitoring with Apache KafkaWildlife Monitoring GithubksqlDB Fundamentals: How Apache Kafka, SQL, and ksqlDB Work TogetherEvent-Driven Architecture - Common Mistakes and Valuable LessonsMotion in Motion: Building an End-to-End Motion Detection and Alerting System with Apache Kafka and ksqlDBWatch the video version of this podcastKris Jenkins’ TwitterLearn more on Confluent DeveloperUse PODCAST100 to get $100 of free Confluent Cloud usage (details)   
How do you analyze Reddit sentiment with Apache Kafka® and microservices? Bringing the fresh perspective of someone who is both new to Kafka and the industry, Shufan Liu, nascent Developer Advocate at Confluent, discusses projects he has worked on during his summer internship—a Cluster Linking extension to a conceptual data pipeline project, and a microservice-based Reddit sentiment-analysis project. Shufan demonstrates that it’s possible to quickly get up to speed with the tools in the Kafka ecosystem and to start building something productive early on in your journey.Shufan's Cluster Linking project extends a demo by Danica Fine (Senior Developer Advocate, Confluent) that uses a Kafka-based data pipeline to address the challenge of automatic houseplant watering. He discusses his contribution to the project and shares details in his blog—Data Enrichment in Existing Data Pipelines Using Confluent Cloud.The second project Shufan presents is a sentiment analysis system that gathers data from a given subreddit, then assigns the data a sentiment score. He points out that its results would be hard to duplicate manually by simply reading through a subreddit—you really need the assistance of AI. The project consists of four microservices:A user input service that collects requests in a Kafka topic, which consist of the desired subreddit, along with the dates between which data should be collectedAn API polling service that fetches the requests from the user input service, collects the relevant data from the Reddit API, then appends it to a new topicA sentiment analysis service that analyzes the appended topic from the API polling service using the Python library NLTK; it calculates averages with ksqlDBA results-displaying service that consumes from a topic with the calculationsInteresting subreddits that Shufan has analyzed for sentiment include gaming forums before and after key releases; crypto and stock trading forums at various meaningful points in time; and sports-related forums both before the season and several games into it. EPISODE LINKSData Enrichment in Existing Data Pipelines Using Confluent CloudWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details) 
How do you plan Apache Kafka® capacity and Kafka Streams sizing for optimal performance? When Jason Bell (Principal Engineer, Dataworks and founder of Synthetica Data), begins to plan a Kafka cluster, he starts with a deep inspection of the customer's data itself—determining its volume as well as its contents: Is it JSON, straight pieces of text, or images? He then determines if Kafka is a good fit for the project overall, a decision he bases on volume, the desired architecture, as well as potential cost.Next, the cluster is conceived in terms of some rule-of-thumb numbers. For example, Jason's minimum number of brokers for a cluster is three or four. This means he has a leader, a follower and at least one backup.  A ZooKeeper quorum is also a set of three. For other elements, he works with pairs, an active and a standby—this applies to Kafka Connect and Schema Registry. Finally, there's Prometheus monitoring and Grafana alerting to add. Jason points out that these numbers are different for multi-data-center architectures.Jason never assumes that everyone knows how Kafka works, because some software teams include specialists working on a producer or a consumer, who don't work directly with Kafka itself. They may not know how to adequately measure their Kafka volume themselves, so he often begins the collaborative process of graphing message volumes. He considers, for example, how many messages there are daily, and whether there is a peak time. Each industry is different, with some focusing on daily batch data (banking), and others fielding incredible amounts of continuous data (IoT data streaming from cars).  Extensive testing is necessary to ensure that the data patterns are adequately accommodated. Jason sets up a short-lived system that is identical to the main system. He finds that teams usually have not adequately tested across domain boundaries or the network. Developers tend to think in terms of numbers of messages, but not in terms of overall network traffic, or in how many consumers they'll actually need, for example. Latency must also be considered, for example if the compression on the producer's side doesn't match compression on the consumer's side, it will increase.Kafka Connect sink connectors require special consideration when Jason is establishing a cluster. Failure strategies need to well thought out, including retries and how to deal with the potentially large number of messages that can accumulate in a dead letter queue. He suggests that more attention should generally be paid to the Kafka Connect elements of a cluster, something that can actually be addressed with bash scripts.Finally, Kris and Jason cover his preference for Kafka Streams over ksqlDB from a network perspective. EPISODE LINKSCapacity Planning and Sizing for Kafka StreamsTales from the Frontline of Apache Kafka DevOpsWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more on Confluent DeveloperUse PODCAST100 to get $100 of free Cloud usage (details)  
Reimagining a data architecture to provide real-time data flow for sporting events can be complicated, especially for organizations with as much data as World Table Tennis (WTT). Vatsan Rama (Director of IT, ITTF Group) shares why real-time data is essential in the sporting world and how his team reengineered their data system in 18 months, moving from a solely on-premises infrastructure to a cloud-native data system that uses Confluent Cloud with Apache Kafka® as its central nervous system. World Table Tennis is a business created by the International Table Tennis Federation (ITTF) to manage the official professional Table Tennis series of events and its commercial rights. World Table Tennis is also leading the sport digital transformation and commercializes its software application for real-time event scoring worldwide. Previously, ITTF scoring was processed manually with a desktop-based, on-venue results system (OVR) —an on-premises solution to process match data that calculated rankings and records, then sent event information to other systems, such as scoreboards.  To provide match status in real-time, which makes the sport more engaging for fans and adds a competitive edge for players, Vatsan reengineered their OVR system to allow instant data sync between on-premises competition systems with the Cloud. The redesign started by establishing an event-driven architecture with Kafka that consolidates all legacy data sources, including records in Excel along with some handwritten forms (some dating back 90 years, even including records from the 1930 World Championship). To reduce operational overhead and maintenance, the team decided to stream data through fully managed Kafka as a service on Azure, for a scalable, distributed infrastructure. Vatsan shares that multiple table tennis events can run in parallel globally, and every time an umpire marks scores in a table, the data moves from the venue into Confluent Cloud, and then the score and rankings are sent to betting organizations and individuals on their mobile apps. EPISODE LINKSEvent Processing ApplicationFully Managed Apache Kafka on AzureWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)  
Inheriting software in the banking sector can be challenging. Perhaps the only thing harder is inheriting software built by a committee of banks. How do you keep it running, while improving it, refactoring it, and planning a bigger future for it? In this episode, Jean-Francois Garet (Technical Architect, Symphony) shares his experience at Symphony as he helps it evolve from an inherited, monolithic, single-tenant architecture to an event mesh for seamless event-streaming microservices. He talks about the journey they’ve taken so far, and the foundations they’ve laid for a modern data mesh.Symphony is the leading markets’ infrastructure and technology platform, which provides a full communication stack (chat, voice and video meetings, file and screen sharing) for the financial industry. Jean-Francois shares that its initial system was inherited from one of the founding institutions—and features the highest level of security to ensure confidentiality of business conversations, coupled with compliance with regulations covering financial transactions. However, its stacks are monolithic and single tenant. To modernize Symphony's architecture for real-time data, Jean-Francois and team have been exploring various approaches over the last four years. They started breaking down the monolith into microservices, and also made a move towards multitenancy by setting up an event mesh. However, they experienced a mix of success and failure in both attempts. To continue the evolution of the system, while maintaining business deliveries, the team started to focus on event streaming for asynchronous communications, as well as connecting the microservices for real-time data exchange. As they had prior Apache Kafka® usage in the company, the team decided to go with managed Kafka on the cloud as their streaming platform. The team has a set of principles in mind for the development of their event-streaming functionality: Isolate product domainsReach eventual consistency with event streamingClear contracts for the event streams, for both producers and consumers Multiregion and global data sharingJean-Francois shares that data mesh is ultimately what they are hoping to achieve with their platform—to provide governance around data and make data available as a product for self service. As of now, though, their focus is achieving real-time event streams with event mesh.  EPISODE LINKSThe Definitive Guide to Building a Data Mesh with Event StreamsData Mesh 101What is Data Mesh? ft. Zhamak DehghaniData Mesh ArchitectureWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details) 
Security is a primary consideration for any system design, and Apache Kafka® is no exception. Out of the box, Kafka has relatively little security enabled. Rajini Sivaram (Principal Engineer, Confluent, and co-author of “Kafka: The Definitive Guide” ) discusses how Kafka has gone from a system that included no security to providing an extensible and flexible platform for any business to build a secure messaging system. She shares considerations, important best practices, and features Kafka provides to help you design a secure modern data streaming system. In order to build a secure Kafka installation, you need to securely authenticate your users. Whether you are using Kerberos (SASL/GSSAPI), SASL/PLAIN, SCRAM, or OAUTH. Verifying your users can authenticate, and non-users can’t, is a primary requirement for any connected system.But authentication is only one part of the security story. We also need to address other areas. Kafka added support for fine-grained access control using ACLs with a pluggable authorizer several years ago. Over time, this was extended to support prefixed ACLs to make ACLs more manageable in large organizations. Now on its second generation authorizer, Kafka is easily extendable to support other forms of authorization, like integrating with a corporate LDAP server to provide group or role-based access control.Even if you’ve set up your system to use secure authentication and each user is authorized using a series of ACLs if the data is viewable by anyone listening, how secure is your system? That’s where encryption comes in. Using TLS Kafka can encrypt your data-in-transit.Security has gone from a nice-to-have to being a requirement of any modern-day system. Kafka has followed a similar path from zero security to having a flexible and extensible system that helps companies of any size pick the right security path for them. Be sure to also check out the newest Apache Kafka Security course on Confluent Developer for an in-depth explanation along with other recommendations. EPISODE LINKSAn Introduction to Apache Kafka Security: Securing Real-Time Data StreamsKafka Security courseKafka: The Definitive Guide v2Security OverviewWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   
Java Database Connectivity (JDBC) is the Java API used to connect to a database. As one of the most popular Kafka connectors, it's important to prevent issues with your integrations. In this episode, we'll cover how a JDBC connection works, and common issues with your database connection. Why the Kafka JDBC Connector? When it comes to streaming database events into Apache Kafka®, the JDBC connector usually represents the first choice for its flexibility and the ability to support a wide variety of databases without requiring custom code. As an experienced data analyst, Francesco Tisiot (Senior Developer Advocate, Aiven) delves into his experience of streaming Kafka data pipeline with JDBC source connector and explains what could go wrong. He discusses alternative options available to avoid these problems, including the Debezium source connector for real-time change data capture. The JDBC connector is a Java API for Kafka Connect, which streams data between databases and Kafka. If you want to stream data from a rational database into Kafka, once per day or every two hours, the JDBC connector is a simple, batch processing connector to use. You can tell the JDBC connector which query you’d like to execute against the database, and then the connector will take the data into Kafka. The connector works well with out-of-the-box basic data types, however, when it comes to a database-specific data type, such as geometrical columns and array columns in PostgresSQL, these don’t represent well with the JDBC connector. Perhaps, you might not have any results in Kafka because the column is not within the connector’s supporting capability. Francesco shares other cases that would cause the JDBC connector to go wrong, such as: Infrequent snapshot timesOut-of-order eventsNon-incremental sequencesHard deletesTo help avoid these problems and set up a reliable source of events for your real-time streaming pipeline, Francesco suggests other approaches, such as the Debezium source connector for real-time change data capture. The Debezium connector has enhanced metadata, timestamps of the operation, access to all logs,  and provides sequence numbers for you to speak the language of a DBA. They also talk about the governance tool, which Francesco has been building, and how streaming Game of Thrones sentiment analysis with Kafka started his current role as a developer advocate. EPISODE LINKSKafka Connect Deep Dive – JDBC Source ConnectorJDBC Source Connector: What could go wrong?Metadata parser Debezium DocumentationDatabase Migration with Apache Kafka and Apache Kafka ConnectWatch the video version of this podcastFrancesco Tisiot’s TwitterKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more on Confluent Developer
Setting up a reliable cloud networking for your Apache Kafka® infrastructure can be complex. There are many factors to consider—cost, security, scalability, and availability. With immense experience building cloud-native Kafka solutions on Confluent Cloud, Justin Lee (Principal Solutions Engineer, Enterprise Solutions Engineering, Confluent) and Dennis Wittekind (Customer Success Technical Architect, Customer Success Engineering, Confluent) talk about the different networking options on Confluent Cloud, including AWS Transit Gateway, AWS, and Azure Private Link, and discuss when and why you might choose one over the other. In order to build a secure cloud-native Kafka network, you need to consider information security and compliance requirements. These requirements may vary depending on your industry, location, and regulatory environment. For example, in financial organizations, transaction data or personal identifiable information (PII) may not be accessible over the internet. In this case, your network architecture may require private networking, which means you have to choose between private endpoints or a peering connection between your infrastructure and your Kafka clusters in the cloud.What are the differences between different networking solutions? Dennis and Justin talk about some of the benefits and drawbacks of different network architectures. For example, Transit Gateways offered by AWS are often a good fit for organizations with large, disparate network architectures, while Private Link is sometimes preferred for its security benefits. We also discuss the management overhead involved in administering different network architectures.Dennis and Justin also highlight their recently launched course on Confluent Developer—the Confluent Cloud Networking course. This hands-on course covers basic networking and cloud computing concepts that will offer support for you to get a clearer picture of the configurations and collaborate with the networking teams.EPISODE LINKSCloud Networking courseManage NetworkingWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details) 
Comments 
Download from Google Play
Download from App Store