DiscoverThe Backend Engineering Show with Hussein Nasser
Claim Ownership
The Backend Engineering Show with Hussein Nasser
Author: Hussein Nasser
Subscribed: 585Played: 17,109Subscribe
Share
© Hussein Nasser
Description
Welcome to the Backend Engineering Show podcast with your host Hussein Nasser. If you like software engineering you’ve come to the right place. I discuss all sorts of software engineering technologies and news with specific focus on the backend. All opinions are my own.
Most of my content in the podcast is an audio version of videos I post on my youtube channel here http://www.youtube.com/c/HusseinNasser-software-engineering
Buy me a coffee
https://www.buymeacoffee.com/hnasr
🧑🏫 Courses I Teach
https://husseinnasser.com/courses
Most of my content in the podcast is an audio version of videos I post on my youtube channel here http://www.youtube.com/c/HusseinNasser-software-engineering
Buy me a coffee
https://www.buymeacoffee.com/hnasr
🧑🏫 Courses I Teach
https://husseinnasser.com/courses
524 Episodes
Reverse
Fundamentals of Operating Systems Course
https://oscourse.win
Very clever! We often call read/rcv system call to read requests from a connection, this copies data from kernel receive buffer to user space which has a cost.
This new patch changes this to allow zero copy with notification.
“Reading' data out of a socket instead becomes a “notification” mechanism, where the kernel tells userspace where the data is.”
This kernel patch enables zero copy from the receive queue.
https://lore.kernel.org/io-uring/ZwW7_cRr_UpbEC-X@LQ3V64L9R2/T/
0:00 Intro
1:30 patch summary
7:00 Normal Connection Read (Kernel Copy)
12:40 Zero copy Read
15:30 Performance
Cloudflare built a global cache purge system that runs under 150 ms.
This is how they did it.
Using RockDB to maintain local CDN cache, and a peer-to-peer data center distributed system and clever engineering, they went from 1.5 second purge, down to 150 ms.
However, this isn’t full picture, because that 150 ms is just actually the P50. In this video I explore Clouldflare CDN work, how the old core-based centralized quicksilver, lazy purge work compared to the new coreless, decentralized active purge. In it I explore the pros and cons of both systems and give you my thoughts of this system.
0:00 Intro
4:25 From Core Base Lazy Purge to Coreless Active
12:50 CDN Basics
16:00 TTL Freshness
17:50 Purge
20:00 Core-Based Purge
24:00 Flexible Purges
26:36 Lazy Purge
30:00 Old Purge System Limitations
36:00 Coreless / Active Purge
39:00 LSM vs BTree
45:30 LSM Performance issues
48:00 How Active Purge Works
50:30 My thoughts about the new system
58:30 Summary
Cloudflare blog
https://blog.cloudflare.com/instant-purge/
Mentioned Videos
Cloudflare blog
https://blog.cloudflare.com/instant-purge/
Percentile Tail Latency Explained (95%, 99%) Monitor Backend performance with this metric
https://www.youtube.com/watch?v=3JdQOExKtUY
How Discord Stores Trillions of Messages | Deep Dive
https://www.youtube.com/watch?v=xynXjChKkJc
Fundamentals of Operating Systems Course
https://os.husseinnasser.com
Backend Troubleshooting Course
https://performance.husseinnasser.com
Fundamentals of Database Engineering udemy course https://databases.win
MySQL has been having bumpy journey since 2018 with the release of the version 8.0. Critical crashes that made to the final product, significant performance regressions, and tons of stability and bugs issues. In this video I explore what happened to MySql, are these issues getting fixed? And what is the current state of MySQL at the end of 2024.
0:00 Intro
2:00 MySQL 8.0 vs 5.7 Performance
11:00 Critical Crash in 8.0.38, 8.4.1 and 9.0.0
15:40 Is 8.4 better than 8.0.36?
16:30 More Features = More Bugs
22:30 Summary and my thoughts
resources
https://x.com/MarkCallaghanDB/status/1786428909376164263
https://www.percona.com/blog/do-not-upgrade-to-any-version-of-mysql-after-8-0-37/
http://smalldatum.blogspot.com/2024/09/mysql-innodb-vs-sysbench-on-large-server.html
https://www.percona.com/blog/mysql-8-0-vs-5-7-are-the-newer-versions-more-problematic/
Fundamentals of Operating Systems Course
https://oscourse.win
In this video I use strace a performance tool that measures how many system calls does a process makes. We compare a simple task of reading from a file, and we run the program in different runtimes, namely nodejs, buns , python and native C.
We discuss the cost of kernel mode switches, system calls and pe
0:00 Intro
5:00 Code Explanation
6:30 Python
9:30 NodeJS
12:30 BunJS
13:12 C
16:00 Summary
Fundamentals of Operating Systems Course
https://os.husseinnasser.com
When do you use threads?
I would say in scenarios where the task is either
1) IO blocking task
2) CPU heavy
3) Large volume of small tasks
In any of the cases above, it is favorable to offload the task to a thread.
1) IO blocking task
When you read from or write to disk, depending on how you do it and the kernel interface you used, the write might be blocking. This means the process that executes the IO will not be allowed to execute any more code until the write/read completes.
That is why you see most logging operations are done on a secondary thread (like libuv that Node uses) this way the thread is blocked but the main process/thread can resume its work.
If you can do file reads/writes asynchronously with say io_uring then you technically don't need threading.
Now notice how I said file IO because it is different than socket IO which is always done asynchronously with epoll/select etc.
2) CPU heavy
The second use case is when the task requires lots of CPU time, which then starves/blocks the rest of the process from doing its normal job. So offloading that task to a thread so that it runs on a different core can allow the main process to continue running on its the original core.
3) Large volume of small tasks
The third use case is when you have large amount of small tasks and single process can't deliver as much throughput. An example would be accepting connections, a single process can only accept connections so fast, to increase the throughput in case where you have massive amount of clients connecting, you would spin multiple threads to accept those connections and of course read and process requests. Perhaps you would also enable port reuse so that you avoid accept mutex locking.
Keep in mind threads come with challenges and problems so when it is not required.
0:00 Intro
1:40 What are threads?
7:10 IO blocking Tasks
17:30 CPU Intensive Tasks
22:00 Large volume of small tasks
I am fascinated by how timeouts affect backend and frontend programming.
When a party is waiting on something you can place a timeout to break the wait. This is useful for freeing resources to more critical processes, detecting slow operations and even avoiding DOS attacks.
Contrary to common beliefs, timeouts are not exclusive to request processing, they can be applied to other parts of the frontend-backend communications. Let us explore this briefly.
0:00 Intro
2:30 Connection Timeout
5:00 Request Read timeout
10:00 Wait Timeout
12:00 Usage Timeout
14:00 Response Timeout
16:00 Canceling a request
19:50 Proxies and timeouts
Learn more about database and OS internals, check out my courses
Fundamentals of database engineering https://databases.win
Fundamentals of operating systems https://oscourse.win
This new PostgreSQL 17 feature is game changer.
You see, postgres like most databases work with fixed size pages. Pretty much everything is in this format, indexes, table data, etc. Those pages are 8K in size, each page will have the rows, or index tuples and a fixed header. The pages are just bytes in files and they are read and cached in the buffer pool.
To read page 0, for example, you would call read on offset 0 for 8192 bytes, To read page 1 that is another read system call from offset 8193 for 8192, page 7 is offset 57,345 for 8192 and so on.
If table is 100 pages stored a file, to do a full table scan, we would be making 100 system calls, each system call had an overhead (I talk about all of that in my OS course).
The enhancement in Postgres 17 is to combine I/Os you can specify how much IO to combine, so technically while possible you can scan that entire table in one system call doesn’t mean its always a good idea of course and Ill talk about that.
This also seems to included a vectorized I/O, with preadv system call which takes an array of offsets and lengths for random reads.
The challenge will become how to not read too much, say I’m doing a seq scan to find something, I read page 0 and found it and quit I don’t need to read any more pages. With this feature I might read 10 pages in one I/O and pull all its content, put in shared buffers only to find my result in the first page (essentially wasting disk bandwidth, memory etc)
It is going to be interesting to balance this out.
Fundamentals of Operating Systems Course
https://os.husseinnasser.com
Why Windows Kernel connects slower than Linux
I explore the behavior of TCP/IP stack in Windows kernel when it receives a RST from the backend server especially when the host is available but the port we are trying to connect to is not. This behavior is exacerbated by having both IPv6 and IPv4 and if the happy eye ball protocol is in place where IPv6 is favorable.
0:00 Intro
0:30 Fundamentals TCP/IP
3:00 Unreachable Port Behavior
6:00 Client Kernel Behavior (Linux vs Windows)
11:40 Slow TCP Connect on Windows
15:00 localhost, IPv6 and IPv4
20:00 Happy Eyeballs
28:00 Registry keys to change the behavior
31:00 Port Unreachable vs Host Unreachable
https://daniel.haxx.se/blog/2024/08/14/slow-tcp-connect-on-windows/
In this episode of the backend engineering show I describe an interesting bug I ran into where the web server ran out of ephemeral ports causing the system to halt.
0:00 Intro
0:30 System architecture
2:20 The behavior of the bug
4:00 Backend Troubleshooting
7:00 The cause
15:30 Ephemeral ports on loopback
Fundamentals of Operating Systems Course
https://os.husseinnasser.com
Linux I/O expert and subsystem maintainer Jens Axboe has submitted all of the IO_uring feature updates ahead of the imminent Linux 6.10 merge window.
In this video I explore this with a focus on what zerocopy.
0:00 Intro
0:30 IO_uring gets faster
2:00 What is io_uring
7:00 How Normal Copying Work
12:00 How Zero Copy Works
13:50 ZeroCopy and TLS
https://www.phoronix.com/news/Linux-6.10-IO_uring
https://lore.kernel.org/io-uring/fef75ea0-11b4-4815-8c66-7b19555b279d@kernel.dk/?s=09
Fundamentals of Operating Systems Course
https://oscourse.win
Looks like fedora is compiling cpython with the -o3 flag, which does aggressive function inlining among other optimizations.
This seems to improve python benchmarks performance by at most 1.16x at a cost of an extra 3MB in binary size (text segment). Although it does seem to slow down some benchmarks as well though not significantly.
O1 - local register allocation, subexpression elimination
O2 - Function inlining only small functions
O3 - Agressive inlining, SMID
0:00 Intro
1:00 Fedora Linux gets Fast Python
5:40 What is Compiling?
9:00 Compiling with No Optimization
12:10 Compiling with -O1
15:30 Compiling with -O2
20:00 Compiling with -O3
23:20 Showing Numbers
Backend Troubleshooting Course
https://performance.husseinnasser.com
https://oscourse.win
Allegro improved their Kafka produce tail latency by over 80% when they switched from ext4 to xfs. What I enjoyed most about this article is the detailed analysis and tweaking the team made to ext4 before considering switching to xfs. This is a classic case of how a good tech blog looks like in my opinion.
0:00 Intro
0:30 Summary
2:35 How Kafka Works?
5:00 Producers Writes are Slow
7:10 Tracing Kafka Protocol
12:00 Tracing Kernel System Calls
16:00 Journaled File Systems
21:00 Improving ext4
26:00 Switching to XFS
Blog
https://blog.allegro.tech/2024/03/kafka-performance-analysis.html
Get my backend course https://backend.win
Google submitted a patch to Linux Kernel 6.8 to improve TCP performance by 40%, this is done via rearranging the tcp structures for better cpu cache lines, I explore this here.
0:00 Intro
0:30 Google improves Linux Kernel TCP by 40%
1:40 How CPU Cache Line Works
6:45 Reviewing the Google Patch
https://www.phoronix.com/news/Linux-6.8-Networking
https://lore.kernel.org/netdev/20231129072756.3684495-1-lixiaoyan@google.com/
Discovering Backend Bottlenecks: Unlocking Peak Performance
https://performance.husseinnasser.com
0:00 Intro
2:00 File System Block vs Database Pages
4:00 Torn pages or partial page
7:40 How Oracle Solves torn pages
8:40 MySQL InnoDB Doublewrite buffer
10:45 Postgres Full page writes
Get my backend course https://backend.win
Cloudflare has announced they are opening sources Pingora as a networking framework! Big news, let us discuss
0:00 Intro
0:30 Reasons why Cloudflare built Pingora?
3:00 It is a framework!
7:30 What in Pingora?
11:50 Security in Pingora
13:45 Multi-threading in Pingora
21:00 Customization vs Configuration
25:00 Summary
https://blog.cloudflare.com/pingora-open-source/?utm_campaign=cf_blog&utm_content=20240228&utm_medium=organic_social&utm_source=twitter
https://backend.win
https://databases.win
I’m a big believer that database systems share similar core fundamentals at their storage layer and understanding them allows one to compare different DBMS objectively. For example, How documents are stored in MongoDB is no different from how MySQL or PostgreSQL store rows.
Everything goes to pages of fixed size and those pages are flushed to disk.
Each database define page size differently based on their workload, for example MongoDB default page size is 32KB, MySQL InnoDB is 16KB and PostgreSQL is 8KB.
The trick is to fetch what you need from disk efficiently with as fewer I/Os as possible, the rest is API.
In this video I discuss the evolution of MongoDB internal architecture on how documents are stored and retrieved focusing on the index storage representation. I assume the reader is well versed with fundamentals of database engineering such as indexes, B+Trees, data files, WAL etc, you may pick up my database course to learn the skills.
Let us get started.
In this video I explore the type of languages, compiled, garbage collected, interpreted, JIT and more.
I talk about default values and how PostgreSQL 14 got slower when a default parameter has changed.
Mike's blog
https://smalldatum.blogspot.com/2024/02/it-wasnt-performance-regression-in.html
Background writing is a process that writes dirty pages in shared buffer to the disk (well goes to the OS file cache then get flushed to disk by the OS) I go into this process in this video
Fragmentation is a very interesting topic to me, especially when it comes to memory.
While virtually memory does solve external fragmentation (you can still allocate logically contiguous memory in non-contiguous physical memory) it does however introduce performance delays as we jump all over the physical memory to read what appears to us for example as contiguous array in virtual memory.
You see, DDR RAM consists of banks, rows and columns. Each row has around 1024 columns and each column has 64 bits which makes a row around 8kib. The cost of accessing the RAM is the cost of “opening” a row and all its columns (around 50-100 ns) once the row is opened all the columns are opened and the 8 kib is cached in the row buffer in the RAM.
The CPU can ask for an address and transfer 64 bytes at a time (called bursts) so if the CPU (or the MMU to be exact) asks for the next 64 bytes next to it, it comes at no cost because the entire row is cached in the RAM. However if the CPU sends a different address in a different row the old row must be closed and a new row should be opened taking an additional 50 ns hit. So spatial access of bytes ensures efficiency,
So fragmentation does hurt performance if the data you are accessing are not contiguous in physical memory (of course it doesn’t matter if it is contiguous in virtual memory). This kind of remind me of the old days of HDD and how the disk needle physically travels across the disk to read one file which prompted the need of “defragmentation” , although RAM access (and SSD NAND for that matter) isn’t as bad.
Moreover, virtual memory introduces internal fragmentation because of the use of fixed-size blocks (called pages and often 4kib in size), and those are mapped to frames in physical memory.
So if you want to allocate a 32bit integer (4 bytes) you get a 4 kib worth of memory, leaving a whopping 4092 allocated for the process but unused, which cannot be used by the OS. These little pockets of memory can add up as many processes. Another reason developers should take care when allocating memory for efficiency.
Top Podcasts
The Best New Comedy Podcast Right Now – June 2024The Best News Podcast Right Now – June 2024The Best New Business Podcast Right Now – June 2024The Best New Sports Podcast Right Now – June 2024The Best New True Crime Podcast Right Now – June 2024The Best New Joe Rogan Experience Podcast Right Now – June 20The Best New Dan Bongino Show Podcast Right Now – June 20The Best New Mark Levin Podcast – June 2024
United States
I was struggling understanding those devil concepts, I was in a position of can't talk on them with anyone because I haven't acquired the all thing that good. Thanks you sensei
Awesome job great content 👌👍