Claim Ownership

Author:

Subscribed: 0Played: 0
Share

Description

 Episodes
Reverse
We're back! 00:00 Welcome: This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. On this episode, our newest host, Chris Villemez, is joined by Kemal Sanjta to discuss a BGP-related incident that took down Twitter for many users around the globe on March 28th. 00:36 Under the Hood: Chris Villemez and Kemal Sanjta leverage their extensive operations experience managing the networks of large-scale SaaS, IoT, and cloud providers to analyze this incident using the ThousandEyes platform. They examine the scope of the outage, review the specific BGP changes that resulted in the outage, and discuss what enterprises can do when they’re experiencing a similar BGP hijack or route leak. Sharelinks: Single agent (Manchester) test: https://anislusvvn.share.thousandeyes.com/ Multi-agent global test showing BGP changes: https://axntbxntyk.share.thousandeyes.com/ 31:00 Outro: We've been on a bit of a break for the past several months as things were relatively quiet on the Internet front and for the foreseeable future we'll be a bit reactive in our episodes, when something major happens trust we'll be here. Questions? Feedback? Have an idea for a guest? Send us an email at internetreport@thousandeyes.com
This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. On today’s episode, our newest host and Technical Marketing Engineer, Chris Villemez, is joined by Kemal Sanjta, Principal Engineer, to dive into the details of the recent AWS outages from December 7th, 10th and 15th. They’ll walk through what ThousandEyes saw from its fleet of vantage points, as well as share some insight into what enterprises can learn from these incidents to build resilient cloud architectures.
00:00 Welcome: This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. 00:15 Headlines: Today we’re going to do a thorough analysis of the major Facebook outage that took place yesterday, Monday, October 4. I’m joined by ​​Gustavo Ramos, ThousandEyes’ in-house expert on Network Engineering. ThousandEyes Blog: https://www.thousandeyes.com/blog/facebook-outage-analysis Analysis from Facebook: https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/ 1:17 Under the Hood: We'll walk through the sequence of events that led to this outage, understand what went wrong (and what actions may have made the situation worse), and what lessons we can all learn from this outage. 25:40 Outro: We've been on a bit of a break for the past several months as things were relatively quiet on the Internet front and for the foreseeable future we'll be a bit reactive in our episodes, when something major happens trust we'll be here. Questions? Feedback? Have an idea for a guest? Send us an email at internetreport@thousandeyes.com
00:00 Welcome: This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. 00:08 Headlines: Today, Mike Hicks (Principal Solutions Analyst, ThousandEyes) and I discuss a recent BGP routing incident that had intermittent impacts on Amazon’s services, including Amazon.com and AWS compute resources, during a five-hour period on July 12. 01:04 Under the Hood: When we look into BGP routing at the time, we can see multiple BGP path changes due to a service provider erroneously inserting themselves into the path for a large number of Amazon routes. Watch this episode to see how the BGP incident led to significant packet loss, resulting in service disruption for some Amazon and AWS users. We also discuss why enterprises need to have continuous oversight of the paths their traffic takes over the Internet. 17:58 Outro: Questions? Feedback? Have an idea for a guest? Send us an email at internetreport@thousandeyes.com
This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. I’m joined today by Mike Hicks, principal solutions analyst here at ThousandEyes, to cover the outage of Akamai’s DNS service. The outage, which occurred on July 22nd around 3:38 PM UTC (8:38AM PT), struck during the course of business hours in Europe and North America, resulting in widespread impacts to applications and services hosted within Akamai servers. The outage itself was short-lived and was resolved roughly one hour after the outage began. In this episode, we examine the customer impact, the relationship between DNS and CDNs, and what enterprises should take away from the incident. We also discuss the question of when it might make sense to invest in DNS or CDN redundancy—and when it is, frankly, overkill. Watch this week’s episode to hear our take, and as always let us know on Twitter what you think.
00:00 Welcome:This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. 00:13 Headlines: Today, Kemal and I unpack an interesting BGP incident, in which a large-scale route leak briefly altered traffic patterns across the Internet. 00:58 Under the Hood: The incident began on Thursday, June 3rd at around 10:24 UTC, and resulted in a significant spike in packet loss that was noticeable in ThousandEyes tests. While this packet loss resolved within the hour (at around 10:48 UTC), we observed some interesting routing changes during this window—as traffic was diverted to a Russian telecom provider that had not previously been in the path. Watch this episode as we explore how this network provider managed to get itself into the routing paths of many major services, and why network visibility is so important to recognize these types of incidents in which your site may still be reachable but your traffic is being sent through an unexpected network. 20:45 Outro: Questions? Feedback? Have an idea for a guest? Send us an email at internetreport@thousandeyes.com
This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. I’m joined by ThousandEyes’ BGP expert, Kemal Sanjta, to review the June 16th outage of Prolexic Routed, a DDoS Mitigation Service operated by Akamai. According to a statement from Akamai, the outage was not due to a DDoS attack or system update, but instead a routing table limitation that was inadvertently exceeded. In this episode, Kemal and I analyzed what happened and how customers of Akamai Prolexic who had automated failover mechanisms in place were able to recover more quickly than those that had to manually switch over to other providers. Watch this episode to learn more about this outage, and how different operational processes resulted in very different service outcomes.
00:00 Welcome: This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. 00:12 Headlines: Today, I’m joined by Hans Ashlock, Director of Technology & Innovation at ThousandEyes, to unpack today’s major outage at Fastly, a popular CDN provider. 3:46 Under the Hood: Today, I’m joined by Hans Ashlock, Director of Technology & Innovation at ThousandEyes, to unpack today’s major outage at Fastly, a popular CDN provider. The widespread outage occurred around 9:50 UTC, about 5:50 am ET, and mostly impacted users across Europe and Asia due to the timing. he outage lasted approximately one hour until 10:50 UTC, yet residual impacts were felt beyond that. Today’s outage is a good example of the importance of having outside-in visibility not just across your app, but also to your app’s edge and all its dependent services. 39:05 Outro: Questions? Feedback? Have an idea for a guest? Send us an email at internetreport@thousandeyes.com
This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. I’m joined today by Mike Hicks, Principle Solution Analyst at ThousandEyes, to cover two recent application-related outages. The first occurred on May 19th around 12:50 UTC at Coinbase—a well-known cryptocurrency exchange. Around the time that news broke saying that the Chinese government would be imposing strict regulation on cryptocurrencies, users attempting to execute transactions were unable to access the application. From the ThousandEyes platform we were able to see a drop in availability around this time as well as increased load times (which in some cases resulted in timeout errors). The second outage happened on May 20th around 17:35 UTC at Slack—an enterprise collaboration platform. While the outage was resolved within 90 minutes, it occurred during normal US business hours, making it particularly disruptive to users attempting to reach the application. These instances remind us that applications, much like the underlying networks they run on, can experience outages, and effective troubleshooting requires end-to-end visibility into both.
00:00 Welcome 00:14 Headlines: DNS and BGP and DDoS Attacks—Oh, My! This week we cover a couple of recent service degradation incidents involving DNS providers 2:19 Under the Hood: Kemal Sanjta, ThousandEyes’ resident BGP expert, joins us to discuss the May 6th disruption to Neustar’s UltraDNS service, which lasted nearly four hours. We discuss the BGP routing changes we observed during the incident and what they can tell us about the cause of the disruption. We also cover a separate incident involving Quad 9, a public recursive resolver service, which the company says was caused by a DDoS attack on May 3rd. 16:19 Expert Spotlight: Michael Batchelder (A.K.A., Binky), is here to discuss the two “Ds” of the Internet — DDoS attacks and the DNS Questions for Binky? Contact him at binky@thousandeyes.com 31:49 Outro: Questions? Feedback? Have an idea for a guest? Send us an email at internetreport@thousandeyes.com
This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. Today, we focused on an interesting outage that impacted Cloudflare Magic Transit, a relatively new offering from the CDN provider which aims to efficiently route and protect the network traffic of its customers. On May 3rd at approximately 3:00 PM PDT (10:00 PM UTC), ThousandEyes vantage points connecting to sites using Magic Transit began to detect significant packet loss at Cloudflare’s network edge—with the loss continuing at varying levels, for approximately 2 hours. While the outage impacted some Magic Transit customers more significantly than others, we also observed mitigation actions by at least one customer to avoid the outage and restore the availability of their service to their users. This outage reminds us that no provider is immune to outages, even cloud and global CDN providers. However, with proactive visibility, you can respond quickly to reduce outage impact on your users. Watch this week’s episode to hear more about the outage from the ThousandEyes perspective.
This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. We’re joined this week by Hans Ashlock, Director of Technology & Innovation at ThousandEyes, to discuss Tuesday’s Microsoft Teams outage. On Tuesday, April 27th, ThousandEyes tests began to detect an outage affecting the Teams service starting around 3 AM (PT) and lasting approximately 1.5 hours. While the outage occurred in the overnight hours for much of the Americas, the global nature of the outage resulted in service disruption for users connecting from Asia and Europe. Transaction views within the ThousandEyes platform show that Microsoft’s authentication service appeared to be available, however, the Teams application was unable to initialize, resulting in error responses. Watch this week’s episode to hear more about what ThousandEyes revealed about the nature of this outage—and what we can all learn from the incident.
This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. On today’s episode, we’re thrilled to be joined by Kemal Sanjta, ThousandEyes’ resident expert on BGP. This week, we’re going under the hood on the April 16th BGP leak at Vodafone India, which leaked more than 30,000 prefixes, causing a major disruption of Internet traffic to some services. While some news outlets reported that the incident lasted approximately 10 minutes (starting around 1:50AM UTC or 9:50AM ET), we found that it lasted quite a bit longer—more than an hour in the case of some prefixes. Watch this week’s show to see how it impacted a major CDN provider.
This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. We’re back from a short sabbatical to cover an interesting outage at Facebook in what appears to be an application outage compounded by a series of routing issues. On April 8th, for roughly 40 minutes, the Facebook application became unavailable for users around the globe who were attempting to connect to the service. Despite the short-lived nature of the outage, we observed prolonged performance degradation even after the application came back online for users. Suboptimal page load and response times, both of which can impact the user experience, were observed alongside a series of routing changes. This outage reminds us all of the importance of having visibility across network and application layers when troubleshooting and prioritizing issues that are impacting user experience. Catch this week’s episode to hear about the outage from ThousandEyes perspective.
On today’s episode, we discuss the recent outage on Verizon’s network that had widespread impacts on users in the US. ThousandEyes Broadband Agents detected an outage starting around 11:30am EST that manifested as packet loss across multiple locations concentrated along Verizon backbone in the US east coast and midwest. While the outage was resolved approximately an hour later, users connecting from the Verizon network across the US experienced varying degrees of impact, depending on the services they were connecting to. This serves as yet another reminder that the context around an outage directly affects the scope of the disruption. Watch this week’s episode to see what this outage looked like from ThousandEyes vantage points.
This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. Despite a quiet last couple of weeks on the Internet, we started off our new year with quite the bang. As droves of mildly-caffeinated workers returned to their home offices on Monday after the holiday break, many were surprised to find that Slack was not available. On today’s episode, we go under the hood of Slack’s Monday outage to see what went wrong and how it was resolved. We’re also excited to be joined by Forrest Brazeal, a cloud architect, writer, speaker and cartoonist, to talk about everyone’s favorite subject: cloud resiliency. Watch this week’s episode to see the interview and hear our outage analysis. Show links: https://forrestbrazeal.com https://acloudguru.com https://cloudirregular.substack.com https://cloudirregular.substack.com/p/the-cold-reality-of-the-kinesis-incident
In this week's episode of #TheInternetReport... 00:00 Welcome 00:16 Headlines: About Monday’s Google Outage; Plus, Talking Holiday Internet Traffic Trends with Fastly 00:43 Under the Hood: This week, we go under the hood on a recent outage that took down the availability of several Google applications, including YouTube, Gmail and Google Calendar. Yesterday morning at approximately 6:50 AM EST, users around the world were unable to access several Google services for a span of around 40 minutes. While short-lived, the outage was notable in that it occurred during business hours in Europe and toward the beginning of the school day on the US east coast—so, people noticed, to put it bluntly. Catch this week’s episode to hear about the official RCA and what we saw from a network perspective. 10:18 Expert Spotlight: We’re thrilled to be joined by David Belson Senior Director of Data Insights, at Fastly talk about Internet traffic trends related to holiday online shopping and charitable giving. Cyber Five: what we saw during ecommerce's big week- https://www.fastly.com/blog/cyber-five-what-we-saw-during-ecommerces-big-week Decoding the digital divide- https://www.fastly.com/blog/digital-divide 19:14 Outro: We're taking a break for the rest of 2020 but join us on Jan. 05 2021 when we kick off the New Year with Forrest Brazeal: https://forrestbrazeal.com https://cloudirregular.substack.com
If you’re an AWS customer or rely on services that use AWS, you might have noticed the major, hours-long outage last week. On November 25th, at approximately 5:15 am PST, users of Kinesis, a real-time processor of streaming data, began to experience service interruptions. The issue was not network-related, and AWS later issued a detailed incident post-mortem analysis identifying an existing operating system configuration issue that was triggered by a maintenance event that involved adding server capacity. Over the course of the day, Amazon attempted several mitigation measures, but the outage was not completely resolved until approximately 10:23 pm PST. What was notable about this outage was its blast radius, which extended far beyond AWS’s direct customers. Several AWS services that use Kinesis, including Cognito and CloudWatch, were affected, as were any user of applications consuming those services (e.g., Ring, iRobot, Adobe). This is a good reminder of the risk of hidden service dependencies, as well as the need for visibility to understand and communicate with customers when something’s gone wrong.
This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. This week, we’re pleasantly surprised to say that the network did not break, and there were no major election-night outages to report. However, that’s not to say we didn’t catch performance glitches in the days and weeks around the big night. Watch this week’s episode, as we cover performance issues at a Secretary of State website as well as why CNN’s election map website was so slow to load for many.
We’ve got an election coming up here in the US, and over the last several weeks, we have been analyzing a dozen or so state election websites to take a closer look at how they’re hosted (e.g., do they use a CDN or are they self-hosted?) and to monitor them for outages. In this episode, we discuss the pros and cons of each hosting method and dive into some examples we’ve seen where election websites have had unexpected performance degradation. Catch this week’s episode to go under the hood on the websites powering the upcoming presidential election—and don’t forget to get out there and vote!
Comments 
Download from Google Play
Download from App Store