227. Facebook's DNS Woes with Sophie Creutz

In this episode of The Rabbit Hole, we unpack the recent outages of Facebook's servers, looking at why this might have happened, some of the more ludicrous theories that have been offered as explanations, lessons to take away, and why the downtime was worse than just a day of limited social media access for many people. With Facebook, Whatsapp, and Instagram offline for the better part of a day, many small businesses could not function, normal communications were halted for some, and a general air of curiosity spread across the globe. Our friend, Sophie Creutz, joins us to go through the most important points to reflect on, The Five Whys, and how Facebook and smaller companies can learn from mistakes such as this to safeguard against further issues in the future.

Key Points From This Episode:

  • The effects of the recent outage at Facebook's servers. 
  • How Facebook and its connected apps are used in different ways across the world.  
  • Tools for mitigating outages through root cause identification.  
  • Some of the theories that were circulating about why Facebook went down. 
  • Proper use of the 'The Five Whys' for dealing with issues.
  • Sharing what is actually known about the outage and what might have caused it.  
  • Recommendations for figuring out what went wrong and avoiding recurring problems.

{{addCallout()}}

Transcript for Episode 227. Facebook's DNS Woes with Sophie Creutz

[EPISODE]

[0:00:01.8] MN: Hello, and welcome to The Rabbit Hole, the definitive developers podcast, living large in New York. I’m your host, Michael Nunez. Our co-host today –

[0:00:08.8] DA: Dave Anderson.

[0:00:10.1] MN: And our guest extraordinaire.

[0:00:12.1] SC: Sophie Creutz.

[0:00:14.2] MN: Today, we’ll be talking about Facebook’s DNS Woes. Oh man!

[0:00:18.4] DA: Oh, yeah. Like it’s one for the record books. New high score.

[0:00:25.0] MN: New high score of about 24 hours. I lost a day. I don’t know. I just kept trying to refresh my Instagram, it wasn’t working.

[0:00:31.9] SC: Oh, no. You were just in a haze that whole time.

[0:00:35.2] MN: Yeah. I was trying to call my mom on WhatsApp, wasn’t working, but it’s so quiet. Everything was quiet that day. It’s pretty intense.

[0:00:42.7] DA: Right. Wow! The day the earth stood still.

[0:00:46.6] MN: Exactly. Oh man!

[0:00:48.3] SC: The day we all navel gaze, right?

[0:00:51.4] MN: I used Twitter. That’s what happened. A lot of people flocked over to Twitter. 

 

[0:00:53.7] SC: In fact, yes. What would we have done without Twitter?

[0:00:57.8] MN: Shout out to Twitter, you are the best. And you know, a lot of people are affected, many – I think it was like billions of users across those three applications were affected by this outage. And you know, 90% of the time when you get an outage that bad, it’s definitely DNS related, I’m sure.

[0:01:18.0] DA: So not only did it affect like the individuals who couldn’t access the service, which was everyone who uses those things, but also like businesses who are like selling, and shareholders who are holding Facebook, that really affected.

[0:01:36.1] SC: Yeah. I think we look at it like, “Oh, that’s kind of silly. It’s a social network. How foolish?” But in reality, there are businesses that use Facebook to sell their product, and they lost a lot of revenue that day, which for some small businesses might actually mean the difference between paying rent that month or not. So that’s something to keep in mind.

[0:01:57.3] DA: Yeah. So I think like when you’re trying to mitigate an outage, or like trying to retrospect on an outage. It’s definitely good to like, consider, like, what the impact of outage was like, what a truly the outcomes were from that.

[0:02:11.1] SC: Yeah. From the social network perspective, maybe it’s worthwhile to note that, for a lot of folks, I think it is super useful as a point of connection and that can look like a lot of different things. But for example, if you’re a musician, you might actually be networking to find gigs through Facebook, and then that’s how you continue to pay your rent that way, and stay connected to the musician community. I’m sure there’s other examples as well.

[0:02:46.0] DA: Mm-hmm. Yeah, totally.

[0:02:47.7] MN: I joked about not being able to speak to my mom, but the idea that – I know Dominican Republic is really huge on using WhatsApp exclusively for phone calls and messaging. The fact that you can’t reach out to someone – imagined, like, “Oh! I am unable to make phone calls when I need to make phone calls,” whether it’s like you’re calling a family member, or like, you mentioned this Sophie, the idea of like, having to make phone calls for business purposes. You just cannot do that and it was really strange to like call my mom through the regular cell phone app to call her landline to say, “Hey! What’s up? How are you? Just so you know, Facebook is not working” and like having to explained that to – my mother was pretty, pretty funny.

[0:03:30.8] SC: I guess it’s a good thing you had to work around though.

[0:03:34.2] MN: Yeah, exactly. Most people may not.

[0:03:37.2] DA: Definitely have a plan B. Yeah. William told us a story about like motorcyclists in Bali who went off the road. His first instinct was to post on Facebook, where he was and like what happened, and then he passed out. Somebody like sent an ambulance to go pick him up.

[0:04:01.1] SC: Yeah.

[0:04:02.1] MN: Oh my God! That’s amazing. Good on that person. That’s pretty cool that he was able to use Facebook to that. I would – it was suck like if you were to send the post like that, and then like, people just liked it and then that was it. That’d be so horrible, but I’m glad that it was used for good. That he was saved in this particular situation.

[0:04:21.9] DA: Right? Yeah. I mean, I guess like when we’re trying to like mitigate outages, it’s good to think about like what the real impact was. But then like, to try to reduce that impact in the future, there are a bunch of tools that we have that we can use to try to identify the root cause. We can use the five whys technique. That one’s pretty helpful. Have you ever used that before, Mike?

[0:04:48.3] MN: Yes, I have. I believe – correct me if I’m wrong, but the idea is that, there is a situation that happened. I’ll use, the Facebook is down. The idea is that one would ask, “Well, why did Facebook go down? And then the answer to that question will be the next lie of the question. So like, suppose, “Oh, it was a DNS issue.” “Well, why was it a DNS issue?” “It’s because the readme that we use and the playbook was not updated to support all these different countries.” “Well, why wasn’t the playbook updated to support those other countries?” “Well, it was because the previous outage that was only seven seconds, we were in a scramble, and we forgot to do that.” “Well, why wasn’t it prioritized for us to –” like you go down the list of the why’s that it happened. I’m not 100% sure those are the whys. Facebook, please don’t cancel my account. I’m not 100% sure. But the idea is that you go down the list to find out why, so that you get a better solution than just, “Hey! Don’t have DNS issues.” Right? Like that’s not the proper solution.

[0:05:56.3] DA: Yeah, exactly. It’s like – it’s not a perfect tool because you can go kind of off the rails pretty quickly and people kind of like practice this to a degree, like inherently, where there were so many weird theories going around the Internet. Like, “Wait! Why is Facebook down? It’s like –

[0:06:17.4] SC: People have some interesting ideas. Yeah, for instance, Facebook went down because someone deleted the Facebook master code. Now, why would someone do that?

[0:06:30.3] DA: Oh, well, obviously, because there was a whistleblower that week, and they wanted to get rid of the evidence.

[0:06:40.7] SC: Right. Yeah. Why was there a whistleblower that week, whistleblower Francis Hogan?

[0:06:48.6] DA: Maybe because of the 10 days of darkness.

[0:06:52.7] SC: Whoa!

[0:06:53.2] MN: Oh! 

[0:06:54.0] DA: And becoming political destruction of the world. That’s why, yeah. That’s it.

[0:07:00.4] SC: Oh wow! Why are there going to be 10 days of darkness?

[0:07:05.8] DA: Because this guy said so on a forum. I think he’s like Internet Jesus or something. I’m not really sure.

[0:07:13.4] SC: Oh wow! Why is there a need for such a figure? How did this figure arise?

[0:07:23.0] DA: Because five whys is really complicated. It’s hard to do. It’s just really hard to find causality in the universe, I guess.

[0:07:31.8] SC: Right. It seems like you could just keep asking why, truly without end. We might have gone beyond five there.

[0:07:42.3] MN: I mean, I think it is good to for the five whys, if I had to give pointers for it, obviously keep it grounded in reality to make sure that – the idea is that, keep it close to the incident, I guess. But one of the things that I’ve seen in my experience, and this is pretty horrible to say. I think the very first why in our given example was, Bobby over here, Bobby Facebook deleted the master code, right? Say that’s the thing that happened? Well, the solution isn’t to fire Bobby immediately, but I’ve seen five whys work in a way where the solution for that particular, I guess, exercise was actually the person got fired. I don’t think that that was like – I don’t think it called for that, but it was like, “Oh! If Bobby did take the chance to update the readme, that would have prevented the DNS issues from going down. Then maybe it was Bobby’s incompetence that he was able to do that, so we’re firing him.

I’m like, “Whoa! Did you just use the five whys to fire, somewhat to justify firing someone? I would suggest people who use that to not do that, because you want to make sure that you have enough trust in your team to explore the whys and then be able to come up with the proper solution rather than, “Oh! We weren’t working out.

[0:09:01.2] DA: I mean, I think that goes back to Project Aristotle, right? Like just the baseline psychological safety.

[0:09:07.5] SC: Absolutely.

[0:09:09.1] DA: And the ability to like raise issues early and often.

[0:09:13.4] SC: Yes. Yeah, it’s the five whys after all, not the five who’s.

[0:09:21.7] DA: Oh! Who hired this guy? Okay. This guy is fired too.

[0:09:25.9] SC: What’s the chain of blame here?

[0:09:28.5] MN: Yeah. Exactly. Yeah, we can’t use the five whys as the chain of blame. They’ll do that. Definitely not useful.

[0:09:35.2] DA: Yeah. I mean, we don’t have like a very clear picture and like what actually happened, but it’s kind of interesting to think about it in that way. But it’s like, from the outside world, like everyone was seeing that Facebook was no longer available on DNS. We’re like – some registers were even like, “Would you be interested in buying the domain facebook.com?

[0:10:00.2] MN: That’s crazy.

[0:10:01.8] DA: So it can be, like, easy to just be like, “Okay. Like, obviously, it’s a DNS problem. This is so clear.” But there were like little bits of information that leaked out.

[0:10:14.8] SC: Yeah. Do we have any like more detailed information about what exactly, probably actually did happen?

[0:10:21.4] DA: There’s an interesting snippet on ZDNet [inaudible 0:10:24.1] posted to Reddit, before like I was scrubbed from the internet clean, but they were saying that like the DNS nonavailability was just a result of an even more arcane network problem, where someone like updated BGP, peer routings incorrectly. Then, as a result of that, the DNS couldn’t get updated.

[0:10:49.7] SC: But why did they update the configuration incorrectly?

[0:10:54.4] DA: I think that’s where we’re off the map. Here, they’re really monsters for us. Like, we don’t have much visibility into it. Like maybe they’re – it was a training problem, maybe they used a wild card instead of like something [inaudible 0:11:10.7]. They like update star. Delete where star, something.

[0:11:18.4] SC: Well, I thought it was sort of interesting to how the article mentions that there were two issues. One was these actually actual routing problems and then there was the fact that physical access, apparently was the barrier. Folks couldn’t even get into the location they needed to get into.

[0:11:37.5] DA: Yeah. Why didn’t people have access to location to fix it?

[0:11:40.9] SC: Yeah.

[0:11:42.2] MN: I thought it was because the access to get into the building was also used by this Facebook server, which was down.

[0:11:48.0] DA: But why did they do that? Why did they set off the Facebook server to like use the same thing?

[0:11:53.7] MN: That is a good question. I wouldn’t – I guess it’s a note to keep. Don’t keep all your things on the same roof. Make sure that people have access to your application even if the server’s down.

[0:12:06.3] SC: Yes. And why would you have folks who need to access a thing, but don’t actually have the ability to themselves get access to that thing? That seems like a strange, maybe silo thing, or an odd separation of concerns?

[0:12:22.5] DA: Yeah. I mean, sometimes like that separation of concerns is like deliberate, you have like responsibility matrices, and like, “Okay. He who can touch this thing cannot touch this other thing. She who can have the keys cannot open the door. I’ve definitely worked in regulated environments that are like that. But it can lead to these kinds of weird situations, like a classic thing was always database production access. Okay. If you’re writing the code, you can’t get access to the database production access. You’ve got to like send it over to the DBA, who will execute the code, but then if you miss like a semi colon or something in your SQL script, then, okay, it has to go all –

[0:13:15.3] SC: It’s the whole thing, yeah. Do you think there are advantages to that kind of system?

[0:13:22.3] DA: I mean, the DBA doesn’t care about the data and the database. So like, they’re not going to do anything, like untoward with it, I guess. They’re unlikely to be like invested in like manipulating it or –

[0:13:34.9] SC: I see. Yeah, something to ponder.

[0:13:38.0] DA: Yeah. You know, you give me access to production database, and you know what’s going to happen. It’s just the Wild West.

[0:13:45.5] MN: I’ll have to say, during the outage, as I was trying to figure it out, and as a software engineer, I thought, “Yeah, I might have an inkling and an understanding as to what is actually happening when you go and try and search through Google as to what’s going on. But it was the first time that I’ve ever saw the BGP acronym being used and I realized how far removed I am from the upside of getting a website up and running. When I first read it, I confused. It was like, “What does it have to do with the Great Britain pound? What’s going on? A GBP? What is the GBP? How does that deal with Facebook? But then it’s BGP and it’s like, “Oh! I have to do the flipper reno myself in my head.” I knew what a DNS was and my first instinct was, “Oh! It was probably some DNS issue. They’ll get over it. That’s fine.” But then it was like, “Oh! This BGP thing and I had to go and do some more research on that to kind of get a sense of what is that in the first place.

[0:14:49.0] DA: I guess like those kinds of separations of concerns kind of naturally happen where you’re focused on application code. The DNS routing, like it’s not changing every day. Even if you were responsible for setting it up, like you might forget about it. I had to work on a project where we had to like learn a lot about DNS, and that just kind of made me realize how little I actually knew. I had a similar reaction when BGP came at PS. There was like, “Okay. Well, how does the internet work again?”

[0:15:24.4] SC: Packets. It’s all packets.

[0:15:27.0] DA: Right. Yeah. But like –

[0:15:29.4] SC: I oversimplify, obviously.

[0:15:33.9] DA: Yeah, just little tiny packets. So DNS, that’s like a lookup of like the host name to the IP address. The packet that you’re sending to Facebook like has to be addressed to like the IP and not the host name. Like the hostname is just for people to feel better about computers and not have to remember really long numbers.

[0:16:01.9] MN: Right. Because you can go into Facebook through an IP address, if you really wanted to, but that would be a really weird way to remember different websites.

[0:16:12.9] DA: Yeah, exactly. DNS like runs away to give you that mapping and some additional metadata. If you’ve ever set up like some analytics on your domain, like you sometimes need to add like meta information to your DNS entry, so that you can prove that you own it or all that stuff. But then BGP is even more low level or like kind of a different thing, where it’s like when you have those servers, how do you actually get between them in the best possible way? It’s not completely random.

[0:16:53.8] SC: Seems like there’s an approved list of neighbors, so to speak, that each BGP speaker can talk to.

[0:17:02.4] DA: Right. Yeah. We’re going to say, these are the ones that you should be sending all of your packets through, like, this is the best way that you can get through our network without packets just randomly going wherever packets want to go. So yeah, if you screw up that BGP routing, then you might not even be able to get to your server that is hosting your DNS, and knowing where the addresses of all those packets should go. So you might not be able to figure out like where facebook.com is or where like doorunlock.facebook.com is or anything like that.

[0:17:45.2] SC: Yeah. If you put your application in XML file in your neighbor address tag inside that, if you put an address that I don’t know, maybe doesn’t exist, or is on the moon, that could –

[0:18:00.3] MN: It’s on the moon.

[0:18:02.2] DA: Moon right here.

[0:18:04.0] SC: Oh! I just can’t wait until we have IP addresses on the moon.

[0:18:08.2] MN: That’s insane.

[0:18:11.9] DA: Yeah. Completely nuts.

[0:18:15.6] MN: Whatever you do for Facebook, be extra. If you’re dealing with this BGP, I imagine one should be very, very careful before you wipe your application off the internet address list. It’s something you do not want to do. I’m unsure how easy or difficult it is for someone to make this mistake. But I would implore that you be very careful, regardless of whether it’s very difficult or easy to make the mistake of deleting your entire application off the face of the Internet, which is kind of a scary thing. [inaudible 0:18:48.8] vast, it could go everywhere. But if it makes it that easy, that’s kind of scary. Be very, very careful of that.

If you do run into this issue, one should just be able to use some kind of root cause analysis kind of exercise such as the five whys to kind of get down to the nitty gritty and figure out what went wrong, and what we can do as a software development team to not have such an incident like that again. But I will say, thank you Facebook, because my mom couldn’t call me many, many times, but I’m sure a lot of people were affected by this. A lot of people were affected by this. I imagine that they got their own little post mortem happening as to what’s going on and what they can do to fix it and that kind of stuff.

[0:19:33.4] SC: I’m sure they learned a ton.

[0:19:36.3] MN: I really hope if there’s some kind of developer blog that comes out of Facebook to talk more in depth about it, but I’m sure this is stuff that they want to keep secret, especially if it was embarrassing, like Bobby, RMRF, some system and then the whole thing just went to [inaudible 0:19:50.1].

[0:19:51.1] DA: Classic Bobby. We can’t fire that guy. He’s always doing that kind of stuff. It’s real. It’s really fine.

[0:19:58.0] SC: It’s fine.

[0:19:59.0] MN: Yeah. That guy is around so that he runs into this problem so that we could do something about it. That’s what it is.

[0:20:05.1] DA: Right. It’s a chaos monkey.

[0:20:07.0] MN: Yeah, exactly. Bobby drops tables. Bobby rim raf. That’s just – this is what Bobby does best. Tearing it up. 

[0:20:16.9] SC: Keep at it. Yeah.

[END OF EPISODE]

[0:20:18.7] MN: Follow us now on Twitter @radiofreerabbit so we can keep the conversation going. Like what you hear? Give us a five-star review and help developers just like you find their way into The Rabbit Hole. Never miss an episode. Subscribe now however you listen to your favorite podcast.

On behalf of our producer extraordinaire, William Jeffries, and my amazing co-host, Dave Anderson, and me, your host, Michael Nunez, thanks for listening to The Rabbit Hole.

[END]

Links and Resources:

Sophie Creutz on LinkedIn

Sophie Creutz on Twitter

The Rabbit Hole on Twitter

Stride

Michael Nunez on LinkedIn

Michael Nunez on Twitter

David Anderson on LinkedIn

David Anderson on Twitter

William Jeffries on LinkedIn

William Jeffries on Twitter