Canada Was Offline Today

I went to work in the office before I realized that the network was down, we couldn’t run our normal operations in production at first and all our business cell phones were cut off too. But I had work to do and international video meetings to join in, so I came home to use my Bell internet, VPN in through our US network and all was good. Of course I thought it was a cyber attack on Rgers 😀
 
No service for a day still means 99.7 % uptime. Not that long ago we'd dream about those kinds of uptimes. That fact that 99.7 % is so bad that it flattens society for a day should give you pause.

Tom

We went from using it for entertainment to transmitting occasional business information to depending on it for day to day business functioning. So what was once an annoyance is now crippling.
With so much on 'the cloud' now, a lot of businesses don't have local servers to run their software so without the internet they are dead in the water. Our work software just stops without internet, I couldn't tell you what you need, the price or where in the warehouse it is even if you knew what you wanted. Anyone remember shipping UPS packages with a handwritten ledger?
There was some movie or book where the plot was to knock out the internet and at first it seemed silly but after thinking about how many things wouldn't happen without the internet it was scary.
 
Depends where you want the redundancy at telco level** that can be an expensive pain*. By what little has leaked out there was a BGP snafu which took out the whole rogers IP range. As everything is IP now basically everything in their network stopped being able to talk. Cloudflare went out a couple of weeks ago (and took this website off the air for a while) with a BGP issue. BIG difference is they fixed it quickly. And its the ability to quickly resolve that sorts the men from the boys.

*I'm sure everyone on here who's worked in or around datacentres has had to deal with $$$ redundant links where it turns out both pairs of fibre go out the building in the same conduit so one badly placed back hoe knocks both out at once

** disclaimer I do currently work with stuff that IS redundant partially at cellular level so can handle a full radio network failure. It sometimes works but still ends up backhauling out of the primary telco so would not survive this sort of outage.
Apparently, Interac has learned something from this: They should have wrote "Strengthen our non-existant network redundancy"... Also "adding a supplier" sounds a lot like Bell to me as there aren't many other providers large enough.
 
  • Like
Reactions: soundchaser
Re: Reasons for Rogers outage:

"If anything can fail it will, and at the worst possible time" -- Peter Principle

Tsujigiri
"To test a new sword on a unsuspecting passer-by" This was an actual practice in medieval Japan. Now it can be used for black humour when a company releases a new product or revision (to get it out the door quickly) after shallow testing.

So Rogers. how are those blades working now?
 
We once bought fiber links to two sites 50 or so miles away from each other that were supposed to go to different locations for redundancy. Come to find out they wound up in the same manhole on the east side of a large city once we had an outage for both.

One of the bigger failures with BGP configuration problems is you may not be able to reach the big router in a datacenter once you pushed a bad config. Having serial links and dial up to log into that switch is sort of critical. Often not done because costly. Sort of the "Work: part of networking.
 
Yep. Internet is like a public utility. A power outage is only fun for the first few hours – especially if it's winter and a bit nippy out.

Tom
Haven't some states in the mob underneath you passed/tried to pass legislation making internet legally a utility? Sure I read something somewhere a few years back?
Having serial links and dial up to log into that switch is sort of critical. Often not done because costly.
I find modern IP based ILO very scary as a concept, but figured I was a dinosaur still finding comfort in a 9600 serial connection into the back for when all else fails 😀
 
I have used alot of IP based ILO to restore down servers. I have reset Dell servers that wanted you to hit F2 for some stupid reason on boot. Even driven 25 miles to the datacenter because some device was stuck on boot for no good reason and missing ILO.
You go and disable the bios setting "stop of keyboard error when booting on all of the servers. Find it reset when you update firmware.

I have configured dial up switch and outer access devices. Can be a lifesaver when you are miles away.
 
Re: Reasons for Rogers outage:

"If anything can fail it will, and at the worst possible time" -- Peter Principle
isn’t that Murphy’s Law?
Peter Principle is something else: the tendency in most organizational hierarchies, such as that of a corporation, is for every employee to rise in the hierarchy through promotion until they reach a level of respective incompetence.
 
Looks like Rogers has fessed up to the problem. From https://www.theregister.com/2022/07/25/canadian_isp_rogers_outage/

The configuration change deleted a routing filter and allowed for all possible routes to the Internet to pass through the routers. As a result, the routers immediately began propagating abnormally high volumes of routes throughout the core network. Certain network routing equipment became flooded, exceeded their capacity levels and were then unable to route traffic, causing the common core network to stop processing traffic. As a result, the Rogers network lost connectivity to the Internet for all incoming and outgoing traffic for both the wireless and wireline networks for our consumer and business customers.

Yeah that'll do it. At least they have had the decency to sack the CTO (probably with a diamond encrusted golden parachute though).

Edit: I actually know the outgoing CTO from a previous gig. But there he wasn't CTO but appeared to be having an affair with the CTO.
 
Last edited:
It's not just Rogers to blame.

Why would captains of industry not wake up to the vulnerability of THEIR decisions before this. Should they not have thought of that and had redundant suppliers themselves even BEFORE the outage happened. Seems like a lot more CTOs need to be sacked as well not just Rogers.
Think about this, if you are running a business that depends on connections. Would you rely solely on one supplier?

For example, I need the internet everyday. So when I change my ISP I at least allow a sizeable overlap of two subscriptions to make sure everything is OK before I end the subscription of one. If I absolutely needed the internet each day, then I would pay for two subscriptions. An extra $100 a month would be nothing if business for thousands of moola went down the drain for a day because I saved $100 a month.

Then there is the vulnerability of connected systems IF subsets cannot operate without full connections. What were they thinking?

I did not get affected because three months ago I went away from Rogers after being with them for over 6 years running. They treated a loyal customer like me as garbage. AFTER I left, the calls to come back came. I had given them ample opportunities to retain me as a customer. Thank goodness their imbecilic marketing policies made me wake up.
 
Looks like Rogers has fessed up to the problem. From https://www.theregister.com/2022/07/25/canadian_isp_rogers_outage/



Yeah that'll do it. At least they have had the decency to sack the CTO (probably with a diamond encrusted golden parachute though).

Edit: I actually know the outgoing CTO from a previous gig. But there he wasn't CTO but appeared to be having an affair with the CTO.

The Rogers fess-up has too much blah blah for the general public. Short story: There team did not have the discipline to carefully plan and check the changes made to critical system components.
 
The Rogers fess-up has too much blah blah for the general public. Short story: There team did not have the discipline to carefully plan and check the changes made to critical system components.
Hey even twitter needed an edit button for a reason. Apple will introduce the ability to undo message errors. Accuracy and planning and perfect execution no longer matters. We even want self driving cars and collision avoidance systems. See a pattern?
 
The Rogers fess-up has too much blah blah for the general public. Short story: There team did not have the discipline to carefully plan and check the changes made to critical system components.
And not enough for anyone even vaguely involved with this sort of stuff. ISPs go down regularly. Things are too complex now to fully test. The measure of a good company is how quickly they can react and fix. For example Cloudflare went down in June. North American continent was asleep but I lost my morning binge of DIYaudio. They fixed it in 90 minutes (and admitted that was too slow and need to work on co-ordinating rollbacks). In the Cloudflare case the upgrades worked in their older smaller type A datacentres and borked their newer, larger Type B ones.

Planning for recovery from these sort of outages is hard and timeconsuming which is why it gets skipped in organisations that, for example have little competition or a C suite whose bonuses are based on getting rid of pesky overheads like expensive techies with 30 years of experience 😀