Today we had a big deal production release. And guess what, everyone is on holiday ! Happy summer everyone !!!
The release was at 2 pm, and everything on my part was already tested and validated up until our staging environment, the step before production where everything is supposed to be strictly equal to the production environment (according to the great theory of perfect software development). But around 2:30 pm, I got a phone call, an email, an instant message, AND a visit in person, simultaneously, to tell me the same thing : “THE SERVER IS BLEEDING !!! THE WEBSITE IS DYING, DON’T YOU SEE !!! LET’S ALL PANIC TOGETHER, JOIN US – NOW.”
Crisis meeting
Basically, the website was very very slow and sometimes blank. So indeed I joined both the chatroom and the conference call. With one hand holding the phone, and with the other hand trying to find out what went wrong in the code, savagely emptying the cache, sending emails crying for help, and at the same time trying to listen to the managers saying “So what exactly is your plan of action ?!!!! We have to deal with tons of customer phone calls here, I don’t hear a lot of solutions from you !!!” “You” being Us, the technical team : me (alone for the web backend and frontend), the system guys (admin/servers/database), the java backend, and also probably the networks guys but I didn’t hear them much. Our collective answer was “(….)” followed by the sound of clicks and keyboards typing. Why talk on the phone when you have an urgent problem to solve ????
15mn later, the java backend detected something that could be fixed on their end, 20mn later the database guys detected we had reached the maximum number of connections we could handle, 30mn later we got the explanation from the managers : “Huh, by the way, we made a huge communication yesterday telling everyone that it would be online at 2pm exactly, did you know? is it related?” Damn. Why do they always do that ?! No, we had no idea because you didn’t communicate to >>>>> US !!! Manager : entity living in a fairy environment where things get into reality exactly as they have imagined it, sometimes even with butterflies.
[Update : later we were informed that it was the highest number of connections the company had recorded in its history]
Where do you communicate the exact release “hour” to customers ??? There is always a time lag of adaptation and bug fixing before you communicate, no ? Where I work, the teams are all separated anyway, it’s rather difficult to keep the information flowing. I kept interacting with people I don’t know, I even never met those managers, big company style. But, that’s not the end of the story. What was my part, right? Was I just blogging about it in the middle of the crisis ?
Debugging the Himalayas
Noooo ! I was completely over stressed. To give you some context, I am new on the job, I’ve only been there for a little more than a month, and I know the basics but when it comes to debugging why it works one time in two, I go blank (like the website itself). It can be the networks, the server.. since it works on other platforms, the code base should be ok. So I rather focused on solving a more concrete issue : the new content was not displaying.
I first thought that was a cache problem, so I emptied all of it. Still didn’t fix the problem. So I wanted to check what the problem was directly on the production server, to check if the files were there or not. But to go on the production server is like to hike the Himalayas bare foot in winter. Geeees, first you need to connect to the stage machine via a physical token, pass 3 connection portals, install a remote desktop software, and then interact in a terminal on the remote running on windows to connect to the production server via ssh. And to do this, you need to follow a document that is stored on sharepoint, in a folder called “old docs”, and remember that the document is not up to date, but recall from your biological memory that the password is not wel0{3k% but instead bol5k%} !!! And then, you can start searching for the files on the production server, provided of course you know the filesystem, because it’s not the same as on your dev environment, and since it hosts all the websites, you need to know where yours is. Well anyway, I got blocked at the first connection portal. Of course, it was not loading.
While waiting for the first portal to load, I decided to debug from the front, maybe I could find something that would tell me if it’s the right file or not. Fortunately there was some debugging information there. Before I reveal the big mystery, let me unroll the history of the project to you.
The mystery revealed
We have spent nearly 3 weeks now preparing this big release, we tested it on our 3 testing environments, local, sandbox, and stage, and it was valid. Of course we made the crazy assumption that the production environment would be the same as the testing platform. Elementary mistake, my friend. The mystery bug came from a reference that we used on all platforms to identify the result of a webservice from the java backend that differed on the production platform, but that we all agreed from the start that it would be the same. Ughhh.
When I found the anomaly, I did a little victory dance in my head and informed everyone that I, too, had found something to fix. When finally calmed down, I couldn’t help laughing at the situation thinking how silly this bug is. Because we spent so much time testing and validating stuff, that in the end don’t work and miserably fail because, well, we didn’t test it on the right platform. It’s like testing that a car indeed runs well on the street, but then you send it in the ocean hoping it will be the same, and you do that super confidently because you have tested it. It’s not the first time I encounter this problem, I never found a place where production === stage, it only exists in the software development paradise.
Anyway, I spent the day on the grill. I was completely left alone for the web part, the only person that could have helped me was always in meetings and everyone else is on holiday. Not to mention that I was fixing someone else’s part so I had to quickly understand what he was trying to do before debugging it. And I have really little experience in debugging in times of extreme tension on a platform that I barely know. I don’t even begin to imagine what it could be like to be on a spaceship bipping all over the place. Well, that was stressing but it ended up with a sweet realization that, daaah I can do it !
Also it’s not finished, I am on another investigation of a really tough bug where this time the server pauses sometimes, not always, for 1mn and more and we don’t know what it’s doing (enjoying life, calling friends maybe, I don’t know) and only in staging. I will unveil more details as we get more logs. Stay tuned.