Rings, bells and victory

A debugging story Ep02

This is the story of my investigation on solving a cryptic bug that caused a very mysterious session reset. It involves, https, cookie privacy (httponly, secure), understanding of sessions, understanding of the infrastructure, team work and the obsessive need to understand why, why and why. Continue reading

In this episode I will require you to keep your eyes and sockets open. Follow me !

About a month ago a very weird bug happened on our website : when a connected user clicked on a given link, a new page opened in a new tab and when the current page was refreshed, the user was disconnected. Every time we would click on that link we were disconnected from the website. This page, let’s call it page X, displayed external content and was served under our website’s domain, it didn’t require the user to be connected. Here is the initial picture :

Cryptic session reset

Here’s how it was so mysterious :

#1 If we kept page X opened, reconnected to the website, and refresh page X, we were not disconnected.

#2 On some machines it worked perfectly. Sometimes on some browsers it worked and on others it didn’t. Sometimes it didn’t work for one browser whereas it worked for the same browser on a different machine.

#3 Of course we could not reproduce the bug on the testing platform. The testing platform was having a great time as always, all bright and shiny.

Chapter I : Investigation

Init

My first instinct was of course to check the sessions. It was not difficult to establish why we were disconnected : page X was resetting the session (it was sending a new session id), but only when we accessed it by clicking, if the page was reloaded the session was preserved which explained weird fact #1. The question left was, why was page X resetting the session, and only on click ? WHY ? …

Second instinct was to look up on Google “why would a page reset a session?” no luck – tried “reasons for session reset” – nothing. Finally, I surrendered “how does a session work”. Let’s really read through all of it. Some of it I knew already but knowing it in more details is always helpful, here’s the gist :

How does a session work ?

So basically, if I answer my first Google query “why would a page reset a session?”, it’s because the server that serves the page doesn’t receive the current session id, and therefore creates a new one, or maybe also the session is reset from the code, it’s always a possibility.

Since my scope of action is the code base, I had to review the code before handing the problem to the system team who manages platforms (servers, framework etc.). It was rather complicated to review the code since page X was part of a totally different project managed by a different team (team B), but that we just happened to host under our domain. So I interrogated team B to know if they had any knowledge of some session related changes lately. But they all denied, no session reset on their side. I was suspicious so I still reviewed all their code. And I was forced to admit, indeed no session reset. The only other possibility was something going wrong with the server.

So you understand what’s coming next here’s an illustration of the context :

Infrastructure

In depth investigation

Since nothing was found in the code that could explain the session reset, we tried to find some pattern for the bug, to be able to reproduce it systematically. But we couldn’t find one, on some machines it worked well. I temporarily teamed with the person who built the website (but who has other responsibilities now), and he couldn’t find any explanation either for the session reset. We tried many things.

CDN problem

I suggested that it could be a CDN problem, a change in configuration that made some proxy on the network trim cookies, page X might be cached on that CDN, some CDNs remove cookies. So we tried on different networks, it worked with the same erratic proportion, some worked, some didn’t. No luck.

Server problem

Then I teamed with the system team. I formulated some hypothesis :

– Since we work with 4 servers on the front, maybe one of them doesn’t read and write session ids correctly, which would explain maybe why on some machines it doesn’t work since we are load balanced on the defective server.

– Maybe the sessions were not correctly replicated across the 4 servers. Since the sessions have to be the same, they are kept in a database that the servers query to get the session id. Maybe something is going wrong with that. It’s a more specific version of the first point.

They restarted all servers, no change. We checked the operations in the database, nothing to declare. Both hypothesis were false.

What more then? The bug was becoming more and more cryptic and unsolvable. Yet, it was critical that we solved it because page X is often accessed by customers and therefore crucial to the business. We couldn’t afford to have sessions  blown away every time this page was opened. And we couldn’t just say “the problem is your browser”, or if it was we had to fix it in some way obviously. People were starting to put pressure, emails were sent with everyone in copy, clients were complaining more and more. But at this point we had no clue.

Cookie privacy problem

Finally, I remembered. I mentioned a small change that was done right before I arrived, those lines were added to the .htaccess :

php_value session.cookie_httponly 1
php_value session.cookie_secure 1

The first line tells the browser to allow access to cookies only through http, and the second one instructs the browser to send cookies only if the protocol used is https. But I had already checked that, page X was in https, our website is in https too so the cookies should be correctly transmitted. As for httpOnly, it was only for reading the cookie not to transmit it. I didn’t quite see how those changes could produce the bug that we had.

Solved

Since any lead was worth exploring now, we deactivated both attributes on the production servers live from the file system. And BAM, it just worked.

Following that, the system team closed the case, the bug was resolved. However, the bug was no less mysterious. From the system team’s standpoint though, it was done and done. Yes, but the website was no longer secure… Suspense !

— Intermission —

Chapter II : Obsession

httpOnly and secure suspected

As a code guardian I could not let that happen. And to be more honest, I wanted to understand what was going on, this bug was driving me crazy. So I took all the classified files with all the facts with me and continued my investigation.

Why would httpOnly, or the secure attributes cause this bug? What were their motives? I asked the system team to deactivate only one of them to check which one was the culprit. But the system team didn’t want to risk reproducing the bug in production, they had already moved on. I had to reproduce the bug on the testing platform to find out what was wrong on my own, as a singleton. As expected, the testing platform was nowhere near cooperative. It was always working, with or without httpOnly or secure.

New clue

That’s when I decided to review all the steps calmly and come back to the crime scene. I let the debug console opened, and repeated every step. I wanted to catch page X in the act. As I was clicking on the malicious link, I quickly switched to page X, and that’s when I saw it.

There was a 302 redirect of 559 ms happening before landing on page X !

302 redirect

You will ask, yes so? I looked closer, it was weird because it was a redirect on the same exact url : page X was redirecting to page X. And this was only in production. On the testing platform, there was no redirect. This was very suspicious. I knew I was on the right track.

Redirection, httpOnly, secure, what is the link between all this? The problem became :

Cryptic session reset

I suspected a session flag to prevent redirect looping but I couldn’t detect what actually caused the redirection. Since copying and pasting page X’s url didn’t trigger any redirect, I tried to replace the culprit link from the console so that it would not open in a new tab, maybe this was the cause of the redirect (for some reason).

That’s when I had an epiphany, how could have I missed this ? In an instant everything became clear, the mystery was solved. All that we had done to solve this cryptic bug flashed before my eyes in a lapse of a few milliseconds.

— Aha moment —

All this time, the solution was right there smiling at us. And now I was the one who found it.

Rings, bells and victory !

Salvation

I warned you when we started, now open your sockets and bear with me.

All this time I had focused on page X when the real culprit was its goddam link. Indeed, I thought the links were all relative like all the other links of the website, it was a fair assumption. Yes, except that links to page X are special, remember it’s an external page hosted under our domain. So for these links, the href is constructed by a webservice (a completely different java application – a discovery for me) that sends it back to me and I just print it.

That’s right, page X’s url was absolute, and when I was about to change it, I realized, it was in http not httpS, like the whole website. I never suspected that could be possible. However when I checked page X’s url in the browser, it was in httpS not http… So it must have been…redirected ! Do you see it too now ?

I quickly checked the redirected url, it was indeed from page X to page X but from http to httpS. That nasty little S ! And when I checked the cookies, the http request sent no cookie to the server because the cookies were set to secure, that’s why the session was reset right before redirecting to https.

Explanation

The server received a http request that had no cookie in it, so it could not read any session id, which forced it to create a new session id. After resetting the session, it executed the page which made it redirect to https. Indeed, the https redirect was done in the code in this manner (another discovery I made during the investigation):

if ( !$this->fw->isSSL() ) {
header( "Location: https://".$_SERVER['SERVER_NAME'].$_SERVER['REQUEST_URI'] );
exit(0);
}

Which is a wtf practice because it’s like letting someone arrive at the destination before telling him he got it all wrong. You should have better showed him the right direction before he departed. For redirection it’s the same, it’s really better to do it in the server configuration, so that even before executing code, it serves the right url. Especially for a redirection that needs to be systematic from http to https, it has to be in the configuration.

Since we don’t have access to the server configuration (it is managed by yet another team, the hosting team), it was done like that at the time to go faster. But now I asked the hosting team to make the https redirection right from the Apache’s configuration to not get any more surprises.

Of course it was working perfectly on the testing platform because the webservice was sending only production links. And when I formed the links for the testing platform on my own I wrote them in https, I never thought they were in fact in http.

So if I sum up, here’s the final picture after the great illumination :

That nasty little S

But what about the weird behaviour that made it work on some machines then ? Well, when I tested with the system team, we noticed in the console that httpOnly was understood only by some browsers, so I think it was the same with the secure attribute. It’s not well documented, but it makes sense. The browsers that didn’t understand the secure attribute transmitted the cookie, so everything was working well. This explains why the machines on which it worked were not so many, such browsers are flawed so there are not a lot of them.

Chapter III : Lessons

Understanding sessions

My lesson here is that, the main reason for a page to reset a session is that the server did not receive the session id, you need to check where it got lost. Following this investigation I filed a report additional to this one that will give you all the technical details of the case, you can access it here : How does a web session work ?

Tech investigator

This episode made me think of a new possible role in addition to the software therapist. I thought I could be promoted to “technical investigator” (Hello, I’m the tech investigator – I like the idea but I need to find the right hat first). After a few more cases to practice on, I’ll be ready, it’s the 3rd unsolvable critical case I solved since I arrived. But being a tech investigator requires a vast knowledge about the whole infrastructure. Starting from just the code base, the areas that were involved expanded very far, look :

Complex infrastructure

I must say, I discovered a certain taste for complex infrastructures. I’d like to know and see more.

Epilogue

Since then, I asked the webservice team to change their url from http to https, so it’s cleaner. Now the https redirect is automatic so I can delete the ugly code redirect. I reactivated the security on the cookies : everybody calms down, the website is secure again ! We do banking operations, it’s a bit important.

— Intermission —

Chapter IV : The silent joy of the engineer

I used to hate this part of engineering, “debugging”, so boring, uninteresting and a waste of time... but my view now changed. Debugging is like flying above and inside your whole application, it makes you understand better how it works, so you can improve it. On the side, you also do a lot of research to understand concepts you need to understand more in depth, you learn a lot. It also sharpens your skills to narrow down more quickly where the problem is.

And above all, at the end you have the immense and absolutely huge satisfaction of understanding what is going on, it wipes out all the crap you had to go through to understand what was wrong. Everything becomes clear and rational – a bit like in the scooby doo cartoon (oops, I didn’t mean to write that publicly) – there is no such thing as esoteric computing, I love it.

No need of applause, or praise, just the inner satisfaction of being able to explain how and why it works and make it finally work, it is enough, it is the silent joy of the engineer.

  • Shailesh

    just one word to say …. superb !!

    • eloone

      thanks !!

  • moon2sun00

    “No need of applause, or praise, just the inner of satisfaction”. Like it, so just keep going and happy digging 😀

  • Aman Thukral

    Wow!