The NoCloud Organization — Part 2
In Part 1 of the NoCloud organization I discussed how complex today’s IT systems can be and I mentioned an organization that believed that if they purchased everything from a single vendor that it would make it easier for them to plug in “monkeys” into their system and pull the levers of the IT machinery. In this post I’ll discuss the many things I think this organization approached incorrectly, and in part 3 (and possibly 4) I’ll try to discuss these points in more detail — in the context of what the organization SHOULD have been doing to become more efficient and effective.
Our NoCloud organization recently went through some major changes. The parent company just purchased another company in the US that had been losing money. The current North American headquarters were in New Jersey but the datacenter needed some cooling upgrades, and a decision was made to put IT under the control of the new company which was just purchased, and then to move the entire datacenter down to Georgia, where more space was available. Over the next few weeks and months we would learn a lot about the organization we saw a few things in meetings that raised our collective eyebrows, and overtime many fears would be validated.
The new organization it turned out was not terribly evolved. They followed the traditional view that IT is a cost center and as far as technology, they were well behind us in every area — Clustered SQL, Active Directory, and when it came to VMware they were running ESX 3.0 using captive disk (DAS). When I got around to finding the right people to talk to, I learned that the reason they were running captive disk because they were never able to get to shared storage to work reliably. They had several production web servers and other systems in this environment which took them about 3 days to restore when the 3rd disk in the RAID-6 array failed because no one was monitoring it (speaking of monitoring, several production LUNs were lost in a separate incident because all parity and hot spare drives had failed over 18 months and no one had noticed). The VMs were being backed up with traditional agents to tape. As of August 2011 this environment was still running the same workloads on ESX 3.0 (unsupported), and possibly may still be.
Our virtulization infrastructure was significantly more evolved, but the new organization didn’t have a high opinion of the concept — based undoubtedly on their own results. I developed a plan to migrate about 350 physical and virtual servers from New Jersey to Georgia using virtulization and then replication for about 90%. It got a lot of push back initially but eventually the concept would be approved as it could be demonstrated to be more risk averse then any of the other methods, and not to mention meeting the aggressive time table (to which a promotion to VP would be linked to).
Within about a month we designed and ordered new hardware to support this new vSphere environment. I worked 80-100 hour weeks over the entire summer interviewing application owners, designing the project plan and the migration scheduleand doing most of the heavy lifting. Many believed the technology (VMware) would not be able to handle such an ambitious project and many believed our process would fail or we would be crushed by logistics. In the end we exceeded everyone’s expectations. Most application owners said their workloads performed better, and we even virtualized applications that spanned over a dozen servers including Novell NetWare and Windows NT, working with HPUX systems.
Because the time table was so aggressive, there was no opportunity to develop best practices and operations, so we worked on setting up a sustainable environment (including core monitoring and backups) and we were told — promised — that we would be given the opportunity to circle back later and build an operational framework. Well after 6 months of working nearly every weekend (no comp days) guess what happened!
For getting the datacenter move done in a ridiculous time frame the new IT director got his promotion to Vice President, which was both celebrated and demonstrated by a shiny new Hummer appearing in the newly designated executive parking space. Of course no thank-yous could be afforded to anyone else – certainly not those who sacrificed every single weekend over the summer – lest they might think they play a significant role and try to work outside management’s intended boundaries. One manager began collecting information for both recognition and monetary reward for those involved in the huge undertaking, but that effort was quickly shut down by managers above him.
As for operations in the new datacenter, suddenly that wasn’t important. I would be assigned a new role where I would work on projects as assigned by the CIO – and many technical parameters (including don’t use the risky virtualization stuff) would be mandated from the start. So what happened to the virtual infrastructure that now housed nearly 400 servers, about 70% of them production? The new team had no experience with virtualization so I would find out that their provisioning process would be to install the OS manually from CD each time. They were instructed that servers had to be built this way because audits for SoX compliance required it. I called several managers and explained to them that this was an incorrect interpretation and that they needed to challenge their auditors to let us do what our competitors are doing. Nothing changed. So when I needed VMs (for test/dev elements) it would usually take more than a week and at times up to 3 weeks before I got the VM turned over to me. If I tried to fix anything I would get yelled at (literally) for daring to touch the VMware environment — “that belongs to operations and you are not operations!”. Never mind that systems were broken and they asked for my help. Some of this is what inspired a previous post entitled “Let Your Fast Zebras Run Free”. Others would have similar experiences in other disciplines, as the best employees were ironically the ones which new management wanted to be “corralled” within their respective pens.
Snapshots would be used as backups and left open for months and then they lost hours trying to troubleshoot performance and backup issues that spilled over into production. Resource limits were set of VMs for no apparent reason, effectively limiting systems with as much as 32GB of RAM to 2GB. When I delivered the environment every single VM would be backed up by default leveraging the vStorage API, but now VMs were no longer automatically backed up (or efficiently in many cases). The organization would be told to disable DRS because the movement of VMs across different servers made it confusing to match up asset tags to servers — a manual vMotion would now require a help desk ticket.
Monitoring? I had led an initiative to build a service map to monitor key elements of our ERP system (Oracle), and critical infrastructure (SAN, AD, SQL, vSphere, Backups and more) and expose it to operations. We were instructed to abandon our investment in these solutions, as they would be replaced with “whatever the new IT org uses” which we eventually learned was essentially nothing. Well they did have SCOM but we ended up learning and then teaching them how to use it to a point that was a faction of competency compared to what we had previously.
What about the strategic direction of the IT organization? After asking to attend the strategy meetings for nearly every week (to which engineers were never invited) eventually I got the call information and we didn’t know whether to laugh or cry. It turned into a great source of comedy in our office to counter the low morale. Management didn’t want to be questioned by engineers and soon the meetings would be “cancelled” but continued privately. Management couldn’t avoid being questioned however, as our parent company in Japan would quietly send over a team of over a dozen people to attempt to understand why ever since the takeover, IT projects were taking so much longer and at much greater cost. I wonder why!
What about agility? You might have gotten some hints already, but in order to get anything done — such as poke a hole in a firewall for example — you would have to reach out to all the required teams, and after the obligatory “I need to setup a meeting with my manager” and calendar logistics, we might be able to make progress after several weeks and several meetings (“what was this for again?”) and then we could finally collect the approvals and submit to management. Or course then we would hope that senior management would actually approve the ticket in time, or else we’d have to get everyone together to pick out a new time window. Several of us became convinced that managers would create processes just to slow things down to a pace they felt comfortable with.
Morale was terrible among the employees. Training for employees was universally rejected. I obtained a voucher to attend VMWorld and I offered to use my own transportation to attend – I just needed to get vacation time approved for that week (a month in advance). The vacation request was denied with no explanation given. When my daughter was healing from surgery and intensive care I checked my mail to make sure everything was going smoothly and responded to a colleagues question about logistics for an upcoming meeting with a vendor. This earned me a phone call from my manager who would scream at me for the next 20 minutes for “working” during paid time off and he asked for the hospital room information so he could come by and compescate my phone and laptop. The nurse (and other visitors) heard the screaming on the other end and expressed concern about the negative energy affecting the patient (my daughter). I would later learn that others would have to go so far as to seek professional help for how they were treated at the workplace.
In many different examples, it became clear to several of us that the new “regime” operated under Theory X — yell at employees (including manager) when the rainbows and unicorns are not aligned as you have envisioned. We saw the use of fear and intimidation to “motivate” employees and efforts to inhibit communication with management and keep employees in the dark about future plans and direction. Empowering employees with information could present a threat so just treat them like pawns so they won’t deviate from their intended boundaries designed by management, and just keep pulling the assembly line levers.
I don’t want to offer too much commentary here, but just let the scenario speak for itself for now. In a couple weeks I’ll post Part 3 in which I’ll attempt to look back at this post from a different perspective — looking at some principles that should be inherent in a well functioning Cloud Organization and how this org specifically failed to meet most (all?) of them.