Netflix VP of It on the Future of Infrastructure


We recently conducted an interview with Mike D. Kail (@mdkail). Mike currently serves as the VP of IT Operations at Netflix, where he’s leading the organization from an on-premises IT stack to one that sits primarily in the cloud, in addition to a number of other meaningful changes around networking, security and data.

To start, could you tell us a bit about your background? How’d you come to Netflix?

I have 23+ years background as a Unix System/Network Architect, where I’d spent time in and around DevOps, BigData, etc. before those became the buzzwords that they are today. Most recently in my previous role at Attensity, I ran a large Hadoop and Hbase cluster and managed other big data components. I joined Netflix in the summer of 2011 and currently serve as VP of IT Operations. For a bit more detail on how I joined Netflix, I captured some of the story in my blog.

Can you walk me through broadly what Netflix’s infrastructure looked like before you joined and where you see it headed over the next few years?

Before I arrived, the internal IT infrastructure was traditional on-premises and two Data Centers.  Our HR system was homegrown Apache + PHP, the financials was in Oracle, in addition to lots of big iron and incumbent storage. Our 10x goal for 2014 is 100% Public Cloud and/or SaaS. For even more detail, I’ve made our roadmap public here Netflix IT 2014 Roadmap.

I’ve heard about Netflix’s 100% mobile culture. Can you expand on that? What are the kinds of folks you like to bring onto your team? What are you looking for?

The best way to describe our culture is “everything is mobile.”  Why should one only be able to work from within the Corporate perimeter? We’re working on removing the perimeter and moving the Campus Network to a “Zero Trust Network Architecture.”

The goal of this transformation is to enable the same user experience, whether someone is at Netflix HQ, a coffee shop, or at home. Even more than that, quite a few people at Netflix (myself included) don’t have an office. All I need is my iPhone 5S, Nexus 5, Macbook Air, and Chromebook, and I can work from wherever I need to be.

What does the move to all these cloud services entail from an infrastructure and security perspective?

Public Cloud (IaaS) and/or SaaS becomes your infrastructure, forcing you to think differently with respect to securing a non-existent or elastic perimeter.  As a result, Identity and Data Access become the new Security Perimeter.  One then thinks about how to provide security within this construct, which requires a new way of approaching the problem.

What’s your view on measurement and analytics? What are some areas where you’ve found good analytics to be particularly helpful in managing the IT org?

I think useful, actionable metrics and analytics are something that many IT organizations unfortunately either don’t think about or don’t know how to track in a scalable manner. Some examples of what we think about are helpdesk ticket trends, not just volume of tickets, and so take a somewhat predictive approach. On the Network side, don’t just analyze overall capacity and throughput, but measure the “QoS” of things such as Round Trip Time (RTT) to SaaS providers (e.g., Google, Workday, Box, etc.).

You’ve spoken about the zero-trust network architecture before. Can you explain what that means at Netflix? How will you manage Network Access Control (NAC)?

We’re taking a layered approach to Zero-Trust. The first will occur in conjunction with the upgrade of our Aruba WAPs to 802.11ac. We are implementing “certificate-based authentication” instead of the standard username/password auth against Active Directory.

In addition, we are logging all activity from an audit/analytics perspective to Sumo Logic and have created dashboards to perform some deeper analytics and correlation.

NAC presents a unique challenge given the device labs we have here and not wanting to add friction. One way to approach NAC is to simply allow TCP ports 80/443 by default to anything connected to a port and then partner with teams that require extensible access.

You’ve talked about the slow-death of Mobile Device Management (MDM). Can you share your thoughts on why that is and what to expect instead?

To me MDM (and Mobile Application Management [MAM]) were never viable solutions if you think about application and data access. Putting more policy and controls on the device is an incorrect way of thinking of security. We should instead focus on strong auth to applications and data, along with some ability to encrypt the cached data at rest on the device. For some additional context, I previously wrote about this subject in my blog.

You’ve also observed that the legacy Enterprise Data Warehouse (EDW)/Extract, Transform, Load (ETL) tools will make way for a next generation of solutions. I’m very curious to hear where you see this space going.

Today’s EDW (incumbent) solutions are neither elastic nor multi-tenant.  Once you outgrow your GB/TB/PB appliance, you need to invest a substantial amount of CapEx to upgrade, forklift the data, etc. During the course of that process, you lose precious time and data insight. The next-gen of EDWs will be cloud-centric, fully elastic and multi-tenant, with the ability to ingest data at “user speed.”

In conjunction with that shift, there will be new ETL pipelines that could potentially leverage other database technology as part of that pipeline. One could envision running the data through a Hadoop cluster to Map/Reduce it, then transform it using an in-memory store, and finally load and analyze it in the cloud EDW.

As new categories of engineering/analytics start to become more prominent (e.g., data engineer, DevOps, data scientist), are you noticing a cultural shift in the Netflix IT organization?

My feeling is that IT should have always been both engineering/dev and data focused, and I believe the industry shift over time will make that a stricter requirement for every company. Everyone doesn’t need to be a full-stack developer, but having the skills to glue together services via APIs will be necessary. Fundamentally, understanding the power of data and knowing how to unlock and expose it will be a key trait.

Are there any positions you’re looking to hire for right now that we can share?

You can see our list of openings in the IT Operations section on our Jobs Site.

What have been some of the challenges in migrating so much of the Netflix organization to the cloud? Are there services or components of the enterprise IT stack you think really ought to remain on-premise?

The main challenge is the aggressive timeline that I’ve set. The notion that something needs to remain on-premise is really an Old World way of thinking and feels more like someone wanting control as opposed to there being a valid argument.

Where do you find the balance between open and proprietary software? How do you see that balance playing out more broadly over the coming years?

We talk a fair amount about “build vs. buy”, and it often comes down to a hybrid solution where we augment the “buy” with pieces that need to be more agile and in our control with ability to add features. On the Open front, I think we will see the steady move to more and more Open Source across all technology areas.  You will also see Open Source contributions coming from members of my organization over the coming months.

  • Share this post: