MTA - Metropolitan Transportation Authority

10/29/2025 | News release | Distributed by Public on 10/29/2025 15:49

Lessons learned in starting a central data team

I came to the MTA in 2014 to help lead an internal strategy team. Over the years, a couple other groups were consolidated into mine, including one that reported several key performance metrics such as subway on-time performance and bus fare evasion. A common problem across all our projects was how difficult it was to work with data. Digging into a question, even an urgent one from the CEO, involved clearing a lot of internal hurdles. What data does the MTA even have on this topic? How do I get access to it? Is there any documentation that can help? And since the tables are almost always large, do we have tools that are powerful enough to work with this data?

In the fall of 2021, I decided it would be more valuable to fix these problems than to continue trying to develop insights with the infrastructure we had. The MTA Data & Analytics team was officially announced in January 2022, and I have led the team since that time. In the following three years, we have built a general purpose, cloud-based platform for data analytics. Our first task was to clean up our own shop-so all the reports and metrics I inherited in 2021 have been rebuilt and automated. We also offered our platform more broadly to the rest of the company to use for all other reporting pipelines. This resulted in a large and growing set of analytical tables available to a wide community of analysts, as well as to the public through our Open Data program. So now when D&A is asked to work on developing data insights into a problem-either by ourselves or in partnership with analysts in other departments-we can now do the work faster, with more detail and a lot less frustration.

For anyone who is trying to create a team like this-especially in the public sector where this can be particularly difficult-below are some lessons learned over the past few years. In some cases, we thoughtfully approached a problem in a way that was correct in hindsight. In others, we took the right approach only after trying alternatives that did not pan out. But I'll skip over the mistakes and just share what ended up working for us.

Have a clear understanding of the problem

Why exactly is it hard to work with data and make data-driven decisions at your organization? You need to be able to answer this question very clearly and with a lot of examples that are relevant for a broad audience of your colleagues.

You will need to define these specifically for your company, but for us, the key problems were:

  1. No shared data facilities: All data was in its raw source (usually an application database) or a series of siloed data warehouses with small and distinct sets of users. As a result, data discovery was a slow and informal process of emails and meetings.
  2. Poor data access: Even if you knew the data existed, there was no agreed upon process for receiving access. There were no formally defined data owners and lots of other veto points besides. When data owners in the business were reluctant to grant access, this could only be overcome through slow and frustrating escalations to executives.
  3. Very limited tools: Some groups had Power BI for reporting, but no infrastructure for automating data pipelines. As a result, many groups made do with using Power BI to do much more than it is intended for, resulting in dashboards that performed poorly and were hard to maintain.

Build executive support for a central data team

If you have a clear understanding of the problem at your organization, you need to make the case to executives for fixing it. I spent a lot of time in the fall of 2021 walking a PowerPoint document around to different executives at MTA, trying to convince them of the value of this team.

The challenge with explaining the problem to executives is that they do not always experience the frustrations themselves, since they typically do get answers to the questions they ask. (Even if it takes a lot of work with inadequate tools behind the scenes to get them that answer.) So, you should describe the parts of the problem that execs would see: the long delay in getting an answer back on seemingly simple questions, the frequent need to resolve disputes between departments regarding data access, and the conflicts between subordinates who come to meetings with conflicting numbers. As in our case, they likely already sense that something is wrong.

This new function will require some budget, mostly for headcount but also for the new platform. To get executive support, let it be known that you are trying something new and asking for their patience to let this investment pay off. This is not 'incremental headcount' just to maintain the existing operation, but rather a new group that will broadly make the company more efficient.

Build this function centrally within your agency

In our case, this meant changing quite a bit about our team. We had been an internal consulting and reporting group at NYCT, which is one of five MTA agencies that are all managed under MTA Headquarters (HQ). If we were to be a central data team for the entire company, I needed to move the team to HQ so it can work with all MTA agencies. I also re-named the group as 'MTA Data & Analytics,' which is usually shortened to D&A, to emphasize what our new mission was. The position of the team within an agency should be chosen to allow it to work on an equal footing with all parts of the organization. In our case, we found a good spot at MTA HQ in the Strategic Initiatives department under Chief Strategy Officer Jon Kaufman.

Partner with your IT department

One key executive that we needed to have on board was our head of IT, Raf Portnoy. To his credit, he was willing to let us build this platform as a proof-of-concept, despite the overlap with traditional IT functions. MTA IT got us started by granting access to several key data sources as well as providing resources on a Microsoft Azure subscription. Our needs have grown a lot since then, but throughout we have continued to work closely with our IT colleagues. As we bring new tools and software to the company, it has been particularly critical to work closely with IT Security so they can be vetted appropriately. Establishing good working relationships broadly within IT will result in much faster progress.

Define clear roles: Separate data engineering from data science

Like most companies, the MTA has a lot of analysts who are great at working with Excel and some who built expertise in Power BI. But we did not have a function for getting the data from the raw sources and reliably storing it in tables that were tractable for our analysts. We needed data engineers. It's tempting, however, to skip this step. In an analytics group staffed with analysts/data scientists, it is hard to justify hiring staff that will not work directly on new business-facing projects. Again, it's an investment-the data engineers will make your data scientists and analysts much more productive.

In particular, it is hard to justify these hires as many analysts can (kind of) do some basic data engineering. Out of frustration and a desire to avoid tedious work, some analysts with more technical skills will start setting up rudimentary pipelines. This arrangement develops as teams try to improve workflows using the skills and tools they have. And it is what my team had pre-2022. That approach may be quickest, but it is very hard to maintain, as it usually relies on a single expert and runs on whatever hardware was available at the time. Adding new datasets, scaling for bigger data, and maintaining old hardware are huge headaches for analysts that should be handled by engineers.

It is critical to find the right person and then empower them to build their own team and implement their plan. Trust is critical as the platform that your data scientists depend on truly belongs to the engineers. We were lucky to hire Mike Kutzma, a talented data engineer who had done this work previously at other companies. He quickly hired four data engineers for his team, and they had some initial pipelines reading and writing to the data lake in a few months. That relatively modest investment of five people unlocked everything else we have been able to do in the past few years, including analyzing Congestion Pricing data (as featured in Wired); releasing over 200 open datasets on topics such as bus Automated Camera Enforcement violations, LIRR train occupancy, and subway origin-destination ridership; and building a Metrics website based on our own open data. Mike's team also built this without help from external consultants, so both the platform and the expertise to maintain it are in-house with the MTA.

Select initial projects that demonstrate the value of the platform

The MTA maintains a lot of external metrics, but there are a few topics that get a lot of attention: ridership, subway and bus performance, and fare evasion. Despite their importance to the company, the underlying data in these areas had been very hard to work with-neither the data nor the process for calculating the metrics was shared outside of a small group. So more detailed questions about changes in trends were hard to answer.

It took a lot of work, but our team built automated pipelines to ingest the raw data behind these metrics on to the new data lake platform, learned from the subject matter experts how to calculate them and then built automated pipelines to parallel the old processes, which were often time-intensive and prone to manual error. The new resulting datasets had all the detail of the raw data, but cleaned up to support both complex analyses and the creation of the familiar, aggregate numbers. Having a wide range of MTA analysts now able to work with this data for the first time through SQL queries and Power BI made clear how the data lake worked. This created a positive cycle through which each pipeline would bring a bit more attention to the platform, thus creating more requests to ingest data in different areas. To kick this process off in your organization, select a good example to start with-one that is both highly visible and detailed enough to show off the platform's capabilities.

Build an '80/20' solution for the easiest and most common data needs

As the platform got more positive attention among analysts at MTA, we started to get requests for features that we did not yet have. Two were common: real-time data streaming and the ability to work with sensitive data (particularly in Power BI). Both could be done but would require a lot of effort from the data engineering team to build. So, instead, we defined a core product - non-sensitive data delivered in clean, easy-to-use tables that refresh next-day-and focused on making that work well. Working out a data governance policy for a large organization is complicated. Focusing first on non-sensitive data allows you to build your platform and processes while you get experience on how to protect and govern data.

Be strategic about what projects you take on

As the platform became more widely accepted, we soon found ourselves with more requests for data ingests than we could handle. That is its own challenge, but it is also an opportunity to think carefully about which ones to prioritize. Doing that requires judgement and knowledge of your company, but here are a few of the considerations we used:

  1. Balance more visible projects that bring attention to the platform against 'tech debt' reduction work that will keep everything running smoothly. It is important to do new work that increases visibility, but not at the cost of frequent failures that undermine confidence in the overall approach.
  2. Balance working in familiar areas of the company with colleagues you know well where you can go deep into complex problems versus venturing into new areas that will demonstrate the breadth of the platform to new departments and incorporate new data sources. A central data team should be available to help all users in the organization.
  3. Avoid areas of the company that will present a lot of hurdles or where data owners do not want to provide access. There will likely be enough other work to do without pushing against dead ends.

For anyone who is interested in talking further about setting up a central data function 'from scratch,' especially in the public sector, you can reach out to us at MTA Open Data [email protected]. We are happy to share what we have learned!

Andrew Kuziemko leads the MTA's central data team as the Deputy Chief of MTA Data & Analytics.

MTA - Metropolitan Transportation Authority published this content on October 29, 2025, and is solely responsible for the information contained herein. Distributed via Public Technologies (PUBT), unedited and unaltered, on October 29, 2025 at 21:49 UTC. If you believe the information included in the content is inaccurate or outdated and requires editing or removal, please contact us at [email protected]