Ed Simmons: Hello. I'm Ed Simmons from Branch Brook Advisors. We have an exciting panel discussion about capital markets data transformation for you today. I'd like to introduce our panelists here.
Sourabh Dhawan is a Senior Vice President at Arcesium, where he leads a team of technical solution architects specializing in data platforms and enterprise data management. Sourabh has worked closely with clients across capital markets and is focused on helping investment firms turn data into a competitive advantage through scalable, resilient solutions.
Oleg Komissarov is a Principal Consultant in Data and Product Architecture. He's the lead for data at the new AI Lake accelerator framework. He brings over 20 years of hands-on experience in data and transformation.
So let's get started. Well, everybody wants to do AI these days, and people have a bunch of disparate systems that don't communicate well. Everybody's nervous about making these systems talk to each other, and how I can provide data to do AI. How would we think about that now, Sourabh?
Sourabh Dhawan: From my viewpoint, things have changed. There was a time when, many years back, it was one monolithic build that you had to do to get your data platform up and running. Times have definitely changed now. It's no longer a multimillion-dollar investment. It's a much smaller scale where you can develop a premium data platform that can serve your needs.
Cloud technologies have really become superior in this framework. They offer you tools and capabilities to design a data platform that suits the needs of your organization. It's gone from a multimillion, multi-year project to come up with something that can even be close to managing your needs to a much smaller, months-long investment with a much smaller dollar cost.
The technology cloud providers offer today has changed a lot. You need to get the right tools based on your needs. There are many on the market, and once you stitch them together, you get the real-class data lineage, data catalog, and data quality, which can be the baseline that helps you build on top of it around AI and other stuff that you're talking about.
Ed Simmons: Oleg, can you talk about the fact that nowadays you don't necessarily have to replace your legacy infrastructure to use one of these modern platforms?
Oleg Komissarov: Right. Many of the modern solutions and infrastructures are serverless. So you can enable your new data infrastructure and AI capabilities by utilizing serverless frameworks and solutions and using them in parallel with your existing infrastructure. Instead of replacement, you can integrate and get data using different methods to reduce your technical debt.
Ed Simmons: One of the things I would add is that the other key to serverless platforms is operational costs. While they can be expensive to run when you're actually using them, they cost next to nothing when you're not using them. This is a big advantage in terms of delivering value before you spend money.
We talked about cloud technologies providing solutions via services, but all these different products don't necessarily work together. How do you view these from an overall, holistic platform view?
Sourabh Dhawan: That's a good question. These cloud providers provide many tools in the market that do similar things. You need to choose them wisely. That's important. When I say you need to choose them wisely, you need to consider certain aspects of defining what is best suited for you.
In my view, the user base is the most important. The group that will eventually use the product gets to define how they will set up and use the tool. That is one of the most important considerations when defining how you stitch these things together.
The user journey is really important. The users need to be brought in, and their views need to be considered well in advance. Organizations often tend to have more of an architectural view of design and engineering and lose focus on the user group. That is, in my view, where you might tend to fail in some of these initiatives.
The tools are in the market, but knowing which one to use and how to knit them together can offer different values. Some are designed to work on scale, others to work on lower volumes, but they tend to provide better outputs. There are different classifications of these tools. You tend to choose more based on the organization's needs and how the different groups need to be structured. Who will ultimately be the consumers of this overall product?
Ed Simmons: What's really interesting here is, once again, we go back to the pay-by-the-drink services model, where in some areas it may not be worth fighting over. You can give two groups the tools they want. In other places, we may want to standardize on one tool, but for some of this stuff, you can basically say if you want to do Tableau or Power BI, we'll support both because once again, the cost for a second one may not be that much more in terms of the cloud.
Oleg, can you talk about how you put this stuff together? What's your methodology for integration?
Oleg Komissarov: Essentially, nobody likes expensive and long projects where you have to integrate software because, from a business standpoint, that does not add value – it only adds cost and delays your business deliverables. We always recommend looking at the technologies that provide open standards, like open lineage or open standards for data consumption and storage. Using open standards and protocols will make it easier for you to integrate without creating custom code.
The second principle is to consider native integrations. If you're building your solution partially or fully on any cloud provider, just look at the integrations that are supported by this provider out-of-the-box. For example, data warehouses are natively integrated with storage and other tools, which makes your integration job basically a configuration as opposed to custom development.
Also, keep scale in mind. You can put together small prototypes quickly, but focus on how your workloads and integrations will work with real production workloads in the future – six months from now, one year from now.
Ed Simmons: One of the things I really like about this type of architecture is, for example, something like Glue. If you're using Glue from AWS, and as long as you don't over-customize it, your system will improve when Glue improves, and they will have a new version of Glue. This is kind of the reverse of the old technical debt problem, where your stuff deteriorates over time. This platform is alive and breathing and, in some ways, maintained by other companies, particularly hyperscalers.
To me, it's very interesting. With these serverless and SaaS-based solutions, you can really improve your platform over time without maintaining a big central team. When I built these things in the past, they would cut the budget of the central team after you built it, and you would struggle to maintain internal products. Here, that is not as big an issue. Sourabh, do you want to comment on that?
Sourabh Dhawan: That's true. It's more about keeping yourself updated with what's happening in the market and what these tools offer. If you look at it, they have a race to be ahead within the cloud providers themselves or within these companies that offer these capabilities. As you said, more and more capabilities are being added to these tools. You basically inherit them for free.
You basically just need to be up to date with what's happening, what you can achieve in the near future, and whether your system is scalable to adapt to these upcoming enhancements in these technologies. You don't need a large team of internal engineers to benefit from these technological advancements. You just need to have a bunch of smart engineers who make the right decisions and stay on par with what's happening in the industry.
Ed Simmons: We've talked about the revolution that this technology has brought about on the technical front. What about data governance? When I built data platforms in the past, I would spend a lot of time dealing with different business units regarding data definitions and things like that. What changes do we see now, Sourabh, regarding how to manage that?
Sourabh Dhawan: I think one thing that is happening, and which is one of the focus areas my team does daily, is defining standardization. There is a lot of focus on how you take customers to one unified view.
Now, I can understand that different groups have different nuances across different groups. However, the core data model, the data lineage around it, and the data catalog, defining a core standard, have become an important element and a core foundation for setting up a scalable data platform. Once you have that, you can get all the benefits of downstream consumption and downstream usage that come with it.
I would say the most important aspect is data modeling—one unified data model with possible extensions to service the different needs of the user groups or business lines that you have. However, the foundation needs to be really solid, which comes with the core standard data models.
Ed Simmons: Oleg, I think part of the change with modern ETL tools is that I don't necessarily have to change my record systems. Can you talk to that?
Oleg Komissarov: If you think about your data as data products, you can have baseline data models and data products. We suggest, for example, organizing your data products and sharing them through data domains. Then, any organization—for example, a financial enterprise—can have a reference data domain that is shared with other organizations or other domains like investment management, which can add attributes.
So you use this data model, expand it, and then share investment data based on this domain. It all comes down to data catalogs, sharing data between different organizational units, and creating even more derived products without reimplementing the same things or redefining the underlying data models and data domains.
Ed Simmons: The analogy I think of is the government—the government is a great analogy here, where we have the federal interstate highway system, which is the model, and then each state can run local roads and run their local businesses quickly. Rather than unifying the whole country, we allow autonomy in the domain.
I think that's really good because it allows you to do a couple of things. You don't have to boil the ocean; you can do one domain at a time, and it makes a lot of sense. What kind of organizational structures are people using for governance in this new world? What do you see?
Sourabh Dhawan: I think they are structuring governance because they're trying to align with their business units. Complex organizations have too many business groups, so the governance is aligned with the business. Once you have the core data models, your governance can align with your business lines.
Groups that need to have access to the specific data within the organization are structured to have that access. Today, some tools can really do that well for you. You have out-of-the-box tools that can really structure your governance needs in your core underlying data models. As your organization is set up, you can define those access patterns so that you have the right access to the right groups, and you don't have any concerns around people having access to incorrect data or abusing their user rights to see data that they're not supposed to.
Ed Simmons: Let's think about integration. How are you guys seeing integration between legacy systems and bringing the legacy data forward? What are the techniques to do that, Oleg?
Oleg Komissarov: Different technologies allow you to implement data virtualization or sourcing through APIs or organized private tunnels to source your data. As with any decision on designing data platforms or workloads, modern and existing technologies allow you to pick the right technology for your business use case.
For example, if your business case requires frequent access to data and large data volumes, and this legacy data source is still locked in your data center, you would probably need to initially copy and then synchronize your data to cloud storage with CDC for faster access.
However, this integration will have additional costs and complexity due to infrequent access. You can just use something like federated queries—a lightweight integration—to cover your data in your legacy data store and bring it back to your business applications or workflows.
Ed Simmons: A couple of things for the uninitiated: CDC is change data capture. The other thing is that we always have to be conscious of the speed of light. If you want your data to perform, your queries have to be close to your data. Although with virtualization and caching, there are ways around this, the speed of light is often a limiting factor. Data gravity is the term in the industry that's very popular now when people talk about it. Obviously, when we do this on the cloud, we have ingress and egress costs to model that keep us honest.
We have data coming from all over. How do we keep track of lineage in one of these really complex environments?
Sourabh Dhawan: Again, some tools offer this really well today. When you have your pipes moving data from one place to another, at every step in the process, you can have these pipes really capture the lineage information so that when there is a need to really go back and see how the data got transformed from one source to another, what kind of normalizations led to a particular data value, you have a robust lineage that can track you from point one to point two and so on.
The tools that exist today can even capture or embed the data visualization or BI capabilities within the lineage itself. Right from the raw data to the gold layer or the data product to the eventual semantic layer—how you see an aspect or element of the data—you can really backtrack how the raw data or the source looked.
You just need the right tools to capture this information. Once you have it, you can use these smart tools to capture the record-level or even the attribute-level lineage.
Ed Simmons: Can you discuss time travel and bi-temporal data, whatever we're calling it these days, and explain why people need it?
Sourabh Dhawan: I think bi-temporal data is needed for the hour. It's basically the "as of" and "as at"—people call it differently. But it basically captures two dimensions. One is the effective date for which the data is changed, and the other is the knowledge time when the data has changed.
When you talk about auditing and solving for what happened to your data, these two dimensions play a crucial role. We are in a world where it's all append-only—you don't overwrite the data. You always need to capture these two aspects of how your data changes so that you can go back and see the state of your system ten days or a month back, and you can recreate the same state.
That has become an important element because you need to be able to do that to see how your system looked at the time. Also, you need to be able to see the effectiveness of when this particular value started to take effect. People can change it today, but it has been effective for ten days. Those two dimensions play a crucial role, and all your data models should be designed to capture these two aspects of the time dimension so that you have a view at any point of how the data is changing.
Ed Simmons: Oleg, I'm going to use the AI word. Can you talk about lineage, time travel, and their importance for AI?
Oleg Komissarov: With time travel, you keep reports about how your data was changing. With lineage, you have an explanation of all transformations that produced that data. Both of them play a crucial role in building business user trust for data, because normally, when you look at some number, you may have doubts. A rating analyst or investment analyst must understand how this number was produced.
All these bi-temporal systems and lineage systems are still very technical. You cannot just provide access to business users and think they will look at all this big diagram and understand the origins of data. But when you feed the same data to AI and ask specific business questions, AI can understand that data and accelerate the time required to understand the origins of the data and how it was produced.
Both capabilities are very important because, in the end, they build and create business trust in data. I also would like to add that when we're talking about lineage, there are two aspects that you should consider. Lineage data must be stored in open formats, like the open lineage format, which can be understood without the presence of any data visualization tool. And there are visualization systems that show lineage.
The idea is that they should be separate. You should make sure that you, as an organization, have both access to data to enable your AI capabilities in the way you want, separately, and provide users with a visual interface.
Ed Simmons: We started talking about AI. What kind of use cases are your hedge funds and asset managers looking to do with AI with this type of data?
Sourabh Dhawan: I think AI is replacing many of the things that started with software engineering. If you talk about hedge funds, they are trying to do correlation analysis. That's a classic use case that AI tries to do more efficiently.
They're trying to analyze their position changes year over year, how their portfolios are performing, the trends around their profits, and what's leading to a better portfolio structure. Basically, all the trends around portfolio monitoring are the focus of asset managers and hedge funds so that they can invest in areas where they are performing better and where the portfolio is trying to give them better returns.
That's a classic use case that I see portfolio managers or hedge funds in general really focus on when it comes to utilizing the capabilities of AI. As Oleg said, the entire lineage or entire history of data that you can give to these LLMs, and they can generate insights into how to better make use of these to invest in the right areas. That's the focus of the hedge funds, which I see as one of the main focus areas.
Ed Simmons: Oleg, can you discuss your work with the AI Lake accelerator framework and rating agencies and give an example?
Oleg Komissarov: In addition to improving human productivity and providing natural language interfaces for data integrations and data processing implementations, it also unlocks access to data that hasn't been in your data pipeline.
For example, with RAG, you can just upload documents to common storage, and they will be indexed with a simple agent on top of it. You can have access to this data. For example, if you are a rating agency, you can just ask questions about any supplemental data that was not extracted from the filings, and you will have data support in your analyst judgment. But on the other hand, you can also access data that was never extracted with your ETL tools. It kind of unlocks access to data.
On the other hand, through agents, you can also query data in your structured, well-processed, validated data sources for data in your official models with lineage inference and traceability. It makes your judgments based on stamped data that meets quality standards and provides good quality data.
Ed Simmons: You mentioned RAG. Can you explain RAG to the audience?
Oleg Komissarov: It's a simple technique where everyone can access GPT and ask questions, but ChatGPT does not know anything about your data. You can implement a lightweight indexing process where data is automatically extracted from your unstructured data sources, like PDF documents or text documents, and becomes searchable. Then you integrate it with the agent.
Now, when you ask GPT any questions, the agent searches for data in this specific data source or set of data sources, or if it's a chain of agents, then it provides your answers based on this data. The data source does not have to be unstructured—it could be your data warehouse or relational database.
Engines these days can query data and actually unlock data because business users don't have to write specific queries and reports. AI agents are capable of generating such queries with complex joins based on user prompts and bringing data back in a secure, managed way.
Ed Simmons: Sourabh, can you touch on your thoughts on that?
Sourabh Dhawan: As Oleg said, this is one of the capabilities where you can get better performance from these AI tools. You provide the reference data—your entire history of the data in the organization—as input. In addition to these algorithms trying to respond or these agent workflows trying to respond to generic commands, they can refer to the data. That's basically the technique—you provide reference access to these tools so that they can respond or react based on the underlying data you provide.
It basically improves accuracy. They can do more, and using this capability helps to increase the productivity of organizations.
Ed Simmons: A couple of AI comments. There are two classes of AI work we're doing here. One where we're using AI to generate things like sentiment analysis, where if it's 74% of the people like something instead of 78% of the people like that, it's not a problem that's necessarily going to kill us, versus something like with a drug or a financial transaction where it's got to be right. It can't be wrong. There can't be any hallucinations.
We've seen both types of use cases here, and we just have to keep track of accuracy. Let me ask each of your final recommendations for people to consider and take away from today.
Sourabh Dhawan: What I would say is that times have changed. Coming up with data—data is power, but making use of data—the times have gone by, and you had to invest years and millions of dollars to make efficient use of the types of data that exist in your organization. It's a much easier project now with the technology that exists.
To use the technology efficiently, you need to choose the right tools. The user base is important. You must bring them to the right stage, not too late in the game. The more important thing is setting up the foundational data models correctly.
Once you do all of these things right, you are bound to use all the stuff that we have been talking about, which can really increase your productivity. Once you have these foundations right, it becomes much easier to really benefit.
I missed one important thing, which is data quality. I think having clean data—again, a capability that many tools provide now—is important. When you're running your data through various pipes at every step, you need to ensure the required data quality, which, again, AI can help you do. But as long as your foundation is strong, you can then benefit from using all these capabilities on top of it. That's my advice to people trying to solve this problem in the organization.
Ed Simmons: Yeah, data quality. We really didn't touch on that enough. Oleg, why don't you talk a little bit about data quality and then give your takeaway for this talk?
Oleg Komissarov: Data quality is a must-have component of your solution. It should be very easy to set up data quality checks based on human-like expressions that can be easily understood by humans. Again, coming back to open standards and data availability, you should treat your data quality outputs—the results of execution of data quality, the data quality metrics—as another valuable data asset, also available for AI for explanations but also for visual exploration.
These components have to be in place. This is crucial because, as we said before, garbage in, garbage out. Without reliable data quality, you cannot get reliable results from AI.
Ed Simmons: Oleg, can you give people one takeaway?
Oleg Komissarov: I agree with Sourabh, but we also see many requests today that are the same as in the past: "Hey, we want to build a data warehouse. Which technology should we use?" I would say think about the future. When you're building or enhancing your data platform, you are not building your data warehouse anymore. You're building your future virtual workers, research, and AI platform.
Ask questions about how my request today will be aligned with the architecture and capabilities in the future. With that, I would say don't ever think about implementing a monolithic platform that can satisfy all requirements upfront, because you don't ever know what requirements you will have in the future. Build it modular, build it as much as possible using open standards. Be ready to retire certain components and add new components. That should be your guiding principle.
Ed Simmons: I think that's a great one. The AI stuff is moving really fast. All sorts of new things are coming out of it, but by building a sound data platform, you can take advantage of the new stuff coming out of the AI platforms very quickly, and all the work being done by many people.
Once again, the other point I would make here is that we used to create these kinds of data platforms to gain a competitive advantage. I think now it's gone the other way. If you don't, you'll be at a competitive disadvantage because the cost of entry has gotten so low. That's my last takeaway.
We do have time for a couple of questions. Both of you, what is the common mistake you see? What should you avoid doing? We've talked about what you should be doing here.
Sourabh Dhawan: One common mistake that I have seen—I also touched upon this before, but bringing the user group too late in the game. Oleg's point is really important in building a system that becomes usable two years later. Those are the two common mistakes that really put you at a big risk.
You need to have much smaller milestones and a quick time to market. Something should come out every month or so, so that you can start to see what's working and what's not working. That's one thing, especially with big organizations, that tends to happen where you have much larger milestones.
You try to define something one year later—how your setup should look—which is a recipe for disaster. You just don't bring the right people or have people try to use the platform that you've been building for a year, and then you realize there are some core components that will not fit properly. Those are the areas where we see organizations tend to make certain mistakes when dealing with projects like this.
Ed Simmons: Oleg, same question.
Oleg Komissarov: It's very easy to come up with prototypes and decide that you can move them to production, but without foundational capabilities, even if they are very lightweight and maybe not very mature—such as data lineage and data quality—you will not be able to have reliable outputs.
Have a foundational platform in place and start from there. Don't just go with quick prototypes without this foundation because they will increase your technical debt. You will have to find solutions down the road.
Ed Simmons: One last question. We have a bunch of legacy systems—we all do. If you had to pick one technique that you like best—change data capture, APIs, or virtualization—which is your favorite? Obviously, you have them all in your toolbox, but do you have a favorite, Sourabh?
Sourabh Dhawan: It's a very tricky question. No one thing just works. It really depends on your use case and your usage pattern. Oleg mentioned this previously as well. If you have virtualization—let's say virtualization is possible and you go with it—but you have a legacy system, there could be times when that really doesn't work, as well as when you are actually doing CDC. Again, if they are not capable of doing CDC, these monolithic systems or legacy systems become a challenge for you.
The federated queries do have their good parts as well as bad parts. Imagine a world where you get a lot more data and then try to join once you have all the data from underlying legacy systems—it can lead to more time than these queries would take.
An API is again a way to really do stuff, scalable and faster, but there are challenges with APIs as well. I don't think there's one good answer to "you need to go with this." It really depends on what you're trying to build, what your end result is, and what you are trying to solve for—that determines your right solution.
Ed Simmons: I think you're saying in some ways it's art and analysis versus science versus choices. Oleg, any last thoughts?
Oleg Komissarov: I will ask another question: Why must I choose one solution? The business organization does not care whether it's one, two, or three. They care about the cost of implementation and how easy it is to do. Any of the solutions are relatively easy to set up, but then the question is how much technical debt you will be building on top of the solution.
What are your plans to enhance functionality and increase or decrease dependency? From this perspective, I think this is very important. As a tactical approach, you can do anything—CDCs or federated queries, for example, are both easy to set up. You can quickly set up your read-only replicas and start accessing this data. Just think about one step ahead and project all the things I mentioned—your future migration strategy, how much effort and money it will cost, and when you will be retiring these technologies. I think that should help answer the question.
Ed Simmons: Great. Well, thanks, everybody. I think this was really a fun discussion, and we'd love to hear from you with your questions and comments. We'll try to do another one in a few weeks. Thank you so much.
Oleg Komissarov: Thank you.
Sourabh Dhawan: Thank you.