How Nubank Built its in-house log platform

5 hours ago 1

This work was done collaboratively by several incredible people (alphabetical order): Amarilis Campos, Beca Maia, Caio Sousa, Daniel Cristian, Jade Costa, Maria Duarte, Otavio Valadares, Robert Cristiam and many teams inside Nu.

Why did we do it?

We believe that behind all technical solutions there must be a clear business problem to be solved. In this case, we were facing challenges related to the stability and efficiency of our monitoring ecosystem. As Nubank scaled rapidly, our existing log infrastructure began to show signs of pressure, especially in terms of cost predictability and scalability.

Considering the logging platform is key to support all the engineering teams during troubleshooting and incident mitigation, not having full control and visibility on your monitoring data is bad. There is nothing worse than trying to debug a production problem and discover you can’t see the logs of your application. In our case, we used to rely on an external solution to ingest and store our logs, and we had poor observability around it (ironic). Once we created metrics to understand the real situation, our analysis showed that a significant portion of logs weren’t being retained end-to-end, which limited our ability to act quickly in incident response scenarios.

On top of that, our contract was getting expensive (really expensive). The only way to mitigate our problems was buying more licenses (paying more), and there wasn’t a clear pricing model for us to plan our spends. If we had problems, we’d have to add more money. No predictability was possible here. It got to a point that the team analyzed that we could hire Lionel Messi as a software engineer, paying the same amount we were paying for the external solution.

With this complex and exciting problem at hand, we decided to explore alternatives, and the most efficient one seemed to be creating our own platform. This way we would have total control over our data, ingestion pipeline, storage strategy and querying runtime.

Check our job opportunities

How was Nubank’s Log Infrastructure

Before moving to an in-house solution, Nubank’s log infrastructure was very simple, and totally coupled with the previous solution.

In short, every application log was being sent directly to the vendor’s platforms by its own forwarder. Also, we had many different unknown internal sources sending data directly to its API.

This architecture served Nubank well for many years, but with our massive hyper-scale growth, some years ago, we started to face its limitations, the future with it started to be a concern.

The primary concerns and problems with this architecture and approach, as identified by the team, were:

Lack of observability: We didn’t have any observability over the ingestion and storage flow, if something happens we didn’t have any trustable metric about it.
High coupling: At this time, lots of our alerts, dashboards were defined directly on vendor interfaces, all our data was stored within it, and we didn’t have the capability to change solutions or migrate away easily.
Lack of control: We didn’t have any way to filter, aggregate, route or apply logic over incoming data.
High Costs: The increasing costs related to logs stack was a constant concern from stakeholders and the trend was that it would keep growing if we didn’t take any action.
Coupled ingestion and querying processes: High load on ingestion directly impacted querying performance, and vice versa.

Divide and conquer

Developing an entire log platform from scratch is hard, at the time we didn’t have anything built!

To be able to solve this problem, we divided the entire project into two major steps:

The Observability Stream: A complete stream platform capable of ingesting and processing observability signals in a reliable and efficient manner. Decoupling from the other solution and having full control over our data.
Query & Storage Platform: The platform that would store and make logs searchable, so that engineers could use it on daily troubleshooting tasks.

For both projects we had a different set of requirements and features that we needed to build, but there were three common requirements:

Reliable: The platform needed to be reliable even under high load or unexpected scenarios to support Nubank operations.
Scalable: Be able to quickly scale when facing spike in ingestion and usage, and on long term when dealing with the hyper-growth of Nubank.
Cost Efficient: Being cost efficient is always important at Nubank, and we needed a platform that would be cost efficient on a long-term vision, being able to ingest and store all our generated data cheaper than any vendor.

With a clear list of requirements and expectations we started the project, first focusing on the ingestion and processing, and then the querying and storage platform.

The Observability Stream

The decision was to build the ingestion platform first, it allowed us to start the migration process without requiring any major disruptions on the developer experience while already decoupling the transaction environment from the observability environment. It also allowed us to gather metrics about our data to support better decision-making, especially during the storage platform development.

The observability stream was built with simplicity in mind, with a mix of open source projects and in-house developed systems.

To summarize, the ingestion architecture is composed of three distinct systems:

Fluent Bit: We opted for a lightweight, easily configurable, and efficient data collector and forwarder. This open-source, CNCF-backed project is a reliable industry standard for the task.
Data Buffer Service: The service responsible to handle all incoming data from forwarders and accumulate them in large chunks of data, to proceed on the pipeline on a micro-batching architecture.
Filter & Process Service: An in-house developed high scalable system able to filter and process any incoming data efficiently. This system is the core of our ingestion platform, being easily extensible to add any new filter/process logic as needed, it’s also responsible to collect metrics from incoming data.

With the observability stream fully operational, we established a foundation of reliability and scalability for our log ingestion processes. This comprehensive system not only resolved our immediate needs for quality data intake but also provided us with invaluable insights into our logging activities. Furthermore, it decoupled our ingestion processes from the querying process, allowing for greater flexibility and the ability to easily change components when needed, a capability we lacked previously due to tight coupling.

Query & Log Platform

With a robust ingestion platform ensuring reliability and scalability, our next challenge was to develop a query and storage solution capable of effectively handling and retrieving this massive volume of log data.

With all this, we needed to choose a query engine to search all this data, and Trino was the choice for several reasons:

Trino partitions feature was a crucial feature. Using it we’re able to enhance our query performance by segmenting data into manageable chunks, this allows queries to target only relevant data subsets, improving response times and reducing resource usage. Trino’s partitioning feature was a key factor in our decision to adopt it.
AWS S3 as storage: By storing all our data on AWS S3 we guarantee the high reliability of our data in a cost-effective way, its high scalability is well grounded to receive this massive amount of data, while being able to scale in long-term as Nubank grows.

To store the logs, the chosen format was Parquet. Using it, we’re able to achieve the best search performance due to its colunar storage while also achieving an average of 95% of compaction rate. This helps us achieve the goal of having all our data stored in the most effective way.

To generate all this parquet, we built a high scalable and extensible parquet generator app, that are capable of transform into parquet all the massive data coming the ingestion platform, the choice to build our own internal infrastructure to it, also emphasize our goal of have a cost-effective alternative while being able to extend and adapt on Nubank’s needs.

With our query and log platform fully integrated and operational, we have successfully redefined how Nubank manages its log data. The strategic choice of Trino for querying, S3 for storage, and Parquet for data format ensures that our logs are not only efficiently stored but also readily accessible for analysis and troubleshooting. These innovations have not only resolved initial challenges but have also equipped Nubank with a powerful tool for future growth.

Final Thoughts

Since mid-2024, Nubank’s in-house logging platform has been the default for log storage and querying. It currently ingests 1 trillion logs daily, totaling 1 PB of data. With a 45-day retention period, it stores 45 PB of searchable data. The platform handles almost 15,000 queries daily, scanning 150 PB of data each day.

Nubank developed this in-house logging platform to achieve significant cost savings and operational efficiency, moving away from reliance on external vendors. This platform is designed to support all current and future operations, scaling efficiently while costing 50% less than market solutions, according to our benchmarks.

This approach also provides Nubank with unparalleled control and flexibility. It enables rapid iteration, custom feature development, and a deeper understanding of data flows, leading to improved analytics, troubleshooting, and security.

Challenging the status quo is a core Nubank value, and this ambition drove the creation of an entire log platform from scratch, leveraging a combination of open-source projects and in-house software development.

Read Entire Article