Optimizing Rails Tests at Doctolib Scale – On Rails

2 hours ago 2

[00:00:06.000] - Robby Russell

Welcome to On Rails, the podcast where we dig into the technical decisions behind building and maintaining production Ruby on Rails apps. I'm your host Robby Russell. In this episode I'm joined by Florent Beaurain, a longtime rails engineer at DoctoLib, home to one of the largest Rails monoliths in Europe. DoctoLib runs on over 3 million lines of Rails code, with hundreds of engineers contributing daily. Their test suite? More than 80,000 tests per commit which takes 100, 130 plus CPU hours. Florent shares how his team revisited Rails defaults to improve developer experience and cut infrastructure costs, like dropping one engine's test time from seven minutes to just under one. We talk about what slows big test suites down, how to fix it, the hidden cost of using factories, why Packwerk didn't quite live up to the dream, how they route, read traffic across Postgres replicas and lessons from years of Rails upgrades and a fast moving organization. Florent joins us from the north of France. Alright, check for your belongings. All aboard. Florent, welcome to On Rails.

[00:01:08.180] - Florent Beaurain

Thanks Robby, thanks for inviting me.

[00:01:10.420] - Robby Russell

So question I like to start with is: what keeps you on Rails?

[00:01:14.980] - Florent Beaurain
When I was at school for development and engineering stuff we did a lot of C and C++ and I basically hated it. I almost quit school and because of that it completely changed my life to do something else. I think I was in like a third grade or something like that and we have this big project over two years to complete our studies to do and that's the only project where we were free to choose the technology and one of the guys at my school was doing Ruby on Rails, okay and he was talking a lot about it, how it's amazing and stuff. So with my group we decided to pick it and try. That's the first time we watched some Ruby and some Rails and I was so impressed on how everything is easy. Everything I needed, it was already there. Okay? It was either already in the standard library or we have it in Active Support or we have a gem for it. And so the experience was amazing and it was a revelation. I was like that's what I need. I'm not smart enough for C stuff. C++ always rebuilt everything but this is the kind of thing I like and that's what bring me at Ruby and I never change since. It’s been almost 10 years now and I have done nothing else than Ruby and Ruby on Rails.

[00:02:28.510] - Robby Russell
Do you remember approximately what version of Rails that was? I'm trying to remember off top of my head.

[00:02:33.550] - Florent Beaurain
Five I think something like that or four or. Yeah, the transition between four and five.

[00:02:38.830] - Robby Russell
You know, for me, having been in the Ruby on Rails ecosystem for over 20 plus years now, it's always interesting to get to talk to different people that are introduced to Rails at a different point in the life cycle. Where you mentioned that there's all these gems available to do a lot of things you wanted to accomplish. And so there was this huge ecosystem that had been around for a decade already. And comparing that to your experience with the other languages that you were learning in school, do you feel like Rails is kind of like what kept you interested in computer science then?

[00:03:08.570] - Florent Beaurain
In many ways. Yeah 100%. When I say I was on, on my way for quitting, it was the truth. It was clearly not for me, my grade was not good enough at school, etc. Because it doesn't keep me interested on the thing. And I was okay, it's too hard for me.

[00:03:21.310] - Robby Russell
Were you thinking about software development at that point in terms of like how you would use the technology? In terms of like when you were working and learning at computer science, were you thinking, I want to build web application type tools or backend or closer to the hardware? Like, was there something kind of drawing you to specifically toward maybe a more web centric area for development?

[00:03:40.830] - Florent Beaurain
Not at the beginning. That's my internship that brings me to the web. Because when you try to find an internship, at least in France, most of what you will find is basically in web. So that's how I started web development and my final project was in Ruby and Rails, so for the web. And that's where I discovered that this is an amazing platform. You know, you can build so many things so easily and you can push it to millions of people. The experience was amazing. So I keep pushing on that. And that's also where you have most of the job also.

[00:04:14.760] - Robby Russell
That's true. I think being able to deploy an application to the Internet and have anybody be able to access it anywhere with their web browser, that's very different than being able to like ship some physical product and hoping that people will buy it and then, then your code might end up working in someone's device somewhere or something. It's interesting. I learned a lot about the organization you work for called DoctoLib and how it's. I think it's one of the largest companies that's using Ruby on Rails, I think in Europe. Is that correct?

[00:04:40.550] - Florent Beaurain
Yes, probably. At least in France.

[00:04:42.710] - Robby Russell
We had a brief conversation before this, but I know that Doctolib’s CI suite currently runs 84,000 tests, yes. Which consumes over 130 CPU hours on a full run.

[00:04:56.380] - Florent Beaurain
Exactly.

[00:04:56.980] - Robby Russell
Just trying to wrap my head around that a little bit. How did it get to that point, first of all?

[00:05:00.780] - Florent Beaurain
So I joined eight years ago and we had like 5,000 tests. So it was running on Jenkins at the time, and it was working pretty well. Then when I joined, we were like 15 engineers. And then that was the point where the company was growing a lot. Okay. So we were hiring 100 person per month, 10 to 15 engineers per month joining the company. So it was pretty fast. So quickly we had massive amount of engineers and we started to write a lot of tests. And so this number grew quickly, quickly, quickly. So the time on Jenkins was not acceptable enough. So we migrate to. I think it was Heroku CI at the time. So we had the parallelization of Heroku CI with 15 workers, if I'm not wrong, that keep the CI time a bit lower and then we go test and then we end in this situation where we are now on infrastructure and stuff to run the CI, which is currently bigger than world production platform.

[00:06:05.680] - Florent Beaurain
That context, where you were onboarding a lot of new developers and you mentioned you started there around, say, 5,000 tests, and now there's 84,000 tests in your test suite. At the time when you joined, how was the code to test ratio at that point? Were you. Was it pretty consistent to what it is now and the application's just grown that much more, or was there a considerable amount of time invested in writing a lot of tests that needed to already be there. When you joined, was there a lack of test coverage?

[00:06:31.290] - Florent Beaurain
If your question was, do we have a lack of coverage? When I joined, it was not the case. It's really linear growth based on the capacity to ship more. It was already very in the culture to write a lot of tests when I joined and mostly end to end. And we can talk about that later. So that explains the number of tests and the big number we have now.

[00:06:53.820] - Robby Russell
I think for people listening. Well, if you have a bunch of automated CI running and it's taking that long, is that really a problem? Is that something you needed to address or kind of focus on improving.

[00:07:05.330] - Florent Beaurain
The duration we have on the pull request? I guess we have to work on that a lot. You have basically several options on the table. I think it's like scaling a web application. Either you throw money at it or you throw manpower to it to try to make it faster or at least more performant. So we either do both. We throw a lot of money to it to have more parallelization. So we just basically add more servers to treat one commit. So at some point we are launching like 315 servers for one commit.

[00:07:37.330] - Robby Russell
Oh my gosh.

[00:07:38.370] - Florent Beaurain
That's basically one way of fixing it and the other way is to putting some people on it trying to make it more performant. The third way we have also is finding a way to launch less tests. So basically apply principle based on the pull request and select the test you have to run. That's another way to reduce it. So we also do that. And yeah, that's the way we have to basically scale that. But at some point the duration is a problem for the velocity of the team and we have to find a way.

[00:08:07.610] - Robby Russell
Yeah, yeah, I could see that potentially causing a lot of bottlenecks. I would imagine if it takes that long and say a couple tests break or you have a couple of flaky tests. How many engineers for context do you have right now is it's like over 700 engineers now.

[00:08:21.410] - Florent Beaurain
We are something around 400 and 500 engineers working on the monolith.

[00:08:25.660] - Robby Russell
4 to 500, my apologies, not 700. And then given that. So like if there's a scenario where people are pushing things and I could have just imagine there being a lot of branches that get stuck in a merge queue because of tests not working or something. It takes that long. When someone's trying to finish a project that they're working on or a task or something, that feedback cycle must, I mean it must not be able to run the whole test suite on their own local development reliably, can they?

[00:08:51.980] - Florent Beaurain
No, in local. It's impossible. That's currently an issue that we have. We don't have a merge queue. The stability of main branch is crucial because if for any reason the main branch is red, basically everyone is kinda shut down and currently I think the pipeline takes 40 or 45 minutes. So if your rebase is on main and lucky you are if UI it was red, potentially becoming red in a few minutes, you launch it and you come back one hour later and it's red. You have to rebase, relaunch and you lost one hour. So yeah, clearly. So that's where test selection is really important to try to reduce the time. So if you don't touch much things and you work on your own engine, you can expect to run few hundred tests. So it should be pretty fast. Not that fast because there is incompressible things in the workflow like building the Looker image and stuff. But you can expect before starting.

[00:09:47.610] - Robby Russell
How long ago did you start working on trying to focus on just improving the performance and speed of your test suite?

[00:09:55.810] - Florent Beaurain
I have been at Doctolib since eight years, so I think it has been the work of a full feature team since eight years.

[00:10:03.570] - Robby Russell
And then so over the last eight years you're keeping an eye on these things as the team's growing and the code base is growing. I know that you mentioned that there's some slowdown that came from - I think you specifically talked about the database in particular. Like in one of the approaches with Ruby on Rails, with a typical test suite is that you run your test and it's constantly resetting your test database in between running those tests. What's wrong with that potentially or what doesn't work about that?

[00:10:31.730] - Florent Beaurain
Okay, so I was tough on working on improving the developer experience. So that was not specifically CI oriented. It was really okay. We want to improve the developer experience locally. And as Shopify we have split monolith in what we call engine on side Shopify, I think they name that components. Okay, so it's small boxes of where people put their code and when I mean their code, it's their applicative code, but also the test to test it. And we were like, okay, if we want to improve the developer experience, one way to do that is having people being able to launch their test locally. Okay, so go out of this CI driven development workflow we had where we just do stuff locally, test it. If we test it in development on the ui, but then push it and expect that the CI is green. And then if it's not, take the test failing and just starting from there. But you have already lost 45 minutes. So our take was if the engine is isolated enough, just launching the engine test should be enough to have the big picture of what will fail or will it pass the CI.

[00:11:40.910] - Florent Beaurain
But even just launching this subset of tests locally was pretty slow. When I say pretty slow, it was like something around 300, 400 tests. It was like more than six to seven minutes on M4 Macro. So pretty slow. So we wanted to improve that. And so most of my teammate was, yeah, but you know, it's a bit slow at the time. So I did this small video to, to back the project where I put a vanilla Rails app and I create thousands of tests and I launch them locally with the same setup as well. So a docker running a PG database, etcetera. And it was incredible, incredibly fast. Okay. It was a matter of seconds to launch thousands and thousands of tests. And I was okay, so that's not Ruby. The problem that's not Rails, the problem. And if you follow David on Twitter, you see I essentially he did a lot of benchmark with the A test suite where I launched thousands of tests. And it's fast. So why it's not fast on our side. So I did a couple of flame graph and stuff to see what is the bottleneck on our Rails application.

[00:12:47.410] - Florent Beaurain
And I came to the conclusion of several things. One of the main bottlenecks is the database. Several things at the database level we are resetting the database between each test. That's a common pattern. But this is very slow and it's even slower in localized application because we have multiple databases. So for each test we don't reset only one database, but we reset 10 databases. And every time we add a new database it's becoming slower and slower. So we had to change that basically. And it was accounting for almost, I would say something like 30% of the time of the test direction for unit test, not for end to end test, of course for unit test. Yeah, it was a massive amount. And the second conclusion is factories. Factories were pretty slow because they are interacting with the database. This database which is in a Docker running on macOS. So not a perfect world. So the database is basically slow by itself. Not all factories are pretty well built. Okay, so some are crunching some Ruby a lot, triggering some events that don't queue stuff that create another object, et cetera, et cetera. So we have a cascade of a lot of amount of object just to create for example one account.

[00:14:03.970] - Florent Beaurain
So when I measure it, we are spending like 50% of the time in factories for test.

[00:14:09.170] - Robby Russell
One of the things you mentioned there around it's a common pattern is to reset your test database between tests. Do you have a good sense for why that's an important thing for tests or because it's trying to maintain or just to avoid any weird things if you're just trying to run a quick test.

[00:14:23.700] - Florent Beaurain
So I will not advise anybody to not reuse the database between tests.

[00:14:28.180] - Florent Beaurain
Okay.

[00:14:28.340] - Florent Beaurain
It's really important to avoid to have test data that link to other tests and so have a consistency in your test to avoid flakiness, for example. But there is other way of doing it than just taking all the table and truncating them. That's what we were doing basically in a more optimized way than truncate on all the table. But that was basically what we were doing.

[00:14:50.600] - Robby Russell
You also mentioned having say 10 different databases or so that you're connecting to is in what context would that be happening? Is this like a multi tenant type of application or there's just 10 different databases for different types of data that need to be very, very much distinct?

[00:15:05.880] - Florent Beaurain
Yeah, so it's for scalability issues. So I think we can go to that later.

[00:15:09.960] - Robby Russell
Okay.

[00:15:10.760] - Florent Beaurain
But that's how we scale up basically.

[00:15:13.240] - Robby Russell
To kind of circle back. You talked about using some flame graphs tools and that led you down the path to looking at how the application was or the test suite was resetting the databases. Between every test that it was running, you've identified factories. There's concerns there that that wasn't entirely super efficient in some aspects. Were there other things that you also noticed during that kind of research process?

[00:15:33.860] - Florent Beaurain
So factories, the iseting of the database and then what comes to be basically some plot we have added over the year. The monolith is now 12 years old. So we have added a lot of small things here and here because someone wanted to fix something because he had a red or because we wanted to put some safeguard, you know, to avoid someone to fall in this trap again. And all those things are always done in a good intention first but without always, you know, we don't monitor. You put it here, but you don't really know the global impact. All you know is just that it fix your problem, you know, that's basically it. And all those things you add it, it's 1ms. Here, 2ms. Here, 3ms. Here. But at the end of the day when you have 80,000 tests, it become a massive problem. So that was basically it. The three problem we had the database resets the factories and the plot we have added over the year.

[00:16:34.470] - Robby Russell
So you identify these different areas. Then as an organization, how do you, or maybe your team and thinking about the developer experience, how do you then begin to start prioritizing finding ways or solutions to improve that situation for yourselves?

[00:16:49.750] - Florent Beaurain
I presented my research basically and I said look, that's what people get when they have a vanilla analyzer. That's all current experience. And here are the things that differ between two and so this is the three things we can work on and they were pretty convinced that we had a case. So they allow us to work a bit on it and make work. Point. Okay, so basically we had a couple of weeks to migrate one of our engine to a new testing framework that was fast enough, or at least faster than what we had. So that's basically what we did. To me, we don't usually do that, we don't usually rewrite things or we are more in the small improvement things every time. But for this case, it's too much. We have too much bloat. If you touch one thing, everything will fall apart. So we should probably restart. And I was pretty convinced that there's not that much in fact in OCI and we can migrate a big part of it without having to reconstruct everything. Okay, so that's basically what we did. We create our own test classes and based on the vanilla Rails test classes and we put nothing more in it.

[00:18:04.190] - Florent Beaurain
And we started with one engine, so we had a lot of redness of course on it. So we bring back the code needed to make it green and we tried back and forth like that. Okay, this helper is missing. This helper is missing. We had this configuration before and do we need it or not? Okay. And so we even changed a bit or test to make it pass. And thanks to that we build these new classes on engine and we win. I think we started this engine, it was taking like something around seven minutes to run and we go down at two minutes. We added a bit of parallel testing to get the extra juice from the Mac M4 and we go down below one minute.

[00:18:44.760] - Robby Russell
Oh wow.

[00:18:46.040] - Florent Beaurain
So then we share that and people were okay, we need that. And so we get the time to work for the full code based migration.

[00:18:54.680] - Robby Russell
So you were able to identify one of your engines and how did you go about selecting one? Was it one that you felt pretty confident that your team had a lot of exposure to or you mentioned that you had a pretty reliable test suite already and so I'm trying to imagine you in an editor like looking at your existing test suite. Out of curiosity, what was the test suite that you went from and what did you move to?

[00:19:16.760] - Florent Beaurain
So for the test engine selection test, it was a trivial choice. We just took the one where we have code ownership on it so it's easier to modify things if we don't rely on the approved process of someone else. We also try to take one that is not too difficult to migrate with a lot of end to end tests with mobile test. So we try to pick an easy one but not too easy. So it's Kindle representative. We try to take an average engine, a recent one without too much legacy and that's what we have targeted. So we have one. It was like trivial to choose, but.

[00:19:55.200] - Robby Russell
What test suite framework are you using there?

[00:19:58.160] - Florent Beaurain
So we are using minitest for the whole test suite. We have just a small, we have something mini test spec Rails I think to have a bit of spec stuff include in active support test case, but that's it's mini test and we haven't changed that.

[00:20:14.080] - Robby Russell
You know, in one of our previous conversations you kind of described this kind of as a hard reboot on your test architecture or infrastructure, kind of leaning back into Rails defaults. So did you feel like there were a lot of defaults that had been changed over those several years and that kind of. You mentioned people with good intentions that there were kind of global impacts to make things a little bit slower. What were some of the patterns that you kind of noticed that you were needing? Were there configuration things that you were just able to. We don't actually even need it like that.

[00:20:41.450] - Florent Beaurain
Yeah, I think the two things is not using fixtures and use factories instead. That was I think 10 years ago, that was a default. You know you Rails new and you just basically put factory bot in it and that's the way to go. So I think drlip started like that. So that's one thing we really want to change. So go back to fixtures. And the second thing is transactional testing. So instead of visiting our database with custom code truncate, we started to use transactional test that exists in Bania Revza. And the idea behind transactional test is when you start your test, we will take a transaction on your database. Everything you will do will be done inside the transaction and at the end of the test the transaction will be rolled back which is way faster. Then truncate all the table. And that's basically the two things we have done to go back to Rails defaults. When I say we have done the transactional test is done the fixture things, it's ongoing, it's a bit more complicated, but we have started to create some factories to fixtures.

[00:21:50.420] - Robby Russell
So the application started being developed a couple years before you were introduced to Rails. Do you feel like you have a sense of why the Ruby on Rails community started to embrace patterns like using factories versus Fixtures?

[00:22:02.630] - Florent Beaurain
I guess I think fixtures like it's one of those things, you know that when you are introduced to it, you don't like it. It's YAML file, you have to feel it's inflexible. You also feel like you start your test, you have tons of data and there's some you need, there's some you don't need. It doesn't feel the right way. I mean so yeah, I think it's not flexible enough and that's why people prefer to go to factories because it's flexible. But that's also I think the drawbacks of factories. So it's kinda like we don't like it because it's inflexible, but that's the strength of the thing. And we like it because it's flexible, but that's also the main drawback of it.

[00:22:43.000] - Robby Russell
It's such an interesting thing where it's been such a long time now since that kind of began to permeate in the community. But I think this was at the time a developer ergonomics thing thing. It felt like as a developer, working with, using and interacting with factories felt a little bit more friendly to us as developers versus handcrafting some YAML files and maintaining.

[00:23:09.980] - Florent Beaurain
Yeah.

[00:23:10.780] - Robby Russell
Well defined and strict. It's kind of like working Python versus Ruby where you got to think about formatting and stuff and like, yeah. So we're like, oh, this is so much better, I can express myself. But then there's this. You fast forward several years and all of a sudden like that's the thing that's slowing down your developer experience for running your test suite and causing other types of weird. There's side effects to that. So I'm like, do you think there's a world where there's something in between this that would make developer experience of just writing tests a little bit nicer. So we're not thinking about fixtures in the same way. But get the fixtures are faster, but where is this balance there?

[00:23:43.000] - Florent Beaurain
So yes, I would say in our case we are still keeping. So we have Factory bot and some fixtures. So we are in the middle round. If I had to start fresh, one of the things we'll do, and that's what I do on my pet project, is I'm using fixtures and Shopify have a tool named Factory Fixture or Fixture Factory, I never know. And basically it's a tool you can add in your app and you can take a fixture and say, I want this fixture, but a bit different. So it's kind of the middle ground where you have fixtures and thanks to the fixtures you can test 80% of your app and then you use factory fixtures or fixture factory to create some object to test edge cases, etc. And just some helpers basically on top of it. But I kinda like this pattern.

[00:24:38.010] - Robby Russell
I'll definitely include links to that once we track that down in the show. Notes for everybody. Is that one of Shopify's gems, you think?

[00:24:44.170] - Florent Beaurain
Yes, it's a Shopify gem Factory fixture, I think. And it's really. When you use it, you declare a factory and you say, I want a user like Bob, but I want it to be admin, for example. And you just put admin True. And so you have something like Bob, your fixture Bob, but with admin, I.

[00:25:05.740] - Robby Russell
Would imagine that must add a little bit of latency or takes a little bit more time to process that I would imagine. But I think if you can use those where sparingly might be able to keep your test suite running as fast as you possibly can make it, I suppose. So going back to your story there as an organization, you identified an engine you went through and you were able to get an engine that would take approximately, say seven minutes or so to run just that engines test suite down to say less than a minute. And then all of a sudden the rest of your team gets it and you're like, all right, we can get more buy in to start tackling more engines or do you start going towards your larger applications? And is that something that your team just was responsible for kind of rewriting tests in this type of approach or was that then spread out to the rest of the organization?

[00:25:50.310] - Florent Beaurain
So very good question. We had some kind of disagreement around that. I was more pushing to, you know, take the opportunity of a great reset to rebuild everything. Meaning that we build the framework and we put a deadline to teams to migrate to it. So we make everything we can to make it easier for them to migrate, but we let them under it so we can have some bit of smartness to it, you know, so they migrate, they see stuff, they report to us, we fix, we can improve the solution and we can also say no, but these kind of things, we don't want to see it anymore. There is a lot of patterns we had in the code base that was not acceptable anymore from a performance standpoint. Overall didn't convince people. So we decided to go to for another solution where we migrate everything ourselves. So we have to make some trade off in that where we cannot rewrite every test and everything for everybody. So we had to keep a maximum of compatibility with what we had in our new framework. Okay, so I think we lost a bit of performance gain, but overall we migrate.

[00:27:00.740] - Florent Beaurain
I think now we are at 90% of the code base to the new framework. So in three months I think it's a, it's an achievement and we are pretty happy about it.

[00:27:11.750] - Robby Russell
This episode of On Rails is brought to you by Concerns, the lightweight supplement for bloated models and scattered logic. Are your controllers overworked models doing too much? You might be a candidate for Concerns. Just one include a day can help extract shared code across your app, whether or not that code actually belongs there. Concerns are modular, reusable and Questionably named side effects may include unclear ownership, callback confusion and saying we'll refactor this later. At least once a week, ask your tech lead if concerns are right for you. Concerns because everything has to go somewhere. Can you tell us a little bit more about as you were introducing, say like there's new and then you had these legacy test cases to support transition. Did you use any tools to try to automate much of that or was that primarily. Was there some copy pasting between files and things like that, or what did that look like?

[00:28:02.250] - Florent Beaurain
So it's one of my teammates that have done it. Francois, a lot of things was just search and replace, basically to replace the class name to another one launching on CI. I think he split it by engine. So he was migrating one engine and another, etc. I think he used also a bit of nothing too complex, but a bit of AI to migrate it. So we had basically a confluence page explaining the breaking changes. The point was for if users wanted to migrate, they can. So we maintained this list of if you have zcr, you have to do this. If you have zcr, you have to do this. If you have zcr, you need to include that. And I think he used a bit of AI to feed with this document to migrate.

[00:28:42.140] - Robby Russell
Some tested one of the things we didn't touch on but is DrLib's platform primarily composed of a monolith with a bunch of engines or are there a bunch of external other services or kind of somewhere is it a hybrid situation there?

[00:28:54.700] - Florent Beaurain
Ten years ago we had this monolith. Over the last six, seven years we migrate from this monolith to a monolith with some engine inside it. And now since two years we also have some external services. So we have this big monolith and 80% of the traffic is still going through the monolith. But we have also some new services.

[00:29:15.350] - Robby Russell
Are those also built with Ruby and Rails or are you using some other technologies frameworks for those?

[00:29:20.470] - Florent Beaurain
No, they are in Java, most of them we have a bit of Node, we have a bit of Elixir, we have some in Rust, most of them are in Java. And the strategy is to have some the external services for now in Java and because of acquisition or some of those stuff, we have also some service in other languages.

[00:29:39.010] - Robby Russell
I see, I see. And then I think another thing that I know that you folks are using pacwork, I believe to modularize parts of your code base.

[00:29:46.930] - Florent Beaurain
So yes.

[00:29:47.890] - Robby Russell
Were you around when that decision was made to start using pacwork?

[00:29:51.810] - Florent Beaurain
Yes. In fact it's my team that decided to do it. So I was in this architecture team we had at this time and that's the time where we we have this big monoid code base and we were like okay, it cannot fit in one head. So we have to find a way to have a team that can work in these small boxes and just have to have these boxes in his head. So have several monoliths, I don't know. And the strategy was to use Rails engine to build these small boxes and that's where we started. So exactly like Shopify with components. I'm not sure they are using Ryzen to do it but anyways that kind of has the same idea behind it. Quickly after they released Packworth and we were okay. It's a no brainer for us because it's basically exactly what we are doing and what we want. So the idea of this small package that have their own public API so if someone want to discuss with it they have to go through this public API and we have this declaration of dependencies, etc. So that was a no brainer for us. So we adopt it and we are still using it.

[00:31:03.180] - Florent Beaurain
But I think that's what you want to discuss after we have a small disillusion about it.

[00:31:09.820] - Robby Russell
Oh, interesting. You know, initially as you were thinking about as an organization, like how to allow teams to kind of focus on their area that they're going to own and they would expose a public API in such a using something like packwork. Are your teams like kind of separated already by different areas of your monolith or is it people jumping in and out of things and like this kind of wrap your head around this part of it and that's where they're going to focus or people jumping around quite a bit between different areas of your platform.

[00:31:38.070] - Florent Beaurain
So we have what we call domain. Okay. And so domain is responsible of a big chunk of the product and inside domain we have feature team and each feature team have a bit of the scope. Okay. And they shared like that. So everything is really well cut inside the organization around that. And of course the code base is following this. You can use this pattern. Every file is owned by your team, each engine is owned by your team. It's really well defined which part is on to who and who can. It's not who can work, everyone can contribute. I mean but that's not what we have tried to do method. It's more everyone has his own part of the code base.

[00:32:17.920] - Robby Russell
Did adopting pacwork help the developer experience in terms of things like local dev performance or Test reliability or was it more of a like let's just allow people to work in their domain a little bit and focus there?

[00:32:31.760] - Florent Beaurain
I don't think so. That's my take. It's personal. But one of the things was okay, we want to modulate the monolith, we want to do it for the developer experience, but it's also great way to improve test performances. As I said a bit earlier, we had this test selection stuff that write based on the diff to launch just the test we need. So if we decouple the application and the engine from other part, we should normally be able to learn just a subset of the test. If you work on your engine, you should be able to launch your engine test and you should be pretty confident in real life. So I don't think it has much improved the situation.

[00:33:12.340] - Robby Russell
Can you speak to that a little bit more? Like if one of your teams works on a set of features and maybe they're responsible for an engine or two and they're running their test suite for their area, what doesn't work about that in air quoting the real world in the situation, where are they a lot more tightly coupled to other areas? Tell us more.

[00:33:30.660] - Florent Beaurain
So it's a big topic, but I think getting to zero dependencies package is a chimera. We have not been able to do it and if you want to do it, you will have to in a big application after several years of working. I'm not speaking about the Greenfield app, you will have to invest massive amount of time to reach it and by doing so you will probably make the code worse in terms of readability, you will have to reinverse a lot the dependencies and because of that you will use events that will make everything worse for debugging and trashability. So yeah, it's a complicated topic, but I don't think it's reachable to have zero dependency things. Also, pacwerki has its own limit. It's static analysis in Ruby. So even if you reach zero in pacwarek and you try to launch your CI with just this engine for example, say I chance that doesn't work and you have other stuff to fix over time you will have to do it again and again and again because someone will introduce a dynamic reference and it doesn't work anymore. So it's really a lot of work and a big investment to reach that.

[00:34:43.690] - Florent Beaurain
So it's a nice tool. We still use it and I still think people should use it. But don't fall in the chimera that you will just work in your engine. Launch the test of your engine and it will be amazing.

[00:34:57.140] - Robby Russell
You and your coworkers there have seen a lot of benefits from it, but it didn't necessarily deliver on that promise or the illusion that you're going to be immune to a lot of similar issues that you already have in your code base where.

[00:35:11.300] - Florent Beaurain
Yeah, exactly. That's exactly that. Yeah, you won't be immune to that. Also Pacware tell you that here there is an issue, but it doesn't tell you how to fix it. It's not always that easy to fix a dependency. That's what I was saying. Sometimes we, we make it worse by trying to fix it. So yes, it's kind. We are not immune. Tool is not perfect. I mean it's just a tool when.

[00:35:32.850] - Robby Russell
You get to that number of engineers, I can only imagine I've. I've never worked in that type of space myself so I have no concept of like just how much is happening on a day to day basis. And to try to try to protect the code base as much as you can and protect the individual developers, that's a lot of competing things and be able to move fast and you don't want to slow down the velocity of your dev team but you also don't want everybody just throwing code wherever at the time and there's not someone that's there to make a decision on every single thing. So it's interesting immersion of code off and out of curiosity, are you also using any AI stuff now and doing a co generation also with AI and taking advantage of that or is that helping at all or.

[00:36:14.980] - Florent Beaurain
We have basically full access and we can leverage it as much as we want. So we have access to, you know, we have Access inside the IDE, we have access to inside GitHub Action. So we can automate a lot of stuff and we are encouraged to that. I think it's great tools. I don't use it much myself for code generation. I don't really like it, you know, like copilot, autocomplete, etc. I prefer to type the code myself. I use it more like a pair programmer or when I want to have a second opinion on stuff or to shape my mind about something for draft or this kind of stuff. But I think it's really cool tool for onboarding in new code base. I know that a lot of new joiners use it, you know, like how we do that in the code base because of course at size you try to stick to the default and stick on how Rails work but there Is some stuff that it's not enough and you want more so you build your own and so people can. Yeah, they need to know how you do that. So they use a lot the tools for that.

[00:37:15.690] - Florent Beaurain
Myself I use it a lot to build some workflows to mitigate things. Like we have feature switches inside the code base and we have an expiration date on feature switches. So basically your team wants to build a feature, they create a feature switch and there is an expiration date on the feature switch. So they put it like in 3 months, 6 months. Often team forget to do the cleanup. So they have the feature switch, it's enabled in production, but feature switch is till they are in the code base and there is a L branch that is useless anymore. And now so I have this workflow that try to clean them up. Take the UXBR filter switch, open the pull request, ping the team and they just have to review and make few adjustments if needed. For this kind of stuff. I find it pretty useful. You know this current work it was not that easy to automate before because you need a bit of logic on okay this not just a search on Ipace you know, it's okay I have if enable I have to remove that and I have to refactor the file. So it's a bit more complex than so zlm bring this bit of intelligence if I can say that allow now to automate this kind of stuff.

[00:38:21.740] - Robby Russell
And it's pretty interesting actually really curious about your feature switches. So is that your way of deprecating a set of features that you're going to remove at some point or just needs to turn on at a certain point? Or is it kind of like you mentioned, kind of like a feature flag but in what context? What pattern does drlib use with your team to we need to remove this.

[00:38:39.240] - Florent Beaurain
At some point or we use it for basically everything. So it's a complete system now with factor feature switch, country feature switch, etc. But that's how we deal with removing a feature as much as introducing a new feature to clients cohort and stuff. Also how we change a query. For example, sometimes we just put a feature suite to change a query because we are not sure the query will be performant enough in production. So it's really a way to deal to release any sort of code or to remove any sort of code for the stability of the platform.

[00:39:12.200] - Robby Russell
Basically when does your team need to make that sort of call on a that granular of a level of. There's a team might have a hunch or a suspicion that this query might be less performant once we roll it out to production. So we want to kind of test it out. Is that kind of speak to not being able to run like a test against something more like a production data in like a staging or QA environment or your local development's never going to have that level amount of data there?

[00:39:36.330] - Florent Beaurain
Yeah, we don't have this amount of data in staging and test environment, so we cannot really test that here. Even if we have the data in staging, it's not the same, you know, so the query plan won't be the same. It's really hard to replicate that in other environments. So we really incentive people to use feature switch for any kind of stuff. We have made it really trivial to add them. But if things go wrong, we prefer if it's feature switch, basically. Yeah, we put basically everything behind it.

[00:40:05.690] - Robby Russell
In those situations where you got these feature switches, does that not. I'm just going to ask kind of like a dumb question here, but does that not introduce a little bit of performance implications itself that you're having to check the current status on these feature flags as you're executing the code? Does any of this code ever get cached? Yeah, you know, in the server that's.

[00:40:26.490] - Florent Beaurain
Running and like so they are storing. The feature switches are stored in the. In the database. So of course there is a kinda fun overhead. But we are using that since all wrong that we have optimized data. There is caching at the request level, there is caching at the worker level. Every time you check for a feature switch, we don't retrieve them from the database. It's now almost transparent.

[00:40:46.470] - Robby Russell
There's a pattern there. If one of your teams is going to ship out something that maybe has this example of a query that you want to see how it's going to perform in production. You ship it out. Do you tend to default it to being off by default or on? And you're like, okay, ship or is it kind of. It depends. But then the processes get shipped out to production. It's deployed now. Someone's looking at some metrics that are coming in. You're like, all right, I'm going to flip the switch and see what happens for the next 10 minutes and be like, oh, that didn't work out, Turn it off. Is that how that's working?

[00:41:15.860] - Florent Beaurain
We roll out, it's false by default and then we have a new eye to activate it. And that's basically what people use. As I said before, there is different types. There is like the on, off, Basically Boolean. But we have also a factor feature switch where we can release it for percentage of traffic, you know, 1%, 10% et cetera. So depending on the criticity of what you're doing, you can use that. You cannot use it in all the cases there is no stickiness to it, people activate it and then they monitor and they can quickly go down. We are also investigating currently to have auto rollback on the feature switch to have some automated process. So we use DataDog as an RPM and we have also Sentry for the error reporting. Actually that's a great probably something about AI. So we had the idea to use Datadog with an alert to auto rollback. But I have also in mind to try to. There is Datadog, even MCP Server and Sentry. Also I kinda want to try to let an LLM make the call if we should rollback or not. So basically when I have a feature switch that is put on, on and off, I have an event and then I can trigger the LLM that can monitor Sentry and Datadog in the next 10, 15 minutes and make the call if we should hold back or not.

[00:42:28.520] - Florent Beaurain
That's something I want to try.

[00:42:29.800] - Robby Russell
Sounds interesting. I know I've been following a little bit about Sentry's doing there in particular because just follow one of them on social media and stuff like that and I know he's been talking a lot about that but curious to see how that kind of pans out because I think I've seen people talking about like self healing things and like having the error reporting tools send you PRs for the potential fix for something and you're like, I'm like that's fascinating.

[00:42:51.820] - Florent Beaurain
I think there is something like that in some tree I haven't tried but I think there is something like that. Investigate your stack trace. It will be probably okay in most of the cases. I mean a lot of sentries we are. This is a trivial fix so I'm pretty sure ILM can be good at it.

[00:43:09.660] - Robby Russell
Yeah, yeah, I think that could be quite interesting. Let's talk a little bit about scaling your database. I know that Doctolib is. I think you're using Aurora Postgres. Was that right?

[00:43:20.540] - Florent Beaurain
Exactly. Yes.

[00:43:21.580] - Robby Russell
My understanding that you've hit AWS's limits and you're not the first team I've talked to recently specifically saying this like you're running on the largest instances, right?

[00:43:30.620] - Florent Beaurain
Yep. Yeah, we are running on the largest instances. We have 10 writers today and each writer can have up to 15 readers and we already have Some writers that reach this reader's limit. So we have to remove stuff from this writer and put it elsewhere.

[00:43:49.750] - Robby Russell
Why aren't they just making larger systems? I'm kind of being a little facetious there, but are you just storing too much data? Tell me more about that. Like when you get to that scale where you're like hitting the ceiling in that way, you'll eventually have to spin up 11 and the number 12 servers. Or tell us more about what's the situation that you're needing to navigate there. How do you. You mentioned maybe needing to move data somewhere else. Tell us more.

[00:44:14.710] - Florent Beaurain
Yes, that's basically it. So currently your bottleneck for the scaling is probably almost everyone is the database. So when I joined, we had one writer and several readers and we pushed that as far as we can. So for the readers, you can have readers, but you need to send queries to it. Okay, so we had some manual code inside the code base to send queries to the readers. And we push that where we could. And at some point the writers add too much load into it. So we need to figure things out. When we looked into the writer, one of the issue had that, in fact, the writer had so many reads to it, so we need to find a way to offload those reads to the readers. So we have worked on a solution to automatically send reads to the readers by analyzing the query. We pass the query, we look if it's a select. If it's a select, we can probably send it to the reader. Then we look if we have write inside the table the select is looking to in the same request. If not, we send it to the reader.

[00:45:19.630] - Florent Beaurain
If yes, we keep it on the writer because there is replication lag between the writer and the reader. So to avoid to have stale data, we keep it there. And so we have done that and we have improve it over the year to offload maximum of the read into the readers. And it was not enough. Ranwriter at some point was not enough. Not because of the amount of data. I think on the main writer there is like 40 terabyte, something like that. That's not really the issue. The issue is more the number of operations we do on it every second. That's more the IO the problem. So we needed to have second writer. So we put the second writer. But once you have it, you need to migrate the data from your main writer to the second one. So you need to select which table you need to move. You have to do it with care because you have to select the table. You won't move one table, you will move a group of tables that are very related to each other to keep the join and the foreign key, etc. Etc. And to do it safely, you have to first ensure that there is no join from the table.

[00:46:25.660] - Florent Beaurain
You won't move to the table, you will move and you need to build tooling around that etc. To be able to do it safely. And also because you know that that's not the last time you will do it. So you need to automate that. And that's how we ended with 1 to 2 and then 2 to 3. And now we are 10.

[00:46:44.010] - Robby Russell
It's interesting. I have to. I don't have it in front of me, but I think I saw someone at Intercom post something the other day on social media about they were evaluating their their reads and specifically that how much data they were sending to their readers by specifying the columns names that they wanted to select. And they were finding that that was actually a pretty big performance issue where you were sending all this select 30 column names versus select an asterisk, just the text being sent across from your app to the database as a reader. And I'm like, these are the types of issues you have at that scale. And it's like, well, you don't want to bring everything back on every query because that doesn't seem performant either. But also like you're sending data to the server, you know, and so it's like that's a big chunk to get less data back. There's an interesting compromise. I hadn't really thought about that and I was like, oh, have you encountered anything like that yourself?

[00:47:43.620] - Florent Beaurain
Yes and no. So I mean, we don't really measure that and we don't really bother to just select what we need. Okay, we are still on the Rails default, if I can say where we just account where and that's it. But we have few models where we know that there's columns with big chunk of data on it. So for those we have specific concerns to avoid to select just this field because we know that it's too big and we have also tons of constraints inside the app for what is related to the database. So when you do a migration, there is a lot of things you are forbidden or you must do to be able to merge it. So for every field we ensure there is a proper size, etc. You cannot go bigger than that because, you know, we know too much and we have a lot of targets that teams must ensure. So the table must be merged that this amount of data inside it, et cetera, that number of columns. We have a lot of constraints on the database.

[00:48:38.130] - Robby Russell
I'm thinking about scenarios where you're introducing a new feature and you might need to modify existing data as part of that rollout. That needs to go through all the existing data in your database because you're changing the nature of it. Is it default? You're like, well, we'll set up new columns and we'll just have the other ones exist for a while and at some point phase those out. Or do you ever modify existing data as part of a rollout of a new feature or something?

[00:49:04.460] - Florent Beaurain
So we don't modify data during rollout. When I'm in the process of updating, people have to do that in several phase with several rollout. So it really depends on the operation you want to do. We have our own tooling. We have something called safe page immigration. I think it's open source, so we use that. We have also a tool from Onkane, our own migration. So we cannot for example rename a column, we cannot rename a table. We have to do multi step process for this kind of changes. So it's the same. If you want to change a kind of data, they will probably keep the old one and have a task running in the background to write the new format. Then you enable the feature to use a new one. It's always a multi step process. The database is a problem at multi steps. It's a problem for performances. It's a problem because you are always short on load and it's like a every year we have to spawn new database, migrate data. It's always in the rush. But it's also a big problem from rollout. When we put in production running migration, there is a lot of kind of migration we cannot do anymore because of the load on the database.

[00:50:15.930] - Florent Beaurain
We cannot log for that. There is a lot of operations that are forbidden and so we need to do it in multiple step or there is stuff we cannot use anymore. I don't have one in mind as an example.

[00:50:26.690] - Robby Russell
But yeah, those are the types of issues that teams at your type of scale have. Those are not necessarily things that a small rail shop and a new application probably have to even think about. But at some point that might need to change as your app grows and gets a little more complicated and takes a lot longer to add a new table or rename a table or you just never get to do that ever again.

[00:50:46.570] - Florent Beaurain
In those situations, I think this kind of stuff, if you have users, you should not do them at all because that's not zero downtime. So if you have a bit of users renaming a table or this kind of stuff, you will bring your website down for a few sec. Even at a small scale there's some actions that you should not do then.

[00:51:07.930] - Robby Russell
I can appreciate that. One of the other things I wanted to talk with you about is that you've been working at drlib for a while now and you've led several Rails upgrades there, right?

[00:51:15.690] - Florent Beaurain
Yes.

[00:51:16.330] - Robby Russell
I think you said you started using rails around rails 5. Was that approximately the version you were when you started working at Doctolib? Was that the version they were running on as well?

[00:51:25.270] - Florent Beaurain
Yes, basically when I joined there's like some freelancers that were upgrading from Rise four to Rise five and the day I joined so they were doing that and then I took over for the remaining.

[00:51:39.190] - Robby Russell
Has that historically while you've been there been like a one person responsibility or primarily you or is there a little bit more of a team effort to accomplish that?

[00:51:48.390] - Florent Beaurain
It was personal at the beginning and since now one year and a half, two years, I have a dedicated team so we do it in teams. But yes, at the beginning there is no ownership over this kind of stuff. We had only feature teams specialized on product and nothing related to the platform and non applicative code. So it was more on a boy scoot rule. That was my thing.

[00:52:11.980] - Robby Russell
What's something you learned the hard way about upgrading Rails apps?

[00:52:16.620] - Florent Beaurain
I think you cannot change something or it makes your test pass and that's it. I have been beaten by that too much. You know, like I have this test, it doesn't pass, I change that. The test is green, I don't really know why, but now it's green and I'm happy and I just go with that. And that's what will fail in production for any reason because I haven't deeply understood the change behind it and so I just either adapted a bit the test or changed one line of the config but without properly understand the whole picture, the big picture. And that's the thing. So for every Rails upgrade we do, for every change in the code base we add, we have an explanation. This is because we change that in the Framework pair link and we make sure we have understood the change upstream and that's the way it takes time.

[00:53:10.450] - Robby Russell
Are you able to then keep relatively up to date these days then or have you got to the point where you're like within a major version release or are you up to date right now?

[00:53:21.450] - Florent Beaurain
We are up to date. We don't spend much time on it to be honest over the year we have improved so it's easier. We have removed patches, we have tried to upstream stuff, we are also sticking to the default on some configuration etc. It makes everything easier. So overall now we have a well oil process and doesn't take too much time and I think Rails have also improved a lot to ship less breaking changes or sometimes it's not even they don't know that it's breaking changes okay and sometimes it's your fault, sometimes it's a bit gray area. It's easy in Rails to plug yourself in things you forgot. Yeah we have taken care overall over the year to remove all those gray things and patches and things where we were using Prevail API. So it's now I won't say pretty trivial.

[00:54:12.190] - Robby Russell
Have you needed to do anything like remove certain types of dependencies like certain types of gems that might have been used in the past that might have prevented you from upgrading because it was not going to work with the next version of Rails? Or is there kind of a philosophy about how you approach when you bring.

[00:54:27.710] - Florent Beaurain
In some so I listened to the Postcards Studio ad with Jean where he was talking about this gem that put like yeah, I don't want ActiveRecord 9 for example. And yeah we removed that because I think about for example this gem bullet. I think for N1 it was really complicated to resist gem because it takes a while before they really roll out the support for the next rail version. So not blaming you. So yes for this specific gem it was complicated. They had an hard code version of rails ActiveRecord inside it and so we remove this kind of stuff overall I think that's all this our limit.

[00:55:07.060] - Robby Russell
We talk with a lot of my company because we got we get brought in to help teams with their upgrades because they have a bunch of people working on their features and nobody on their team has a lot of experience doing it finishing an upgrade. Like there's plenty of people that have started upgrade project, they created a branch, they did that for a couple of weeks, they got stuck, had to switch back to a feature and then six months later they're like where was I on this upgrade? And they can't figure out how to build momentum there so they'll call a company like us to come in and help them with that and but I'm always like ideally you as a team need to figure this out and how to keep this a regularly part of your process. And so at Drlip's scale you're able to like have a team that's just thinking about the developer experience. Do you think you'd have any advice for smaller teams on how they can start to mitigate that themselves?

[00:55:52.500] - Florent Beaurain

It's like a lot of things to get better at it. You have to do it and put your hands on it. It's not that hard. You have to try to understand the changes and to that your best friend is often bundled open. You open the gem, you open the Rails code base and you try. We have the chance to work with language that is easy. It's so easy to read, easy to understand. So most of the Rails code base, most of the gems are very easy to grasp and I think we should take advantage of that to try to understand what I've changed. Okay, the divs, then you have the pull request. Once you have the pull request, you have the context and you have why it's done, why we changed that. And then you easily can make the change needed inside your context.

[00:56:37.900] - Robby Russell
It's so easy. Just open up use bundle open and look at the code base and you'll figure it out. You can read the Rails source code. It's not that scary.

[00:56:45.900] - Florent Beaurain
Yes, I haven't contributed to Rails that much. I think one hand is enough and maybe less. But yet I think I have the repo open all the time in my computer and I spend days to. It's amazing. I have tried to do the same. We are using React for the frontend Adoptoid. I have tried to do the same with React.

[00:57:10.230] - Robby Russell
No comment.

[00:57:10.750] - Florent Beaurain
It's a nightmare. I mean.

[00:57:14.230] - Robby Russell
It's also my understanding that Doctolib provides a CLI tool that helps your new engineers get set up. Could you tell us a little bit about that?

[00:57:23.160] - Florent Beaurain

So what we call dctl, it's a CLI made by Go, I think something like that. And we use it like an entry point for many things. We have a lot of commands into it. There is like a team dedicated to that and we have also community based plugin. We can add it to task. You get a new computer, you just tap the repository and then you are able to reinstall DCTL and once you have done you can just pull DCTL dev env and it will set up your laptop end to end to have Ruby Node. Then you can launch the application, it will install all the tools. Then there is a lot of commands into it. Like if you want to connect to Staging, for example, we can connect to the staging database to try some performance stuff to have a bit more data. Everything goes through that. It's great tools for the onboarding. I mean you just dctl, devant and Hub, you're up and running.

[00:58:13.600] - Robby Russell
That's interesting. So you can allow that to make it easy. Are you using like Docker locally as well?

[00:58:19.600] - Florent Beaurain
Yes, we have Docker, but only for data store.

[00:58:23.680] - Robby Russell

Okay.

[00:58:24.240] - Florent Beaurain
So we have Ready Celestial Search and postgrad into it.

[00:58:28.160] - Robby Russell
So you got your databases there, your data storage Rails is just running on your Mac machines or whatever. And then if you want to connect to a staging database environment is then that'll then just automatically connect that for you.

[00:58:42.090] - Florent Beaurain
Yeah, I just do desetl staging connect and connect it to the staging database.

[00:58:46.810] - Robby Russell
I can see how that could be really helpful. What about things like seeding your local database with enough data given how large of a platform you have?

[00:58:54.410] - Florent Beaurain

That's a big topic we haven't talked about. That's I think one of the best things around fixtures. It's once you have fixtures and if you use fixtures to your test, you have basically your seed environment. So we don't do that. We have a bunch of fixtures that we use, but we don't use them in test. We use them to seed our environment and that's basically how we do it. I know that some team have also some rake tasks for specific cases. They seed when they want. They don't add them to the fixtures set for everyone because it will be too slow. But yeah, we don't have anything magic around that. So we have some fixtures in YAML that we load in the seed rb. And do you think this is like.

[00:59:35.670] - Robby Russell

A thing that it would be nice if the Rails framework itself provided some more functionality in this space? Or do you feel like this is only an issue with certain organizations when they get certain size of an engineering team or certain size? Because I'm primarily talking like on the podcast with a lot of people that are working at really large companies. So I'm just like, wow, I feel like this is the seed situation that we might have on a brand new MVP is very different than 12 years later and do an application and so like how do we kind of connect the dots so we're not having to figure out these little creative things. Every organization is trying to solve this problem themselves.

[01:00:09.520] - Florent Beaurain
To be honest, I haven't talked a lot about this topic. I have seen some people try to push their own DSL and et cetera around it. I don't know if Rails should push for something. Maybe they should double down on fixtures and make the sys. RB loads the test fixtures by default. I don't Know something like that. I will be pretty happy about it because I think every toy project I do, that's the first thing I do. Not sure we are needing more tooling around it. Often teams will need big corp, etc. We want to have real data from staging. It's a bit too much for the framework to go in that direction. I think if you are the stage where you need that, you probably have the manpower to do it and anyway it will be too specific from what you have to.

[01:00:54.830] - Robby Russell
It's interesting because I work in the consulting space and so we come into different clients and a lot of smaller teams don't have really super efficient ways to do it. And plenty of clients like will see that their team is needing to pull like a production database snapshot into like a staging like environment. They might be scrubbing the data and then using that so that they can test out something because they can't test this work, try to fix an issue that might be only showing up production without having some more realistic data. And they're not going to give their developers direct access to connect to the production database. So they're trying to find all these interesting but they don't have that big of a team to figure all that out. And so they're like well what do we do in the meantime?

[01:01:31.390] - Robby Russell
And it's.

[01:01:32.390] - Robby Russell
There's not a lot of good patterns I think that are at least not easy to find in commonly shared because like everybody's like I don't really know how to best like. And they ask us for advice and I'm like well you can do what these big companies are doing when they have someone specially focused on that problem and like you can't afford that. So I don't, I don't know what to say. Yeah, be more successful as a company.

[01:01:53.020] - Florent Beaurain
I don't know. No, I don't know. But for this kind of stuff we connect to staging that is refreshed from production and where we scrap the data and that's how we do.

[01:02:05.220] - Robby Russell
Basically I want to circle back to upgrades in particular. What's the strategy there? Are you doing anything like dual boot there? Are you running things you have a branch that's running against the latest code in main on Rails or.

[01:02:20.290] - Florent Beaurain
Yeah. So when there is a new version of Rails shipping, I mean a major one, depends on the time frame we have, et cetera. But if we have the time early, we start with the alpha. Often there is the alpha. So we start to open the pull request, run the bin/rails update, something like that command to update the configuration, etc. And the first thing we try to have is the CI running. Okay. Often the CI will crash. So we fix this kind of crashes and once the CI is running we know the number of tests that it will be failing. Then we do the back and forth or with the CI driven development where we try to have the CI win. So we will fix everything by yourself, every test, every code, every change needed. So we won't ask Tim to do it, we will do it ourselves. Sometimes we consult them to better understand the feature, but we basically do everything ourselves. And we have something close to green, we will backport changes. So I mean by that that everything that is needed for the next version, but we can already merge it to main, we do it.

[01:03:24.170] - Florent Beaurain
So at the end of the day we want to have the smallest change possible. So we do pull requests to offload that and offload that. I don't mention it. But of course as I said just before, we have a link to every change on thresh for every change we have done inside the pull request and on the CIS green. And we have the smallest change possible, we will merge that through what we will call the dual boot. So once it's merged, everything will run on the next version of Rails. So locally, all the CI, all our staging, pre prod environment, this kind of stuff will run with the next version, but not production and production CI. Okay, so everything is next except the production and production CI. And then we do that for an amount of time. It depends of the confidence we have with the change. It will depend on the agenda of the organization. As it's a big change, we often have to when we can do it. And once we have the green light to do it, we remove the dual boot and we control it. That's basically the process.

[01:04:26.620] - Robby Russell
When you talk about bout backporting is do you organize those commits in a way that you can just cherry pick those commits in particular or is that generally how that works? Or like to bring back into main? Or how granular do you try to keep those changes to? Or is it because you have this branch with a bunch of like you get the test suite passing again?

[01:04:46.990] - Florent Beaurain
Yeah, we group them by rest changes, you know, like okay, this configuration has changed it need to, we need to change this code, this code, this code. We will group them together. Okay, that's. That's a change in Rails that have lead to this diff. That's basically how we group them.

[01:05:02.560] - Robby Russell
Do you feel like you've gotten to a good pattern of like as you're navigating one of those upgrade branch and you got your branch. You're working through getting your test suite running in the first place and then working through the broken test at that point that you have a good like kind of pattern recognition to spot like oh, this. These things seem to be kind of clustered. So I'm going to focus on that area for a while because I know that when you do that sometimes it's just like, oh, there's a bunch of things going on. You're like how do you prioritize where you go? And it's like different rabbit holes you could potentially go. Which I think is what makes people a little nervous about doing it because it feels like you can go in any direction. There's a bunch of fires popping up in parallel. How do I approach this?

[01:05:42.690] - Florent Beaurain
Yeah. So in our case or CI is reporting us the most common pattern of failures across our test. So I run the world street and it will save me like 10,000 tests is failing because of that. And I often focus by this one first, the big one. So you are pretty happy because you saw the number of green tests grow a lot. And yes, the remaining 100 are the slowest to do because it's always a different case. But that's basically how we do it.

[01:06:12.090] - Robby Russell
That makes sense. When thinking about tests. Could you tell us a little bit about how your team addresses and avoids. Is it safe to assume that you still have flaky tests appear at times?

[01:06:23.990] - Florent Beaurain
Yes, it's a massive topic. We have something around almost 20,000 end to end test. That's something that doesn't grow a lot anymore. Completely change the strategy and we have not stopped. But we try to avoid to add too much end to end test because of the flakiness. The cost also. But a big part of it is because of the flakiness. We have invest a lot on solve the flakiness problem over the year. I think it's a seven years project. And so we have a lot of things. We have. I don't know if you have that in US but in Europe we have this military score. So basically when you buy something which is food, there is a score, you know, A, B, C, D if it's good for your health or not.

[01:07:10.690] - Robby Russell
Oh yeah, yeah, yeah, yeah.

[01:07:11.930] - Florent Beaurain

So we have basically the same for test. So we are having metrics coming from the main bunch and we are computing a score for a test. This test is flaky to this number of percentage. So here's the list. Three score and then the team have it and then we push bug to them, they have to fix them, et cetera. We have a lot of strategies on retrying. If they fail two times in a row on the CI, we automatically skip them and create a bug ticket to the team so they have to fix it. So yeah, we have a bunch of process to try to undink them. With a new framework we have introduced Capybara Logstack also it's a German company that have open sourced that. So basically most of flakiness come from the React front end. So Capybara state is not in sync with the front end state and it tried to click on a button but the JavaScript is not ready and it failed. Okay. Or the dropdown is popping but not finished yet and we try to it doesn't work, button is moving and click not on. That's often this race condition with the front end.

[01:08:15.520] - Florent Beaurain
It's like 70%, 75% of our flakiness. So Capybara Logstep tried to address that by synchronizing Capybara with the state of the front end. So Capybara will have basically a mutex and do nothing until the patch is loaded, until Ajax query have been fired, the browser is either the network is hidden, etc. Etc. Sort of fix a lot of problem we had I think. But still there is still a lot of flakiness coming from the front end and so that's why we have all this process. And I think the best strategy we had is basically make them less noisy and forcing people to work on it by skipping them.

[01:08:55.230] - Robby Russell

Just pulled that up. So I see the Capybara lockstep, not lockstep. Okay, so I'll include links to this in the show notes as well. It's such an interesting thing that.

[01:09:03.930] - Florent Beaurain
Yeah, it's really nice.

[01:09:05.290] - Robby Russell
Why can't Capybar just do this from the get go? Why is JavaScript so complicated?

[01:09:13.530] - Florent Beaurain
I think it will improve. So to do that Capybara step have some acts and some JavaScript snippets you need to include and stuff. But Selenium is migrating to Bidi, which is a new protocol to communicate with browser. So it directly communicates with the browser instead of having the Chrome driver. And thanks to this new protocol Selenium will have more information about the browser state and it should improve I hope or we will at least be able to build more reliable tool to synchronize the state.

[01:09:46.290] - Robby Russell
You mentioned your team uses React. Is that code in within the Rails app? Are you using like React on Rails or is that a separate repository where all the like the front end live. This is opening up a can of worms.

[01:09:57.170] - Florent Beaurain
It's a good question. So when I joined I never use React Analyst but I think there is like this Ruby Helper where you can do React components and it will mount a component into your slim page or I mean your ERB page, et cetera. So we have this pattern in the code base and that's one of the way we did in the past for some part of the application like in admin pages. We did that also in the patient website. But no, now we don't do that. We have a proper spa. It's still the monolith that renders the first HTML layout but then directly mounts the spa into it.

[01:10:35.260] - Robby Russell
I see. And so in a local environment then is that then people are spinning up the React SPA and that then has the Rails app running as well there.

[01:10:44.940] - Florent Beaurain
Yep. So we have a webpack dev server running and the Rails application running which is not ideal today. React web server is a webpack device.

[01:10:54.350] - Florent Beaurain
Sorry.

[01:10:54.550] - Florent Beaurain
Consume a lot of ram, put our machine under EV pressure. It's one of the biggest driver of bad dev experience currently.

[01:11:04.190] - Robby Russell
Are there any non Rails patterns or ideas that have been especially successful in your code base there?

[01:11:12.750] - Florent Beaurain
To be honest, I don't know. We mostly stick to what Rails have maybe in three or four years. I will tell you because with all this modularization topic and new services going on, we're starting to introduce Kafka and so we will probably see some patterns you don't have in usual Rails app coming. So yeah, there's nothing coming to mind. I mean the race engine stuff, etc. I think it gave us a good Runway. It's not perfect but it gave us a good Runway.

[01:11:45.020] - Robby Russell
Is there anything that your team does differently than you think most Rails teams do?

[01:11:50.720] - Florent Beaurain
I don't think so, no. I mean with the scale we have different things. We have 10 databases, this kind of stuff. But at all scale I think people will do the same than us. I did my best to try to stick to it, you know. And if I can say we have been pretty lucky on that direction because Shopify have paved the way for many things and they have put that in the framework. So we have benefited a lot from that. They give us okay the direction and they also gave us the tools when we needed to have multi database. Elin just merge it. She was at GitHub at the time but just merge it in Rails. So we had to upgrade and up we had the multi database. So there's a lot of things we do we just get it from either Shopify or Raise? Because they have done it and then they merge.

[01:12:41.340] - Robby Russell
Seems to me that Rails has definitely been part of Doctolib's success. Maybe early on. I don't actually know a lot about how the organization started, but was one of the founders a software developer themselves? Out of curiosity.

[01:12:53.940] - Florent Beaurain
So there is three founders and out of those three founders two was technical.

[01:12:58.420] - Robby Russell
And they started that first Rails app.

[01:13:00.340] - Florent Beaurain
Yeah, exactly.

[01:13:01.380] - Robby Russell
It seems to be a common theme in a lot of the conversations I've been having with different companies. And so do you think that Rails is still one of Doctolib's? I'm Air quoting Secret Weapons.

[01:13:13.390] - Florent Beaurain
I think so, yes. Pretty sure some people at company will say no, but I'm kind of the people that think that yes, it's a good example. But I think onboarding. Every Rails app I know looks the same. There's teeny differences, you know, like they use service objects, we don't. Okay. There are squeaky objects, we don't. Okay. But most of them look the same and it's trivial to navigate to it. Okay, I have an account, I have an account test, I have an account controller, I have an account controller test. So it allows us to onboard I think people really fast because if they have a bit of experience with Rails it's really easy for them to catch up. If they don't, the structure is so self explanatory and Ruby help a lot. Also here at the code level it's so easy to read. So I think it's so trivial to class prize and even in the big code base I think it's a big advantage, the simplicity.

[01:14:13.890] - Robby Russell
I can appreciate that. I think about when teams are growing and people come in at different points and maybe they have different experiences or at a certain scale people might come to an organization and they've worked with different tech stacks and they, they have different experiences to bring and they're like well we used to do this differently in this other company I worked at. At this larger scale. Rails seems different in certain ways. Like I wish I had some of the things that I could lean on but then like there's a lot of like there's different people in different roles as well. And so I would imagine I, I feel like sometimes I talk with people at a large organizations and they're like there's some people that are like well not everybody loves Rails as much as I do but and like you know, or it's just becomes this interesting kind of thing where you get different people, different like ideas there and not everybody there is living and breathing Ruby on Rails necessarily either because you've got your react developers I'm imagining and different people focusing on different things. And like there's this interesting tension of the competing forces on like what's slowing different people down based off of where they're spending the majority of their time.

[01:15:12.220] - Robby Russell
And so it's good to hear that Doctolib is able to keep up on Rails. You have that large of organization, you're it's been able to scale with you and you're able to look at Shopify and see how Shopify is paving the way to make that possible. So you can keep working with Ruby on Rails there and kind of just follow their breadcrumbs that they're leaving as they're going like follow us this way. It's working. This is going to work out. Those are things that we didn't have 10 years ago and we used to have to have these conversations like well Rails might not work in this sort of larger situation that companies kind of get to and you have other issues where you're just hitting issues with like AWS database, you know, sizes and stuff like that. And that's not necessarily Rails fault. That's not a Rails issue.

[01:15:51.070] - Florent Beaurain
Some people yeah as most of the web scalability things, it's always the today's.

[01:15:57.070] - Robby Russell
A couple of last quick questions for you. Is there a technical book that you find yourself recommending to teammates or peers over and over again?

[01:16:05.150] - Florent Beaurain
To be honest, I don't read that much book but I have this book, this software engineer guidebook from Gaga I have to read. I heard good things about it. So maybe if I have to recommend one it would be this one. It's not technical per se, it's more how you how tech lead, how you lead things, how you lead change. And I think when you work at Big Corp, it's one of the biggest challenge you have. It's always about humans and how you bring the change to the table and how you are able to have things moving forward instead of just changing code. You know, it's pushing code and change stuff. It's the easy part.

[01:16:41.640] - Robby Russell
I think refactoring your code is one thing. Refactoring how your team communicates and makes decisions on things is a whole nother big challenge. And I don't know that Rails solves that itself, but I'll definitely include links to that in to that book in the show notes as well. I'm curious, where can listeners best follow your thoughts or ruminations about software engineering or Rails? Or does Drlip have like an engineering blog? Do you publicly talk much about this stuff? You mentioned having gems and stuff that you've released as an organization as well, right? Where can I direct people to?

[01:17:14.320] - Florent Beaurain
Good question. I think we had an engineering blog at some point, but nobody's taking care of it anymore. So we had a Twitter also, but I don't think people are taking care of it anymore. So to be honest, I don't have much for the open source work it's on GitHub organization.

[01:17:30.240] - Robby Russell
I'll track down these for everybody and maybe when this episode comes out, you'll have an excuse to write a blog post that like, hey, check out that episode on On Rails on the engineering blog. There you I've given you a free blog post. Get that rebooted. Thanks so much Florin for stopping by to talk shop with us On Rails today.

[01:17:46.060] - Florent Beaurain
Thanks.

[01:17:50.620] - Robby Russell
That's it for this episode of On Rails. This podcast is produced by the Rails foundation with support from its core and contributing members. If you enjoyed the ride, leave a quick review on Apple podcasts, Spotify or YouTube. It helps more folks find the show. Again, I'm Robby Russell. Thanks for riding along. See you next time.

Read Entire Article

Optimizing Rails Tests at Doctolib Scale – On Rails

Related

Microsoft blocks more Microsoft Account bypasses on Windows ...

Profiteers or keeping the lights on? The power plants that m...

Jony Ive Says He Wants His OpenAI Devices to 'Make Us Happy