Data Science Weekly – Issue 600

1 week ago 6

Hello!

Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

And now…let's dive into some interesting links from this week.

Reflecting on my tenure at the City of Boston
I consider my role as a data engineer on the City of Boston Analytics Team as my first “real” data job - first time working with a data warehouse, first time working collaboratively with other data engineers (and first time creating & reviewing Pull Requests), first time creating production ETL pipelines with an orchestration platform (and maintaining workflows & pipelines other people created)…Looking back, I can divide my tenure into 2 halves:
(1) the first year, during which I learned how to become an effective data engineer on the Analytics team - learning the tools, processes, and standard practices; and
(2) the second year, during which I was able to implement improvements to our tools, processes, and standard practices - and ended up re-architecting our data warehouse & ELT pipelines along the way…So, let’s meander in (mostly) chronological order…

People use AI more than you think
The Google I/O keynote yesterday was a great “State of the Union” for AI that highlighted this across modalities, form factors, and tasks. It is really recommended viewing. Google is trying to compete on every front. They’re positioned to win a couple use-cases and be in the top 3 of the rest. No other AI company is close to this…The slide that best captured this was this one of AI tokens processed across all of Google’s AI surfaces (i.e. this includes all modalities), and it is skyrocketing in the last few months….
The Lost Decade of Small Data?
We benchmark DuckDB on a 2012 MacBook Pro to decide: did we lose a decade chasing distributed architectures for data analytics?…

Computational Public Space (talk by Bret Victor)
A values-driven approach to integrating computation into cities…
What the hell is MCP?
A new protocol for a new internet…A lot of people were under the impression that MCP meant that the big problems with AI agents were ‘solved’ and we were about to see large scale deployments of them across the Internet…While MCP does solve some important problems with respect to AI agents, we’ve yet to see any large scale deployments of web agents. Just like with pretty much everything else, the pattern has been slow but steady progress. MCP does, however, establish a standard for how AI agents might navigate the web. Let’s motivate the need for such a standard…
Regression Discontinuity Design: How It Works and When to Use It
I’d love for you to tag along in this deep dive into Regression Discontinuity Design (RDD)…In this post, I’ll give you a crisp view of how and why RDD works…Inevitably, this will involve a bit of math — a pleasant sight for some — but I’ll do my best to keep it accessible with classic examples from the literature…We’ll also see how RDD can tackle a thorny causal inference challenge in e-commerce and online marketplaces: the impact of listing position on listing performance. In this practical section we’ll cover key modeling considerations that practitioners often face: parametric versus non-parametric RDD, choosing the right bandwidth parameter, and more. So, grab yourself a cup of of coffee and let’s jump in!..
Buckaroo - The Data Table for Jupyter
Buckaroo is a modern data table for Jupyter that expedites the most common exploratory data analysis tasks. The most basic data analysis task - looking at the raw data, is cumbersome with the existing pandas tooling. Buckaroo starts with a modern performant data table, is sortable, has value formatting, and scrolls infinitely. On top of the core table experience extra features like summary stats, histograms, smart sampling, auto-cleaning, and a low code UI are added. All of the functionality has sensible defaults that can be overridden to customize the experience for your workflow…
Convolutions, Polynomials and Flipped Kernels
This is a post about multiplying polynomials, convolution sums and the connection between them…
Getting AI to write good SQL: Text-to-SQL techniques explained
In this blog post, the first entry in a series, we explore the technical internals of Google Cloud's text-to-SQL agents. We will cover state-of-the-art approaches to context building and table retrieval, how to do effective evaluation of text-to-SQL quality with LLM-as-a-judge techniques, the best approaches to LLM prompting and post-processing, and how we approach techniques that allows the system to offer virtually certified correct answers…
Boltzmann Machines
Here we introduce Boltzmann machines and present a Tiny Restricted Boltzmann Machine that runs in the browser…
attention is logarithmic, actually
Time complexity is the default model brought up when discussing whether an algorithm is “fast” or “slow”…my expertise is mostly in performance engineering of ml systems, so the focus of this article will mostly relate to algorithms that apply to tensors…this model is not perfect, and i will detail why in a later section, but to start off, the best question to ask is: what is the time complexity of element wise multiplication?…from which we will eventually work up to my thesis, which is that vanilla attention as it is implemented in transformers, should be considered logarithmic in computational complexity….
Fonts in R
The purpose of this document is to give you a thorough overview of fonts in R. However, for this to be possible, you’ll first need a basic understanding of fonts in general…
Is python no longer a prerequisite to call yourself a data engineer? [Reddit]
I am a little over 4 years into my first job as a DE and would call myself solid in python. Over the last week, I've been helping conduct interviews to fill another DE role in my company - and I kid you not, not a single candidate has known how to write python - despite it very clearly being part of our job description. Other than python, most of them (except for one exceptionally bad candidate) could talk the talk regarding tech stack, ELT vs ETL, tools like dbt, Glue, SQL Server, etc. but not a single one could actually write python….
Linear Algebra 101 for AI/ML (Part 1)
You don't need to be an expert in linear algebra to get started in AI, but you do need to know the basics. This is part 1 of my Linear Algebra 101 for AI/ML series, which is my attempt to compress the 6+ months I spent learning linear algebra before I started my career in AI. With the benefit of hindsight, I know now that you don't need to spend 6+ months or even 6 weeks brushing up on linear algebra to dive into AI. Instead, you can quickly ramp up on the basics and get started coding in AI much faster. As you make progress in AI/ML, you can continue your math studies…
Five simple things that will immediately improve your diagrams
Just like writing, drawing good diagrams is a skill that takes practice to get good at. Artistic ability certainly helps, but we are here to tell you the good news that it's not an absolute requirement. If you've ever looked in dismay at your diagrams or figures, wondering why they don't look as good as you'd hoped, this is the article for you…
Your guide to AI: May 2025
Welcome to the latest issue of your guide to AI, an editorialized newsletter covering the key developments in AI policy, research, industry, and start-ups over the last month…

.
* Based on unique clicks.
** Find last week's issue #599 here.

Want to get better at Data Science / Machine Learning Math? I have a one weekly tutoring slots open. Hit reply to this email and let me know what you want to learn.
Looking to get a job? Check out our “Get A Data Science Job” Course
It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume.
Promote yourself/organization to ~68,400 subscribers by sponsoring this newsletter. 30-40% weekly open rate.