The Flash Fill Feature in Excel (2021)

4 months ago 19

My POPL 2011 paper, Automating string processing in spreadsheets using input-output examples, which describes the technology behind the popular Flash Fill feature in Excel, and named at POPL 2021 as the Most Influential Paper for POPL that year, was the most important turning point in my research philosophy and career. I went from searching for the hardest problem I can solve to searching for the simplest problem that will have the most impact. I am grateful to various actors who played a role in this part of my journey and this article describes those behind-the-curtain stories.

I am indebted in particular to one woman, whose name I will never know, who sparked in me the inspiration for Flash Fill. I met her on a flight from Frankfurt to Seattle, returning home after a workshop on program synthesis in late 2009. She was impressed to know that I had a Phd in computer science and that I worked for Microsoft Research. She wondered if I could help her with a problem that she was struggling with. She opened up her laptop, fired up Excel, and pointed to a column of names, which were listed in the format: first name, space, last name. She then asked me, “How can I reformat this column so that it lists the last name first, instead of the first name?” Unfortunately, at the time, I had no idea about the programming model underneath Excel. So I had to sheepishly excuse myself out of the situation! After returning home, I searched for a solution to her problem on Excel Help forums, and it is then that I realized that there were many, many people who struggled with simple repetitive tasks like that of the woman on the airplane. They would solicit the help of an expert on a help forum, while communicating their intent using a few input-output examples. And that provided me with the inspiration for the problem addressed in this paper.

I also thank the Excel product team for shipping this technology without which this work wouldn’t be as famous. Flash Fill turned out to be a popular feature with around a million invocations each week, not least because 99% of people who use spreadsheets do not know programming. The Excel team had insisted “We can’t ship this unless you make it work in a fraction of a second (to enable an interactive user experience) and make it work with one example in most cases (because otherwise users would make fun and lose trust if the feature comes up with unintended interpretations).” Their strict requirements around efficiency and ambiguity handling inspired not only the technical solution in the paper but also my research journey that followed.

  • A first idea was to restrict the program search to an appropriate domain-specific language (DSL) that contains the right abstractions for the underlying task domain, which in this case was string transformations—I designed this DSL after a careful study of Excel help forums, and this exercise has since then sensitized me to the importance of customer connection, which I have pursued in various forms including user studies, interviews, and telemetry analysis.
  • However, even this DSL was big enough and ruled out use of constraint solvers or enumerative methods for search that previous synthesis techniques had explored, including some of my own recent work at the time, From Program Verification to Program Synthesis (a POPL 2010 paper that I had co-authored with Jeff Foster and Saurabh Srivastava, which also received the most-influential-paper award at POPL 2020). This inspired a new deductive search methodology to efficiently search over the underlying DSL to identify many programs that satisfy the user-provided examples—the key new idea was to treat the grammar of the DSL as a non-deterministic program and push the example-based constraints down the structure of the grammar, leveraging the inverse semantics of the operators in the underlying DSL (akin to weakest precondition computation in program verification, or the backpropagation step in training ML models).
  • The resulting sub-grammar identifies the many programs, each of which satisfies the user-provided specification, but most of which are not a program that the user intends. With my training as a formal methods researcher, I would have picked any program from this set preaching “Garbage in, garbage out”, or if goaded to be more constructive, I would have been inspired by my own recent work at the time, Oracle-guided component-based program synthesis (an ICSE 2010 paper that I had co-authored with Susmit Jha, Sanjit Seshia, and Ashish Tiwari, which also received the most-influential-paper award at ICSE 2020), to guide an active-learning session with the user to resolve the ambiguity by seeking more examples. However, since that was not an option, it inspired me to develop program ranking techniques, based on both syntactic and semantic properties (over the underlying dataset) of programs, for user’s intent prediction. This was a big mindshift that has since then set me up on my journey to explore combination of logical reasoning techniques with machine learning techniques, an area that I believe shall lead to foundational advances in AI over the coming years.

While it took me 3 weeks to develop a first prototype implementation of Flash Fill, shipping the feature in Excel involved the collective work and support of many fantastic colleagues and interns at Microsoft Research, spanning a period of 2 years. The then lab director Rico Malvar sponsored this project as “Impossible Things Initiative”, which brought me sufficient intern funding and other resourcing. My colleague Ben Zorn helped found connections with the Excel team even before a first prototype could be built, and program manager Shobana Balakrishnan nurtured relationships with the Excel team. MSR Engineers Dany Rouhana ported my C# prototype code to C++ code, and Piali Choudhary built a prototype Excel addin around it. My repeat interns took the project to new heights: Rishabh Singh improved the expressiveness and ranking of Flash Fill, Vu Le generalized the ideas to other application domains including FlashExtract, and Alex Polozov generalized the ideas into an even more foundational framework, FlashMeta.

I am grateful to the program committee and the program chair for accepting this paper after defying various norms. Co-incidentally, I was part of the PC and initially POPL 2011 had a rule stating that no single-authored PC submissions would be entertained. I asked the PC chair Mooly Sagiv if I should add my wife’s name to the paper and he immediately retracted this requirement, which allowed me to submit the paper in the first place. This paper didn’t have as much theoretical rigor as many of the other POPL papers that I have co-authored, as an anonymous paper reviewer remarked, “disappointed not to see more discussion of general principles… Are there ideas here that can be applied to other domains, and if so, how do the techniques presented here generalize?”. Also, at the time, this paper may have felt a bit out of scope of POPL given its ML and HCI components, as another paper reviewer remarked, “this is an AI paper, with only a tenuous connection to POPL topics… the paper does risk being out of scope”. Both these observations set the tone for what followed, with the community driving tremendous foundational and cross-disciplinary advances in program synthesis over the last ten years, leading to synthesis competitions and several courses (ColoradoMITPurdueUCSD, UW, etc).

This paper has been the most satisfying paper amongst the 100+ papers that I have co-authored. My dad, who retired as a civil engineer, has cheered every research paper I got accepted. But it was after this Flash Fill paper that he said “Son, for the first time, I understand what your research is about.” My child Sumay, who is currently an elementary grade student, has often asked me what kind of scientist I am and what inventions have I done. A surreal moment arrived last year when I was able to answer his question, “Papa, what have you invented as a scientist?”. His class had been using many digital applications, each with their own usernames and passwords that had been set to some function of the students’ name. These credentials were communicated by the teacher by sharing an example of the rule that is to be used to derive these credentials from the student’s name and ID. I impressed Sumay by showing him how Flash Fill is smart enough to infer these credentials for any given name. A few days ago, my sister, who is a medical doctor, called me up and told me that she proudly explained Flash Fill to my nephew when she found its mention in his middle-school computing textbook prescribed as part of an Indian national curriculum, and used it to impress upon him how computer science does cool stuff. And this connection with my loved ones, that’s bliss—as much as the recognition from the academic community and the award committee. Thank you!

Disclaimer: These posts are written by individual contributors to share their thoughts on the SIGPLAN blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGPLAN or its parent organization, ACM.

Read Entire Article