The Science of Language in the Era of Generative AI

4 months ago 5

Within the last few years, millions of people have experienced interacting with artificial systems such as ChatGPT (released in late 2022) that on the fly produce fluent text in response to user prompts, text that is sensitive to conversational context and often appears to be meaningful and useful to the user. While the societal impact is undoubtedly immense, the impact for scientific research in the cognitive science of language is less obvious.

Systems such as ChatGPT are examples of large language models (LLMs), developed by using large-scale computational resources to train massive neural network models on enormous quantities of text—orders of magnitude larger than a human lifetime’s worth of linguistic experience. LLMs, an example of generative artificial intelligence (AI; called generative because it can generate new instances of the type of data that it is trained on), are an engineering breakthrough with wide-ranging potential applications. This technological achievement should naturally motivate us to explore potential ramifications for language science, but evaluating these ramifications, and in fact whether or not there are any ramifications, is far from trivial. We are accustomed to thinking of science and technology developing in tandem with puzzles and problems in each area reinforcing developments in the other (see, e.g., Stokes, 1997). However, the current state of affairs is arguably different, as LLMs’ practical success does not necessarily rely on scientific understanding of the modeled subject domain, at least not in the science of language. 

The three of us are language researchers spanning diverse foci and orientations: a generative linguist (Fox), a computational psycholinguist and cognitive scientist (Levy), and a natural language processing and machine learning researcher (Kim). This paper is the result of dialogue among us aimed at offering perspective both on the achievements of LLMs and their limitations in the context of traditional goals of scientific research on language, specifically scientific research within the broad framework of generative grammar (Chomsky, 1957, 1965, 1980, 1995; Pollard & Sag, 1994; Bresnan, 2001, inter alia). We begin by reviewing the fundamentally different research strategies of, on the one hand, generative grammar-based theory development in the pursuit of scientific explanation and, on the other hand, contemporary LLM development that is focused on predicting and producing human-like linguistic behavior, largely by capitalizing on the statistical regularities in large language datasets (corpora). With this as background, we ask whether and how LLMs can serve as valuable or useful tools for research questions in the science of language and, conversely, whether the science of language might contribute towards improvements in the technology. We further speculate on potentially valuable directions for the interplay between generative AI and language science in the future. It is important to note from the outset that the three of us come to the topic from different perspectives and continue to have different opinions about the core questions asked here and that the results of our deliberations would have been reported very differently had they been written by any one of us on his own. We hope that they will nevertheless be useful in directing future engagement and collaboration.

1.1. Central Goals of Language Science

The central goal of linguistics—the science of human language—is understanding the following: why linguistic expressions are structured the way they are, how these structures represent meaning and externalized signal (sound for spoken languages, manual gestures for signed languages), how humans comprehend and produce these structures, and how humans acquire the mental capacities that make all of this possible (see, among many others, Chomsky, 1965 and Baker, 2002). By discovering the core underlying principles governing language, linguistics aims to ‘carve nature at its joints.’ Some of these principles may ostensibly be surface false in that they may not capture superficial surface patterns but instead reveal and explain deeper systematic phenomena. Consider, for example, the principle that allows language users to add relative clauses as modifiers of nouns, for example, the bracketed relative clause in the boy [the girl met]. The principle, which appears to hold in almost every language, allows us—when stated properly—to replace the girl with any other expression of the same syntactic category, any other noun phrase, for example, the woman. However, this demand, in turn, is not surface true, as observed when considering the psychological difficulties associated with certain so-called center-embedding constructions. So, for example, replacing the girl with a more complex noun phrase such as the girl the woman likes yields a result that a speaker will never use and an addressee will find very difficult to understand: the boy [the girl the woman likes met]. So, we have a fairly general principle with a host of considerations arguing that it is correctly stated, yet it yields results that appear to be false on the surface. The consensus has been that, despite this surface falsity, the principle is, nevertheless, correct in its idealized form (as part of the grammar internalized by a speaker of the language, speaker’s competence). The principle appears to be sometimes false, according to this consensus, due to interfering factors (performance factors) pertaining to how speakers make use of the principles of grammar (of their competence) in producing and comprehending sentences in real time. This distinction between competence and performance is none other than the distinction between idealized theory and noise, familiar from high school physics (e.g., the distinction between the theory of motion and friction). And a consequence of this distinction is that there is a notion of ‘being correct’ for natural languages (being grammatical) that is importantly distinct from the notion of being ‘probable’ or ‘expected.’ No such competence–performance distinction (and no distinction between grammaticality and expectedness) is designed into current generative AI systems for language (see Yngve, 1960; Miller & Chomsky, 1963; and Chomsky, 1965; and in the context of LLMs, see Fox & Katzir, 2023; Katzir, 2023; and further in section 2.1). 

In contrast to some of the earlier works on generative modeling, which were driven by scientific understanding (Rumelhart & McClelland, 1986; Elman, 1993; Hinton, 2007), the development of modern generative AI technologies is largely use or performance driven: generative models of molecules can help scientists discover new drugs, image generation models allow nonexperts to create high-quality graphics at low cost, and LLMs enable computers to perform useful tasks involving language—such as answering a question given a user prompt. These models are generally trained on ‘raw’ data (e.g., pixels, words) and prioritize prediction over scientific understanding of the data domain. As an example, LLMs make use of enormous neural networks—computational models loosely inspired by the human brain that process data through layers of connected nodes (neurons)—trained over raw text to predict the next word as accurately as possible. And although it is possible in principle that a model with the right inductive biases could learn surface-false (but true) generalizations despite being trained only on raw surface form—figuring out the correct theory/idealization and understanding the various sources of noise that mask this theory—this is generally not the primary goal of modern generative AI.

Figure 1

An LLM takes the input context the boy that the girl met and passes it through a series of neural network layers (typically based on the transformer architecture; Vaswani et al., 2017) to obtain internal embedding representations of each word in context. These embedding representations are used to predict the next-word probabilities. Generation proceeds by sampling a word (or taking the most likely word) and then appending it to the previous input to obtain a new context.

1.2. The Architecture of LLMs

LLMs—which power popular AI applications such as ChatGPT—are fundamentally next-word prediction systems that process a sequence of words to output a probability distribution over the next word. For example, given the input the boy that the girl met, an LLM might assign a probability of 0.34 to was, 0.25 to in, 0.1 to at, 0.005 to were, and so on as the possible next word. To obtain these probabilities, LLMs pass the input through a series of neural network layers, each of which modifies the input representation into a new representation by applying a learnable transformation that is determined by the weights of the neural network. While the attained numeric, embedding representations—the ‘hidden layers’ of a neural network—are not human meaningful in their original form, they have been shown to correlate with linguistic representations such as parse trees that within language science are taken to underlie human language (section 2). See Figure 1 for an overview.

We can generate continuations from an LLM by sampling a word from this distribution (or taking the most likely word), appending the newly sampled word to the input, and then feeding the updated input back into the LLM. An LLM is trained by iteratively updating its weights to optimize an objective function, namely the context-based probabilities of words in the training dataset. This method of optimization means that LLM training is not based on the principles of language structure posited by linguistic theory. And because LLMs work with probabilities, they can (and often do) assign nonzero probabilities to continuations that are grammatically incorrect (e.g., were in the above example). However, as we describe later in this paper, empirical analysis of LLM probabilities suggest that they often correlate with the patterns implied by the grammatical structures of linguistic theory—not necessarily an unexpected result if linguistic theory plays an important causal role in deriving the surface probabilities that LLMs attempt to uncover.

A key goal of the enterprise of generative grammar is to understand human linguistic abilities through the method of the sciences: uncovering regularities, discovering generalizations, proposing principles that might account for what is observed, and, throughout, applying so-called ‘inference to the best explanation.’ Here, we briefly illustrate how this goal has been pursued in the context of one of the most celebrated features of natural language structure, namely that units of language are hierarchically organized in a format that is directly relevant for the structure of thought. Sentences do not involve simply linear concatenation of words: words are grouped into phrases, and those phrases are themselves grouped into larger phrases, much like the units that appear within parentheses in various formal languages (e.g., the language of basic arithmetic). Grammatical principles and processes are sensitive to this hierarchical organization, as famously illustrated, for example, by subject–verb agreement. In 1a below, the main verb is agrees in number with the singular noun dog, not the sentence’s first noun days or the noun linearly closest to the verb, neighbors. If we try to let the verb agree with either of these alternative nouns (as in 1b), the result would be ungrammatical, conventionally indicated with the asterisk.

1a. Most days the dog that belongs to the neighbors is barking at the cat.

1b. *Most days the dog that belongs to the neighbors are barking at the cat.

The rule of subject–verb agreement is picked up by the child acquiring the language, even though in most English sentences the element that agrees with the verb is both the linearly first noun and the noun linearly closest to the verb. In other words, the rule holds, even though the predominant data available to the child is compatible with alternative hypotheses that make reference to linear order (first or linearly closest). The rule itself is actually rather simple but makes reference not to linear order but to a hierarchical structure in which the subject is the noun phrase that is structurally closest to the verb phrase, with the result that the structurally closest noun dog is the noun that agrees with the verb regardless of its linear proximity.

Figure 2

English subject–verb agreement illustrates the central role of hierarchical sentence structure in language. The main verb of the sentence, is, must agree in number with the subject of the sentence, the noun phrase that is itself singular (as its head word dog is singular). A subject cannot be defined solely in terms of linear ordering: rather, it is the noun that is structurally closest to the main verb. In this example, dog is the head word of the entire subject phrase, which means that it determines the number marking required on the main verb. In this example and others in this paper, the set of phrases that a word heads is indicated by the coloring of the branches of the tree structure.

The hierarchical sentence structure needed for stating the rule can be made explicit as a tree whose parts are licensed by a grammar specifying what structural components are possible in a sentence of the language. Once the structure is in place, we can state the relevant rule as demanding agreement between the verb and a noun that bears a particular structural relation to it (the structurally closest noun). A successful and active area of mathematical and empirical research in linguistics involves formally characterizing natural language grammars, the tree structures they license, and the set of relevant properties that might enter into rules of grammar, for example, the notion of closeness that enters into agreement rules and the notion of command that we illustrate below.

Tree structures like that in Figure 2 play an important role throughout the grammar of any language. One way to see this is by considering their centrality to the expression of thought. Consider, for example, the ambiguity of a sentence such as Mary poked the man with a stick. This sentence can express the thought that Mary used a stick to poke the man. However, it can also express a different thought, namely that Mary poked a particular man from a set of men: the man who himself had a stick. This ambiguity is often referred to as a structural ambiguity: the two thoughts are distinguished by the way words are combined with each other to form larger structures in which different modes of combination yield different thoughts. Moreover, there are a bunch of different ways that we can probe the structure of a sentence, all of which show that this is the right way to think about things. For example, there are ways of permuting word order, and permutations of this sort are directly constrained by the syntactic structure. Consider, for example, the following two questions:

2a. Which man did Mary poke with a stick?

2b. Which man with a stick did Mary poke?

Question 2b, in contrast to 2a, cannot be understood as a question about the identity of the man that Mary used a stick to poke. This is because the sentence involves displacement of a constituent, which man with a stick, and when this is a constituent, the thought expressed must be about a man who has a stick with him (rather than about a method of poking that involves a stick).

Looking across a wider range of hierarchically stated grammatical generalizations reveals deeper principles. To give a sense of this, we describe two additional phenomena that illustrate a single deeper principle involving the relative positioning of elements in a hierarchical tree structure. It turns out that numerous phenomena in language are sensitive to two fundamental relationships that can hold between nodes of a tree: sisterhood and dominance (and to their combination). Here, we review two such phenomena: generalizations involving negative polarity items such as the word any (similar elements include ever, in years, and lift a finger) and related generalizations involving reflexive pronouns like herself.

Consider example 3a below. You may have the intuition that the word any in this sentence somehow relates to the word no, which appears earlier in the sentence. This intuition is correct: if no is replaced with other elements of the same category, for example, the, some, or every, the sentence becomes anomalous, as illustrated in 3b.

3a. No teacher that the students like receives any praise.

3b. *The teacher that the students like receives any praise.

3c. *The teacher that no students like receives any praise.

No is not the only word that can license any here: certain other words or phrases, like few (fewer than seven), can do so as well. When their meanings are analyzed with the tools of logic, the words/phrases that can license any turn out to form a natural class. (The class is often called the class of downward entailing or entailment reversing phrases. A phrase X is entailment reversing if for every phrase Y and Y′, if Y entails Y′, then [X Y′] entails [X Y]). However, these words/phrases cannot just appear anywhere in the sentence to license any. Example 3c is anomalous because no is in the wrong place to license any. The condition turns out to be that any must be contained in (or dominated by) the sister of a licensing phrase with the specified semantic property, as illustrated in Figure 3. In generative linguistics, when the sister of a phrase X contains/dominates an element Y, X is said to command (or c-command) Y. This command relationship turns out to be relevant to a wide range of linguistic phenomena, including the two phenomena we describe here.

Figure 3

Negative polarity items like any require a licensor that commands it—here, sentence-initial no.

The second example involves reflexive pronouns like herself (see Charnavel, 2019 and references therein). The sentences in 4a through d share the same beginning: the woman felt the children had embarrassed, and in each case, this string of words is followed by an ordinary pronoun her/them or a reflexive pronoun herself/themselves. Interestingly, the structural configuration involved in the sentence determines the type of pronoun that can be used for every choice of referent: if the pronoun refers to the children, then a reflexive pronoun must be used; if it refers to the woman (or to any other individual), then a reflexive pronoun cannot be used (matching subscripts indicate coreference).

4a. The womani felt the childrenj had embarrassed themselvesj.

4b. *The womani felt the childrenj had embarrassed themj.

4c. *The womani felt the childrenj had embarrassed herselfi.

4d. The womani felt the childrenj had embarrassed heri.

It turns out that just as in the case of negative polarity items like any, the distribution of reflexive pronouns is characterized by the command relation: specifically, a reflexive pronoun must co-refer with the closest referring phrase that ‘commands’ it (Figure 4).

Figure 4

A reflexive pronoun like themselves must co-refer with the closest phrase that commands it.

The relevance of the command relation becomes obvious when we consider examples parallel to 4a in which the co-referring expression is linearly closest to the reflexive pronoun but does not command it. The result is unacceptable, just like 4c:

4e. *The woman felt that the brother of the childrenj criticized themselvesj

The above two examples illustrate how a single simply stated property identified on the basis of tree structures, that of command, can achieve considerable explanatory generality. 

2.1. The Competence–Performance Distinction

As we mentioned in section 1.1, a key component of many scientific theories of language is the competence–performance distinction. Whereas knowledge of language—a system that derives the tree-structured grammatical descriptions discussed in the previous section—is relatively stable in the mind of an adult native speaker, using that knowledge in comprehension or production requires interface with memory, perceptual or motor mechanisms, and potentially other cognitive and physical capabilities or contingencies, collectively termed performance systems and factors. Performance is subject to constraints, which limit the use of knowledge (very roughly speaking, analogous to how executing a computer program requires control by the operating system and hardware resources such as memory and microprocessors). The underlying knowledge constitutes the speaker’s linguistic competence, whereas capturing the observable manifestations of that knowledge additionally requires understanding of performance systems. The distinction is needed to provide an accurate characterization of the factors that enter into an account of linguistic behavior. For example, it provides an accurate account of apparent deviations from grammatical knowledge because of performance constraints. In addition to the case of center-embedding discussed above, we can see such deviations also in the case of subject–verb agreement that we have already discussed. Although the rule of agreement is clearly internalized by native English speakers, they often make mistakes, as in example 5 below:

5. The key to the cabinets are on the table.

These mistakes are common enough in written English (and often not so easy to spot) that for a number of years, the New York Times ran a periodic column called Subject, Meet Verb to explain and highlight them. Theories with the competence–performance distinction can preserve the simple (and arguably correct) characterization of subject–verb agreement we described in this paper, while attributing apparent deviations to ‘performance errors,’ the topic of considerable psycholinguistic research (Bock & Miller, 1991). One behavioral consequence, particularly relevant to understanding the difference between language in humans versus LLMs, is the relationship between the notions of grammaticality and likelihood. For humans, the two are distinct: for example, when prompted with the start of the sentence, The key to the cabinets, example 5 may well be a likelier continuation than example 6 below:

6. The key to the cabinets was so thoroughly rusted that it was impossible to fit into the keyhole.

However, if the rule we described for subject–verb agreement in section 2 is correct, only the latter is grammatical. One reason to believe that the rule is indeed correct (and that when a speaker produces Y, it is a performance error) comes from the observation that upon reflection (e.g., when the relationship between the subject and the verb is pointed out), native English speakers will generally agree that example 5 has an error but example 6 does not.

This is in sharp contrast to LLMs, whose behavior is exhaustively determined by their architecture together with their weights learned during training. LLM string probability distributions do not intrinsically involve a distinction between grammaticality and likelihood; LLMs can and do put higher probabilities on ungrammatical strings than on grammatical strings—for example, GPT-2 scores example 5 as 146 trillion times more likely than example 6. More broadly, LLMs do not have a built-in competence–performance distinction.

In an earlier era of connectionist research, the idea was advanced that neural network architectures would automatically converge on human linguistic performance limitations (e.g., Christiansen & Chater, 1999). However, this seems not to be the case: even GPT-2 shows ‘superhuman’ behavior in some respects on core instances of human performance limitations such as the multiple center-embedding discussed in section 1.1 (unlike the case of subject–verb agreement just discussed). Such cases offer a potential scientific opportunity: if we view apparent superhuman LLM behavior as converging toward what would be expected under human competence unconstrained by performance factors, then one can ask what additional apparatus is required to recapitulate human performance constraints (see Hahn et al., 2022 as an example of this approach). Even if it turns out to be unrealistic to view LLM behavior in this way, what we learn about the additional required apparatus might help advance our understanding of the interaction between competence and performance.

Two additional points regarding the competence–performance and grammaticality–likelihood distinctions might be useful to mention. First, some recent LLMs have been shown to exhibit ‘metalinguistic’ behavior (Beguš et al., 2023); for instance, when we explicitly prompted with queries about whether each of examples 5 and 6 are grammatical, ChatGPT-4o stated that example 5 is ungrammatical due to subject–verb mismatch, whereas it stated that example 6 is ‘mostly grammatical’ with minor clarity issues. These metalinguistic abilities, and whether they would emerge without prescriptive information about grammar plentiful in the training data, remain poorly understood (see also Hu & Levy, 2023). Second, there could be generalizable boundaries within LLMs embedding space corresponding to the grammaticality distinction (c.f. the work of Warstadt et al., 2019 learning supervised classifiers based on textbook syntax examples). The competence–performance and grammaticality–likelihood distinctions remain crucial for understanding language in the human mind and may be fruitful considerations for LLM research going forward.

3.1. Evaluating the Linguistic Capabilities of LLMs

As the above case studies illustrate, work in linguistics attempts to reveal deep underlying principles governing human language. LLMs are not designed to reveal underlying principles. Still, we might expect what is discovered in linguistics to be helpful in improving the performance of LLMs. Within linguistics there exists a rich suite of sentences (or minimally contrasting pairs as we have seen) that have been instrumental in revealing or illustrating various properties (or laws) of human language. One important contribution of linguistics to LLM development is constructing rigorous unit tests that can reveal the successes and failures of LLMs on intricate linguistic phenomena. For example, if an LLM assigns higher probability to sentences such as Most days the dog that belongs to the neighbors is barking at the cat than to ungrammatical sentences like Most days the dog that belongs to the neighbors are barking at the cat, it would be consistent with the hypothesis that the LLM has acquired a generalization about subject–verb agreement that resembles the one predicted by the tree-based proposal described in section 2. In fact, in the case of syntax, there is already a substantial body of work on using the above framework to investigate the syntactic generalizations learned by language models (Marvin & Linzen, 2018; Futrell et al., 2019). Linguistics can enrich this research program further by developing more sophisticated evaluation suites. The successes and failures of LLMs on such evaluation suites can inform LLM development. For example, an LLM that is found to fail on a particular linguistic task could be ‘patched’ by training on relevant examples that might illustrate the governing principles and might subsequently be more useful for whatever purposes it might be deployed.

3.2. Informing New Model Architectures and Objectives

Current approaches for learning LLMs incorporate remarkably little of what we know about how human language works, instead adopting a ‘tabula rasa’ approach wherein a flexible learner (i.e., a neural network) is trained on unimaginably large amounts of text. While this brute force approach to language learning has resulted in systems that are useful for many applications of interest, it requires training on datasets that are many orders of magnitude larger than are needed for human language acquisition, in part due to a model’s need to learn from scratch what is already known about human language (e.g., that hierarchical structures of a particular sort underlie human language). Addressing such sample inefficiency is especially important for extending LLMs to languages with a limited amount of digitized text as well as capturing long-tailed language phenomena (such as sentences with deeply nested center embeddings, as noted previously). One way in which linguistics could contribute to generative AI, then, is by providing a set of linguistic laws that can potentially be incorporated into LLMs. One might hope that these laws could inform the development of new model architectures and training objectives in order to enable more sample-efficient learning. However, we should of course remember that the history of AI is full of examples of human knowledge–based approaches that were ultimately replaced by more general methods that turned out to be better placed to leverage computation and data (Sutton, 2019). But this does not mean that there cannot be ways of incorporating linguistic laws into LLMs in a way that is compatible with flexible and scalable learning. As an example, past work has shown that linguistic structures such as parse trees can be used to guide a language model’s intermediate activations (Strubell et al., 2018, Qian et al., 2021; Sartran et al., 2022); these works incorporate linguistic knowledge into the learning process but still make use of flexible learners that can learn from data.

3.3. Expanding the Reasoning Capabilities of LLMs Through Structured Representations

LLMs have been shown to possess some degree of ‘reasoning’ abilities, especially when coupled with prompting techniques that encourage the LLM to generate the intermediate reasoning steps (in natural language) before generating an answer (Kojima et al., 2022; Wei et al., 2022; see Mahowald et al., 2024 for some cautionary remarks). However, this type of reasoning in ‘pure language space’ is unlikely to be optimal given that LLMs are trained primarily to predict the next word as accurately as possible. Linguistics (and cognitive science more broadly) provides a rich set of semantic/logical representations that can computationally operationalize certain aspects of reasoning. For example, a logical representation of a sentence (e.g., ∀x(Man(x) → Mortal(x) for All men are mortal)—which is related to a sentence’s hierarchical representation (and, at least in principle, algorithmically derivable, as in Montague, 1970)—is a natural representation with which to perform deductive reasoning. The goal of obtaining a symbolic representation of a sentence that captures core aspects of its meaning and moreover enables computations (such as logical inference) to be performed on top of it was the motivation behind the classic task of semantic parsing in natural language processing. However, existing representations for semantic parsing are generally tailored to particular domains (e.g., a domain-specific language for querying databases) and thus unscalable. Given the promise of more general-purpose representations that arguably come from formal semantics, a hybrid approach in which the LLM first predicts these representations (given a natural language sentence) and then offloads the reasoning process to another system that performs symbolic computations on top of the predicted semantic representations might expand the reasoning capabilities of generative AI systems in addition to making them more interpretable and robust. This type of hybrid approach is general and can (for example) be extended to cases that require probabilistic reasoning (Wong et al., 2023).

3.4. Characterizing the Computational Power and Limits of LLMs

One goal of mathematical linguistics is to precisely characterize the expressivity of different computational models over strings, for example, whether a particular model can recognize that a string has ‘typed balanced parentheses.’ While these types of characterizations seem simple at a first glance, they are in fact deeply related to many phenomena that occur in natural language. For example, recognizing whether the string ( [ ] ) has balanced parentheses is related to understanding sentences with center embeddings (not to be confused with LLM vector embeddings) of the form (the mouse [the cat chased] hid). Despite their internal complexity, LLMs are behaviorally computational models that process strings, and thus, tools from mathematical linguistics can be used to study their computational power. And indeed, there is a rich body of work on studying the expressivity of the transformer architecture (as well as other architectures like recurrent neural networks) on which modern LLMs are built, which has yielded insights as to why these neural network architectures may or may not be well equipped to model certain types of phenomena (see Lan et al., 2024). In particular, transformers (under certain assumptions) have been shown to be limited in recognizing the above language of typed balanced parentheses (Hahn, 2020), which would imply that they would have difficulty modeling the full complexity of natural language. These types of fundamental insights could explain the current failure modes of LLMs and potentially lead to new classes of LLMs that can overcome the identified limitations.

4.1. Insights for Psycholinguistics

In generative linguistic theories, syntactic structure characterizes the major paths of meaning composition by which words recursively combine into larger and larger phrases, eventually accounting for core aspects of the meaning of an entire sentence—who did what to whom, properties of the participants in denoted events or states, scopal relations among logical operators, and more. Decades of psycholinguistic research indicate that syntactic structure identification and meaning composition is highly incremental: human readers and listeners do not wait until the end of a sentence to start analyzing it but instead rapidly recruit diverse information sources to determine the form and meaning of the linguistic input (Marslen-Wilson, 1975; Tanenhaus et al., 1995). As a simple demonstration of this point, when you read a sentence beginning Jamie was clearly intimidated. . ., no special effort is needed to come to expect that the next word might be by followed by the source of intimidation. This is just one example of how effortlessly the human mind combines multiple information sources—here, lexical (the word intimidated) and syntactic (the passive-voice context, without which the expectation would be for the intimidatee and not the intimidator)—to quickly analyze linguistic input and, further, of how language comprehension is ubiquitously predictive. To give the flavor of how LLMs might be applied to the study of incremental comprehension in the mind and brain, we now discuss one such application in some detail: the theory and modeling of cognitive effort in real-time human language comprehension.

Cognitive effort is a theoretical construct intended to account for the fact that difficulty in language processing is differential and localized: not all sentences are equally easy to understand, and not all words within a given sentence are equally easy to understand (Levy, 2008). If the cognitive effort required for a linguistic input in its context is high, the input may take longer to analyze, relate, and integrate with the context; if cognitive effort is too high, understanding may fail altogether. Consider, for example, cases of so-called garden-path syntactic disambiguation such as sentence 7 below:

7. As the dog scratched the experienced veterinarian removed its muzzle.

If you have not seen sentences like this before, you probably find example 7 at least momentarily confusing, and you may notice that the source of confusion is connected to a specific word, the verb removed. This word is confusing because the preceding part of the sentence, As the dog scratched the experienced veterinarian, is structurally ambiguous, with a human subjective preference for one interpretation (here, where the dog is scratching the veterinarian, Figure 5 top panel) over the other (here, where the dog is scratching itself, Figure 5 bottom panel).

Figure 5

Structural ambiguity in incremental processing of sentence 7. The panel on the top indicates the initially preferred interpretation, in which the experienced veterinarian is taken to be the object of the verb scratched inside the subordinate clause beginning the sentence. The next word, removed, is grammatically incompatible with this initially preferred interpretation, ruling it out, leaving the interpretation illustrated on the bottom, in which the experienced veterinarian is not inside the initial subordinate clause at all and instead is the subject of the main clause, and the sentence thus far is interpreted as saying that the dog is scratching itself. These trees are labeled T1 and T2 for additional reference later in this section.

The word removed is compatible only with the latter initially unpreferred interpretation, hence the confusion. If the former initially unpreferred interpretation were blocked, for example, with the introduction of an appropriately placed comma as in sentence 8 below, then no confusion would have ensued:

8. As the dog scratched, the experienced veterinarian removed its muzzle.

This confusion at removed (in the absence of a comma) is quantifiable behaviorally in various ways: how long people take to read that word in sentence 7 relative to reading it in sentence 8 (Frazier & Rayner, 1982; Staub, 2007), differential brain responses (Osterhout & Holcomb, 1992), and subjective reports of failure to understand the sentence. Theoretically, these measurably different processing signatures of removed in sentence 7 over 8 are taken as indications of differences in cognitive effort, and a question for psycholinguistics is how to account for this difference.

Traditional theories designed to account for this difference in cognitive effort, which can be traced to some of the earliest work in cognitive science (Miller & Chomsky, 1963; Fodor et al., 1974; Frazier & Fodor, 1978), propose that features of a sentence’s syntactic structure enter directly into the determination of cognitive effort. Such theories account for the difficulty at removed in this sentence as follows. Because sentence processing is highly incremental, the comprehender has already analyzed the veterinarian as the object of the preceding verb scratched, but if this analysis were correct, the main verb of the sentence removed should not come next because the main subject of the sentence has not yet appeared, and English word order is subject–verb–object (i.e., the verb follows the subject). So, for comprehension to succeed, that analysis must be undone, and the veterinarian must be then correctly analyzed as the subject of removed. Such theories posit that structural reanalysis processes directly contribute to cognitive effort.

Levy (2008), building on earlier work by Hale (2001), proposed an alternative theory of cognitive effort in syntactic processing in which garden-path disambiguation effects were subsumed as a special case of prediction in language. In cases not involving syntactic ambiguity, it has long been known that highly predictable words are particularly low effort to process as evidenced by reading times (Ehrlich & Rayner, 1981) and brain responses (Kutas & Hillyard, 1980). For example, the word most likely to appear next in the context

9. The children went outside to… 

is read faster than if it appears in the context 

10. My brother came inside to…

even though it is a perfectly plausible continuation in both contexts. Traditionally, these linguistic expectations were estimated using various forms of the Cloze task, which involves asking native speakers to guess what word occurs in a given context. When examples (9) and (10) are used in the Cloze task (i.e., giving the beginning of a sentence as preceding context, the most common Cloze task variant in psycholinguistics), play is indeed far more likely to be guessed in example (9) than (10). Applying that idea to our garden-path contrast in examples 7 and 8 above, it seems clear that however probable removed may be in example 8, it is even less probable in example 7, since in 7, one’s bets are on an interpretation that does not even permit a verb at that point. However, removed is not a particularly likely next word in either context, and historically, the prevailing view in psycholinguistics was that word predictability effects only mattered in the relatively high-probability realm (e.g., among words with perhaps 10% probability or above). Hale (2001) and Levy (2008) hypothesized that word predictability effects might be best characterized in terms of surprisal (negative log probability, perhaps the most fundamental information-theoretic quantity), in which it is probability ratios rather than differences that matter so that two words that both have very small probabilities in context might nevertheless differ dramatically in the cognitive effort they impose on the comprehender. This hypothesis is not practical to test using the Cloze task because enormous participant sample sizes would be required to confidently estimate differences among very small probabilities (potentially well below 1%). However, the hypothesis is testable using probabilistic language models trained on large corpora. Smith and Levy (2013) showed, contrary to the conventional wisdom of the time, that word predictability effects on reading times are indeed linear in surprisal. This earlier work used n-gram language models (next-word probabilities conditioned on only a few words of preceding context), which cannot capture the potentially long-distance syntactic effects involved in garden pathing. However, the linear relationship between surprisal and reading times has held up and indeed been generalized across numerous languages in more recent work using LLMs (Wilcox et al., 2023; Shain et al., 2024), which do capture syntactic effects as reviewed in section 3 (Evaluating the Linguistic Capabilities of LLMs). 

Figure 6

Regressing human reading times against LLMs’ next-word probabilities reveals a linear contribution of a word’s surprisal (log inverse probability in context, a basic information-theoretic measure of information content) to how long the word takes to process (figure from Wilcox et al., 2023).

These results have set the stage to test a particularly strong version of surprisal theory posited by Levy (2008), which is that the relationship between the grammatical representation of a linguistic and cognitive effort is entirely mediated by surprisal, that is, that surprisal is a ‘causal bottleneck’ between the two. To understand this, consider how the initial ambiguity of sentence 7 is implicated in the surprisal of the disambiguating word removed. Let us denote the grammatical representation of the sentence up to but not including removed as a random variable T. From the law of total probability, we can see that the probability of the next word being removed is the weighted sum of the probability of removed under each possible value of T, where the weights are the conditional probabilities of the various values of T given the context preceding it. Slightly simplifying by assuming that the only values of T are those illustrated in the two incremental parses shown in Figure 6, we can write this probability as in Equation A below:

This analysis shows that removed should be much higher surprisal in example 7 than in 8, that is, that P(removed | Context 7) ≫ P(removed | Context 8). Because the grammatical interpretation illustrated in the top panel of Figure 5, T1 is initially strongly preferred but does not permit a verb to come next, so P(T1 | Context 7) ≫ P(T2 | Context 7), but P(removed | T1, Context 7) ≈ 0. And since the grammatical signal of the comma in 8 is nothing more than that the subordinate clause has ended, we have that P(T2 | Context 8) ≈ 1 and P(removed | T2, Context 7) ≈ P(removed | T2, Context 8). This pattern is exactly what we see in LLMs: in GPT-2, for example, the word removed has 14.9 bits of surprisal in sentence 7 but only 7.7 bits in context 8. So, qualitatively, LLMs capture this grammatical garden-pathing effect.

However, the quantitative relationship between surprisal and reading times might allow us to make further headway toward testing the causal bottleneck hypothesis. If this hypothesis is correct, and if LLMs give sufficiently good estimates of subjective probabilities during language comprehension, then LLM surprisal should predict cognitive effort just as well for words that involve grammatically based structural expectation violations as for words that do not. However, van Schijndel and Linzen (2018, 2020), Wilcox et al. (2021), and Huang et al. (2024) have shown that this is not the case: cognitive effort, as measured by processing times in various types of reading tasks, is greater for words that involve structural expectation violations, like removed in example 7, than for those that do not, like removed in example 8 (Figure 7). This implies that either LLM surprisal estimates are systematically miscalibrated in cases involving structural expectation violation (which would require an explanation of its own) or that surprisal is not a complete causal bottleneck between word-level expectations and cognitive effort in real-time language comprehension.

Figure 7

The relationship between GPT-2–estimated surprisal (x-axis) and the sum of the predicted processing time effect due to surprisal and residual error (y-axis) in a multiple linear regression of word-by-word processing times against surprisal and other linguistic variables (data replotted from Wilcox et al., 2021). Processing times for words that involve expectation violations regarding grammatical structure (red) are systematically underpredicted, whereas processing times in corresponding versions of the sentences in which those words do not involve structural expectation violations (blue) are predicted well. The surprisal–processing time relationship, shown as a solid black line, is calibrated using a separate set of words that do not involve structural expectation violations. Semitransparent lines show standard errors of word-specific mean processing times.

4.2. Predicting Brain Responses during Language Comprehension

Space precludes us from detailed discussion, but we briefly note that LLMs’ internal representations themselves have been argued to have predictive power for brain activation patterns in response to linguistic stimuli, as illustrated in Figure 8a (Gauthier & Levy, 2019; Schrimpf et al., 2021; Goldstein et al., 2022; Caucheteux & King, 2022). This finding is a potentially useful basis for further inquiry into aspects of the neural processing of an intended linguistic message (e.g., Tuckute et al., 2024) and for applications such as decoding language from brain activation patterns, as illustrated in Figure 8b (Tang et al., 2023; Silva et al., 2024). The theoretical relevance of this work for what is known about the neurocognitive machinery involved in the representation and processing of language remains to be discussed. Among other things, it is important to understand whether the predictive power extends from surface regularities to the abstract properties of the linguistic representation (section 2), especially in light of the competence–performance distinction discussed in section 2.1 and other discoveries pertaining to neurocognitive representation of language (e.g., Makuuchi et al., 2013; Friederici, 2017; or Grodzinsky et al., 2020).

Figure 8

The pipeline for encoding models of brain response using LLM representations (Schrimpf et al., 2021; Caucheteux & King, 2022; Goldstein et al., 2022). Brain-imaging responses to linguistic stimuli are regressed against LLMs’ internal embedding representations of those stimuli (left). This encoding model can then be used to generate language corresponding to brain activation patterns arising during comprehension of new linguistic stimuli or even during semantic processing of nonlinguistic input such as silent videos through algorithms that search for word sequences that the LLM scores as linguistically plausible and that the encoding model predicts would give rise to brain response patterns similar to those recorded (right). Figure generously provided by Jerry Tang.

4.3. Developing and Testing Quantitative Predictions in Language Science More Broadly

The previous section offered an example of how theories about language make quantitative predictions and of how LLMs might facilitate testing those predictions. Identifying additional cases of this sort might allow us to find other areas in which LLMs can be useful for the advancement of the science of language.

One such case might arise in the context of quantitative investigations of linguistic theories through analyzing databases of naturalistic language use. In this case, the range of questions that can be formulated and addressed depends on the database’s coverage, not only in breadth but in the depth with which the linguistic structures in the dataset are characterized. For example, the availability of Universal Dependencies (Nivre et al., 2016; De Marneffe et al., 2021), a rich multilingual database that provides accompanying linguistic parse tree annotations in addition to raw text, has enriched the investigation of a proposed quantitative linguistic universal, dependency length minimization: that there is a preference to minimize the disparity between linear distance and structural distance between words in a sentence (Ferrer i Cancho, 2004; Gildea & Temperley, 2007, 2010; Park & Levy, 2009; Futrell et al., 2015). The universal is consistent with predictions that might be made under very different types of theories of the pressures shaping language structures. Formal theories dating to the 1980s (Chomsky, 1981; Baker, 2002) posit a head directionality parameter, which might be accompanied by an inductive bias preferring a consistent (across phrase types) linear positioning of syntactic heads within their phrases. On the other hand, processing-based theories (Hawkins, 1994; Gibson, 1998; see also Behaghel, 1932) predict this pattern by positing—based on principles similar to those responsible for the difficulty of center-embedding constructions as described earlier in the paper—that disparities between linear and structural distance lead to extra cognitive effort in language comprehension and/or production and that pressure to avoid such cognitive effort can shape properties of language. These theories make different predictions at more granular levels, however, and can potentially be empirically distinguished through linguistically detailed computational analysis. Efforts by Futrell et al., 2015 and Hahn et al., 2020 to do so have currently turned out to favor processing-based theories (see also Gibson et al., 2019 for an overview of this research strategy more broadly), but in our view, there is room for further investigation.

This is an example of the potential value to basic language science of high-quality datasets characterizing the patterns of linguistic structure in naturalistic language. However, due to the expertise required for linguistic annotation, such databases are expensive to develop at scale. LLMs, and machine learning technology more broadly, might facilitate database development by lowering the costs associated with manual annotation. In particular, LLMs can generally be specialized to specific linguistic tasks (e.g., part-of-speech tagging, syntactic parsing) with many fewer annotated examples than are typically needed in traditional approaches. Such specialized LLMs could be applied on unannotated text to widen a database’s coverage in terms of both the number of languages and the number of sentences within a language. These annotations may of course contain errors—and these errors may be systematically related to an LLM’s linguistic performance—something that would need to be taken into account.

4.4. Learnability Considerations

Language has historically played a major part in the study of the role of the innate state of the human mind. Because language is central for the generation and expression of thought and because the languages of the world differ from one another in form and structure, accounts of language learnability have potentially broad implications in the explanation of the range of mental representations that can be entertained as the mind develops. One, among many, considerations entering into this inquiry is the Poverty of the Stimulus (Chomsky, 1980; Clark & Lappin, 2010): the idea that linguistic experience underdetermines the linguistic state of an adult native speaker (the knowledge that allows a speaker to generate linguistics representations of the sort discussed in section 2) and that therefore innate constraints need to figure prominently in the explanation of how this state is attained. A famous example brought up in this context is the rule for polar question formation in English, an instance of the principle that rules refer to as hierarchical linguistic structure. This rule is exemplified in example 11 below:

11a. The boy who has talked can read.

11b. Can the boy who has talked read?

11c. *Has the boy who talked can read?

To form a yes/no polar question out of 11a, the main verb—defined in terms of the sentence’s hierarchical structure, as described in section 2—is moved to the beginning of the sentence, yielding 11b. It is rare for the main verb of an English sentence not to also be the linearly first verb appearing in the sentence, yet children never seem to acquire a generalization for polar question formation based on linear order.

There are many questions one can ask at this stage. One question relates to explanation. Specifically, what is the explanation of the attested generalization? One might think that the question is misplaced—that the generalization is a simple accident of history, one of an unstructured set of possible historical accidents that the child learning the language needs to acquire. However, there is another possibility. 

In section 2, we spoke about a similar generalization pertaining to the identity of the noun phrase that the verb agrees with, and we saw that it is always the structurally closest noun phrase (rather than the linearly closest noun phrase). Here, we are asking a question about the identity of the verb that is fronted to the beginning of the sentence, and again we see that structural rather than linear closeness is the deciding factor: it is the verb structurally closest to the root of the tree, not the linearly first verb, that moves to the beginning of the sentence. So, one might propose that there is a linguistic law underlying the two generalizations relating to structural closeness and the possible relationships that can, in principle, exist among different positions in a hierarchical representation (explaining what kind of movement and agreement rules can exist). If this is true, it is reasonable to conclude that the child does not need to learn important aspects of the generalization and that these instead reflect abstract truths that distinguish possible from impossible languages—facts about the nature of the world (e.g., the innate structures of the mind).

This possibility raises a second question about linguistic typology. Specifically, different explanations of attested generalizations, in terms of fundamental linguistic laws, lead to cross linguistic predictions which can be studied. And indeed there is a typologically diverse set of languages that have rules similar to verb fronting, and it is always the main verb of the sentence that is fronted, never (as far as we know) the verb that appears first in a linear string (see, e.g., Holmberg, 2015 for verb fronting and Deal, 2023 for agreement).

A third relevant question is whether a generalization is learnable from the data available to the child. A finding that it is not would strengthen the case for innately specified constraints or biases entering into the account of the generalization. Here, LLMs might be useful in helping us explore what kind of information can be extracted from the data available to the child. 

In the case of polar question formation, the increasing availability of large publicly available corpora over the past three decades has led to disputes over the precise frequency of occurrence of examples like 11b that could be diagnostic of the correct polar question formation rule for a language learner (Pullum & Scholz, 2002). In tandem with these large corpora, we can investigate when a specific neural network model is trained on a specific dataset intended to provide an approximate characterization of a child learner’s linguistic experience, and what generalizations does the model acquire? This has become an active area of study, and the theoretical significance of the results that have been obtained is an area of some debate and controversy, for several reasons.

First, the neural network–based learnability studies seen in recent years generally involve experiments on specific trained models using specific sets of linguistic materials. For example, Yedetore et al. (2023) trained long short-term memory networks (LSTMs) and transformers on child-directed speech from CHILDES (MacWhinney, 2000) and tested them on polar questions using two different methods. The first method involved standard LLM-style training on a 9.6-million word sample—roughly a year’s worth of linguistic experience for a typical child in the United States (Hart & Risley, 1995)—and tested the trained models as to which of example 11b, example 11c, and four other ungrammatical variants they assigned the highest probability to. Using this method, the models performed below chance, preferring 11b less than one-sixth of the time. The second method involved training (using a smaller dataset of 189,359 sentence pairs derived from CHILDES by Pearl & Sprouse, 2013) and testing on a sentence pair task: given a declarative input sentence of the form in example 11a, generate the corresponding polar question. Using the identity of the first word in the polar sentence response generated by the model as a diagnostic for the nature of the model’s generalization, the transformer performed very well under this second method (above 85% accuracy relative to a baseline of ≤ 50%). As illustrated by this example, method and task formulation may have considerable influence on conclusions regarding an LLM’s learning outcomes.

Second, even for a given task formulation, it is not at all trivial to go from a specific set of experimental tests using a specific (set of) model(s) to a broader characterization of the generalization acquired regarding the linguistic pattern that the tests are designed to evaluate. For instance, both verbs in sentence 11a are auxiliary verbs, which have especially simple behavior when they are the main verb in polar question formation: they move to the front of the sentence, as in example 11b. Other verbs lead to more complex behavior involving what is called do-insertion—for example, converting Mary sleeps to a polar question leads not to *Sleeps Mary? but to Does Mary sleep?—and there is no guarantee that a model performing well on forming polar questions involving auxiliary verbs would perform similarly on forming polar questions using other verbs. Essentially, any such set of tests presents researchers with the classic problem of scientific induction (Goodman, 1955). This is in fact exactly analogous to the scientific induction problem faced by linguists in hypothesizing a grammatical description of a language based on finite evidence from native speakers, but there are no guarantees that the inductive strategies that seem to have worked well in the development of linguistic theory will be equally reliable for understanding neural networks’ linguistic generalizations.

Third, for the science of mind, one goal of learnability studies is to place bounds on learnability for humans from human-scale experience, but there are difficulties with treating such results as either lower bounds or upper bounds on learnability for humans. If one finds that a specific computational architecture and algorithm acquires a generalization (according to some agreed-upon criterion) from a certain quantity of linguistic data, it is a lower bound on what any learning system could learn from the data, and computational architectures and algorithms can always be improved upon. However, it is not necessarily a lower bound on what a human could learn from the corresponding amount of human experience because the human experience includes coupled nonlinguistic input (e.g., the visual and social environment) that may contain additional useful information for the learner. Furthermore, the converse problem arises as well: modern computational models are unlikely to offer compelling upper bounds for human learning because they are not subject to the same perceptual and cognitive constraints (e.g., memory systems) as a human learner.

Finally, the theoretical significance of the results on learnability should be evaluated in conjunction with considerations of the sort introduced at the outset about the nature of linguistic laws. We expect that the interaction between these different considerations will continue to be a topic of lively research in the foreseeable future. 

In the previous sections, we have reviewed the architectural bases of generative AI for language and scientific theories of generative linguistics and considered how these different approaches can inform each other. One can also envision an ambitious and speculative broader goal for the interplay between the two: developing methods for mapping back and forth between human-interpretable linguistic theories and LLMs. We review initial progress in this direction.

As described earlier, LLMs’ next-word predictions often seem to reflect features of language’s hierarchical grammatical structure (although, note the absence of a built-in grammaticality–likelihood distinction as described in section 2.1) This finding alone does not offer a holistic characterization of the linguistic patterns captured by LLMs. However, there is work on probing embedding representations for linguistic structure (Hewitt & Liang, 2019; Belinkov, 2022). For example, Hewitt & Manning (2019) showed that the embedding representations of sentences in BERT, an influential predecessor to modern LLMs (Devlin et al., 2019), could be linearly mapped to a vector space in which words that are closer to each other are also structurally closer in the sentence’s syntactic tree, allowing (with the introduction of additional constraints) the sentence’s grammatical structure to be decoded from the embedding representation, as illustrated in Figure 9. Methods have also been developed to determine whether the embedding space information used by these probes is causally implicated in LLMs’ behavior in the ways that would be predicted by linguistic theory (Geiger et al., 2021; Tucker et al., 2021).

Work in this direction is in its infancy; it has been harder, for example, to develop such probing methods for autoregressive models (Eisape et al., 2022). However, developing more ambitious methods for mapping back and forth between traditional grammatical models and LLMs is a potentially high-impact area for the future interplay between generative AI and language science. Methods for mapping probabilistic grammars directly into neural network weights could be used to set LLMs’ inductive bias for better learning, especially for low-resource languages. Conversely, methods for mapping from pretrained LLMs to grammatical models might enable automated generation of scientifically testable hypotheses about the structure of a language on the basis of text samples.

Figure 9

Using ‘structural’ probes to recover hierarchical sentence structure from BERT (Devlin et al., 2019), a predecessor to today’s LLMs. BERT’s internal ‘contextualized’ embedding representation is linearly mapped to a space in which words that are closer together are more closely related in their linguistic tree-structural representation (Hewitt & Manning, 2019). Figure from Manning et al. (2020).

The science of language has at its disposal a growing array of technological tools. LLMs are the newest and stand prominently among them. We have attempted to illustrate ways in which language science and LLM technology might stand to benefit if research proceeds from a firm understanding of both deep learning and the theoretical goals and achievements of generative linguistics. We have briefly surveyed how language science can contribute to LLM development through evaluation strategies, ideas for architectures and training objectives, meaning-oriented representations, and formal tools for mathematical analysis. Conversely, we’ve described how LLMs have already played a role in psycholinguistics and neurolinguistics, how they might accelerate the development of datasets suitable for deep linguistic analysis, and how they might contribute towards an understanding of learnability considerations in the linguistic domain (this last being a topic of current controversy). We have also outlined the broader ambitious goal of mapping more comprehensively between LLMs’ internal representations and human-readable theoretical descriptions of natural language structure, briefly describing work to date in this area as well as limitations. We invite readers of this paper to continue to strengthen their foundational knowledge in deep learning and linguistic theory alike and to add to the brief list of research areas and ideas we have outlined here.

Baker, Mark. The Atoms of Language. Oxford University Press, 2002.

Beguš, Gašper, Maksymilian Dąbkowski, and Ryan Rhodes. “Large Linguistic Models: Analyzing Theoretical Linguistic Abilities of LLMs.” arXiv (2023). https://doi.org/10.48550/arXiv.2305.00948.

Behaghel, Otto. Deutsche Syntax: Eine geschichtliche Darstellung. Vol. 4: Wortstellung-Periodenbau. Universitätsverlag Winter GmbH Heidelberg, 1932.

Belinkov, Yonatan. “Probing Classifiers: Promises, Shortcomings, and Advances.” Computational Linguistics 48, no. 1 (2022): 207–19. https://doi.org/10.1162/coli_a_00422.

Bock, Kathryn, and Carol A. Miller. “Broken Agreement.” Cognitive Psychology 23 (1991): 45–93. https://doi.org/10.1016/0010-0285(91)90003-7.

Bresnan, Joan. Lexical-Functional Syntax. Blackwell, 2001.

Caucheteux, Charlotte, and Jean-Rémi King. “Brains and Algorithms Partially Converge in Natural Language Processing.” Communications Biology 5, no. 1 (2022): 134. https://doi.org/10.1038/s42003-022-03036-1.

Charnavel, Isabelle. Locality and Logophoricity: A Theory of Exempt Anaphora. Oxford University Press, 2019.

Chierchia, Gennaro. Logic in Grammar. Oxford University Press, 2013.

Chomsky, Noam. Syntactic Structures. Mouton, 1957.

Chomsky, Noam. Aspects of the Theory of Syntax. MIT Press, 1965.

Chomsky, Noam. Rules and Representations. Columbia University Press, 1980.

Chomsky, Noam. Lectures on Government and Binding. Foris Publishers, 1981.

Chomsky, Noam. The Minimalist Program. MIT Press, 1995.

Chowdhery, Aakanksha, Sharan Narang, Jacob Devlin, et al. “PaLM: Scaling Language Modeling with Pathways.” Journal of Machine Learning Research 24, no. 1 (2023): 11324–436.

Christiansen, Morten H., and Nick Chater. “Toward a Connectionist Model of Recursion in Human Linguistic Performance.” Cognitive Science 23, no. 2 (1999): 157–205. https://doi.org/10.1016/S0364-0213(99)00003-8.

Clark, Alexander, and Shalom Lappin. Linguistic Nativism and the Poverty of the Stimulus. Wiley, 2010.

Crnič, Luka. “Non-monotonicity in NPI Licensing.” Natural Language Semantics 22, no. 2 (2014): 169–217. https://doi.org/10.1007/s11050-014-9104-6.

D’Alessandro, Roberta, in press, A short history of Agree, To appear in the Cambridge Handbook of Minimalism, https://ling.auf.net/lingbuzz/005888

De Marneffe, Marie-Catherine, Christopher D. Manning, Joakim Nivre, and Daniel Zeman. “Universal Dependencies.” Computational Linguistics 47, no. 2 (2021): 255–308. https://doi.org/10.1162/coli_a_00402.

Deal, Amy Rose. “Interaction, Satisfaction, and the PCC.” Linguistic Inquiry 55, no. 1 (2023): 39–94. https://doi.org/10.1162/ling_a_00455.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 2019.

Elman, Jeffrey L. “Learning and Development in Neural Networks: The Importance of Starting Small.” Cognition 48, no. 1 (1993): 71–99. https://doi.org/10.1016/0010-0277(93)90058-4.

Ehrlich, Susan F., and Keith Rayner. “Contextual Effects on Word Perception and Eye Movements during Reading.” Journal of Verbal Learning and Verbal Behavior 20 (1981): 641–55. https://doi.org/10.1016/S0022-5371(81)90220-6.

Eisape, Tiwalayo, Vineet Gangireddy, Roger P. Levy, and Yoon Kim. “Probing for Incremental Parse States in Autoregressive Language Models.” In Findings of the Association for Computational Linguistics. Association for Computational Linguistics, 2022.

Ferrer i Cancho, Ramon. “Euclidean Distance between Syntactically Linked Words.” Physical Review E—Statistical, Nonlinear, and Soft Matter Physics 70, no. 5 (2004): 056135. https://doi.org/10.1103/PhysRevE.70.056135.

Fodor, Jerry A., Thomas G. Bever, and Merrill F. Garrett. The Psychology of Language: An Introduction to Psycholinguistics and Generative Grammar. McGraw-Hill, 1974.

Fox, Danny, and Roni Katzir. “Large Language Models and Theoretical Linguistics.” Theoretical Linguistics 50, no. 1-2 (2024): 71–76. https://doi.org/10.1515/tl-2024-2005.

Frazier, Lyn, and Janet Dean Fodor. “The Sausage Machine: A New Two-Stage Parsing Model.” Cognition 6, no. 4 (1978): 291–325. https://doi.org/10.1016/0010-0277(78)90002-1.

Frazier, Lyn, and Keith Rayner. “Making and Correcting Errors during Sentence Comprehension: Eye Movements in the Analysis of Structurally Ambiguous Sentences.” Cognitive Psychology 14 (1982): 178–210. https://doi.org/10.1016/0010-0285(82)90008-1.

Friederici, Angela D. Language in Our Brain: The Origins of a Uniquely Human Capacity. MIT Press, 2017.

Futrell, Richard, Kyle Mahowald, and Edward Gibson. “Large-Scale Evidence of Dependency Length Minimization in 37 Languages.” Proceedings of the National Academy of Sciences 112, no. 33 (2015): 10336–41. https://doi.org/10.1073/pnas.1502134112.

Futrell, Richard, Ethan Wilcox, Takashi Morita, Peng Qian, Miguel Ballesteros, and Roger Levy. “Neural Language Models as Psycholinguistic Subjects: Representations of Syntactic State.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), edited by Jill Burstein, Christy Doran, and Thamar Solorio. Association for Computational Linguistics, 2019.

Gajewski, Jon. “Licensing Strong NPIs.” Natural Language Semantics 19 (2011): 109–48. https://doi.org/10.1007/s11050-010-9067-1.

Gauthier, Jon, and Roger P. Levy. “Linking Artificial and Human Neural Representations of Language.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), edited by Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan. Association for Computer Linguistics, 2019.

Geiger, Atticus, Hanson Lu, Thomas Icard, and Christopher Potts. “Causal Abstractions of Neural Networks.” In Advances in Neural Information Processing Systems. Vol. 34. Curran Associates, Inc., 2021.

Gibson, Edward. “Linguistic Complexity: Locality of Syntactic Dependencies.” Cognition 68, no. 1 (1998): 1–76. https://doi.org/10.1016/S0010-0277(98)00034-1.

Gibson, Edward, Richard Futrell, Steven Piantadosi, et al. “How Efficiency Shapes Human Language.” Trends in Cognitive Sciences 23, no. 5 (2019): 389–407. https://doi.org/10.1016/j.tics.2019.02.003.

Gildea, Daniel, and David Temperley. “Optimizing Grammars for Minimum Dependency Length.” In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, edited by Annie Zaenen and Antal van den Bosch. Association for Computational Linguistics, 2007.

Gildea, Daniel, and David Temperley. “Do Grammars Minimize Dependency Length?” Cognitive Science 34, no. 2 (2010): 286-310. https://doi.org/10.1111/j.1551-6709.2009.01073.x.

Goldstein, Ariel, Zaid Zada, Eliav Buchnik, et al. “Shared Computational Principles for Language Processing in Humans and Deep Language Models.” Nature Neuroscience 25, no. 3 (2022): 369–80. https://doi.org/10.1038/s41593-022-01026-4.

Goodman, Nelson. Fact, Fiction, and Forecast. Harvard University Press, 1955.

Grodzinsky, Yosef, Isabelle Deschamps, Peter Pieperhoff, et al. “Logical Negation Mapped Onto the Brain.” Brain Structure and Function 225, no. 1 (2020): 19–31. https://doi.org/10.1007/s00429-019-01975-w.

Hahn, Michael. “Theoretical Limitations of Self-Attention in Neural Sequence Models.” Transactions of the Association for Computational Linguistics 8 (2020): 156–171.

Hahn, Michael, Dan Jurafsky, and Richard Futrell. “Universals of Word Order Reflect Optimization of Grammars for Efficient Communication.” Proceedings of the National Academy of Sciences 117, no. 5 (2020): 2347–53. https://doi.org/10.1073/pnas.1910923117.

Hahn, Michael, Richard Futrell, Roger P. Levy, and Edward Gibson. “A Resource-Rational Model of Human Processing of Recursive Linguistic Structure.” Proceedings of the National Academy of Sciences 119, no. 43 (2022): e2122602119. https://doi.org/10.1073/pnas.2122602119.

Hale, John. “A Probabilistic Earley Parser as a Psycholinguistic Model.” In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL). Association for Computational Linguistics, 2001.

Hart, Betty, and Todd R. Risley. Meaningful Differences in the Everyday Experience of Young American Children. Brookes Publishing Company, 1995.

Hawkins, John A. A Performance Theory of Order and Constituency. Cambridge University Press, 1994.

Hewitt, John, and Christopher D. Manning. “A Structural Probe for Finding Syntax in Word Representations.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), edited by Jill Burstein, Christy Doran, and Thamar Solorio. Association for Computational Linguistics, 2019.

Hewitt, John, and Percy Liang. “Designing and Interpreting Probes with Control Tasks.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, 2019.

Hinton, Geoffrey E. “Learning Multiple Layers of Representation.” Trends in Cognitive Sciences 11, no. 10 (2007): 428–34. https://doi.org/10.1016/j.tics.2007.09.004.

Holmberg, Anders. “Verb Second.” In Volume 1, edited by Tibor Kiss and Artemis Alexiadou. De Gruyter Mouton, 2015. https://doi.org/10.1515/9783110377408.342.

Homer, Vincent. “Disruption of NPI Licensing: The Case of Presuppositions.” In Proceeding of SALT 18, edited by Tova Friedmann and Satoshi Ito. Linguistic Society of America, 2009.

Hu, Jennifer, and Roger Levy. “Prompting Is Not a Substitute for Probability Measurements in Large Language Models.” In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, edited by Houda Bouamor, Juan Pino, and Kalika Bali. Association for Computational Linguistics, 2023.

Huang, Kuan-Jung, Suhas Arehalli, Mari Kugemoto, et al. “Large-Scale Benchmark Yields No Evidence That Language Model Surprisal Explains Syntactic Disambiguation Difficulty.” Journal of Memory and Language 137 (2024): 104510. https://doi.org/10.1016/j.jml.2024.104510.

Katzir, Roni. “Why Large Language Models Are Poor Theories of Human Linguistic Cognition: A Reply to Piantadosi.” Biolinguistics 17 (2023): e13153. https://doi.org/10.5964/bioling.13153.

Kojima, Takeshi, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. “Large Language Models Are Zero-Shot Reasoners.” Advances in Neural Information Processing Systems 35 (2022): 22199–13.

Kutas, Marta, and Steven A. Hillyard. “Reading Senseless Sentences: Brain Potentials Reflect Semantic Incongruity.” Science 207, no. 4427 (1980): 203–5. https://doi.org/10.1126/science.7350657.

Ladusaw, William A. “Polarity Sensitivity as Inherent Scope Relations.” PhD thesis, University of Texas, 1979.

Lahiri, Utpal. “Focus and Negative Polarity in Hindi.” Natural Language Semantics 6 (1998): 57–123. https://doi.org/10.1023/A:1008211808250.

Lan, Nur, Emmanuel Chemla, and Roni Katzir. “Bridging the Empirical-Theoretical Gap in Neural Network Formal Language Learning Using Minimum Description Length.” In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), edited by Lun-Wei Ku, Andre Martins, and Vivek Srikumar. Association for Computational Linguistics, 2024.

Levy, Roger. “Expectation-Based Syntactic Comprehension.” Cognition 106, no. 3 (2008): 1126–77. https://doi.org/10.1016/j.cognition.2007.05.006.

MacWhinney, Brian. “The CHILDES Project.” Computational Linguistics 26, no. 4 (2000): 657.

Mahowald, Kyle, Anna A. Ivanova, Idan A. Blank, Nancy Kanwisher, Joshua B. Tenenbaum, and Evelina Fedorenko. “Dissociating Language and Thought in Large Language Models.” Trends in Cognitive Sciences 28, no. 6 (2024): 517–40. https://doi.org/10.1016/j.tics.2024.01.011.

Makuuchi, Michiru, Yosef Grodzinsky, Katrin Amunts, Andrea Santi, and Angela D. Friederici. “Processing noncanonical sentences in Broca's region: reflections of movement distance and type.” Cerebral Cortex 23 no. 3 (2013): 694–702.

Manning, Christopher D., Kevin Clark, John Hewitt, Urvashi Khandelwal, and Omer Levy. “Emergent Linguistic Structure in Artificial Neural Networks Trained by Self-Supervision.” Proceedings of the National Academy of Sciences 117, no. 48 (2020): 30046–54. https://doi.org/10.1073/pnas.1907367117.

Marslen-Wilson, William D. “Sentence Perception as an Interactive Parallel Process.” Science 189, no. 4198 (1975): 226-28. https://doi.org/10.1126/science.189.4198.226.

Marvin, Rebecca, and Tal Linzen. “Targeted Syntactic Evaluation of Language Models.” In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, edited by Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii. Association for Computational Linguistics, 2018.

Miller, George A., and Noam Chomsky. “Finitary Models of Language Users.” In Handbook of Mathematical Psychology, edited by D. Luce. Wiley, 1963.

Montague, Richard. “Universal Grammar.” Theoria 36 no. 3 (1970): 373–98. https://doi.org/10.1111/j.1755-2567.1970.tb00434.x.

Nivre, Joakim, Marie-Catherine De Marneffe, Filip Ginter, et al. “Universal Dependencies v1: A Multilingual Treebank Collection.” In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), edited by Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, et al. European Language Resources Association, 2016.

Osterhout, Lee, and Phillip J. Holcomb. “Event-Related Brain Potentials Elicited by Syntactic Anomaly.” Journal of Memory and Language 31, no. 6 (1992): 785–806. https://doi.org/10.1016/0749-596X(92)90039-Z.

Park, Y. Albert, and Roger Levy. “Minimal-Length Linearizations for Mildly Context-Sensitive Dependency Trees.” In Proceedings of the 10th Annual Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL 2009). Association for Computational Linguistics, 2009.

Pearl, Lisa, and Jon Sprouse. “Syntactic Islands and Learning Biases: Combining Experimental Syntax and Computational Modeling to Investigate the Language Acquisition Problem.” Language Acquisition 20, no. 1 (2013): 23–68. https://doi.org/10.1080/10489223.2012.738742.

Pollard, Carl, and Ivan A. Sag. Head-Driven Phrase Structure Grammar. University of Chicago Press, 1994.

Pullum, Geoffrey K., and Barbara C. Scholz. “Empirical Assessment of Stimulus Poverty Arguments.” The Linguistic Review 18, no. 1–2 (2002): 9–50. https://doi.org/10.1515/tlir.19.1-2.9.

Qian, Peng, Tahira Naseem, Roger Levy, and Ramón Fernandez Astudillo. “Structural Guidance for Transformer Language Models.” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), edited by Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli. Association for Computational Linguistics, 2021.

Radford, A., Narasimhan, K., Salimans, T. & Sutskever, I. “Improving language understanding by generative pre-training.” https:// openai.com/research/language-unsupervised (2018).

Rumelhart, David E., and James L. McClelland. “On Learning the Past Tenses of English Verbs.” In Psycholinguistics: Critical Concepts in Psychology, vol. 4, edited by Alvin I. Goldman. MIT Press, 1986.

Sartran, Laurent, Samuel Barrett, Adhiguna Kuncoro, Miloš Stanojević, Phil Blunsom, and Chris Dyer. “Transformer Grammars: Augmenting Transformer Language Models with Syntactic Inductive Biases at Scale.” Transactions of the Association for Computational Linguistics 10 (2022): 1423–39.

Schrimpf, Martin, Idan Asher Blank, Greta Tuckute, et al. “The Neural Architecture of Language: Integrative Modeling Converges on Predictive Processing.” Proceedings of the National Academy of Sciences 118, no. 45 (2021): e2105646118. https://doi.org/10.1073/pnas.2105646118.

Shain, Cory, Clara Meister, Tiago Pimentel, Ryan Cotterell, and Roger P. Levy. “Large-Scale Evidence for Logarithmic Effects of Word Predictability on Reading Time.” Proceedings of the National Academy of Sciences 121, no. 10 (2024): e2307876121. https://doi.org/10.1073/pnas.2307876121.

Silva, Alexander B., Kaylo T. Littlejohn, Jessie R. Liu, David A. Moses, and Edward F. Chang. “The Speech Neuroprosthesis.” Nature Reviews Neuroscience 25, no. 7 (2024): 473–92. https://doi.org/10.1038/s41583-024-00819-9.

Smith, Nathaniel J., and Roger Levy. “The Effect of Word Predictability on Reading Time Is Logarithmic.” Cognition 128, no. 3 (2013): 302–19. https://doi.org/10.1016/j.cognition.2013.02.013.

Sportiche, Dominique, Hilda Koopman, and Edward Stabler. An Introduction to Syntactic Analysis and Theory. Wiley-Blackwell, 2014.

Staub, Adrian. “The Parser Doesn’t Ignore Intransitivity, After All.” Journal of Experimental Psychology: Learning, Memory, and Cognition 33, no. 3 (2007): 550–69. https://doi.org/10.1037/0278-7393.33.3.550.

Stokes, Donald E. Pasteur's Quadrant: Basic Science and Technological Innovation. Brookings Institution Press, 1997.

Strubell, Emma, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. “Linguistically-Informed Self-Attention for Semantic Role Labeling.” In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, edited by Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii. Association for Computational Linguistics, 2018.

Sutton, Richard. “The Bitter Lesson.” Incomplete Ideas, March 13, 2019. http://www.incompleteideas.net/IncIdeas/BitterLesson.html.

Tanenhaus, Michael K., Michael J. Spivey-Knowlton, Kathleen M. Eberhard, and Julie C. Sedivy. “Integration of Visual and Linguistic Information in Spoken Language Comprehension.” Science 268, no. 5217 (1995): 1632-34. https://doi.org/10.1126/science.7777863.

Tang, Jerry, Amanda LeBel, Shailee Jain, and Alexander G. Huth. “Semantic Reconstruction of Continuous Language from Non-invasive Brain Recordings.” Nature Neuroscience 26, no. 5 (2023): 858-66. https://doi.org/10.1038/s41593-023-01304-9.

Taylor, Wilson L. “‘Cloze Procedure’: A New Tool for Measuring Readability.” Journalism Quarterly 30, no. 4 (1953): 415–33. https://doi.org/10.1177/107769905303000401.

Tucker, Mycal, Peng Qian, and Roger Levy. 2021. “What if This Modified That? Syntactic Interventions with Counterfactual Embeddings.” In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, edited by Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli. Association for Computational Linguistics, 2021.

Tuckute, Greta, Nancy Kanwisher, and Evelina Fedorenko. “Language in Brains, Minds, and Machines.” Annual Review of Neuroscience 47, no. 1 (2024): 277–301. https://doi.org/10.1146/annurev-neuro-120623-101142.

van Schijndel, Marten, and Tal Linzen. “A Neural Model of Adaptation in Reading.” In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, edited by Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii. Association for Computational Linguistics, 2018.

van Schijndel, Marten, and Tal Linzen. “Neural Network Surprisal Predicts the Existence but Not the Magnitude of Human Syntactic Disambiguation Difficulty.” PsyArXiv (2020). https://doi.org/10.31234/osf.io/7j8d6.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30 (2017): 5998–6008.

Warstadt, Alex, Amanpreet Singh, and Samuel R. Bowman. “Neural Network Acceptability Judgments.” Transactions of the Association for Computational Linguistics 7 (2019): 625–41. https://doi.org/10.1162/tacl_a_00290.

Wei, Jason, Xuezhi Wang, Dale Schuurmans, et al. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” Advances in Neural Information Processing Systems 35 (2022): 24824–37.

Wilcox, Ethan, Pranali Vani, and Roger P. Levy. “A Targeted Assessment of Incremental Processing in Neural Language Models and Humans.” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, edited by Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli. Association for Computational Linguistics, 2021.

Wilcox, Ethan G., Tiago Pimentel, Clara Meister, Ryan Cotterell, and Roger P. Levy. “Testing the Predictions of Surprisal Theory in 11 Languages.” Transactions of the Association for Computational Linguistics 11 (2023): 1451–70. https://doi.org/10.1162/tacl_a_00612.

Wong, Lionel, Gabriel Grand, Alexander K. Lew, et al. “From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought.” arXiv (2023). https://doi.org/10.48550/arXiv.2306.12672.

Yedetore, Aditya, Tal Linzen, Robert Frank, and R. Thomas McCoy. “How Poor Is the Stimulus? Evaluating Hierarchical Generalization in Neural Networks Trained on Child-Directed Speech.” In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), edited by Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki. Association for Computational Linguistics, 2023.

Yngve, Victor. “A Model and an Hypothesis for Language Structure.” Proceedings of the American Philosophical Society 104 (1960): 444–66.

Yong, Zheng-Xin, Cristina Menghini, and Stephen H. Bach. “Low-Resource Languages Jailbreak GPT-4.” arXiv (2023). https://doi.org/10.48550/arXiv.2310.02446.

Read Entire Article