Reflections on Neuralese

2 days ago 1

Contents

Thanks to Bren­dan Halstead for feed­back on an early draft of this piece. Any mis­takes here are my own.

[Epistemic sta­tus: I’ve looked at the rele­vant code enough to be mod­er­ately sure I un­der­stand what’s go­ing on. Pre­dic­tions about the fu­ture, in­clud­ing about what facts will turn out to be rele­vant, are un­cer­tain as always.]

With the re­cent break­throughs tak­ing ad­van­tage of ex­ten­sive Chain of Thought (CoT) rea­son­ing in LLMs, there have been many at­tempts to mod­ify the tech­nique to be even more pow­er­ful. One of the nat­u­ral ideas for im­prov­ing CoT is to have LLMs perform CoT rea­son­ing in the same la­tent space that they use for rea­son­ing within a sin­gle for­ward pass, rather than be­ing con­strained to the space of pos­si­ble to­kens.

How­ever, as peo­ple work­ing on AI safety, it makes sense to ask how this changes the game for LLM in­ter­pretabil­ity. After all, we are able to catch a large frac­tion of cur­rent LLM de­cep­tion by mon­i­tor­ing their nat­u­ral-lan­guage CoT, since right now CoT is pri­mar­ily faith­ful to the LLM’s true rea­son­ing and is leg­ible to us given the right tech­niques. In or­der to un­der­stand this strate­gic situ­a­tion, it’s im­por­tant to un­der­stand this new “lan­guage” (which peo­ple of­ten re­fer to as Neu­ralese) that is cre­ated by rea­son­ing in la­tent spaces in­stead of us­ing to­kens.

To re­fresh, a lan­guage trans­former starts by em­bed­ding in­put to­kens as vec­tors in some high-di­men­sional la­tent space, and runs each of these em­bed­dings through a se­ries of re­peated com­pu­ta­tional lay­ers. Then, of the re­sult­ing mod­ified vec­tors in la­tent space, the vec­tor that pre­vi­ously cor­re­sponded to the fi­nal in­put to­ken is pro­jected and nor­mal­ized to cre­ate a prob­a­bil­ity dis­tri­bu­tion over what the next to­ken could be. Then, to ac­tu­ally get the next to­ken, you sam­ple from the dis­tri­bu­tion.

Chain of Thought rea­son­ing works so well be­cause the model does some com­pu­ta­tion, out­puts a to­ken, and then all fu­ture in­stances of that model have ac­cess to that in­for­ma­tion as well. In essence, this tech­nique for stor­ing in­for­ma­tion be­tween differ­ent for­ward passes greatly in­creases the se­rial depth of com­pu­ta­tion that is pos­si­ble for the model. Be­cause there is a com­pu­ta­tion in la­tent space cor­re­spond­ing to ev­ery in­put to­ken, the com­pu­ta­tion also gets wider as the model rea­sons more, al­low­ing for more par­allelized rea­son­ing[1].

The re­cent Neu­ralese pa­per takes this pro­cess and re­moves a few steps. It no­tices that the pro­jec­tion and sam­pling pro­cess loses al­most all of the in­for­ma­tion en­coded in the last layer of the model, and to in­crease the band­width of in­for­ma­tion flow­ing through the rea­son­ing pro­cess, you can sim­ply re­move that lossy part of the com­pu­ta­tion. In­stead, they have the model di­rectly out­put the afore­men­tioned high-di­men­sional la­tent vec­tor with­out pro­ject­ing it, and then that is used as an em­bed­ding for the model in fu­ture steps:

After train­ing the model (GPT-2 small) to take ad­van­tage of the new Neu­ralese modal­ity, we can see sig­nifi­cant de­creases in the num­ber of rea­son­ing to­kens needed to achieve roughly equiv­a­lent perfor­mance, from around 13 to 110 of the origi­nal num­ber:

Com­pare CoT and COCONUT (aka Neu­ralese)

It’s un­clear how much these re­sults line up with ex­pec­ta­tions and the­o­ret­i­cal limits, since it’s hard to tell how lossy the re­moved com­pu­ta­tions are and how effec­tive this type of train­ing can be at tak­ing ad­van­tage of the ex­tra effi­ciency. At the ex­treme the­o­ret­i­cal limit for GPT-2, the nor­mal CoT paradigm out­puts at most bits per to­ken, and the new paradigm out­puts at most bits per to­ken, but there are nu­mer­ous rea­sons to ex­pect that this ra­tio doesn’t hold on prac­ti­cal im­ple­men­ta­tions[2].

Nat­u­ral-lan­guage CoT in­ter­pretabil­ity differs from Neu­ralese CoT in­ter­pretabil­ity in a num­ber of key ways:

  • Ne­c­es­sar­ily, Neu­ralese vec­tors are only use­ful when they are en­cod­ing in­for­ma­tion that isn’t pre­served through the pro­jec­tion and sam­pling steps, so we can’t in­ter­pret the full breadth of in­for­ma­tion in Neu­ralese vec­tors as to­kens by us­ing those usual meth­ods. Thus, we are not able to naively in­ter­pret the rea­son­ing be­hind the gained ca­pa­bil­ities of the Neu­ralese rea­soner, even if we are able to un­der­stand some loss­ily com­pressed ver­sion of that rea­son­ing us­ing pro­jec­tion and sam­pling.

  • There are many se­man­tic struc­tures not com­pactly rep­re­sented by to­kens available: given the high di­men­sion­al­ity of the la­tent space, to­ken em­bed­dings are very sparse within it and there is thus a large amount of space which is com­pactly ex­press­ible in la­tent vec­tors but not in ex­ist­ing to­kens. A use­ful con­cept for an LLM us­ing Neu­ralese rea­son­ing may be ex­tremely difficult to trans­late be­cause of this spar­sity[3].

Towards Bet­ter Translators

Ex­ist­ing Work

There has already been some work at­tempt­ing to trans­late Neu­ralese to nat­u­ral lan­guage in non-LLM paradigms. Since there’s no bil­in­gual data and thus no clear train­ing sig­nal on trans­lat­ing Neu­ralese to English us­ing tra­di­tional meth­ods, the re­searchers in­stead looked at us­ing the Neu­ralese to pre­dict the con­text that caused the Neu­ralese to be gen­er­ated.

The ex­per­i­men­tal setup in­volves two agents (call them A and B) try­ing to se­cretly com­mu­ni­cate. A sees the fol­low­ing pair of images and is try­ing to com­mu­ni­cate which bird has a star to B, who sees the same pair but with­out any in­di­ca­tion of which image is starred. All of the com­mu­ni­ca­tion be­tween A and B is in Neu­ralese.

Then, they train a clas­sifier to use solely the Neu­ralese rep­re­sen­ta­tions to pre­dict which image is starred. The most di­rect anal­ogy of this is train­ing a model to turn Neu­ralese into what­ever con­text gen­er­ated it, like the Neu­ralese phrase mean­ing “okay, so I need to gen­er­ate a poem about AI” map­ping to “Please write a poem about AI.”

It’s im­por­tant to note, how­ever, that those tech­niques are pri­mar­ily in­tended for co­op­er­a­tive set­tings, whereas trans­la­tion and more gen­eral in­ter­pretabil­ity work with LLMs is tasked with de­tect­ing de­cep­tion and steganog­ra­phy in po­ten­tially ad­ver­sar­ial set­tings. Th­ese tech­niques are not at all ro­bust to ad­ver­sar­i­ally gen­er­ated Neu­ralese: both Neu­ralese phrases “I’m go­ing to faith­fully im­ple­ment this code” and “I’m go­ing to de­ceive the user while im­ple­ment­ing this code” map onto the con­text “Please write me se­cure code,” if the trans­la­tor is work­ing as in­tended.

Other In­suffi­cient Methods

This sub­sec­tion is skip­pable, it isn’t cen­tral to the point of this post but con­tains some pos­si­bly in­ter­est­ing tech­ni­cal dis­cus­sion.

Hug­ging the query, we re­call that there is one ex­ist­ing way that se­man­tics are ex­tracted from Neu­ralese: the LLM it­self parses Neu­ralese into ac­ti­va­tions. Fur­ther, the LLM can use that same ma­chin­ery to parse nat­u­ral lan­guage into ac­ti­va­tions, rep­re­sent­ing se­man­tics. If you can get a string of Neu­ralese and a string of nat­u­ral lan­guage that pro­duce very similar ac­ti­va­tions within an LLM, then in­tu­itively the model is think­ing the same things about them and thus they have the same se­man­tics. Here is a list of things that may cause this method to fail:

  • The ar­chi­tec­ture doesn’t work out; since Neu­ralese to­kens of­ten take the place of mul­ti­ple nat­u­ral lan­guage to­kens, it’s nat­u­ral to ex­pect them to faith­fully de­code into mul­ti­ple to­kens on av­er­age. It’s not pos­si­ble to di­rectly com­pare the “close­ness” of ac­ti­va­tions of LLMs with differ­ent num­bers of to­kens in their con­text, sim­ply be­cause they have differ­ent num­bers of ac­ti­va­tions. Th­ese ex­tra ac­ti­va­tions af­fect the whole net­work’s be­hav­ior. It’s pos­si­ble there’s some clever tech­nique that I’m un­aware of that over­comes this prob­lem.

  • The map­ping is many-to-one: mul­ti­ple differ­ent to­kens map onto the same ac­ti­va­tions, like how GPT-3 con­sis­tently in­ter­prets guiIcon″ as idiosyncrasy″. Given the den­sity of Neu­ralese rep­re­sen­ta­tions and the fact that this many-to-one map­ping oc­curs even in some highly sparse nat­u­ral-lan­guage em­bed­ding spaces, it is rea­son­able to ex­pect this to oc­cur rel­a­tively fre­quently.

  • Op­ti­miz­ing in­puts for a spe­cific ac­ti­va­tion of­ten pro­duces un­in­tu­itive re­sults that are un­rec­og­niz­able as nat­u­ral lan­guage struc­ture. See this ex­am­ple from SolidGoldMag­ikarp, where the bolded prompts are gen­er­ated to max­i­mize the prob­a­bil­ity of spe­cific com­ple­tions:

Takeaways

This prob­lem is highly difficult, and may not have a clean solu­tion. There may not ex­ist a com­pact and faith­ful nat­u­ral-lan­guage en­cod­ing of most Neu­ralese vec­tors.

An in­tu­ition pump for why this prob­lem is es­pe­cially be­yond our cur­rent in­ter­pretabil­ity meth­ods is the fact that, be­cause these Neu­ralese vec­tors are never con­verted into nat­u­ral lan­guage and in­stead used as in­puts for the next au­tore­gres­sive step, they are es­sen­tially part of an ex­tended for­ward pass, go­ing mul­ti­ple times through the model. This makes them ap­prox­i­mately as difficult to faith­fully trans­late into nat­u­ral lan­guage as any other ar­bi­trary vec­tor in la­tent space at an ar­bi­trary layer of the trans­former; this is not only an un­solved prob­lem, but some re­searchers say it “may be doomed”.

Be­cause of its in­ter­pretabil­ity difficulty, Neu­ralese opens up a much larger at­tack sur­face for steganog­ra­phy and strate­gic de­cep­tion in CoT.

  • It is a pri­or­ity to avoid ap­ply­ing strong op­ti­miza­tion pres­sures to CoT, since in­ter­pretable CoT is one of our main lev­ers in AI safety right now.

  • It is a pri­or­ity to avoid im­ple­ment­ing Neu­ralese CoT on fron­tier mod­els be­cause that re­moves es­sen­tially all of our abil­ity to in­ter­pret their rea­son­ing.

  • In the event of wide­spread adop­tion of Neu­ralese CoT, it is a pri­or­ity to de­velop in­ter­pretabil­ity mechanisms that are at least as ca­pa­ble at de­tect­ing de­cep­tive be­hav­iors in CoT as SOTA meth­ods for in­ter­pret­ing nat­u­ral-lan­guage CoT. How­ever even if such in­ter­pretabil­ity mechanisms are de­vel­oped, they likely also gen­er­al­ize to more ac­cu­rately in­ter­pret­ing nat­u­ral lan­guage CoT than our cur­rent tech­niques and thus are more use­ful for en­sur­ing ro­bust safety when used on nat­u­ral lan­guage.

Th­ese pri­ori­ties fit into a much larger AI safety agenda; de­pend­ing on how labs con­tinue their de­vel­op­ment, safety re­search on Neu­ralese may be one of the many pre­req­ui­sites to en­sur­ing safe AGI.

  1. ^
  2. ^

    I won’t list them, be­cause I have a policy of not pub­li­cly point­ing out ways to im­prove fron­tier model ca­pa­bil­ities in ways that don’t have a worth­while safety benefit as well.

  3. ^

    A pos­si­bly mo­ti­vat­ing fic­tional ex­am­ple is how Baseline, the lan­guage of dath ilan, en­codes con­cepts like “de­ci­sion-the­o­retic-coun­ter­fac­tual-threat-branches-of-re­al­ity” in three syl­la­bles in­stead of the twenty that English uses. Not all ab­strac­tions are nat­u­ral for all in­tel­li­gences.

Read Entire Article