Abstract Language Acquisition in the Absence of Explicit Neg

更新时间：2023-04-19 09:43:01 阅读量：实用文档文档下载

说明：文章内容仅供预览，部分内容可能不全。下载后的文档，内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的，是否完整无缺。

abstract推荐度：
相关推荐

Language Acquisition in the Absence of Explicit Negative Evidence:How Important is Starting Small?

Douglas L.T.Rohde David C.Plaut

Carnegie Mellon University and the Center for the Neural Basis of Cognition

June1999

To appear in Cognition

Abstract

It is commonly assumed that innate linguistic constraints are necessary to learn a natural language,based on the apparent lack of explicit negative evidence provided to children and on Gold’s proof that,under assumptions of virtually arbitrary positive pre-sentation,most interesting classes of languages are not learn-able.However,Gold’s results do not apply under the rather com-mon assumption that language presentation may be modeled as a stochastic process.Indeed,Elman(1993,Cognition)demon-strated that a simple recurrent connectionist network could learn an arti?cial grammar with some of the complexities of English, including embedded clauses,based on performing a word pre-diction task within a stochastic environment.However,the net-work was successful only when either embedded sentences were initially withheld and only later introduced gradually,or when the network itself was given initially limited memory which only gradually improved.This?nding has been taken as support for Newport’s“less is more”proposal,that child language acqui-sition may be aided rather than hindered by limited cognitive resources.The current article reports on connectionist simula-tions which indicate,to the contrary,that starting with simpli-?ed inputs or limited memory is not necessary in training recur-rent networks to learn pseudo-natural languages;in fact,such restrictions hinder acquisition as the languages are made more English-like by the introduction of semantic as well as syntac-tic constraints.We suggest that,under a statistical model of the language environment,Gold’s theorem and the possible lack of explicit negative evidence do not implicate innate,linguistic-speci?c mechanisms.Furthermore,our simulations indicate that special teaching methods or maturational constraints may be un-necessary in learning the structure of natural language.1Introduction

Traditionally,the problem of language acquisition has been treated as a problem of learning to identify and pro-duce the valid sentences in one’s language.The ideal-ized speaker is presumed to possess a set of rules,or competence grammar,capable of generating all well-formed sentences or determining whether any sentence is valid or invalid.The learning process is driven both by the learner’s innate endowment of structured linguis-tic knowledge and by the learner’s exposure to language. Fundamental questions thus concern the nature of these sources of information,how they are utilized,and the ex-tent to which each is responsible for the eventual attain-ment of language skill.

The standard approach in linguistics has tended to view the input to the child learner simply as a sequence of valid sentences.Statistical properties of this input are generally overlooked or thought to bear little relevance to learning. Indeed,some consider this a feature of the approach as at-tention to statistics potentially places a tremendous com-putational burden on the learner(see Allen&Seidenberg, 1999,for discussion).Additionally,Baker(1979),among others,has argued that children receive negligible explicit negative feedback following production errors.1

Rohde and Plaut Language Acquisition and Starting Small

explicit evidence,such as a greater tendency for parents to rephrase un-grammatical compared with grammatical utterances.In contrast,we will use implicit negative evidence to refer to distributional properties of the input which do not depend on the language production of the learner. Implicit negative evidence is sometimes referred to as indirect,although we favor the former term.

cepted in the linguistics community and is associated with the theories of Universal Grammar and the innate Lan-guage Acquisition Device.Given the apparent lack of ex-plicit negative evidence provided to children,strong in-nate linguistic constraints are regarded by many authors (e.g.,Berwick,1985;Marcus,1993;Morgan&Travis, 1989;Morgan,Bonamo,&Travis,1995)to be an in-escapable solution to the learnability problem.On the surface,it seems perfectly reasonable to hypothesize that the set of natural languages is limited:It is unlikely that every regular or every context-free language is a possible natural language.However,even under this assumption, most interesting subsets of these language classes would still be unlearnable under Gold’s model.It remains to be seen what degree of constraints,if any,would enable the learning of natural language in Gold’s framework.

However,Gold made brief mention of a third possibil-ity:that his assumption regarding the possible texts(or sequences of positive examples)for a language was too general and that“there is an a priori restriction on the class of texts which can occur”(p.454).In Gold’s model,a fair text is a series of positive examples from the language in which every legal sentence will eventually occur.Su-per?nite languages were found to be unlearnable only if texts are arbitrary or are produced by the powerful class of recursive functions.Such a function can prohibit learn-ing by producing a series of examples designed specif-ically to confuse the learner inde?nitely.However,this hardly seems an appropriate model for a child’s linguistic environment—while there is ongoing debate on the ex-tent to which child-directed speech is simpli?ed relative to adult-directed speech(see,e.g.,Gallaway&Richards, 1994;Snow&Ferguson,1977)no one would propose that it is tailored speci?cally to hinder language acquisition.

An alternative is to constrain the possible texts by mod-eling language as a stochastic process—some sentences or grammatical constructions are more frequent than others and language is generated by a relatively stationary distri-bution over these strings(see Seidenberg,1997;Seiden-

Rohde and Plaut Language Acquisition and Starting Small

2The term“construction”here refers to grammatical distinctions,ab-stractions or rules rather than to speci?c sentences.Thus,for example, Chomsky’s(1957)famous sentence,“Colorless green ideas sleep furi-ously”,is supported by the input as one of many simple active SVO sentences.Although connectionist networks might not instantiate such constructions as explicit,distinct data structures,these systems nonethe-less have the capability of developing internal distributed representations that support effective generalization across sentences with similar gram-matical structure(in the classic sense).

language leads to a rather different de?nition of what it means to learn a language.On the traditional view,learn-ing a language involves converging on the single,correct grammar of the language;any deviation from this gram-mar in the actual behavior of language users must be as-cribed to performance factors.Moreover,given that all learners of a language must acquire competence in equiv-alent grammars,it is critical to have formal guarantees that this will happen.From a stochastic perspective,by con-trast,the grammars acquired by members of a language community need not be identical but only suf?ciently sim-ilar to permit effective communication.The degree of agreement among inpiduals in,for example,making grammaticality judgments would thus be expected to be very high but not perfect.It is still possible to formulate explicit bounds on learnability,but these bounds are prob-abilistic rather than absolute.Moreover,on this view,the study of actual language performance plays a more central role than on traditional views because such performance is taken to re?ect underlying language knowledge more directly.

This leads to a serious practical problem.The human brain is considerably restricted as a learning device due to its limited memory and analytical abilities.The princi-pal mechanisms of language acquisition seem to operate online with relatively little storage and subsequent analy-sis of the actual inputs.In contrast,the learning mecha-nisms proposed by Horning,Angluin,and others rely on repeated evaluation and re-evaluation of vast sets of com-plete,candidate grammars.They are thus unlikely to lead to reasonable computational models of our language ac-quisition mechanism.

Given restrictions of limited memory and online learn-ing with iterative updates of a small set of candidate gram-mars,one way the statistical structure of a language can be approximated is through the formulation and testing of implicit predictions.By comparing one’s predictions to what actually occurs,feedback is immediate and negative evidence derives from incorrect predictions.Although not

Rohde and Plaut Language Acquisition and Starting Small

Note:Transition probabilities are speci?ed and additional constraints are applied on top of this framework.

Table2:Semantic Constraints on Verb Usage

Intransitive Transitive Objects Verb Subjects Subjects if Transitive Note:Columns indicate legal subject nouns when verbs are used intransitively or transitively and legal object nouns when transitive.

grammar used by Elman was nearly identical to the cur-rent one,except that it had one fewer mixed transitiv-ity verb in singular and plural form,and the two proper nouns,Mary and John,could not be modi?ed.

In the current work,several additional constraints were applied on top of the grammar in Table1.Primary among these was that inpidual nouns could engage only in cer-tain actions,and that transitive verbs could act only on certain objects.For example,anyone could walk,but only humans could walk something else and the thing walked must be a dog.The full set of constraints are listed in Table2.

Another restriction in the language was that proper nouns could not act on themselves.For example,Mary chases Mary would not be a legal sentence.Finally,con-structions which repeat an intransitive verb,such as Boys who walk walk,were disallowed because of redundancy.

These and the above constraints will be referred to as se-mantic constraints.In the simulation,semantic constraints always applied within the main clause of the sentence as well as within any subclauses.Although number agree-ment affected all nouns and verbs,the degree to which the semantic constraints applied between a noun and its mod-ifying phrase was controlled by specifying the probability that the relevant constraints would be enforced for a given phrase.In this way,effects of the correlation between a noun and its modifying phrase,or of the level of informa-tion the phrase contained about the identity of the noun, could be investigated.

Two other parameters were used to control the behavior of the grammar.First,the framework depicted in Table1 was modi?ed to allow the direct speci?cation of the per-centage of simple and complex sentences produced.Sec-ond,the probability of noun phrase modi?cation was ad-justed to control the average length of sentences in the language.

When probabilities are speci?ed for the productions in the grammar,it becomes a stochastic context-free gram-mar(SCFG).A grammar of this form is convenient not only for generating example sentences,but also because it allows us to calculate the optimal prediction behav-ior on the language.Given the stochastic nature of the language,the network cannot in general predict the ac-tual next word in a sentence accurately.Rather,over the course of training,we expect the network to increasingly approximate the theoretically correct prediction given the sentence context up to the current point,in the form of a probability distribution over the26words in the vocab-ulary.One advantage of expressing the language as an SCFG is that this probability distribution can be computed exactly.However,the above mentioned number agree-ment and semantic constraints are dif?cult to incorporate into the basic grammar shown in Table1.Therefore,a program was developed(Rohde,1999)which takes the grammar,along with the additional constraints,and pro-duces a new,much larger SCFG with the constraints in-

Rohde and Plaut Language Acquisition and Starting Small

City-Block

Squared Error

Cosine

Divergence

Rohde and Plaut Language Acquisition and Starting Small %Complex A B C D E R Elman

Used in Simulation2.

tively infrequent.Sentences longer than16words were discarded in generating the corpora,but these were so rare()that their loss should have had negligible effects.In order to perform well,the network cannot pos-sibly“memorize”the training corpus but must learn the structure of the language.

2.1.4Training procedure

In the condition Elman referred to as“starting small,”he trained his network for5epochs on each of the four cor-pora,in increasing order of complexity.During training, weights were adjusted to minimize the summed squared error between the network’s predicted next word and the actual next word,using the back-propagation learning procedure(Rumelhart et al.,1986)with a learning rate of 0.1,reduced gradually to0.06.No momentum was used and weights were updated after each word presentation. Weights were initialized to random values sampled uni-formly between0.001.

For each of the?ve language classes,we trained the network shown in Figure1using both incremental and non-incremental training schemes.In the complex regi-men,the network was trained on the most complex corpus (75%complex)for25epochs with a?xed learning rate. The learning rate was then reduced for a?nal pass through the corpus.In the simple regimen,the network was trained for?ve epochs on each of the?rst three corpora in increas-ing order of complexity.It was then trained on the fourth corpus for10epochs,followed by a?nal epoch at the re-duced learning rate.The?nal six epochs of training on the fourth corpus—not included in Elman’s design—were intended to allow performance with the simple regimen to approach asymptote.

Because we were interested primarily in what perfor-mance level was possible under optimal conditions,we searched a wide range of training parameters to determine a set which consistently achieved the best performance overall.3We trained our network with back-propagation using momentum of0.9,a learning rate of0.004reduced to0.0003for the?nal epoch,a batch size of100words per weight update,and initial weights sampled uniformly be-tween 1.0(cf.0.001for Elman’s network).Network performance for both training and testing was measured in terms of pergence(see Table3).In addition to being an appropriate measure of the difference between two dis-tributions from an information theoretic standpoint(see Rumelhart et al.,1995),pergence has the feature that, during training,error is injected only at the unit represent-ing the actual next word.This is perhaps more plausible than functions which provide feedback to every word in the vocabulary.

Because pergence is well-de?ned only over proba-bility distributions(which sum to1.0),normalized Luce ratios(Luce,1986),also known as softmax constraints, were applied to the output layer.In this form of nor-malization,the activation of output unit is calculated by

,where is the unit’s net input and ranges over all of the output units.The remaining units in the network used the standard logistic activation function,

Rohde and Plaut Language Acquisition and Starting Small

4The comparison for simple sentences and for very complex sen-

tences is unreliable because there were very few novel simple sentences

and no very complex sentences that appeared both during training and

testing.

Rohde and Plaut Language Acquisition and Starting Small

Relative Total Unique Percent Familiar Novel

Clauses Sentences Sentences Novel Sentences Sentences Example Novel Sentence Overall10000478969.8

Rohde and Plaut Language Acquisition and Starting Small

5To match the average lengths of sentences generated by grammar R as closely as possible to those produced by Elman’s grammar,the selec-tion probabilities for intransitive verbs across the levels of complexity (0%,25%,50%,and75%)were increased from50%for each(as in grammar classes A–E)to54%,65%,75%,and50%,respectively.sure,momentum of0.9,eleven epochs of training on the ?nal corpus,a batch size of10words,a learning rate of 0.004reduced to0.0003for the last epoch,and initial weights between.In the latter case,we used logis-tic output units,squared error,no momentum,?ve epochs of training on the fourth corpus,online weight updating (after every word),a learning rate of0.1reduced to0.06 in equal steps with each corpus change,and initial weights between.

3.2Results and discussion

Even when training on sentences from a grammar with no semantic constraints,our learning parameters resulted in an advantage for the complex regimen.Over the best 12of15trials,the network achieved an average per-gence of0.025under the complex condition compared with0.036for the simple condition(=34.8,

.001).Aside from the learning parameters,one impor-tant difference between our training method and Elman’s was that we added6extra epochs of training on the?-nal corpus to both conditions.This extended training did not,however,disproportionately bene?t the complex con-dition in some way.Between epoch20and25,the average pergence error under the simple regimen dropped from 0.085to0.061.During the same period,the error under the complex regimen fell only from0.051to0.047.6

It is again important to establish that the network was actually learning to perform the task well.Otherwise the apparent advantage for starting large might be an artifact of settling into local minima due to poor training methods. The best measure of network performance would appear to be a direct comparison with the results published by Elman(1991).However,as discussed earlier,Elman eval-uated his network using empirically derived probabilities, rather than predictions generated directly from the gram-mar.

Rohde and Plaut Language Acquisition and Starting Small

7Goldowsky&Newport(1993)provide an illustration of how ran-domly degraded input could aid learning in a morphology-like associa-tion task.However,the results appear to depend largely on their use of a learning mechanism that collects co-occurrence statistics rather than perhaps more appropriate correlations.It is not clear whether similar results could be obtained in a mechanism attempting to learn natural language syntax.during processing by setting the activations at this layer to0.5.For the?rst12epochs of training,this was done randomly after3–4words had been processed,without regard to sentence boundaries.For the next5epochs the memory window was increased to4–5words,then to5–6, 6–7,and?nally,in the last stage of training,the memory was not interfered with at all.

In the current simulation,the training corpus consisted of75%complex sentences,although,as mentioned above, Elman’s may have extended to100%complexity.Like El-man,we extended the?rst period of training,which used a memory window of3–4words,from5epochs to12 epochs.We then trained for5epochs each with windows of4–5and5–7words.The length of the?nal period of unrestricted memory depended on the training methods. When using our own methods(see Simulation2),as when training on the?nal corpus in the simple regimen,this pe-riod consisted of10epochs followed by one more with the reduced learning rate.When training with our approxima-tion of Elman’s methods on grammar R,this?nal period was simply?ve epochs long.Therefore,under both con-ditions,the memory-limited network was allowed to train for a total of7epochs more than the corresponding full-memory network in Simulations1and2.When using our methods,learning rate was held?xed until the last epoch, as in Simulation1.With Elman’s method,we reduced the learning rate with each change in memory limit.

4.2Results and discussion

Although he did not provide numerical results,Elman (1993)reported that the?nal performance was as good as in the prior simulation involving progressive inputs. Again,this was deemed a success relative to the com-plex,full-memory condition which was reportedly unable to learn the task.

Using our training methods on language R,the limited-memory condition resulted in equivalent performance to that of the full-memory condition,in terms of pergence

Rohde and Plaut Language Acquisition and Starting Small

本文来源：https://www.bwwdw.com/article/4n1q.html

相关文章：

藏历佛教节日10-20

唐朝与周边国家的关系 - 图文09-11

中国民航大学2012级新生家长调查问卷（最终稿）10-14

车辆查控战术训练教案01-06

汤池一中安全工作总结03-22

度春荒的阅读题答案05-27

广东海大青年学生健康教育测试题及答案01-04

为“大自然”停步作文1000字07-07

2019-麻醉医生考核述职报告-实用word文档（1页）03-16

上一篇：济南市部编版小升初语文现代文阅读试题(及答案) 下一篇：2022年专业技术人员继续教育公需科目试题03