Formal Theory of Creativity, Fun, and Intrinsic Motivation

Formal Theory of Creativity, Fun, and Intrinsic Motivation (PDF)

2022 • 18 Pages • 259.01 KB • English
Posted July 01, 2022 • Submitted by Superman

Visit PDF download

Download PDF To download page

Summary of Formal Theory of Creativity, Fun, and Intrinsic Motivation

1 Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990-2010) J¨urgen Schmidhuber Abstract—The simple but general formal theory of fun & intrinsic motivation & creativity (1990-) is based on the concept of maximizing intrinsic reward for the active creation or discovery of novel, surprising patterns allowing for improved prediction or data compression. It generalizes the traditional field of active learning, and is related to old but less formal ideas in aesthetics theory and developmental psychology. It has been argued that the theory explains many essential aspects of intelligence includ- ing autonomous development, science, art, music, humor. This overview first describes theoretically optimal (but not necessarily practical) ways of implementing the basic computational prin- ciples on exploratory, intrinsically motivated agents or robots, encouraging them to provoke event sequences exhibiting previ- ously unknown but learnable algorithmic regularities. Emphasis is put on the importance of limited computational resources for online prediction and compression. Discrete and continuous time formulations are given. Previous practical but non-optimal implementations (1991, 1995, 1997-2002) are reviewed, as well as several recent variants by others (2005-). A simplified typology addresses current confusion concerning the precise nature of intrinsic motivation. Index Terms—Formal theory of creativity, fun, surprise, nov- elty, novel patterns, attention, active learning, aesthetics theory, developmental psychology, limited computational resources, ty- pology of intrinsic motivation, science, art, music, humor. I. INTRODUCTION To solve existential problems such as avoiding hunger or heat, a baby has to learn how the initially unknown environ- ment responds to its actions. Therefore, even when there is no immediate need to satisfy thirst or other built-in primitive drives, the baby does not run idle. Instead it actively conducts experiments: what sensory feedback do I get if I move my eyes or my fingers or my tongue just like that? Being able to predict effects of actions will later make it easier to plan control sequences leading to desirable states, such as those where heat and hunger sensors are switched off. The growing infant quickly gets bored by things it already understands well, but also by those it does not understand at all, always searching for new effects exhibiting some yet unexplained but easily learnable regularity. It acquires more and more complex behaviors building on previously acquired, simpler behaviors. Eventually it might become a physicist discovering previously unknown physical laws, or an artist creating new eye-opening artworks, or a comedian coming up with novel jokes. For a long time I have been arguing, using various wordings, that all this behavior is driven by a very simple algorithmic J. Schmidhuber is with the Swiss AI Lab IDSIA, Galleria 2, 6928 Manno-Lugano, the University of Lugano & SUPSI, Switzerland. See˜juergen/interest.html mechanism that uses reinforcement learning (RL) to maximize the fun or internal joy for the discovery or creation of novel patterns. Both concepts are essential: pattern, and novelty. A data sequence exhibits a pattern or regularity if it is compressible [45], that is, if there is a relatively short program program that encodes it, for example, by predicting some of its components from others (irregular noise is unpredictable and boring). Relative to some subjective observer, a pattern is temporarily novel or interesting or surprising if the observer initially did not know the regularity but is able to learn it. The observer’s learning progress can be precisely measured and translated into intrinsic reward for a separate RL controller selecting the actions causing the data. Hence the controller is continually motivated to create more surprising data. Since 1990, agents were built that implement this idea. They may be viewed as simple artificial scientists or artists with an intrinsic desire to build a better model of the world and of what can be done in it. To improve their models, they acquire skills to create / discover more data predictable or compressible in hitherto unknown ways [67], [71], [69], [70], [111], [77], [79], [85], [92], [93], [97], [96], [94], [99]. They are intrinsically motivated to invent and conduct experiments, actively exploring their environment, always trying to learn new behaviors exhibiting previously unknown algorithmic regularities. They embody approximations of a simple, but general, formal theory of creativity and curiosity and interest- ingness and fun, explaining essential aspects of human or non- human intelligence, including selective attention, science, art, music, humor [85], [92], [97], [96], [94]. Crucial ingredients are: 1. An adaptive world model, essentially a predictor or compressor of the continually growing history of actions / events / sensory inputs, reflecting what’s currently known about how the world works, 2. A learning algorithm that continually improves the model (detecting novel, initially surprising spatio-temporal pat- terns that subsequently become known patterns), 3. Intrinsic rewards measuring the model’s improvements (first derivative of learning progress) due to the learning algorithm (thus measuring the degree of subjective surprise or fun), 4. A separate reward optimizer or reinforcement learner, which translates those rewards into action sequences or behaviors expected to optimize future reward. A simple example may help to see that it is really possible to learn from intrinsic reward signals `a la Item 3 that one can learn even more in places never visited before. In an 2 environment with red and blue boxes, whenever the learning agent opens a red box, it will find an easily learnable novel geometric pattern (that is, its predictor will make progress and thus generate intrinsic reward), while all blue boxes contain a generator of unpredictable, incompressible white noise. That is, all the RL controller has to learn is a simple policy: open the next unopened red box. Ignoring issues of computation time, it is possible to de- vise mathematically optimal, universal RL methods for such systems [85], [92], [97], [96] (2006-2009). More about this in Section II. However, the practical implementations so far [69], [70], [111], [77], [79] were non-universal and made approx- imative assumptions. Among the many ways of combining algorithms for 1-4 the following variants were implemented: A. 1990: Non-traditional RL (without restrictive Markovian assumptions [72]) based on an adaptive recurrent neural network as a predictive world model [68] is used to maxi- mize the controller’s intrinsic reward, which is proportional to the model’s prediction errors [67], [71]. B. 1991: Traditional RL [35], [114] is used to maximize intrinsic reward created in proportion to expected improve- ments (first dervatives) of prediction error [69], [70]. C. 1995: Traditional RL maximizes intrinsic reward created in proportion to relative entropies between the learning agent’s priors and posteriors [111]. D. 1997-2002: Non-traditional RL [103] (without restrictive Markovian assumptions) learns probabilistic, hierarchical programs and skills through zero-sum intrinsic reward games of two players (called the right brain and the left brain), each trying to out-predict or surprise the other, taking into account the computational costs of learning, and learning when to learn and what to learn [77], [79]. B-D (1991-2002) also showed experimentally how intrinsic rewards can substantially accelerate goal-directed learning and external reward intake. Outline. Section II will summarize the formal theory of creativity in a nutshell, laying out a mathematically rigorous but not necessarily practical framework. Section III will then discuss previous concrete implementations of the non-optimal but currently still more practical variants A-D mentioned above, and their limitations. Section IV will discuss relations to work by others, explain how the theory extends the tra- ditional field of active learning, and how it formalizes and extends previous informal ideas of developmental psychology and aesthetics theory. Section V will offer a natural typology of computational intrinsic motivation, and Section VI will briefly explain how the theory is indeed sufficiently general to explain all kinds of creative behavior, from the discovery of new physical laws through active design of experiments, to the invention of jokes and music and works of art. II. FORMAL DETAILS OF THE THEORY OF CREATIVITY The theory formulates essential principles behind numerous intrinsically motivated creative behaviors of biological or artificial agents embedded in a possibly unknown environment. The corresponding algorithmic framework uses general RL (Section II-G; [32], [98]) to maximize not only external reward for achieving goals such as the satisfaction of hunger and thirst, but also intrinsic reward for learning a better world model, by creating / discovering / learning novel patterns in the growing history of actions and sensory inputs, where the theory formally specifies what exactly is a pattern, what exactly is novel or surprising, and what exactly it means to incrementally learn novel skills leading to more novel patterns. A. The Agent and its Improving Model Let us consider a learning agent whose single life consists of discrete cycles or time steps t = 1, 2, . . . , T. Its complete lifetime T may or may not be known in advance. In what follows, the value of any time-varying variable Q at time t (1 ≤ t ≤ T ) will be denoted by Q(t), the ordered sequence of values Q(1), . . . , Q(t) by Q(≤t), and the (possibly empty) sequence Q(1), . . . , Q(t − 1) by Q(< t). At any given t the agent receives a real-valued input x(t) from the environment and executes a real-valued action y(t) which may affect future inputs. At times t < T its goal is to maximize future success or utility u(t) = Eµ � T � τ=t+1 r(τ) ����� h(≤t) � , (1) where the reward r(t) is a special real-valued input (vector) at time t, h(t) the ordered triple [x(t), y(t), r(t)] (hence h(≤ t) is the known history up to t), and Eµ(· | ·) denotes the conditional expectation operator with respect to some possibly unknown distribution µ from a set M of possible distributions. Here M reflects whatever is known about the possibly probabilistic reactions of the environment. As a very general example, M may contain all computable distributions [110], [45], [32]. This essentially includes all environments one could write scientific papers about. There is just one life, no need for predefined repeatable trials, no restriction to Markovian interfaces between sensors and environment [72]. (Note that traditional Markovian RL [114] assumes that the world can be modeled as a Markov Decision Process (MDP), and that the perceptual system reveals the current state. In realistic scenarios, however, robots have to learn to memorize previous relevant inputs in form of appropriate internal representations, which motivates the work on RL in partially observable MDPs or POMDPs, e. g., [72], [35].) The utility function implicitly takes into account the expected remaining lifespan Eµ(T | h(≤t)) and thus the possibility to extend the lifespan through appropriate actions [83], [98]. Note that mathematical analysis is not simplified by discounting future rewards like in traditional RL theory [114]—one should avoid such distortions of real rewards whenever possible. To maximize u(t), the agent may profit from an adaptive, predictive model p of the consequences of its possible inter- actions with the environment. At any time t (1 ≤ t < T ), the model p(t) will depend on the observed history so far, h(≤t). It may be viewed as the current explanation or description of h(≤ t), and may help to predict future events, including rewards. Let C(p, h) denote some given model p’s quality or performance evaluated on a given history h. Natural quality measures will be discussed in Section II-B. 3 To encourage the agent to actively create data leading to easily learnable improvements of p [71], [70], [111], [79], [85], [92], [97], [96], [94], [99], the reward signal r(t) is simply split into two scalar real-valued components: r(t) = g(rext(t), rint(t)), where g maps pairs of real values to real values, e.g., g(a, b) = a + b. Here rext(t) denotes traditional external reward provided by the environment, such as negative reward in response to bumping against a wall, or positive reward in response to reaching some teacher- given goal state. The formal theory of creativity, however, is especially interested in rint(t), the intrinsic reward, which is provided whenever the model’s quality improves—for purely creative agents rext(t) = 0 for all valid t: The current intrinsic reward or creativity reward or curiosity reward or aesthetic reward or fun rint(t) of the action selector is the current surprise or novelty measured by the improvements of the world model p at time t. Formally, the intrinsic reward in response to the model’s progress (due to some application-dependent model improve- ment algorithm) between times t and t + 1 is rint(t + 1) = f[C(p(t), h(≤t + 1)), C(p(t + 1), h(≤t + 1))], (2) where f maps pairs of real values to real values. Various alternative progress measures are possible; most obvious is f(a, b) = a−b. This corresponds to a discrete time version of maximizing the first derivative of the model’s quality. Note that both the old and the new model have to be tested on the same data, namely, the history so far. So progress between times t and t+1 is defined based on two models of h(≤ t+1), where the old one is trained only on h(≤ t) and the new one also gets to see h(t ≤ t + 1). This is like p(t) predicting data of time t + 1, then observing it, then learning something, then becoming a measurably better model p(t + 1). The above description of the agent’s motivation concep- tually separates the goal (finding or creating data that can be modeled better or faster than before) from the means of achieving the goal. Let the controller’s RL mechanism figure out how to translate such rewards into action sequences that allow the given world model improvement algorithm to find and exploit previously unknown types of regularities. It is the task of the RL algorithm to trade off long-term vs short-term intrinsic rewards of this kind, taking into account all costs of action sequences [71], [70], [111], [79], [85], [92], [97], [96], [94], [99]. The universal RL methods of Section II-G as well as RNN-based RL (Section III-A) and SSA-based RL (Section III-D) can in principle learn useful internal states containing memories of relevant previous events; less powerful RL methods (Sections III-B, III-C) cannot. B. How to Measure Model Quality Under Time Constraints In theory C(p, h(≤ t)) should take the entire history of actions and perceptions into account [85], like the following performance measure Cxry: Cxry(p, h(≤ t)) = t � τ=1 || pred(p, x(τ)) − x(τ) ||2 (3) + || pred(p, r(τ)) − r(τ) ||2 + || pred(p, y(τ)) − y(τ) ||2 where pred(p, q) is p’s prediction of event q from earlier parts of the history [85]. Cxry ignores the danger of overfitting through a p that just stores the entire history without compactly representing its regularities, if any. The principle of Minimum Description Length (MDL) [37], [115], [116], [110], [62], [45], however, also takes into account the description size of p, viewing p as a compressor program of the data h(≤t). This program p should be able to deal with any prefix of the growing history, computing an output starting with h(≤ t) for any time t (1 ≤ t < T ). (A program that wants to halt after t steps can easily be fixed / augmented by the trivial method that simply stores any raw additional data coming in after the halt.) Cl(p, h(≤t)) denotes p’s compression performance on h(≤ t): the number of bits needed to specify both the predictor and the deviations of the sensory history from its predictions, in the sense of loss-free compression. The smaller Cl, the more regularity and lawfulness in the observations so far. For example, suppose p uses a small predictor that correctly predicts many x(τ) for 1 ≤ τ ≤ t. This can be used to encode x(≤ t) compactly: Given the predictor, only the wrongly predicted x(τ) plus information about the corresponding time steps τ are necessary to reconstruct input history x(≤t), e.g., [73]. Similarly, a predictor that learns a probability distribution on the possible next events, given previous events, can be used to efficiently encode observations with high (respectively low) predicted probability by few (respectively many) bits (Section III-C; [31], [101]), thus achieving a compressed history representation. Alternatively, p could also make use of a 3D world model or simulation. The corresponding MDL-based quality measure C3D(p, h(≤ t)) is the number of bits needed to specify all polygons and surface textures in the 3D simulation, plus the number of bits needed to encode deviations of h(≤ t) from the predictions of the simulation. Improving the 3D model by adding or removing polygons may reduce the total number of bits required. The ultimate limit for Cl(p, h(≤t)) would be K∗(h(≤t)), a variant of the Kolmogorov complexity of h(≤ t), namely, the length of the shortest program (for the given hardware) that computes an output starting with h(≤ t) [110], [37], [45], [80]. Here there is no need not worry about the fact that K∗(h(≤ t)) in general cannot be computed exactly, only approximated from above (indeed, for most practical predictors the approximation will be crude). This just means that some patterns will be hard to detect by the limited predictor of choice, that is, the reward maximizer will get discouraged from spending too much effort on creating those patterns. Cl(p, h(≤t)) does not take into account the time τ(p, h(≤ t)) spent by p on computing h(≤t). A runtime-dependent per- formance measure inspired by concepts of optimal universal search [43], [81], [82], [85], [96], [99] is Clτ(p, h(≤t)) = Cl(p, h(≤t)) + log τ(p, h(≤t)). (4) Here compression by one bit is worth as much as runtime 4 reduction by a factor of 1 2. From an asymptotic optimality- oriented point of view this is one of the best ways of trading off storage and computation time [43], [81], [82]. In practical applications (Section III) the predictor / com- pressor of the continually growing data typically will have to calculate its output online, that is, it will be able to use only a constant number of computational instructions per second to predict / compress new data. The goal of the possibly much slower learning algorithm must then be to improve the compressor such that it keeps operating online within those time limits, while compressing / predicting better than before. The costs of computing Cxry(p, h(≤ t)) and Cl(p, h(≤ t)) and similar performance measures are linear in t, assuming p consumes equal amounts of computation time for each single prediction. Therefore online evaluations of learning progress on the full history so far generally cannot take place as frequently as the continually ongoing online predictions. At least some of the learning and its progress evaluations may take place during occasional “sleep” phases [85], [96]. But practical implementations so far have looked only at parts of the history for efficiency reasons: The systems described in Sections III, III-A, III-B, III-C, III-D [71], [70], [111], [79] used online settings (one prediction per time step, and constant computational effort per prediction), non-universal adaptive compressors or predictors, and approximative evaluations of learning progress, each consuming only constant time despite the continual growth of the history. C. Feasibility of Loss-Free Compression, with Examples Any set of raw data, such as the history of some ob- server’s previous actions & sensations & rewards including suspected noise, exhibits a pattern or regularity if there exists an algorithm that is significantly shorter than the raw data but is able to encode it without loss of information [37], [109], [110], [45]. Random noise is irregular and arbitrary and incompressible. But random-dot stereograms (e.g., a single foreground square against a more distant background) are compressible since parts of the data are just copied from others. Videos are regular as most single frames are very similar to the previous one. By encoding only the deviations, movie compression algorithms can save lots of storage space. Complex-looking fractal images are regular, as they usually look a lot like their details, being computable by very short programs that re-use the same code over and over again for different image parts. The entire universe is regular and apparently rather benign [74], [78], [88], [90]: every photon behaves the same way; gravity is the same on Jupiter and Mars, mountains usually don’t move overnight but tend to remain where they are, etc. Many data analysis techniques are natural by-products of loss-free compression. For example, data set compression is possible if the data can be separated into clusters of numerous close neighbors and few outliers. Abstraction is another typical by-product. For example, if the predictor / compressor uses a neural net, the latter will create feature hierarchies, higher layer units typically corresponding to more abstract features, fine-grained where necessary. Any good compressor will identify shared regularities among different already existing internal data structures, to shrink the storage space needed for the whole. Consciousness may be viewed as a by-product of this [97], [96], since there is one thing that is involved in all actions and sensory inputs of the agent, namely, the agent itself. To efficiently encode the entire data history, it will profit from creating some sort of internal symbol or code (e. g., a neural activity pattern) representing itself. Whenever this representation is actively used, say, by activating the corresponding neurons through new incoming sensory inputs or otherwise, the agent could be called self-aware or conscious [97], [96]. True, any loss-free compression method will require space that grows without bound over time. But this is not a funda- mental practical obstacle. Soon storage for 100 years of high resolution video of will be cheap. If you can store the data, do not throw it away! The data is holy as it is the only basis of all that can be known about the world [97], [96]. Attempts at predicting / compressing the raw data (by finding regularities / abstractions) should take place in a separate, typically smaller part of the storage. Even humans may store much of the incoming sensory data. A human lifetime rarely lasts much longer than 3 × 109 seconds. The human brain has roughly 1010 neurons, each with 104 synapses on average. Assuming that only half of the brain’s capacity is used for storing raw data, and that each synapse can store at most 6 bits, there is still enough capacity to encode the lifelong sensory input stream with a rate of at least 105 bits/s, comparable to the demands of a movie with reasonable resolution, but possibly at a much higher rate, assuming that human compressors are much smarter than those of cameras. D. Optimal Predictors vs Optimal Compressors For the theoretically inclined: There is a deep connec- tion between optimal prediction and optimal compression. Consider Solomonoff’s theoretically optimal, universal way of predicting the future [109], [110], [45], [32]. Given an observation sequence q(≤ t), the Bayes formula is used to predict the probability of the next possible q(t + 1). The only assumption is that there exists a computer program that can take any q(≤ t) as an input and compute its a priori probability according to the µ prior. (This assumption is extremely general, essentially including all environments one can write scientific papers about, as mentioned above.) In general this program is unknown, hence a mixture prior is used instead to predict: ξ(q(≤t)) = � i wiµi(q(≤t)), (5) a weighted sum of all distributions µi ∈ M, i = 1, 2, . . ., where the sum of the constant positive weights satisfies � i wi ≤ 1. This is indeed the best one can possibly do, in a very general sense [110], [32]. The drawback of the scheme is its incomputability, since M contains infinitely many distributions. One may increase the theoretical power of the scheme by augmenting M by certain non-enumerable 5 but limit-computable distributions [80], or restrict it such that it becomes computable, e.g., by assuming the world is computed by some unknown but deterministic computer program sampled from the Speed Prior [81] which assigns low probability to environments that are hard to compute by any method. Remarkably, under very general conditions both universal inductive inference [109], [110], [45] and the compression- oriented MDL approach [37], [115], [116], [62], [45] converge to the correct predictions in the limit [56]. It should be mentioned, however, that the former converges faster. As far as discoveries of regularity and compressibility are concerned, it does not make an essential difference whether we force the system to predict the entire history of inputs and actions, or just parts thereof, or whether we allow it to focus on internal computable abstractions thereof, like the system discussed in Section III-D. Partial compressibility of selected data covered by the system’s limited focus of attention implies compressibility of the whole, even if most of it is random noise. E. Discrete Asynchronous Framework for Maximizing Cre- ativity Reward Let p(t) denote the agent’s current compressor program at time t, s(t) its current controller, and DO: Controller: At any time t (1 ≤ t < T ) do: 1) Let s(t) use (parts of) history h(≤ t) to select and execute y(t + 1). 2) Observe x(t + 1). 3) Check if there is non-zero creativity reward rint(t + 1) provided by the asynchronously running improvement algorithm of the compressor / predictor (see below). If not, set rint(t + 1) = 0. 4) Let the controller’s RL algorithm use h(≤t+1) includ- ing rint(t + 1) (and possibly also the latest available compressed version of the observed data—see below) to obtain a new controller s(t + 1), in line with objective (1). Note that some actions may actually trigger learning algorithms that compute changes of the compressor and the controller’s policy, such as in Section III-D [79]. That is, the computational cost of learning can be taken into account by the reward optimizer, and the decision when and what to learn can be learnt as well [79]. Compressor / Predictor: Set pnew equal to the initial data compressor / predictor. Starting at time 1, repeat forever until interrupted by death at time T : 1) Set pold = pnew; get current time step t and set hold = h(≤t). 2) Evaluate pold on hold, to obtain performance measure C(pold, hold). This may take many time steps. 3) Let some (possibly application-dependent) compressor improvement algorithm (such as a learning algorithm for an adaptive neural network predictor, possibly triggered by a controller action) use hold to obtain a hopefully better compressor pnew (such as a neural net with the same size and the same constant computational effort per prediction but with improved predictive power and therefore improved compression performance [101]). Although this may take many time steps (and could be partially performed offline during “sleep” [85], [96]), pnew may not be optimal, due to limitations of the learning algorithm, e.g., local maxima. (To inform the controller about beginnings of compressor evaluation processes etc., augment its input by unique represen- tations of such events.) 4) Evaluate pnew on hold, to obtain C(pnew, hold). This may take many time steps. 5) Get current time step τ and generate creativity reward rint(τ) = f[C(pold, hold), C(pnew, hold)], (6) e.g., f(a, b) = a − b. (Here the τ replaces the t + 1 of eq. 2.) This asynchronuous scheme [85], [92], [96] may cause long temporal delays between controller actions and corresponding creativity rewards, and may impose a heavy burden on the controller’s RL algorithm whose task is to assign credit to past actions. Nevertheless, Section II-G will discuss RL algorithms for this purpose which are theoretically optimal in various senses [85], [92], [97], [96]. F. Continuous Time Formulation In continuous time [99], let O(t) denote the state of subjective observer O at time t. The subjective simplicity or compressibility or regularity or beauty B(D, O(t)) of a sequence of observations and/or actions D is the negative number of bits required to encode D, given O(t)’s current limited prior knowledge and limited compression / prediction method. The observer-dependent and time-dependent subjec- tive Interestingness or Surprise or Aesthetic Value I(D, O(t)) is I(D, O(t)) ∼ ∂B(D, O(t)) ∂t , (7) the first derivative of subjective simplicity: as O improves its compression algorithm, formerly apparently random data parts become subjectively more regular and beautiful, requiring fewer and fewer bits for their encoding. Note that there are at least two ways of having fun: execute a learning algorithm that improves the compression of the already known data (in online settings: without increasing computational needs of the compressor / predictor), or execute actions that generate more data, then learn to compress / understand the new data better. G. Optimal Creativity, Given the Predictor’s Limitations The previous sections discussed how to measure compressor / predictor improvements and how to translate them into intrinsic reward signals, but did not say much about the RL method used to maximize expected future reward. The chosen predictor / compressor class typically will have certain com- putational limitations. In the absence of any external rewards, one may define optimal pure curiosity behavior relative to 6 these limitations: At discrete time step t this behavior would select the action that maximizes u(t) = Eµ � T � τ=t+1 rint(τ) ����� h(≤t) � . (8) Since the true, world-governing probability distribution µ is unknown, the resulting task of the controller’s RL algorithm may be a formidable one. As the system is revisiting previ- ously incompressible parts of the environment, some of those will tend to become more subjectively compressible, and while the corresponding curiosity rewards may first go up, they will eventually decrease once the new regularity has been learnt. A good RL algorithm must somehow detect and then predict this decrease, and act accordingly. Traditional RL algorithms [35], however, do not provide any theoretical guarantee of optimality for such situations. Is there a best possible, universal RL algorithm that comes as close as any other computable one to maximizing objective (8)? Indeed, there is. Its drawback, however, is that it is not computable in finite time. Nevertheless, it serves as a reference point for defining what is achievable at best, that is, what’s optimal creativity. Readers who are not interested in the corresponding theory may skip the remainder of this section and jump immediately to the practical implementations of Section III. For the others, the next paragraphs will outline how the universal approaches work. Optimal inductive inference as defined in Section II-D can be extended by formally including the effects of executed actions, to define an optimal action selector maximizing future expected reward. At any time t, Hutter’s theoretically optimal (yet uncomputable) RL algorithm AIXI [32] uses such an extended version of Solomonoff’s scheme to select those action sequences that promise maximal future reward up to some horizon T (e.g., twice the lifetime so far), given the current data h(≤t). That is, in cycle t + 1, AIXI selects as its next action the first action of an action sequence maximizing ξ-predicted reward up to the given horizon, appropriately generalizing eq. (5). AIXI uses observations optimally [32]: the Bayes-optimal policy pξ based on the mixture ξ is self- optimizing in the sense that its average utility value converges asymptotically for all µ ∈ M to the optimal value achieved by the Bayes-optimal policy pµ which knows µ in advance. The necessary and sufficient condition is that M admits self- optimizing policies. The policy pξ is also Pareto-optimal in the sense that there is no other policy yielding higher or equal value in all environments ν ∈ M and a strictly higher value in at least one [32]. AIXI as above needs unlimited computation time. Its com- putable variant AIXI(t,l) [32] has asymptotically optimal run- time but may suffer from a huge constant slowdown. To take the consumed computation time into account in a general, optimal way, one may use the recent G¨odel machines [83], [86], [84], [98] instead. They represent the first class of mathematically rigorous, fully self-referential, self-improving, general, optimally efficient problem solvers, and are applicable to the problem embodied by objective (8). The initial software S of such a G¨odel machine contains an initial problem solver, e.g., some typically sub-optimal RL method [35]. It also contains an asymptotically optimal initial proof searcher, typically based on an online variant of Levin’s Universal Search [43], which is used to run and test proof techniques. Proof techniques are programs written in a universal language implemented on the G¨odel machine within S. They are in principle able to compute proofs concerning the system’s own future performance, based on an axiomatic system A encoded in S. A describes the formal utility function, in the present case eq. (8), the hardware properties, axioms of arithmetic and probability theory and data manipulation etc, and S itself, which is possible without introducing circularity [98]. Inspired by Kurt G¨odel’s celebrated self-referential formulas (1931), the G¨odel machine rewrites any part of its own code (including the proof searcher) through a self-generated executable pro- gram as soon as its Universal Search variant has found a proof that the rewrite is useful according to objective (8). According to the Global Optimality Theorem [83], [86], [84], [98], such a self-rewrite is globally optimal—no local maxima possible!— since the self-referential code first had to prove that it is not useful to continue the search for alternative self-rewrites. If there is no provably useful optimal way of rewriting S at all, then humans will not find one either. But if there is one, then S itself can find and exploit it. Unlike the previous non- self-referential methods based on hardwired proof searchers [32], G¨odel machines not only boast an optimal order of complexity but can optimally reduce (through self-changes) any slowdowns hidden by the O()-notation, provided the utility of such speed-ups is provable [87], [91], [89]. Limitations of the “universal” approaches: The methods above are optimal in various ways, some of them not only computable but even optimally time-efficient in the asymptotic limit. Nevertheless, they leave open an essential remaining practical question: If the agent can execute only a fixed number of computational instructions per unit time interval (say, 10 trillion elementary operations per second), what is the best way of using them to get as close as possible to the theoretical limits of universal AIs? Especially when external rewards are very rare, as is the case in many realistic environments? As long as there is no good answer this question, one has to resort to approximations and heuristics when it comes to practical applications. The next section reviews what has been achieved so far along these lines, discussing our implementations of IM-based agents from the 1990s; quite a few aspects of these concrete systems are still of relevance today. III. PREVIOUS IMPLEMENTATIONS OF INTRINSICALLY MOTIVATED AGENTS: PROS AND CONS The above mathematically rigorous framework for optimal curiosity and creativity (2006-) was established after first approximations thereof were implemented (1991, 1995, 1997- 2002). Sections III-A, III-B, III-C, III-D will discuss advan- tages and limitations of online learning systems described in the original publications on artificial intrinsic motivation [71], [70], [111], [77] which already can be viewed as example implementations of a compression progress drive or prediction progress drive that encourages the discovery or creation of 7 surprising patterns. Some elements of this earlier work are believed to remain essential for creating systems that are both theoretically sound and practical. A. Intrinsic Reward for Prediction Error (1990) Early work [67], [71] describes a predictor based on an adaptive world model implemented as a recurrent neural network (RNN) (in principle a rather powerful computational device, even by today’s machine learning standards), predict- ing sensory inputs including reward signals from the entire previous history of actions and inputs. A second RNN (the controller) uses the world model and gradient descent to search for a control policy or program maximizing the sum of future expected rewards according to the model. Some of the rewards are intrinsic curiosity rewards, which are proportional to the predictor’s errors. So the same mechanism that is used for normal goal-directed learning is used for implementing creativity and curiosity and boredom—there is no need for a separate system aiming at improving the world model. This first description of a general, curious, world-exploring RL agent implicitly and optimistically assumes that the pre- dictor will indeed improve by motivating the controller to go to places where the prediction error is high. One drawback of the prediction error-based approach is that it encourages the controller to focus its search on those parts of the environment where there will always be high prediction errors due to noise or randomness, or due to computational limitations of the predictor. This may prevent learning progress instead of promoting it, and motivates the next subsection, whose basic ideas could be combined with the RL method of [67], [71], but this has not been done yet. Another potential drawback is the nature of the particular RNN-based RL method. Although the latter has the potential to learn internal memories of previous relevant sensory inputs, and thus is not limited to Markovian interfaces between agent and environment [72], like all gradient-based methods it may suffer from local minima, as well as from potential problems of online learning, since gradients for the recurrent RL controller are computed with the help of the dynamically changing, online learning recurrent predictive world model. Apart from this limitation, the RNN of back then were less powerful than today’s LSTM RNN [28], [100], which yielded state of the art performance in challenging applications such as connected handwriting recognition [24], and should be used instead. B. Intrinsic Reward for World Model Improvements (1991) Follow-up work [69], [70] points out that one should not focus on the errors of the predictor, but on its improvements. The basic principle can be formulated as follows: Learn a mapping from actions (or action sequences) to the expecta- tion of future performance improvement of the world model. Encourage action sequences where this expectation is high. This is essentially the central principle of Section II-A. Two implementations were described: The first models the reliability of the predictions of the adaptive predictor by a separate, so-called confidence network. At any given time, reinforcement for the model-building control system is created in proportion to the current change or first derivative of the reliability of the adaptive predictor. The “curiosity goal” of the control system (it might have additional “pre-wired” external goals) is to maximize the expectation of the cumulative sum of future positive or negative changes in prediction reliability. The second implementation replaces the confidence network by a network H which at every time step is trained to predict the current change of the model network’s output due to the model’s learning algorithm. That is, H will learn to approximate the expected first derivative of the model’s prediction error, given the inputs. The absolute value of H’s output is taken as the intrinsic reward, thus rewarding learning progress. While the neural predictor of the implementations is compu- tationally less powerful than the recurrent one of Section III-A [71], there is a novelty, namely, an explicit (neural) adaptive model of the predictor’s improvements, measured in terms of mean squared error (MSE). This model essentially learns to predict the predictor’s changes (the prediction derivatives). For example, although noise is unpredictable and leads to wildly varying target signals for the predictor, in the long run these signals do not change the adaptive predictor’s parameters much, and the predictor of predictor changes is able to learn this. A variant of the standard RL algorithm Q- learning [114] is fed with curiosity reward signals proportional to the expected long-term predictor changes; thus the agent is intrinsically motivated to make novel patterns within the given limitations. In fact, one may say that the system tries to maxi- mize an approximation of the (discounted) sum of the expected first derivatives of the data’s subjective predictability, thus also maximizing an approximation of the (discounted) sum of the expected changes of the data’s subjective compressibility (the surprise or novelty). Both variants avoid the theoretically desirable but imprac- tical regular evaluations of the predictor on the entire history so far, as discussed in Section II-B. Instead they monitor the recent effects of learning on the learning mechanism (a neural network in this case). Experiments illustrate the advantages of this type of directed, curious exploration over traditional random exploration. One RL method-specific drawback is given by the limi- tations of standard Markovian RL [72], which assumes the current input tells the agent everything it needs to know, and does not work well in realistic scenarios where it has to learn to memorize previous relevant inputs to select optimal actions. For general robots scenarios more powerful RL methods are necessary, such as those mentioned in Section III-A and other parts of the present paper. Any RL algorithm has to deal with the fact that intrinsic rewards vanish where the predictor becomes perfect. In the simple toy world [69], [70] this is not a problem, since the agent continually updates its Q-values based on recent expe- rience. But since the learning rate is chosen heuristically (as usual in RL applications), this approach lacks the theoretical justification of the general framework of Section II. For probabilistic worlds there are prediction error measures that are more principled than MSE. This motivates research described next. 8 C. Intrinsic Reward Depending on the Relative Entropy be- tween Agent’s Prior and Posterior (1995) Follow-up work (1995) describes an information theory- oriented variant of the approach in non-deterministic worlds [111]. Here the curiosity reward is proportional to the pre- dictor’s surprise / information gain [15], measured as the Kullback-Leibler distance [39] between the learning pre- dictor’s subjective probability distributions on possible next events before and after new observations - the relative entropy between its prior and posterior, essentially another measure of learning progress. Again experiments show the advantages of this type of curious exploration over conventional random exploration. Since this implementation also uses a traditional RL method [114] instead of a more general one, the discussion of RL method-specific drawbacks in previous subsections remains valid here as well. Note the connection to Section II: the concepts of Huffman coding [31] and relative entropy between prior and posterior immediately translate into a measure of learning progress reflecting the number of saved bits—a measure of improved data compression. Note also, however, a drawback of this naive probabilistic approach to data compression: it is unable to discover more general types of algorithmic compressibility [45] as discussed in Section II. For example, the decimal expansion of π looks random and incompressible but isn’t: there is a very short algorithm computing all of π, yet any finite sequence of digits will occur in π’s expansion as frequently as expected if π were truly random, that is, no simple statistical learner will outperform random guessing at predicting the next digit from a limited time window of previous digits. More general program search techniques are necessary to extract the underlying algorithmic regularity. This motivates the universal approach discussed in Section II, but also the research on a more general practical implementation described next. D. Learning Programs & Skills Through Zero Sum Intrinsic Reward Games (1997-2002) The universal variants of the principle of novel pattern creation of Section II focused on theoretically optimal ways of measuring learning progress & fun, as well as math- ematically optimal ways of selecting action sequences or experiments within the framework of artificial creativity [85], [92], [97], [96]. These variants take the entire lifelong history of actions and observations into account, and make minimal assumptions about the nature of the environment, such as: the (unknown) probabilities of possible event histories are at least enumerable. The resulting systems exhibit “mathematically optimal curiosity and creativity” and provide a yardstick against which all less universal intrinsically motivated systems can be measured. However, most of them ignore important issues of time constraints in online settings. For example, in practical applications one cannot frequently measure predictor improvements by testing predictor performance on the entire history so far. The costs of learning and testing have to be taken into account. This insight drove the research discussed next. To address the computational costs of learning, and the costs of measuring learning progress, computationally power- ful controllers and predictors [77], [79] were implemented as two very general, co-evolving, symmetric, opposing modules called the right brain and the left brain, both able to construct self-modifying probabilistic programs written in a universal programming language (1997-2002). An internal storage for temporary computational results of the programs is viewed as part of the changing environment. Each module can suggest experiments in the form of probabilistic algorithms to be executed, and make predictions about their effects, betting intrinsic reward on their outcomes. The opposing module may accept such a bet in a zero-sum game by making a contrary prediction, or reject it. In case of acceptance, the winner is determined by executing the algorithmic experiment and checking its outcome; the intrinsic reward eventually gets transferred from the surprised loser to the confirmed winner. Both modules try to maximize their intrinsic reward using a rather general RL algorithm (the so-called success-story algo- rithm SSA [103]) designed for complex stochastic policies— alternative RL algorithms could be plugged in as well. Thus both modules are motivated to discover truly novel algorithmic patterns, where the dynamically changing subjective baseline for novelty is given by what the opponent already knows about the (external or internal) world’s repetitive patterns. Since the execution of any computational or physical action costs something (as it will reduce the cumulative reward per time ratio), both modules are motivated to focus on those parts of the dynamic world that currently make learning progress easy, to minimize the costs of identifying promising experiments and executing them. The system learns a partly hierarchical structure of more and more complex skills or programs necessary to solve the growing sequence of self-generated tasks, reusing previously acquired simpler skills where this is beneficial. Experimental studies [79] exhibit several sequential stages of emergent developmental sequences, with and without external reward. Many ingredients of this system may be just what one needs to build practical yet sound curious and creative systems that never stop expanding their knowledge about what can be done in a given world, although future re-implementations should probably use alternative reward optimizers that are more general and powerful than SSA [103], such as variants of the Optimal Ordered Problem Solver [82]. E. Improving Real Reward Intake (1991-) The references above demonstrated in several experiments that the presence of intrinsic reward or curiosity reward can actually speed up the collection of external reward. However, the previous papers also pointed out that it is always possible to design environments where the bias towards regularities introduced through artificial curiosity can lead to worse performance—curiosity can indeed kill the cat. 9 IV. RELATION TO WORK BY OTHERS A. Beyond Traditional Information Theory How does the notion of surprise in the theory of creativity differ from the notion of surprise in traditional information theory? Consider two extreme examples of uninteresting, unsurprising, boring data: A vision-based agent that always stays in the dark will experience an extremely compressible, soon totally pred...