Li Jigang Lisp Cue Word Inspiration: Compression Pushes Progress

Previously in the articleInterview with Li Jigang: Dialogue with the Big Model from a Philosophical PerspectiveIt is mentioned in that Jigang summarized the core idea of writing prompts last year as "articulate" and this year it was "Compressed Expression", both before the fire of the whole network Lisp style prompt words presented to you.

So why compression, or what is the source of inspiration for compressed expressions? From this next post, we can get a small glimpse of the source of Lisp style cue words.

The author of this text is German computer scientist Jürgen Schmidhuber, who has had a profound impact on the field of human intelligence. Prof. Schmidhuber is not only a pioneer in reinforcement learning, neural grids, and the history of human intelligence, but he has also founded several successful companies, including Nnaisense, which focuses on advanced human intelligence algorithms.

Because of Professor Schmidhuber's rich experience and the relatively deep subject matter discussed in the paper, this paper may be a bit difficult to read, so we suggest that you prepare yourself carefully first.

In the chapter, Schmidt-Hüller makes the point that when people or intelligent systems learn new skills to predict or compress information more effectively, that information becomes more appealing. Just like when you learn new skills to solve puzzles faster, the puzzles naturally become more attractive.

He further explained that new things are exciting not because they are unusual, but because they allow us to discover new and simple ways of doing things that we did not know before. This desire pushes us to explore the unknown, to learn new things, and to make our brain or intelligence smarter.

This inner strength motivates different people to excel in their own fields: like babies exploring the world, mathematicians solving puzzles, musicians creating music, dancers dancing, comedians telling jokes, or even yourself trying new things.

Original source:《 Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes 》

summaries

I believe that data become temporarily interesting when an observer with a desire for self-improvement but limited computational power learns to predict or simplify them in a better way. Such a process makes the data more concise and aesthetically pleasing to the observer.

Curiosity is the pursuit of creating or discovering non-random, regular data that are new and surprising not because they defy convention, but because their regularities are not yet known to us, which allows us to further compress and understand them.

This pursuit maximizes what is called "interestingness", our subjective sense of beauty or compressibility, which is equivalent to the slope of the learning curve. This intrinsic drive has inspired everyone from exploring babies to mathematicians, composers, artists, dancers, comedians, yourself, and since 1990, human intelligence, to explore and learn.

1. Storing, compressing and rewarding compression progress

If the entire history of the universe is computable [123, 124], and there is no evidence to refute this possibility [84], then the most concise explanation would be to be able to compute its shortest program [65, 70]. Unfortunately, we do not yet have a generalized method for finding the shortest program to compute any given data [34, 106, 107, 37]. Therefore, physicists have traditionally taken an asymptotic approach, analyzing a small part of the world at a time in an attempt to find simple laws that describe their finite observations better than the best previously known laws. Essentially, this is a search for a program that compresses observations more efficiently than the best known program.

For example, the Law of Universal Consciousness by Goetheanum can be expressed as a small piece of code, which allows many observation sequences involving the fall of apples and other objects to be greatly compressed. Despite its limited predictive power - for example, it cannot account for the quantum fluctuations of the original apple - it is still possible to significantly reduce the number of bits needed to encode the data stream by assigning short codes to highly probabilistically predictable events, under the assumption that the law holds [1]. It can also be used to minimize the number of contrasts needed to encode the data stream by assigning short codes to those events that are highly predictable [28].

Einstein's theory of the theory of relativity led to additional progress in compression because it succinctly explained many deviations that could not previously be predicted using the theory of the Goetheanum.

Most physicists agree that we have room for improvement in the field of physics. However, it is not just physicists who want to make their own observations more objective and refined.

Simply put, those who can find patterns in the past and predict the future will find ways to make their own understanding and judgment of things more concise and efficient.

A long time ago, Roger explained children's exploratory learning behaviors through his concepts of "assimilation" and "adaptation".

Assimilation, which is the incorporation of new knowledge into old knowledge, is a bit like information compression. Adaptation, which is the adjustment of old knowledge to new knowledge, is a bit like the improvement of compression techniques.

But Piaget's idea was too arbitrary and did not provide sufficient detail, so the computer could not simulate his idea. So how do we model this advance in compression technology in a human-made system?

Imagine an intelligence interacting with a world it does not understand at first. We can use our Reinforcement Learning framework to allow this intelligence to discover data that will lead to additional compression and increased predictive power.

Reinforcement learning is a machine learning methodology that trains algorithms with rewards and penalties to learn which actions to take in a given situation to maximize some kind of cumulative reward. Simply put, it is a method in which a computer program learns to make the best choices in a given situation through trial and error.

This framework helps an intelligent body better understand the world by motivating it to actively explore, even if there is little or no external reward. Intelligentsia are motivated to explore through intrinsic rewards (i.e., curiosity rewards) in order to discover previously unknown patterns in the input data stream that are relevant to the action.

1.1 General

In Section 1.2 we briefly describe our algorithmic framework, which is based on three kukuroku elements:

A predictor or compressor that continually learns from and improves upon its data history.
A computable metric is used to measure the progress of the compressor, which is used as the basis for intrinsic rewards.
A reward optimizer or reinforcement learner that converts rewards into actions in order to obtain more rewards in the future.

Specific technical details and theoretical concepts will be elaborated in the appendix, including the discrete-time implementation.

In Section 1.3 we will explore the relationship between frames and external rewards, where 'external' refers to those rewards that originate outside the brain and control the behavior of the 'external' body.

As will be shown informally in Section 2, many of the basic elements of intelligence and cognition, such as novelty, surprise, fun detection, unsupervised distraction, subjective aesthetics, curiosity, creativity, art, science, music, and jokes, can be seen as natural consequences of our framework.

In particular, we will dispense with the traditional notion of Boltzmann/PoHanon surprise and show thatBoth science and art can be seen as a desire to crunch more data by creating or discovering new ways to do so.

Section 3 provides an overview of approximate implementations of this framework that have been used in past work.

Section 4 will apply the theory to images customized for human observers, showing the process of reward learning from low to high subjective compressibility.

Section 5 will outline how to improve on our previous implementation and how to further test our theoretical predictions in both cosmonautics and neuroscience.

1.2 Algorithmic Architecture

This simple set of algorithmic principles is essentially a distillation and summary of some of the core ideas from previous publications on the subject [57, 58, 61, 59, 60, 108, 68, 72, 76, 81, 88, 87, 89]. As we mentioned before, the specific technical details are in the Appendix. In Section 2, we discuss these principles, which at least qualitatively explain many aspects of intelligent subjects (e.g., human beings). This inspired us to implement and test these principles in cognitive robots and other artificial systems.

Storage: A storage method. In interacting with the world, it is important to keep complete and original records of all behaviors and sensory observations, including reward signals. Data is valuable because it is the only basis for our understanding of the world. It is not impossible to imagine storing complete data: a person's life rarely exceeds one billion seconds. The human brain has approximately 86 billion neurons, each with an average of 7,000 synapses. Assuming that half of the brain's capacity is used to store raw data, and that each synapse can store up to six bits of information, there is still sufficient space to record a single round of sensory input at a rate of about 100 Mbps, which is comparable to the requirements of a high-definition movie. This is comparable to the needs of a high-definition movie, but the storage power of modern technological systems will soon exceed this level. Remember, if you can store data, don't throw it away!
Enhanced data compressibility. Theoretically, any regularity that exists in the data can be used to compress it. The compressed data can be viewed as a simplified interpretation of the original data. Therefore, in order to interpret the world more efficiently, a portion of computational resources can be devoted to the development of adaptive compression algorithms that attempt to partially compress data. For example, an adaptive neural network may be able to learn how to predict or infer certain data from other historical data, thereby gradually reducing the number of bits needed to encode the overall data. See Appendices A.3 and A.5.
Let the intrinsic curiosity reward reflect progress in data compression. The intelligence needs to monitor the optimization of the adaptive data compressor: whenever it learns to encode historical data with fewer betas, it creates an intrinsic or curiosity reward based on the progress of learning or compression (i.e., the number of betas saved). See Appendices A.5 and A.6.
Maximizing Intrinsic Curiosity Rewards [The RL algorithm can be used to maximize the expected rewards, especially the intrinsic curiosity, by observing the current state of the adaptive compressor.] A generalized reinforcement learning (RL) algorithm can be used to allow an exercise selector or controller to observe the current state of the adaptive compressor in order to maximize the desired rewards, especially intrinsic curiosity rewards. To optimize intrinsic curiosity rewards, a good RL algorithm directs the intelligence's attention and learning focus to those aspects of the world where new, unknown, but learnable patterns can be discovered or created. In short, it will strive to steepen the learning curve of the compressor. This active unsupervised learning helps us better understand how the world works. See Appendices A.7, A.8, A.9, A.10 for details.

This framework essentially sets a goal for curious or powerful systems, not necessarily a goal to be achieved with a specific compressor, predictor, or reinforcement learning algorithm. We'll talk later about what possible options there are to make this framework a concrete example, including those that have been implemented before.

1.3 Relationship to external incentives

Of course, the goal of many intelligences is not just to satisfy one's curiosity, but also to solve practical problems. Any problem that can be explicitly formulated can be translated into a reinforcement learning problem: an intelligence is trying to maximize its expected future rewards within a limited life cycle in a possibly unknown environment.

As we move into the new century, we are seeing a very common use of reinforcement learning algorithms, which are referred to as general-purpose problem solvers or general-purpose artificial intelligence (see Appendices A.8 and A.9 for details). These algorithms are perfect in theory, but may not always be the best choice in practical applications, such as those mentioned in [29, 79, 82, 83, 86, 85, 92].

These generalizations automatically discover and utilize these concepts in terms of learning progress, compression progress, and curiosity promotion. Why, then, would it be necessary to go to the trouble of constructing an explicit, proactive framework for experimenting with curiosity?

One answer is that the current common methodology tends to ignore constant deceleration factors that have nothing to do with the nature of the problem, hiding them under the asymptotic notation of theoretical computer science.

This leads to a key question hanging in the balance: If an intelligent body can only perform a certain number of computational instructions per unit of time (e.g., 10 trillion basic operations per second), how can we fully utilize these instructions to get as close as possible to the theoretical limits of recent generalized artificial intelligence, especially when external rewards are scarce, which is the norm in many real-world environments?

The central idea of this paper is that in a reward-scarce environment, curiosity-driven is such a common and useful concept for resource-constrained reinforcement learning that it should be presupposed, rather than learned from scratch, in order to save computational time-although this saving may be constant, the amount may still be significant.

Inherent in this approach is the assumption that in the real world a better understanding of the past will help us to predict the future more accurately and accelerate the search for solutions to external tasks, while ignoring the possibility that curiosity can sometimes backfire, as in the saying "curiosity killed the cat".

2. Compression process drivers

Let's talk about these fundamentals of intelligence and cognition, which are really the result of the principles that come naturally to us.

2.1 Compact internal representations or symbols: by-products of high-efficiency historical compression

In order to compress the history of observations made so far, a compressor (e.g., a predictive neural grid) will automatically create internal representations or symbols for things that are frequently repeated (e.g., patterns of certain neural feature detectors).

Even with limited predictive power, we can still do highly effective information compression by labeling events that are highly probabilistically predictable with short labels.

For example, it is a common phenomenon that the sun rises in the east and sets in the west every day. Therefore, it is much more efficient to summarize this repetition in the data history with an internal symbol such as "daylight" and describe it through a small piece of internal code that can be used repeatedly, rather than simply storing the raw data.

In fact, we often see predictive neural grids creating internal (and hierarchical) code as a byproduct of minimizing prediction error on training data.

2.2 Consciousness as a unique by-product of the compression process

One thing that is always present in the movements and sensory reception of an intelligent body is the body itself. To efficiently record all the data history, an intelligent body needs to create an internal symbol or code, such as a neural activity pattern, to represent itself. Once this representation of self is activated, whether through new sensory inputs or otherwise, we can call the intelligent body self-aware or conscious.

This straightforward explanation, while not losing our intuitive understanding of the concept of consciousness, is more succinct than other recent views. In what follows in this paper, we will not ascribe any mystical color to the concept of consciousness - we believe thatIt is a natural by-product of the data compression created by the intelligence as it continues to solve problems and model the world.So instead, consciousness will not occupy an important place in the subsequent discussion of this paper.

2.3 The Lazy Brain's Subjective Perception of Beauty Changes Over Time

Imagine that O(t) represents the state of a subjective observer O at a particular moment t. The "lazy brain" theory tells us that the subjective sense of beauty B(D,O(t)) for a new object of observation D is the same as that for a new object of observation D. "Lazy brain theory tells us that the subjective beauty B(D,O(t)) of a new object D (forget about how attractive it is, that's another topic - go to section 2.4) is actually proportional to the amount of information needed to encode D. The lazy brain theory tells us that the subjective beauty B(D,O(t)) of a new object D is actually proportional to the amount of information needed to encode it. is actually proportional to the amount of information needed to encode D. The amount of information here is based on the current state of the observer's adaptive compressor with limited a priori knowledge.

For example, if you want to encode a face that you have seen before in a high-efficiency way, a compressor such as a neural grid may find it useful to create an internal representation of a "standard face. When you encode a new face, you only need to record the difference between it and the "standard face. Therefore, new faces that are not very different from the "standard face" will be subjectively more beautiful. Similarly, the same is true for faces that have certain patterns, such as symmetry or simple contrast-theoretically, the compressor can use any pattern to reduce the amount of information needed to store the data.

Simply put, if several sub-patterns are comparable in the eyes of a certain observer, thenThe one that is subjectively the most beautiful is the one that is the easiest (shortest) to describe using the observer's current specific encoding and memorization methodsFor example, mathematicians prefer simple proofs that are succinctly described in the formal language they use. Mathematicians, for example, prefer simple proofs that are succinctly described in the formal language they use. Others prefer pictures of various objects that are simple, beautiful, and of low complexity.

This logically explains why many people prefer faces similar to themselves. What they see in the mirror every day influences the "standard face" in their mind, and this is actually based on coding efficiency.

2.4 Subjective interest: first-order derivatives of subjective beauty, steepness of the learning curve

What is beautiful is not always attractive. It is only interesting when it is novel, that is, when its simple laws are not yet fully understood by us observers who are learning how to compress information more efficiently. We can define the subjective interestingness I(D,O(t)) of data D for an observer O at a particular time t as:

This is the first order derivative of subjective beauty. As learners continue to optimize their information compression algorithms, the seemingly chaotic data becomes subjectively ordered and beautiful, and the required coding contrast is reduced. As long as this process is performed, the data will continue to be attractive and valuable. A discrete-time version of this concept is elaborated in the Appendix and in the discussion of previous implementations in Section 3. In the meantime, we recommend reading [59, 60, 108, 68, 72, 76, 81, 88, 87] for a deeper understanding.

2.5 Primitive Beauty and Fun vs. External Rewards

Note that the concepts of beauty and fun that we have discussed are in fact limited in that they are pure because they are not inherently associated with the pleasure of external rewards (see Section 1.3). For example, one might say that a hot bath in a cold day makes people feel "good" because they are satisfied with the reward of a hot bath. For example, someone might say that a hot bath in a cold day makes me feel "good" because the target value preset by an external temperature sensor has been met sufficiently for a reward (the "external" in this case is outside the brain, i.e. outside the control of our physical behavior). (The "outside" here refers to the outside of the brain, i.e. the part of the brain that controls our physical behavior). Alternatively, some people may call a song "wonderful" for emotional reasons, such as associating it with a good memory of a first kiss. But that's not the focus of our discussion here - we're looking at intrinsic rewards that are based on learning progress.

2.6 Genuine Novelty and Surprise: Contrasting with Traditional Information Theory

Imagine two extreme cases that are so uninteresting, bland, and boring: a visually dependent intelligence that is always in the dark receives visual information that becomes so compressed that it soon becomes completely predictable and unchanging. If, however, it is confronted with a screen full of white noise, despite the fact that, according to the traditional view, this screen is full of information and so-called "novelties" and "surprises" (according to the definition of Boltzmann and Pannon [102]), the data experienced by the intelligent body is extremely unpredictable. data is extremely unpredictable and incompressible by nature. In both cases, the data are monotonous [72, 88] because they cannot be further compressed. Therefore, we reject the traditional sense of "surprise".Data that is either random or completely predictable is not really novel or surprising - only data that still contains unknown algorithmic patterns can be considered novel and surprising![57, 58, 61, 59, 60, 108, 68, 72, 76, 81, 88, 87, 89]。

2.7 Focus, Curiosity and Active Inquiry

When external rewards are no longer available, or we cannot find new ways to enhance the expected external rewards, our "brain" is essentially seeking to maximize novelty and fun.This is like the first order derivative of subjective aesthetics or information compressibility and the steepness of our learning curve. Within the constraints of information compression techniques and improved algorithms, the brain makes every effort to select those sequences of executions that will lead to the greatest expected future progress in information compression.

It learns to focus its attention and actively choose its experiments, concentrating on things that are difficult to compress so far, but which, through further learning, are expected to become predictable and compressible. It gradually loses interest in things that can already be subjectively understood and compressed. Similarly, the brain gets bored with things that appear to be difficult to compress at the moment and seem to stay that way based on existing experience, or things that cost far more to make compressible than other things.

2.8 Discovery

An unusually large compression breakout deserves to be called adiscoveriesAs mentioned at the beginning of this chapter, the simple law of gravity can be expressed in a small piece of code. As mentioned at the beginning of the chapter, the simple law of gravity, which can be expressed in a small piece of code, can greatly compress all of our previous observations of the falling apple and other objects.

2.9 Unsupervised Learning Beyond the Conventional

Traditional unsupervised learning is centered on uncovering patterns in data. It does this by clustering data, or by reorganizing data using statistically distinct factor codes [4, 64], or even by predicting one part of the data to derive another part. All of this is essentially a manifestation of data compression. For example, where data points are clustered, we can highly efficiently encode a data point by its clustering center and a small amount of bias information. When there are duplicates in the data, a de-redundant factor encoding [64] is more compact than the original data. Even when the data is predictable, compression can be achieved by assigning short codes to those parts that are highly predictable from previous observations [28, 95]. Overall, the main goal of traditional unsupervised learning is to improve data compression by discovering a program that can quickly compute and interpret historical data in a way that is shorter than the shortest known program.

However, conventional unsupervised learning is still inadequate - it can only analyze and encode data, but it cannot make choices about the data. We need to expand on the dimension of active choice, as our unsupervised learners must not only analyze data, but also be able to choose those actions that affect the outcome of an observation. This is like a scientist choosing an experiment, a baby choosing a toy, an artist choosing a color, a dancer choosing a movement, or the decision-making process of any focusing system [96] when choosing the next sensory input. This is what we are trying to achieve with our curiosity and power framework based on reinforcement learning.

2.10 Art and music: by-products of compressed progress

Art and music have value far beyond the social level. Although some people think that art is superfluous, good art has a much deeper meaning.Works of art that I admire and that are closely connected to the observer can broaden our understanding of the world or possible worlds.

They reveal previously unknown patterns latent in compressible data, they connect otherwise isolated and unsupported patterns in a surprising way, making the combination of these patterns subjectively more compressible (art, as an eye of enlightenment), and ultimately less known and less fresh.

Let me make a hypothesis:All types of artistry and focused perception are really nothing more than by-products of our quest for novelty and curiosity, a quest that provides a constant stream of rewards for the improvement of the compressor.

Let us explore this idea in depth, continuing the discussion in [81, 88]. Both workers and human observers need to perceive works of art sequentially and usually actively.

For example, looking at a sculpture through a series of eye shifts or camera movements, or filtering and highlighting the sound of a pianist's performance while ignoring background noise through internal attention shifts. There is no doubt that many people derive pleasure and satisfaction from enjoying works of art, such as certain paintings or songs.

However, different observers, with different sensory devices and compression algorithms, may prefer different input sequences. Therefore, any objective theory of what constitutes good art must consider the subjective observer as a variable. We need to answer the question: what sequences of actions and resulting shifts in attention should observers perform in order to maximize their pleasure?

According to our principles, observers should select those new sequences that maximize fast learning compressibility. This means that they should select sequences that match their current knowledge and that can be adapted to their (often limited) ways of integrating, learning, or compressing new data.

2.11 Music

For example, when a person is choosing a list of songs to be played, which one should he pick? Definitely not the one he's already listened to ten times in a row, because it's become too predictable. But equally, it's not the weird new song that has a completely alien rhythm and sound. It was too unconventional, too random and subjective. He should try a song that is familiar in the sense that it may contain unexpected harmonies, melodies, or rhythms, but sufficiently new patterns or compression in the flow of sound to be quickly recognizable. Sooner or later, of course, the song will wear out, but at least not yet.

Observer dependence can be demonstrated by the fact that Schoenberg's Decathlon sound system is much less popular than some popular music. This may be due to the fact that the harmonies of the Decathlon sound system are more complex and its algorithmic structure is less obvious to many listeners. For example, in the⼗⼆⾳system (e.g. political, adminstrative etc), the frequency ratio of consecutive sounds cannot usually be shown simply by a small integer ratio. However, those who have prior knowledge of the basic concepts, objectives, and limitations of the Decathlon's sound system tend to be able to appreciate Schoenberg's work more than those who do not have such a background.

All of this is consistent with our principle that any subjective observer of a compressor learning algorithm is trying to compress the history of his sounds and other inputs as well as possible. The row selector looks for rows that affect the history in such a way as to improve the performance of the compressor.

The interesting sounds and other sequences are the ones that push the compressor forward because they contain new patterns that were previously unknown but could be learned. The patterns that are not interesting are those that appear to be random, or random, or whose structure is too complex to understand.

2.12 Art forms such as painting, sculpture, dance and film

This statement applies not only to dynamic arts such as film and dance, whose movements can be compressed, but also to painting and sculpture. This is because the shift in power of the viewer's attention when viewing these works of art creates a dynamic sequence of patterns.

2.13 No Absolute "Ideal Contrast" Between the Expected and the Unexpected

In previous studies, some scholars have tried to explain aesthetic experience in terms of information theory. ⧏33⧐information theory ⧏35⧐ information theory ⧏34⧐ They have emphasized the "ideal" contrast between intended and unintended information conveyed by an aesthetic object (i.e., the balance between its "order" and "complexity"). (i.e., a balance between its "order" and "complexity"). Note, however, that our approach does not assume such an objectively existing "ideal" contrast. Instead, we propose a new dynamic measure of interestingness. This method focuses on the variation of the number of bits needed to encode an object, and takes into account the a priori knowledge of the observer and the limitations of the compression algorithm they use.

2.14 Active creation by the artist and passive perception by the audience: blurring the boundaries

Just as the viewer is intrinsically satisfied by successive attention to works of art that reveal previously unknown patterns, the artist is similarly rewarded by the creation of these works.

For example, after hundreds of disappointing failures, I felt no greater sense of accomplishment than when I finally discovered the simple gaps that allowed me to make the drawings in Figures 1 and 2.

However, the line between the artist and the viewer is blurred. They are both performing a series of acts to reveal new forms of compression. The inherent strength of both is in line with our principles. (Translation: Figures 1 and 2 are in the latter part of this document.)

Certainly, the artist gets a sense of satisfaction based on innovation when he or she creates a truly original work of art. At the same time, they want to be recognized by outsiders, whether by praise or money or both. However, our principle theoretically distinguishes between these two forms of reward.

2.15 Similarities between artists and scientists

Under our eyes, scientists and artists are actually quite similar.They have compressed their journeys of observation by carefully selecting their experiments to explore simple but novel patterns.

Whether they are painters, dancers, musicians, pure mathematicians, or physicists, their power to create can be seen as a byproduct of our curiosity framework based on compressed progress. They are all trying to create new, non-random, regular data that contains amazing, previously unknown regularities.

For example, many physicists design experiments to create data supported by unknown laws in order to further compress the data. Artists, on the other hand, combine known objects in a subjective and novel way, so that an observer's subjective description of the result is more succinct than describing each part individually, because the parts share some previously unnoticed regularity.

So what is the main difference between science and art? The essence of science lies in formally defining compressed progress through the discovery of new laws. For example, a law of gravitation can be expressed in a few symbols. In the fine arts, however, the compressive progress achieved by observing works of art that combine otherwise disparate things in new ways (with art as the inspirer) may be subconscious, and observers may not be able to formally describe it, but they can feel the intrinsic reward of such progress, even though it is impossible to say exactly which memories become more subjective and compressible in the process.

The framework in the Appendix is sufficiently formalized to implement our principles on a computer. The resulting human observers will have varying computational power in their history compressors and learning algorithms. This will determine for them what is good art or science for the language, and what they find interesting.

2.16 Jokes and Other Entertainment Resources

Comedians, like other artists, like to combine familiar concepts in a novel way. In this way, the audience's perception of the whole is more succinctly described than each part individually, because there are common patterns between the parts that have not been noticed before.

The laughter provoked by witty jokes is in many ways similar to that of babies and adults when they are learning new skills. For example, I learned to throw and catch three balls after the age of 25. It wasn't an overnight process, but a gradual and satisfying journey: at first I could only last a second, then two, then four, progressing gradually until I succeeded. Following the advice of my juggling instructor, I practiced in front of a mirror, and each time I made progress, a silly grin appeared unconsciously on my face. Similarly, when my little girl stood unaided for the first time, she had a smile of triumph on her face.

Under algorithmic eyesight, all these phenomena can find rational explanations:These smiles may be due to the triggering of a reward mechanism within our brain that rewards us for creating a stream of data containing unknown patterns. For example, observing the sensory experience of throwing and catching a ball for yourself, which is very different from observing someone else throwing and catching a ball, is novel and naturally brings intrinsic rewards until our brain adapts to this new skill.

3. Early concrete realization of the compression progress drive system

As previously stated, the predictor and compressor are closely linked. Any incoming stream of sensory data that has a predictable component can be used to enhance the overall compression effect. Thus, the systems described in earlier research on reinforcement learning [57, 58, 61] can actually be seen as prime examples of compression progress-driven implementations.

3.1 Forecast Errors and Rewards (1990)

Early research [57, 58, 61] proposed a predictor based on a recurrent neural network [115, 120, 55, 62, 47, 78], which is essentially a fairly powerful computational tool, even by today's machine learning standards. It is able to predict perceptual inputs, including reward signals, based on a history of all previous inputs and actions. The reward for curiosity is positively correlated with the predictor's error, which implies an optimistic assumption: as long as the prediction error is large, the predictor is expected to improve.

3.2 Improving Progress in Reward Compression through Predictors (1991)

Recent research progress [59, 60] reminds us that traditional prediction methods may not be applicable in all situations, especially in those probabilistic environments full of uncertainty. Instead of just focusing on the predictor's errors, we should pay more attention to how it keeps improving and refining itself. Otherwise, the system will fall into the trap of focusing too much on the parts that are always unpredictable due to noise, randomness, or computational power limitations, which in turn prevents us from improving the subjective compressibility of the data.

Although the neural predictor mentioned in the follow-up study [61] is not as computationally powerful, it has led to an innovation - the introduction of an explicit, self-adaptive model for predictor improvement. The core function of this model is to learn to predict changes in the predictor itself.

For example, although external noise is unpredictable and can cause dramatic fluctuations in the predictor's target signal, in the long run these fluctuations will not affect the parameters of the adaptive predictor too much. It is by learning from these variations that the predictor changes to continually improve its own predictive power.

In addition, a standard reinforcement learning algorithm [114, 33, 109] has been introduced, which seeks to maximize the information gain under given constraints by receiving a curiosity reward signal that is proportional to the expected change in the predictor over time [16, 31,38, 51, 14]. In fact, it can be understood that the system is actually attempting to maximize an approximation of the discounted sum of the expected first-order derivatives of the subjective predictability of the data, which at the same time implies that it is maximizing an approximation of the discounted sum of the expected changes in the subjective compressibility of the data.

3.3 Reward mechanisms for a priori and a posteriori relative entropy of intelligences (1995)

In the 1995 study, a method based on information theory was derived from the non-deterministic world. This curiosity reward mechanism was again synchronized with the predictor's surprise or information gain, this time measured by the Kullback-Leibler distance-that is, the difference in the subjective probability distribution of the learning predictor before and after a new observation, that is, the relative entropy between the prior and the posterior.

In 2005, Baldi and Itti named this method "Gulliver's Surprise" and experimentally verified that it was superior to previous methods in explaining the pattern of visual power of attention in human beings.

It is worth mentioning that Huffman coding and the notion of relative entropy between a priori and a posteriori can be immediately translated into a measure of learning progress - reflecting the aspect ratio of savings, which is a measure of improvement in data compression.

Note that while a simple probabilistic approach is useful for data compression, it does not reveal deeper levels of algorithmic compressibility. For example, the expansion of the number π in the Decade may look random and appear to be uncompressible, but it is not. There is a short algorithm to compute all the digits of π. It can be used to compute all the digits of π, but it does not reveal deeper algorithmic compressibility.

However, any finite sequence of numbers will occur in the expansion of π as often as would be expected from a truly random sequence of π. In other words, no simple statistical learner can predict the next number more accurately than a random guess based on only limited observations of previous numbers. In other words, there is no simple statistical learner that can predict the next number more accurately than a random guess based on limited observations of previous numbers. We need more general program search techniques, such as [36, 75, 15, 46], to uncover potential algorithmic patterns.

3.4 Compression progress in zero-sum reward games revealed by algorithmic experiments (1997)

Recent academic research [68,72] (1997) has significantly improved the computational power of controllers and predictors by designing them as co-evolving symmetric pairwise modules. These modules are based on an adaptive, probabilistic program [97,98] written in a common programming language [18,111] that supports loops, recursion, and hierarchies. The internal memory of the program is treated as a part of the environment and is used for temporary storage of computational results. Each module can propose experimental algorithms and predict their results in the form of "bets", where the "bets" actually act as built-in incentives. The opposing modules can accept or reject the bet by making opposite predictions, creating a zero-sum game.

Once the bet is accepted, the outcome of the algorithmic experiment will determine the winner, and the "stake" will be transferred from the loser to the winner. Both modules attempted to maximize their own gains using a generic reinforcement learning algorithm designed for complex stochastic strategies [97,98] (of course, other types of reinforcement learning algorithms can be inserted as needed). This motivates both modules to continually explore new algorithmic laws and compression, where the criterion of novelty is defined by the degree of knowledge of the repeating laws of the world by the pair of hands.

This approach can be viewed as a process of system identification through co-evolution of computable models and tests. By 2005, Bongard and Lipson [11] adopted a similar co-evolutionary approach based on less common models and tests.

3.5 Enhancing Access to Real Rewards

These papers have experimentally demonstrated that intrinsic or curiosity-driven rewards can actually speed up the acquisition of external rewards.
Reed's speed.

3.6 Other Realization Methods

More recently, some researchers have innovated and improved upon the curiosity framework, and Singh, Barto, and others have implemented curiosity within the framework of reinforcement learning options, using prediction error as a direct reward for curiosity, as discussed in Section 3.1. In fact, they were among the first to propose the concepts of "intrinsic reward" and "intrinsically motivated reinforcement learning" at the AAAI Spring Symposium in 2005. In fact, they were among the first to propose the concepts of "intrinsic reward" and "intrinsically motivated reinforcement learning", and in 2005, at the AAAI Spring Symposium, other implementations of machine learning developments were presented. You can read more about this in the special issue of Connection Science.

4. Subjective beauty and its first generation of "playful" visual displays

As mentioned earlier (Section 3.3), the probabilistic version of our theory [108] (1995) is able to account for some of the variations in human visual attention power [32] (2005). Similarly, we can apply this approach to the construction of images that contain easy-to-learn patterns, again emphasizing that the motivations of creative artists and passive visual art observers are not fundamentally different (Section 2.14).

Both are creating a series of behaviors that result in interesting inputs, with "interestingness" being a measure of the learning process, e.g., based on the relative entropy between a priori and a posteriori knowledge (Section 3.3), or the number of betas needed to encode the data saved (Section 1), or other similar methods (Section 3). .

Here we provide some examples of subjective beauty designed for the human observer and show the learning process from low to high subjective beauty. Due to the limitations of the current graphic media, we can only use visual items, not audio or tactile ones. These examples are intended to support this idea:The unsupervised power of attention and the creative force of artists, dancers, musicians, and pure mathematicians is really just a byproduct of their pursuit of compressed progress.

4.1 A simple and clear face with a concise description of the algorithm

Figure 1 shows a construction of a face that some human observers would consider "beautiful". The basic features of this face follow a very simple pattern [69] that can be defined with very few bits of information.

In other words, a stream of data created by observing an image (e.g., through a series of rapid eye movements) is more likely to be compressed in the presence of these regularities.

Although few people can immediately understand how the painting was made without a grid overlay to explain it, most people do notice that the facial features harmonize with each other in a way that shows regularity.

According to our hypothesis, the satisfaction gained by the observer comes from discovering, consciously or unconsciously, the compressibility of this data. This face maintains its attractiveness until the observer fails to discover new, unknown regularities. Then it becomes boring, even to those who find it beautiful - as noted many times in the text, beauty and attractiveness are two different concepts.

Figure 1 shows an example of the construction of a female face published in 1998. Some observers consider this face to be very "beautiful". Although there are many messy details in the face, such as texture, that are not briefly described, basic facial features, such as position and shape, can be encoded in a simple scheme that is simpler and more precise than the ancient face studies of Leonardo da Vinci and Dürer.

This means that the image contains a highly compressible pattern or pattern that can be described with few bits of information. The observer can perceive this pattern through a series of focused eye movements or sweeps, and consciously or unconsciously detect the compressibility of the incoming data stream.

So how was the painting made?

First, the sides of the square are divided into several equal intervals. Then, the boundaries of certain intervals are connected to form a grid of three rotated, superimposed grids based on lines with slopes of 1, 2, or 3. Higher resolution detail of the grid is obtained by iteratively selecting two previously created, adjacent, parallel lines and inserting a new line equidistant from these lines between them.

Finally, the grid is compressed vertically by a factor of four. The resulting lines and their intersections define the basic boundaries and shapes of the eyebrows, eyes, eyelid shadows, mouth, nose, and face contours in a simple way that is easy to see from the construction scheme.

Although this formula seems so simple, it is actually very hard to find. I've tried and failed hundreds of times before to find such an exact match between simple guises and beautiful faces.

4.2 Another Picture That Can Be Encoded with Few Bites

Figure 2 shows an example: a butterfly and a vase with flowers. This picture can be described with very little information because it can be constructed using a simple algorithm based on a fractal circle pattern - see Figure 3 for details.

People who understand the algorithm tend to appreciate the picture more than those who don't, because they can realize its simplicity. But it is not a simple, black-and-white process.

Since most people have a great deal of visual experience with circles, they quickly notice that the curves fit together in a regular way. However, very few people can immediately say the precise principles behind this picture. This pattern can be learned from Figure 3.

The reward from this conscious or subconscious discovery process of moving from shorter to shorter data descriptions, or from less compression to more compression, or from less subjective perceived beauty to more subjective perceived beauty, depends on the first order derivative of the subjective beauty, that is, the steepness of the learning curve.

Figure 2 shows a photograph of a butterfly and a vase, which is cited in Leonardo da Vinci's works [67, 81]. Figure 3 shows how this photograph was constructed and explains why it is so succinctly described.

Figure 3: The construction of Figure 2 is realized by a minimalist algorithm using fractal circles [67]. The frame itself is a circle. Its leftmost point is also the center of another circle of the same size.

Whenever two equal-sized circles are tangent to or intersect each other, their point of contact becomes the center of the other two circles, and the size of these two new circles is equal to and halved from the size of the original circles.

Each line in the drawing is a part of a circle, and its endpoint is exactly where the circle touches or intersects with the circle. Large circles are few and far between, while small circles are numerous. In general, the smaller the circle, the more bits of information are needed to describe it. This picture is simple (and easy to compress) because it is built on a few larger circles.

Many observers report that they experience a special joy in discovering the beauty of this simplicity. The observer's learning process led to a reduction in the subjective complexity of the data, which in turn led to a temporary high peak of subjective beauty: a steep learning curve. (It also took me a long time to find a satisfying and rewarding way to create a good painting using fractal circles.)

5. Conclusions and outlook for the future

We found that a surprisingly simple algorithmic principle, based on the notion of data compression and its progression, can informally articulate the principles of attention, novelty, surprise, interest, curiosity, creativity, subjective beauty, humor, and the fundamentals of science and art.

The core elements of this formal framework include:

A continuously upgraded data history predictor or compressor.
A computable compressor progress metric is used to evaluate intrinsic reward.
A reward optimizer or reinforcement learner that converts rewards into sequences of rows that are expected to maximize future returns.

In order to enhance our previous method of realizing these elements (⻅ Part III), we will:

Research on better adaptive compressors, in particular the recent emergence of new recurrent neural nets (RNN) and other common but practical prediction methods.
Explore the conditions under which learning progress metrics can be computed both accurately and efficiently, avoiding the need for frequent and costly compressor performance evaluations of entire historical data.
To study the use of the latest improved reinforcement learning techniques in areas such as strategy gradient and human evolution.

In addition to building more advanced reinforcement learning agents, we can also test the predictions of our theories through⼼theoretical study, which will extend previous relevant research and go beyond what can only be considered anecdotal evidence.

It should not be difficult to design controlled experiments in which participants predict sequences of events that are initially unknown but causally linked, and which may contain complex or simple, learnable patterns or regularities.

Participants need to quantify their intrinsic reward for improving predictions. Are the rewards really strongest when the forecast improves the fastest? Do intrinsic rewards really disappear when the prediction reaches perfection or stops improving?

Finally, how can we validate our predictions through neuroscience?

So far, our understanding of the human nervous system is limited. However, it is known that certain neurons seem to be able to predict the activity of other neurons, and brain scans show that certain brain regions become active when they receive a reward.

Thus, the aforementioned mentioned experiments in cardiology should be combined with neurological studies to determine the source of intrinsic reward and possibly link it to improvements in neural predictors.

The success of this work will give us more strength in practicing our ideas in the field of robotics.

Appendix A

The contents of this appendix are mainly based on the literature [81, 88].

Compression technology that allows us to understand the world more deeply. By improving big data compression algorithms, we can discover more efficient ways of processing data. Now, we need to think about how to build an intelligent system that not only gets external rewards, but also learns and explores on its own to discover data that is easy to understand and compress. (Such insights gained through self-exploration may ultimately help us to better fulfill the tasks assigned by our teachers.)

Let us formally consider a learning intelligence whose life consists of a series of discrete time periods, expressed as (t=1,2,...,T). Its life consists of a series of discrete time periods, symbolized by (t=1,2,...,T). The entire life cycle (T) of an intelligence may or may not be predetermined. In the following discussion, we use (Q(t)) to indicate the value of the variable (Q) at time (t). If the value of (Q) is a sequence from time 1 to time (t), we denote it as (Q(leq t)).

If the sequence does not include the value of time(t), we denote it as (Q(<t)). At each point in time (t), the intelligent body receives a real input from the environment (x(t)) and then executes a real action (y(t)) that may affect future inputs. Before time (t) reaches the end of the life cycle (T), the goal of the intelligence is to maximize its future success or effectiveness as much as possible.

where r(t) represents the additional actual reward input at a given moment t, and h(t) is an ordered triple [x(t),y(t), r(t)] (so that h(≤t) is the historical information known up to moment t).

Eμ(-|-) represents the conditional expectation operator for some possibly unknown distribution μ, which may come from a set of possible distributions ℳ. ℳ here reflects all that is known about our possible stochastic responses to the environment.

By contrast, ℳ may contain all the distributions we can calculate. Here we have only one chance of growth, there is no predetermined repetition of the experiment, and the interaction between the sensor and the environment is not limited to the properties of inflation, but the utility function implicitly takes into account the expected remaining lifetime Eμ(T|h(≤t)), i.e., there is still the possibility of prolonging the lifetime by appropriate maneuvers.

Recent research advances have resulted in the first learning machines that are universal and optimized in many very general senses. As mentioned in the introduction, these machines can theoretically determine for themselves whether or not curiosity and modeling of the world is useful in a particular context, and then adjust their own behavior accordingly.

However, this appendix will assume a priori that historical compression or interpretation is beneficial and should be pursued. For the time being, we will not consider the possible harms of curiosity.

To achieve this goal, in the spirit of our research since 1990, we decompose the reward signal r(t) into the real-valued parts of two scalars: r(t) = g(r_ext(t), r_int(t)), where g is a function that maps real-valued pairs to real-valued ones, e.g. g(a, b) = a + b. The following is an illustration of how this can be done.

r_ext(t) here refers to the traditional external rewards provided by the environment, such as negative rewards for hitting a wall, or positive rewards for reaching a certain teacher-specified target state.

For this paper, however, we are particularly interested in r_int(t), the internal, intrinsic, or arguably curiosity-driven reward that is provided when the agent's data compressor or internal world model is boosted in some measurable way. Our initial focus will be on r_ext(t) = 0 (for all valid t). This basic principle is essentially what we have previously published in a different variant:

Principle 1

When the predictor or history compressor is improved, curiosity is rewarded to the controller.

Therefore, we first need to make a mental distinction between the goal (i.e., interpreting and compressing the history) and the method of achieving the goal. Once we have clearly defined the goal as an algorithm that computes the rewards of curiosity, we then leave it to the controller's Reinforcement Learning (RL) mechanism to take care of it. This mechanism has to figure out how to translate these rewards into a series of executing steps that allow existing compressor improvement algorithms to discover and utilize types of compressibility that we were not aware of before.

A.1 Predictors and compressors

Much of our previous work in reinforcement learning has been prediction-oriented, e.g., [57, 58, 61, 59, 60, 108, 68, 72, 76]. However, prediction and compression are closely related. A method that correctly predicts many x(τ) , given the history h(<τ), for 1 ≤ τ ≤ t, can be used to compactly encode h(≤ t).

Given a predictor, only the incorrectly predicted x(τ) plus the corresponding time step τ is necessary to reconstruct the history h(≤t), e.g., [63]. Similarly, a predictor can learn that the next event likely to occur given the previous event can be used to implement a compressed history display method by efficiently encoding [28, 95] a few (respectively, many) bits of the observations with a high (respectively, low) probability of prediction.

In general, we can think of the predictor as a program p that recomputes h(≤t). If the program is shorter than the original data h(≤t), then whether h(≤t) is regular or non-random [106, 34, 37, 73], generally reflects the underlying environmental law. Then p may also be useful for predicting the future. x(τ) is τ>t .

However, it should be mentioned that the prediction method of the face-to-face compressor based on the principle of minimum descriptive length (MDL) [34, 112, 113, 54, 37] does not necessarily converge to the correct prediction as quickly as the Sorozhov's universal inductive inference [106, 107, 37], although both methods converge in the limit in the general conditions [52].

A.2 Which predictor or history compressor?

The complexity history of evaluating certain compressors p, h(≤t), depends on p and its performance measure C. We will focus first on the former. Let us first focus on the former. Given t, one of the simplest p's is to simply predict x(t+1) from x(t) and y(t+1) using a linear map.

More complex p such as adaptive recurrent neural network (RNN) [115, 120, 55, 62, 47, 26, 93, 77, 78]. Nonlinear mappings and the entire history h(≤t) will be used as the basis for prediction. In fact, the first work on reinforcement learning [61] focuses on this point. Theoretically, the best prediction factor is Solomon's generalized induction method [106, 107, 37].

A.3 Compressor performance measures

At any time t (1≤t<T ), if some compression program p is able to compress history h(≤t), let C(p,h(≤t)) show the compression performance of p on h(≤t). The appropriate performance measure is:

where l(p) denotes the length of p in terms of betas: the shorter p is, the more regular and compressible the algorithm is, and the more predictability and legitimacy we have found in our observations so far. The ultimate limit of Cl(p,h(≤t)) is K∗(h(≤t)), a variant of the Kolmogorov complexity of h(≤t), i.e., the duration of the shortest program (for a given hardware), and the output of the computation starts at h(≤t) [106, 34, 37, 73].

A.4 Compressor Performance Indicators with Time Considerations

Cl(p,h(≤t)) does not take into account the time τ(p,h(≤t)) spent by p in computing h(≤t). An alternative performance inspired by the notion
The most common searches for measurement methods [36, 75] are.

In this context, the value of compressing an aspect is equivalent to reducing the running time to 1/12 of the original one. This is one of the best ways to tradeoff between storage space and computation time in terms of asymptotic optimality [36, 75].

A.5 Measures of Compressor Progress/Learning Progress

The previous sections discussed only the measures of compressor performance, but not performance enhancement, which is the core issue in our curiosity-oriented context. To reiterate the point made above: it is not the compression performance of the compressor itself that is important, but rather its performance improvement. Our curiosity should be rewarded by the progress the compressor makes between time t and t+1 (due to some application-dependent compressor improvement algorithms):

Here ( f ) is a function that maps one pair of real values to another. A number of possible alternatives for measuring progress exist; the most obvious is ( f(a, b) = a - b ). This corresponds to a discrete-time version of the first-order derivative that maximizes subjective data compressibility. In short, it is a measure of progress in compressor performance by calculating the difference between two consecutive measurements.

Note that both old and new compressors must be tested on the same data (i.e., historical data to date).

A.6 An Asynchronous Framework for Creating Rewards for Curiosity

Let p(t) indicate the current compressor program of the agent at time t, and s(t) its current controller, and execute:

controllers: Do this at any time t ( 1 ≤ t<T ):

Let s(t) select and execute y(t+1) using (partial) history h(≤t).
Note that x(t+1) .
Check if there is a nonzero curiosity reward r_int(t+1) by the compressor improvement algorithm (⻅ below), which is performed separately asynchronously. If not, set r_int(t+1)=0.
Let the reinforcement learning (RL) algorithm for the controller use h(≤t+1). Include r_int(t+1) (and possibly a compressed version of the latest observations (⻅ below) to obtain a new controller s(t+1), consistent with Objective (2).

compactors: Set p_new as the initial data compressor. Starting at time 1, the following steps are repeated continuously until interrupted or terminated at time T:

Set p_old=p_new; get the current time step t and set h_old=h(≤t).
Solve for p_old on h_old to get C(p_old,h_old) (Section A.3). This can take many time steps.
Let us consider a (application-related) compressor improvement algorithm, such as a learning algorithm for an adaptive neural grid predictor. This algorithm uses the current compressor (called h_old) to obtain a hopefully better compressor, p_new, such as a neural network of the same size but with more predictive power and therefore better compression performance [95]. Although this process may take many time steps (and may be partially accomplished during "hibernation"), p_new may not be optimal due to limitations of the learning algorithm, such as the possibility of falling into local maxima.
Evaluate p_new over h_old to get C(p_new,h_old). This can take many time steps.
Get the current time step τ and create a curiosity reward.

For example, f(a,b) = a-b; ⻅ Section A.5.

Obviously, this asynchronous approach may result in a long time delay between the controller operation and the corresponding curiosity reward. This may place a heavy burden on the controller's reinforcement learning algorithm, which is tasked with assigning credit for past actions (in order for the controller to understand the beginning of the compressor evaluation process, for example, we can increase its input by the unique representation of such events). However, there are some reinforcement learning algorithms that are theoretically optimal in various senses, as discussed below.

A.7 Optimizing Curiosity, Creativity and Concentration

The compressor level we choose usually has some computational limitations. In the absence of any external incentives, we can define the optimal purely conscious behavior with respect to these limitations: at time t, the behavior will be chosen to maximize energy:

Since the real, world-supporting probability distribution μ is unknown, the task facing the controller's reinforcement learning algorithm can be quite daunting. As the system revisits previously uncompressible parts of the environment, some of these parts will tend to become easier to subjectively compress, and the corresponding rewards of curiosity will diminish over time.

A good reinforcement learning algorithm must be able to detect this reduction, predict it, and then act accordingly. However, traditional reinforcement learning algorithms [33] do not provide any theoretical optimality guarantees for this situation. (This is not to say that suboptimal reinforcement learning algorithms may not lead to success in some applications; experimental studies may provide interesting insights.)

First, we make a natural assumption: the compressor is not a hypercomplex system like Kolmogorov complexity, i.e., its output and rint(t) are computable for all t. The question is whether there is an optimal reinforcement learning algorithm that is as close to the value of the maximization goal (7) as any other algorithm. The question then arises whether there exists an optimal reinforcement learning algorithm that is as close to the value of the maximization goal (7) as any other algorithm? Such an algorithm does exist. However, it has the disadvantage that it cannot be computed in a finite amount of time. Nevertheless, it serves as a reference point to define what is the best achievable performance.

A.8 Optimal but non-countable yoke selectors

There is an optimal method for choosing the response that utilizes the theoretically optimal generalized predictor and its Bayesian learning algorithm [106, 107, 37, 29, 30]. The latter simply assumes that the responses in the environment are sampled from an unknown probability distribution μ contained in a set ℳ of all countable distributions contrasted with the post-equation file (2). More precisely, given a sequence of observations q(≤t). Our only assumption is that there exists a computer program that can take any q(≤t) as an input and compute its prior probability μ prior. Generally speaking, we do not know about this program, so we perform predictions using a mixed prior instead:

The weighted sum of all distributions μi ∈ ℳ, i=1,2,..., where the sum of constant positive weights satisfies ∑_i w_i ≤ 1 . In a general sense, this is really the best we can do [107, 29]. Disadvantages Since ℳ contains an infinite number of distributions. We can increase the theoretical power of the scheme by enhancing ℳ in some ways. For example, suppose that the world is computed by a program sampled from some unknown but deterministic computer [74], which assigns low probabilities to the language to environments that are difficult to compute by any means.

Once such an optimal predictor is available, it can be extended to define an optimal rule selector that maximizes the expected future payoff by formally incorporating the effects of the performed operations. At any point in time, t, Hart's theoretical optimum is the (as yet computationally inaccessible) reinforcement learning algorithm Aixi [29] uses an extended version of the Solomon's predictor scheme to select those sequences of executives that maximize the future rewards up to a certain span T given the current data h(≤t).

That is, in period t+1, Aixi chooses the first action of the sequence of actions as the next action to maximize the predicted payoff of ξ for a given time horizon, appropriately summarized in Eq. (8). Aixi optimization uses the observation [29] that the ⻉-optimal policy pξ based on the ξ mixture is self-optimizing because it averages to ξ. For all μ ∈ ℳ, the utility values converge to the optimal value of ⻉-optimal policy achieved by ⻉-optimizing policy pμ Knowing μ ahead of time is a necessary and sufficient condition for ⻉-optimal policy to be able to predict the optimal value of ξ in a given timeframe. A necessary and sufficient condition is that ℳ recognizes the self-optimizing policy. The policy pξ is also Pareto-optimal in the sense that no other policy can produce a higher or equivalent result. values in all environments ν ∈ ℳ, as well as higher values in the strict sense of the term. At least one of the values in [29].

A.9 Computable Proof-Optimal Optimization Selector

The above Aixi requires unlimited computation time. Its computable variant, Aixi(t,l) [29], has an asymptotically optimal running time, but may suffer from a large sustained slowdown. In order to take into account the computational time consumed in a generally optimal way, we can use recent Gödel machines [79, 82, 80, 92] instead. They represent the first class of mathematically rigorous, fully self-reflexive, self-improving, general-purpose, optimally efficient problem solvers. They are also applicable to the problems described in Objective (7).

The initial software for this Gödel machine 𝒮 contains initial problem solvers, such as a typical suboptimal method [33]. It also contains an incrementally optimized initial proof finder Universal Search [36], based on an online variant of Levin, which is used to run and test the proof technique.

Proof techniques are programs written in the common language ᵊ on Gödel machines. They are, in principle, capable of computing proofs about the future performance of the system itself, based on the axiomatic encoding of the system 𝒜 in 𝒮. 𝒜 describes the formal utility function, which in our case is Eq. (7), hardware properties, arithmetic and probabilistic axioms, theory and data manipulation, etc., as well as 𝒮 itself, it is possible not to invoke a loop [92].

Inspired by Kurt Gödel's famous self-reversal formula (1931), the Gödel machine rewrites any part of its own code (including the proof searcher) by means of a self-executing program, as long as its Universal Search variant finds a proof that the rewrite is useful according to objective (7). According to the Global Optimality Theorem [79, 82, 80, 92], such an autowrite is globally optimal and no local maxima are possible! Since the self-referencing code must first be proved, it is not useful to continue looking for other self-rewrites.

If there is no provably optimal rewriting method for 𝒮, then it cannot be found in the human class. But if there is then 𝒮 itself can be found and utilized. Unlike the previous non-self-referential approach based on a hard-wired proof finder [29], the Gödel machine not only has an optimal complexity order, but also optimally reduces (by self-interpreting) any slowdowns hidden by the O() annotation, provided that the real-world program can be proven. Compare with [83, 86, 85].

A.10 Reinforcement Learning Algorithms that are not commonly used, but are still commonly used and functional

Recently, there have been significant advances in reinforcement learning algorithms that are not as general as the algorithms described above, but are capable of learning very general, program-like behaviors.

In particular, the evolutionary method [53, 99, 27] can be used to train a recurrent neural network (RNN), which is a general-purpose computer. The evolutionary approach [40, 122, 121, 45, 39, 103, 42] can be used to train recurrent neural networks (RNNs), which are a general-purpose computer.

A particularly effective methodology series uses cooperative coevolution to search the space of grid components (neurons or individual synapses) rather than the complete grid. Co-evolution works by combining these components into a grid and selecting those that participate in the best performing network for reproduction [43, 20, 21, 19, 22, 24].

Other recent RL techniques using RNNs are based on the concept of policy gradients [110, 119, 118, 56, 100, 117]. It would be interesting to evaluate variants of such controlled learning algorithms within the framework of curiosity rewarding.

A.11 Acknowledgements

We thank Marcus Hutter, Andy Barto, Jonathan Lansey, Julian Togelius, Faustino, Giovanni Pezzulo, Gianluca Baldassarre, and Martin Butz for their invaluable comments, which helped to improve the first version of this file. These comments helped to improve the first version of this text.