Patchwork-Intelligent Systems and Prompting Them
Defining Intelligence (Spoiler: We Don’t Know It Yet)
Very soon, systems equipped with artificial intelligence will permeate our lives. They will be everywhere: around us, and perhaps even inside us. What will they be like? In the past, we learned about this from science fiction writers and futurists. Today, we can confidently describe the first generation of intelligent systems, providing detailed insights into their structure, behavior, strengths and weaknesses, as well as their economic characteristics.
What we still cannot do is clearly explain what intelligence truly is, nor how to confirm that a system presented to us fully possesses it. The rule commonly referred to as the Turing Test sounds enticingly simple, but it is unlikely to hold up to scrutiny in reality. We don’t know who exactly the person is that an intelligent system would need to emulate so closely in behavior to pass this test. If the criterion assumes this person always acts rationally, makes no mistakes, and is capable of answering any question correctly, then we won't be able to conduct the test—because such a person simply doesn't exist. If the criterion allows this person to make mistakes, get confused, or act foolishly, then how far can they go in doing so before we stop agreeing that they are a suitable benchmark for testing a system claiming intelligence? How often do the people around you live up to your expectations regarding their own intelligence?
Why not simply say that errors come in different qualities: human and non-human? Dropping a pan on the floor and making a child cry—that’s a human error. Throwing the child in the trash to stop the crying—that’s not a human error. If an intelligent system is a system, we agree to trust, then its behavior must be entirely free of non-human errors. But how many human errors are we willing to tolerate? An oak tree stump makes no mistakes at all, yet that doesn’t make it intelligent.
What kinds of tests should be included in the methodology for evaluating an intelligent system to determine if it truly deserves that title? Will it be enough to confirm the system’s ability to pass exams in various academic disciplines? Reports claiming that this threshold has been crossed frequently appear in the media. Try asking Midjourney to create your portrait based on a verbal description, and then compare it to a portrait drawn by a human artist from the same description. Find an obscure short story in literature or write one yourself. Let the plot be believable but not entirely conventional. Then ask a large language model to explain what emotions the characters experienced, what motivated them, and why the events led to that particular outcome. Afterward, ask the same questions to human readers.
It seems that neither sufficiently well-defined criteria nor tests that would adequately assess a system’s intelligence currently exist. Perhaps their development should not be the task of engineers, neurophysiologists, or even psychologists, but rather philosophers. If anyone can explain what intelligence, consciousness, or personhood truly are, it would be them. However, as far as we know, they have yet to be contracted for the job.
Lacking an intensive definition of an intelligent system, we are forced to act extensively, simply expanding the scope of artificial intelligence to include areas of work that, in one way or another, produce human-like results. So far, two such directions have existed. They have developed quite successfully but were not closely interconnected. Now, they have begun a great convergence.
Large Language Models
The first direction is large language models (LLMs). Solutions of this class form the basis of widely known public services like ChatGPT, Claude, and others. An LLM receives an instruction in natural language and responds with the result of executing that instruction. A specific terminology has already emerged to describe such instructions. They are now called “prompts,” and the skill of crafting these prompts is known as “prompt engineering.”
The result of executing a prompt is usually delivered as content in a specific format. This can be graphical, textual, or multimedia. A specific case of text can be considered as programming code or data in textual markup, such as the source of a diagram in UML language.
The performance of LLMs is undeniably impressive, yet it remains inconsistent. Take a look at the guides and articles on prompt engineering, the publication of which has surged over the past year. The authors of these works approach the topic from different angles: some aim to teach their readers how to make easy money, others focus on eliminating routine from creative tasks, while some, like the author of this text, take the stance of a researcher exploring new technologies. At their core, however, their discussions boil down to one key question: how to limit the degrees of freedom of artificial intelligence, how to direct its generative capabilities in the desired direction, and how to curb unnecessary improvisation. Here are a few of the most common and widely considered effective techniques.
Assigning the responder's role. At the beginning of the prompt, the user asks the system to structure its response as if it were a person in a specific role: a copywriter, a nutrition expert, a rejected admirer, or a loving grandmother.
Specifying the recipient's role. At the beginning of the prompt, the user informs the system who the response will be directed to, detailing the recipient's knowledge, expectations, and intentions. Naturally, this technique is linked to the one described above. For example, we can ask the system to respond as a nutritionist to a patient or as a grandmother to her grandchild. We are free to either include explicit descriptions of these roles in the prompt or rely on the system's ability to adequately simulate such characters.
Cascading prompts. The first prompt invites the system to broadly outline a work plan. If necessary, we modify this plan. Subsequent prompts instruct the system to execute each point of the finalized plan. For instance, we can first ask the system to describe the characters: a grandmother and her grandson. Then, we prompt it to generate a dialogue that could occur between them. We may also include our revised response to the first prompt as input for the second. Cascades can be multi-layered. We can refine the initial plan point by point and level by level until it evolves into a fully developed text.
Templating successful examples. This technique works well in ChatGPT, where templates are supported. Let’s say we’ve come across a well-written advertisement and want to create several of our own in a similar style. First, we ask the system to convert the advertisement into a template. It recognizes the specific elements of the text and replaces them with placeholders, which ChatGPT formats in square brackets. After refining the template, we use it in prompts to generate our own ads. For example, an ad for a refrigerator becomes similar ads for a sofa, a television, and a washing machine. The key point is that we don’t simply instruct the system to write an ad that resembles the given sample, we maintain a checkpoint in the process. This approach applies to all other techniques as well: we keep checkpoints to monitor the system, much like we would supervise a child or an inexperienced colleague.
Including a persistent part in prompts. A frustrating characteristic of many current systems that rely on LLMs is their occasional loss of context. For example, you might be writing a JavaScript program for the Windows Script Host environment, but midway through, the system suddenly forgets this and starts using functions available only in Node.js. Another common situation occurs when, after composing five functions through a series of prompts, you ask the system to write a sixth function that is supposed to use the first five. However, upon receiving the response, you realize it ignored the previous results and wrote the sixth function from scratch, completely different from what you needed. To prevent such misunderstandings, users create a persistent part and include it in every prompt within a related series. This persistent part serves as a constant reminder to the LLM of what it might otherwise forget. The persistent part can grow incrementally; for example, when developing a program, you can gradually append the existing functions to it as “synapses” of the previously written code.
Result verification. Despite these numerous and rather sophisticated methods of guiding LLMs, they do not prevent them from making obvious blunders. There is a known case where one of the generative networks depicted Wehrmacht soldiers as people of African and Asian descent, dressed in Nazi uniforms. The model had been trained with a focus on political correctness, which typically emphasizes racial and gender diversity among characters. However, it failed to recognize that in this particular case, it was not just making an error but committing a tactless blunder—one for which any human artist would have been harshly criticized.
A curious incident from our own practice. We needed a technical text excerpt for exercises in literary editing, which we use in our training sessions for technical writers. We created a brief description of a fictional project management program and asked ChatGPT to write the procedure for creating a new project card. The system generated a generally plausible and well-written text, which we even had to intentionally ‘spoil’ so that our trainees would have something to correct. However, we did find a glaring mistake in the description of the first step of the procedure. “Click the ‘Create’ button or a similar one,” the system wrote. Anyone who has read or, especially, developed user documentation knows that nothing like this would ever appear in a user manual. There are no ‘similar’ buttons in software interfaces. The user needs to press a specific button, and if it is duplicated by a menu item or a keyboard shortcut, any writer would list them explicitly.
We’re not citing these examples of errors by chance—mistakes that a human is unlikely to make unless they are disconnected from society or intentionally joking. The key characteristic of such errors is their unpredictability. You never know when and to what extent an LLM will go off track.
Fortunately, users still have a universal superpower at their disposal to eliminate any errors: reviewing and editing the model's outputs.
Mechanical Bodies
The second direction is the creation of mechanical bodies. To clarify, we haven’t encountered this term in any publications; we are introducing it ourselves. In this category, we include phenomena such as robots from Boston Dynamics, whose demonstrations most people have likely seen—and, admittedly, they look quite impressive. The strength of such devices is, of course, their ability to operate in the physical world across a wide variety of landscapes and situations. They can move, carefully pick up and place objects, navigate obstacles, and even avoid attacks.
At the same time, mechanical bodies lack even imperfect intelligence. If you were to tell a Boston Dynamics robot, “Transfer the jars of cucumbers from the cellar to the trailer,” you couldn’t expect the task to be completed. How are mechanical bodies controlled then?
When it comes to mechanical bodies, tasks are distinguished by two horizons: short-term and long-term.
Short-term tasks involve elementary actions that are built into the mechanical bodies’ capabilities. These are similar to actions we perform on autopilot, without conscious thought. For example: walking downstairs, picking up an object, placing it down, or stepping over a log lying across the road. The number of such tasks can be large, but it is still finite.
Long-term tasks are aimed at achieving a practical result, such as moving a jar of cucumbers from the cellar to the trailer. The number of such tasks is infinite, as the variety of real-world situations is endless. Developers suggest building long-term tasks out of the short-term tasks available to the mechanical body, arranged in a specific sequence. The planning and execution of long-term tasks are carried out by an operator equipped with a remote control or a more advanced device (think of the film “Avatar”).
Embodied Artificial Intelligence
The task that today’s most ambitious tech companies are focused on is creating a robot equipped with artificial intelligence. Solutions of this type have not yet become a reality but have already been given the name ‘embodied AI.’ Systems with embodied AI will be able to handle long-term tasks independently, without the need for micromanagement by an operator. At least, that is the expectation for them.
There are various approaches to creating embodied artificial intelligence. One of them involves using LLMs to control mechanical bodies. This doesn’t imply that the hardware-software complex enabling the functioning of the LLM must necessarily fit into a space comparable to the size of the human skull. Such a setup would be required for creating a robot astronaut or an ocean floor explorer, but for now, it would suffice to connect the robot’s ‘soul’ to its body via a wireless channel.
Formally, integrating a mechanical body with a LLM into a single system doesn’t present any major difficulty. Figuratively speaking, you could connect a hardware voice assistant to a robot vacuum cleaner and have it giving your guests a tour of the apartment [1]. However, the range of long-term tasks available to such a system would likely be limited to a few specific cases that you would need to manually program. Additionally, any unforeseen situation would disrupt the predefined scenario.
How can we make a robot’s behavior more meaningful and flexible? Engineers at Microsoft have proposed a simple technique that allows transforming a long-term task, formulated in natural language, into a series of short-term tasks performed by the robot [2]. This approach leverages two capabilities already at our disposal. These capabilities are:
- Automatic scene labeling and image description
- Automatic generation of program code in response to a prompt
These two capabilities serve as the foundation for making the robot’s behavior more structured and adaptable. Solutions for automatic image labeling enable the recognition of objects present in images, the creation of object lists, and the generation of textual descriptions of scenes [3].
Since image labeling results in text, we can incorporate it into the prompt for the LLM. Imagine a person who integrates their impressions of what they see into their train of thoughts, thinking, “I see a sofa!” or “I see a painting!” and so on. Our robot would behave in a similar way. Advances in labeling systems can not only identify and list the objects in images, but also perform more complex operations, such as determining their coordinates in space or measuring the distance between them and the camera.
Automatic code generation with ChatGPT is now familiar to everyone, from amateurs to professional programmers. Some view it with disdain, while others use it actively and unapologetically, but almost everyone takes advantage of this capability. The system generates code in response to a prompt that contains the problem statement.
Let’s imagine that our robot is equipped with an API that allows it to execute the short-term tasks available to it. For the sake of clarity, let’s assume that this API allows calling functions from a program written in a widely known programming language like Python or JavaScript. ChatGPT is certainly capable of generating code in these languages. The robot would also be equipped with something akin to an operating system or command processor, enabling it to run programs that interact with its API.
The robot's API description is available to it (and the LLM it uses) in the form of a concise developer guide.
Finally, let’s assume that the robot’s overall activity is controlled by a continuously running script. The task of this controlling script is to ensure that the robot executes user commands given in natural language.
Now, we have everything we need to set up the following working cycle for our robot.
Here’s the translated and organized version of the robot's working cycle:
1. The user gives the robot a command using natural language.
For example, the user instructs the robot:
“Move the mug from the table to a cabinet.”
2. The script retrieves an image of the surrounding scene from the robot’s cameras.
Let’s assume the robot is standing in the middle of a room. In addition to the robot and the user, there is a table, a chair, and a cabinet in the room. A mug is on the table.
3. The script receives a textual description of the scene using the image labeling subsystem.
4. The script generates a prompt for the LLM. The following elements are combined into a single text for the prompt:
Recommended by LinkedIn
- The command received from the user.
- The description of the initial scene.
- The developer guide (API documentation).
- Instructions like the following:
"Write a program that will make the robot carry out the task described above. The robot starts performing the task in the situation described above. The program should be written in Python and should use the functions from the API described above."
5. The script sends the composed prompt to the LLM for processing.
6. The script receives a program from the LLM, which it has generated for the robot. The program might look like this:
table_coords = detect_coords("table")
go_to_coords(table_coords)
mug_coords = detect_coords("mug")
take_object(mug_coords)
# and so on.
7. The script executes the generated program on the robot.
This cycle allows the robot to follow user instructions by leveraging both its sensory inputs and the generative capabilities of the LLM to fulfill tasks.
Thus, the LLM generates a program that, when executed, will direct the artificial body to fulfill the given task in the real world. What happens if, in the meantime, a mischievous experimenter moves the mug from the table to the chair or closes the cabinet door is another question. Presumably, such situations could be handled within reasonable limits by making repeated requests to the LLM in case of failure. Additionally, the constant part of the prompt could be designed in such a way that it encourages the model to generate control programs with branches, accounting for such changes within reasonable bounds. The key would be to ensure that the system can adjust the control program in real time.
A major advantage of this approach is that everything necessary for its implementation already exists. A system based on this method can be quickly designed and assembled from available components, including pre-trained models that are fully ready for use.
On the other hand, we have neither practical nor philosophical certainty that every situation, even in such a limited physical environment as a household setting, can be accurately described in words with the necessary precision. As a result, a robot’s behavior, based solely on verbalization of what is happening, may often be clumsy and inappropriate—what we might call stiff or awkward.
A more advanced version of the same approach assumes that LLMs are capable of successfully processing not only text in the traditional sense, but also any data that shares similar properties: linearity, division into meaningful blocks, and syntax [4]. After all, no one gives them lectures on vocabulary and grammar, as happens with those of us trying to learn a foreign language. LLMs are trained by processing massive datasets composed of various examples. If that’s the case, we can feed them any ‘languages’ as long as they generate data structured like text.
If the verbal descriptions of scenes that the robot’s control script includes in prompts for the LLM are inherently imprecise, why not replace them with the scenes themselves? To do this, we only need to learn how to translate scenes into a form that allows them to function in the text as an integral part of it. This task is solved using so-called embeddings.
Imagine that we have learned to associate a specific sequence of numbers with any scene or part of a scene, with the number of elements in each sequence being limited by the same natural number. Mathematicians would say that we have constructed a mapping from the set of images to the set of vectors of this dimension.
Additionally, we managed to define this mapping in such a way that the vector size is always much smaller than the size of the image it corresponds to. Here, by size we mean the number of bytes or symbols in their machine representations. The specific measure is not that important. Essentially, images can be quite large, but the vectors derived from them are relatively small.
But that's not all. Imagine we have three images: two of them show roughly the same thing, for example, a pile of items a lady checked in as luggage, and the third shows something entirely different, like a mountain cabin or a drum set. In this case, the distance between the ends of the vectors corresponding to the two similar images will be smaller than the distance from either of them to the end of the third vector.
In short, our transformation converts similar images into nearby points, while dissimilar images are mapped to points far apart from each other. We’re intentionally avoiding the question of what specific distance we mean here. Different metrics are possible, but for now, it’s not critical.
The vectors that meet the outlined requirements are called embeddings. There is hope that embeddings of scenes and individual objects will be more informative for the LLM than their textual descriptions. An embedding of a carpetbag, for instance, would likely not only represent the carpetbag, as the word ‘carpetbag’ does, but also provide information about its color, size, position in space, and neighboring objects. This gives rise to the idea of including embeddings in natural language text, just as we include words or phrases. Technically, an embedding is a string of characters, so nothing prevents us from handling them in this way. The system generates embeddings at the same step of the command execution process as the textual descriptions. Text descriptions are still necessary, as the user interacts with the robot in natural language. To replace the word ‘suitcase’ with the carpetbag’s embedding in the command text, the system first needs to identify in the scene image what corresponds to that word.
Prompts created using embeddings might look approximately like this:
Move <emb.carpetbag> from <emb.platform> to <emb.train car>.
For brevity, we use placeholders for embeddings here, enclosed in angle brackets. In reality, these positions would be occupied by the actual embeddings: sequences of numbers, automatically generated and meaningless to most ordinary people.
In response to a prompt with embeddings, the LLM would still generate a program to control the robot’s actions via the API. Since embeddings are more informative than plain text, one could expect the robot using them to execute commands with greater accuracy and precision. On the other hand, to fully realize this advantage, additional training of the LLM on specially prepared datasets with embeddings would be required. Such development would demand time, funding, and expertise available only to industry giants.
The threshold for creating a system that generates ‘human-like’ prompts for itself seems significantly lower and more affordable to regular companies. It seems logical to assume that this is how the first generation of mass-produced intelligent systems will be structured. How intelligent will they be?
How Do We Communicate With Them?
Will the behavior of such a system be as consistent as that of a conscientious, competent person with common sense? If not, where exactly will the fabric of behavior we consider acceptable be torn? What might be self-evident to any of us but suddenly an impenetrable mystery to the machine? Answering this question means determining the level of trust we can place in a system claiming intelligence, and the extent of additional oversight we will need to implement—otherwise, its use may be impractical or, in certain cases, unsafe.
Will the transition from patchwork-intelligent systems to truly intelligent ones happen in the near future? Technological optimists look forward to this with hope, while alarmists approach it with fears bordering on apocalyptic concerns. At the same time, doubts remain about whether this is even possible given the current state of our fundamental knowledge.
Roger Penrose, in his work “Shadows of the Mind,” mathematically argues that human consciousness contains an element fundamentally irreducible to computations and likely based on as-yet-unknown physical effects [5]. If he is right, then no purely computational system can match the capabilities of the human mind. Modern artificial neural networks, even the most complex and trained on the largest datasets, remain computational systems by nature. Increasing their computational power and loading them with additional datasets will strengthen them many times over, but it will not lead to the next qualitative leap. You can harness all the horses in the world to a cart, but it still won’t make it an aircraft. Of course, you might disagree with Penrose.
Since we don’t yet have reliable tests to confirm the full intelligence of a system, we will refer to them as ‘patchwork-intelligent’ because they are capable of delivering unpleasant surprises: in one situation, the system behaves predictably and correctly, while in another, similar or adjacent one, it makes an absurd mistake.
If the primary components of a patchwork-intelligent system are an LLM and an artificial body, then the measures of additional control over it will stem from the characteristics of these components, which we already know. Not entirely—true—but at least we won’t start from scratch.
Since the robot operates in real time, we won’t be able to test the long-term task-solving programs it generates for itself. We’ll need to compensate for this lack of oversight somehow—perhaps by equipping the robot with an extensive library of texts that it can use as constant parts of prompts. Creating this library might be delegated to technical writers, as they are the most prepared to work with large volumes of text, which need not only to be written but also managed, stored, and composed.
In prompts, we will have to explicitly state directives and prohibitions stemming from desires, fears, and social norms that are clear to most people. To achieve this, prompt authors will need to employ a technique known in literary studies as ‘defamiliarization’. In literature, defamiliarization typically involves portraying ordinary situations or feelings as strange (hence the name of the technique), illogical, or exaggerated. The author might force the reader to view them through the eyes of another being, such as an animal or an alien.
The authors of the library of constant parts for prompts will be forced to explicitly describe what is usually implied in human interactions. We can assume that many of these fragments will describe roles and the social situations that bind them: what is permissible or impermissible around children, in the presence of guests, or when the owners are asleep, and so on. For instance, one prompt might forbid the robot from saving video recordings of improperly dressed people, even for technical purposes. What kind of genre will this be? We will have to develop it. Will such prompts, in the end, resemble commandments or moral teachings like those found in religious texts in terms of composition, structure, and organization into blocks?
It’s likely that in prompts designed for patchwork-intelligent systems, we will use evaluative adjectives like ‘big,’ ‘small,’ ‘close,’ ‘far,’ ‘good,’ and ‘bad’ much less frequently and more cautiously. We may have to explicitly clarify their meanings for each specific case—what constitutes a ‘big’ dog versus a ‘big’ problem—or develop a more descriptive or precise vocabulary.
Prompt engineers will likely need to thoroughly study the ‘internal world’ of the system for which they are creating instructions. Does the system distinguish between a sideboard and a buffet, or a buffet and a dresser? What is the set of short-term tasks that make up its activity? This knowledge will be essential for the author to make directives and prohibitions as precise as possible, avoiding misinterpretation by the system. For example, to forbid certain actions in the presence of a specific object or under certain circumstances, such as not making noise around a sleeping child, the prompt engineer will need to know how the system internally labels this object, and which specific short-term tasks constitute the undesirable action. Otherwise, the instruction might be too vague for the system to properly follow.
Sources
1. Boston Dynamics. Robots That Can Chat
2. Microsoft. ChatGPT for Robotics: Design Principles and Model Abilities.
3. Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv
Batra, Devi Parikh. VQA: Visual Question Answering
4. Google. An embodied language model, and also a visual-language generalist
5. Roger Penrose. Shadows of the Mind: A Search for the Missing Science of Consciousness, 1994.
#AI #trustworthyAI #intelligent_systems