MMRole Brings Personal AI-Asssitants to Life

MMRole Brings Personal AI-Asssitants to Life

In the large digital pile on my desktop called scientific research, I came across a paper from the folks at Renmin University of China. They have developed this wild new concept:

Role-Playing Agents powered by artificial intelligence.

But not just any AI — This is all about multimodal role-playing agents (mmRPAs).

Multimojowha? Don't worry, I'll come to that later on.

This time, I am not talking about AI-Assistants for a change (Agentic AI), nor AI-Native businesses. No, this is about a particular development of the world of Personal AI, which I wrote about in a previous article called: "We need to develop a vision for Machine Customers".


The place of mmRPA in the grand scheme of things



Before we start!

If you like this topic and you want to support me:

  1. Share the article, or comment; that will really help my articles to reach more people 🙌
  2. Or Connect with me on Linkedin 🙏
  3. Subscribe to TechTonic Shifts to get your daily dose of tech 📰


These MRPA characters don't just behave like your average AI.

They can actually see and react to images like a real person would. You literally take your Personal AI-assistant, like your Character.AI or Replika, out of a boring text conversation and drop them into a vibrant, visual world where they can analyze and interact with images, sound, etc..

I can almost see your frown at this...

Allow me to explain it with two examples.


Example 1: a virtual shopping assistant....

Say, you are browsing an online clothing store. And instead of just reading text descriptions of outfits, you snap a picture of an item you like or a style you're inspired by. Your AI shopping assistant analyzes the image and responds visually and textually.

You: Snaps a picture of a celebrity outfit at a red carpet event

Personal AI Assistant: "Nice choice! That jacket looks a lot like this one from our new collection. Pair it with these trousers for a similar look. Here's how it would look together."

The assistant then displays a mock-up of the outfit on a model, suggesting matching accessories.

This assistant doesn’t just offer generic suggestions; it tailors its responses based on the visual context that you provide. That would make the shopping experience more interactive and personalized.


Example 2: a virtual Travel Companion

And now you are planning a vacation and want to explore potential destinations. Instead of typing “*Tell me about Paris*”, you just upload a photo of a famous Parisian landmark, like the Eiffel Tower.

You: Uploads a photo of the Eiffel Tower

Personal AI Assistant: "Aha, ze Eiffel Toweeeerrrrr! (*poor imitation of Frenglish*) Did you know it was initially met with criticism by many Parisians when it was first built? Here’s what the area looks like now, and here are some nearby cafes where you can grab a classic croissant."

The assistant then shows a current street view of the Eiffel Tower’s surroundings, complete with recommendations for nearby attractions, restaurants, and even a simulated walking tour.

Here, the AI doesn’t just rattle off facts but interacts with the image you’ve provided, offering a visually rich, informative experience that makes your travel planning more engaging and immersive.

These two scenarios are I think a good example of how we are going to move from text-based to multimodal interactions which can make AI feel more intuitive and lifelike. It will improve the user experience by integrating visual context directly into the conversation, AND it will become portable.

How cool is that?

At least, I think that it's cool - yet I know that some of you don't like this development one bit.


Intermezzo:

How cool would it be to be able to walk around with your Personal AI-Assistant, wherever you go. The Assistant listens to you, can interpret commands, sees what you are looking at and can augment that with information or graphics, and even knows where you are to give you directions for instance. You could carry your personal AI Assistant with you either on your phone, the countless AI gadgets that have hit the market, or the AI / VR glasses from Rayban | Meta, the G1 or the RayNeo X2.

I happen to know a company that will exactly allow you to do just that.

I don't know the inside outs of their proposition, because they are very hush about it, but given the sheer number of patents, and the brilliant team they have gathered, this must be a game-changer. However, they are still in Beta, and will probably launch somewhere in October.

I'll definitely hope to be one of the first to try it out.

Meanwhile my RayNeo X2 glasses will have arrived by then, so do count me in !


The MMRole framework

The people behind this idea are the researchers at Renmin University. They have created this framework called MMRole, and it's not just another run-of-the-mill AI toolkit.

MMRole is the solution for creating these super-advanced MRPAs.


Don't be scared, just close your eyes and scroll on!


And here's the best part:

It's not just about making personal agents talk better; it's about making them see and feel better too.

So, why is MMRole such a big deal ?

And you are right to question this because Role-Playing Agents have been around for a while, but they've been quite limited.

If you are a gamer like me (with or without seizures), you are familiar with Non Player Characters or NPC's. Role-Playing Agents could function similarly to NPCs in a game. Because just like NPCs, RPAs are designed to simulate specific characters and interact with players. But the main difference is that RPAs, especially with multimodal capabilities, could take NPCs to the next level because they have more dynamic, and lifelike interactions that go way beyond pre-scripted responses.

So in practice they would be able to respond to both text and visual cues in a way that is more immersive and personalized.

The MMRole framework is really pushing boundaries by giving these RPAs multimodal abilities. Now, they can not only understand and generate text but also analyze and respond to images, and even add personal traits to those characters.


It may seem like a minor update, but given the possible uses in the near future, the implications of this are huge.


The MMRole isn't a fancy gimmick; the framework has two main components:

1. MMRole-Data: This is a massive dataset packed with 85 characters, over 11,000 images, and more than 14,000 dialogues. And these aren't just random conversations; they're crafted to reflect each character's personality and how they'd react to different visual cues. Think of it like this....when Harry Potter sees Hermione at the Yule Ball, he's not just blurting out generic lines; he's reliving that awkward teenage moment with Ron by his side.

2. MMRole-Eval: This is the core of the framework. Evaluating how well these agents perform isn't just about checking if they can string together a coherent sentence. The evaluation model uses eight different metrics. The metrics range from basic conversational skills to more complex things like personality and tone consistency. So, for instance, does Iron Man still sound like the snarky Tony Stark when he is commenting on a photo of the Avengers? Does Hermione still have that characteristic earnestness when she's discussing something from the Wizarding World?

To demonstrate the power of their framework, the same researchers have built a MMRole-Agent. That is a specialized MRPA that's like exactly the Personal AI-Assistant I used as an examples.

They have trained it using the MMRole-Data, and guess what? This agent didn't just hold its own—it outshined many existing models, especially when it comes to staying in character while interacting with visual content.

They even put MMRole-Agent up against some heavyweight (pun intended) like Claude 3 Opus and LLaVA-NeXT-34B (LLaVA = Large Language and Vision Assistant on GitHub). While Claude 3 Opus knocked it out of the park in the heavyweight category (*over 100 billion parameters*!), MMRole-Agent was the champ in its own weight class. It just goes to show that with the right training data and fine-tuning, you can create an AI that's a real contender.

Of course, MMRole isn't perfect. But to me this is a Proof of Concept that Personal AI-Assistants will be the future.

The complexity that they are facing now, is that these Agents need to maintain a consistent personality and tone across complex multimodal scenarios. Think of it like trying to keep an actor in character while they improvise scenes based on random images thrown at them.

That is the difficulty the practical applications of it face.

But despite the challenges, the MMRole framework is a huge leap forward in creating more immersive and realistic AI-driven characters. The researchers are working hard to improve how these agents handle multimodal inputs. They need to make sure they stay true to their personalities no matter what scenario they're in. The future of AI role-playing isn't just about better conversations; it's about creating characters that feel as real as the people they're based on.

If you want to geek out on all the juicy details, check out their paper, or fool around with their Agent.

Signing off - Marco


Well, that's a wrap for today. Tomorrow, I'll have a fresh episode of TechTonic Shifts for you. If you enjoy my writing and want to support my work, feel free to buy me a coffee ♨️

Think a friend would enjoy this too? Share the newsletter and let them join the conversation. LinkedIn appreciates your likes by making my articles available to more readers.


Top-rated articles:



Nick Richtsmeier

Creator of the Trust-Made Growth™️ Index | Giver of Damns

4mo

Absolutely not.

To view or add a comment, sign in

More articles by Marco van Hurne

Insights from the community

Others also viewed

Explore topics