Generative AI - a more colorful take

Generative AI - a more colorful take

It’s been 7 months since I’ve started spending the weekends and late nights into this space of generative AI, thanks to the release of Midjourney, ChatGPT and Stable Diffusion. What I’ve learned from this journey, is what I’m sharing today and hopefully more in the future. I’ll try to keep it in simple english and if anything is factually incorrect, please forgive me.

The Big Step Changes (at least to me)

The three big break through moments in this AI space in recent history

  1. “Attention is all you need”
  2. GPT-3 Moment - [Large] Size matters
  3. Stable Diffusion - [Small] Size matters


“Attention is all you need”

Language Models is a probability distribution over sequences of tokens, symbols or words in a language. It’s main purpose is to as best as possible, predict with high probability the correctness of the next word, based on trained model and prompted inputs.

For example, a good language model of English will be able to understand and generate ‘correct’ responses with high probability

The paper “Attention is all you need” in 2017 was a big break-though due to 2 main innovations:

  • It proposes the transformer architecture with multi-headed self-attention which allow parallelize training and thus being able to harness the power of multi-GPUs
  • Multi-headed attentions allow neural network to learn multiple ways to capture more relationships of words and choose which part it pays more ‘attention’ to.

Transformer give us sequence-to-sequence (seq2seq) models which are comprised of encoders and decoder models. Seq2seq models are comprised of an encoder and a decoder. The encoder takes in a sequence of input data and creates a fixed-length representation of that input known as a context vector. The decoder then uses that context vector to generate a sequence of output data. We can use different models either as “Encoders” or “Decoders” or combine multiple of them to solve specific tasks.

Seq2seq models have been used successfully in a variety of applications, ie. BERT, ChatGPT, DALL-E and Stable Diffusion!


[Large] Size matters

GPT-2 was interesting, but couldn’t make an impact. GPT-3 in 2020 was a watershed moment with it’s 175 billion parameters trained was the big through in LLMs that we needed. And 2 years later, ChatGPT, an overlaid interface with rich affordances UX, coupled with fine tuning via reinforced learning with human feedback (RLHF), gave us the amazing product we all come to know and use today.

In LLMs, indeed large SIZE does matter. What awaits us is now the exciting arms race to deliver extremely large model trained in trillions of parameters, coupled with further RLHF. We all will have Microsoft, Facebook, Google, OpenAI and Moore’s Law to AI compute power and computer energy efficiency to thank for!

For context, GPT-2 XL has 1,5B parameters whereas GPT3 has 175B parameters and GPT-4 has around 100,000B parameters, a ~600 increase in size.

No alt text provided for this image

Internally, we are already using ChatGPT to help us significantly in the narrative and creative writing process. Exploring multiple options and iterate very quickly upon content writing style, plot changes and more. Then we feed it into our special Stable Diffusion workflow to then convert our written words into visual languages to communicate our ideas internally.

No alt text provided for this image
These images are created by us using a combination of sketches, AI tools and prompt engineering.


[Small] Size matters

The next significant step change is with the release of Stable Diffusion in 2022. Being a small model in size, It allows people to see that:

  • we don’t need to own a huge model to benefit due to its open source nature
  • we can run the models on our home computer or mobile device (SD 1.5 is 5gb in size). We can experiment, build things around and ontop of it. We can train it using accessible hardwares

And within the span of less than 1 year, we saw:

Different diffusion models trained on different visual datasets (from realism to anime)

Opensource communities release hundreds of addons to SD WebUIs (namely Automatic 1111) that allows us to:

  • control the output in much better way than before using ControlNet, InPaint & OutPaint
  • train cheap and effective LoRA models to combine with base models to further control outputs
  • and so many other amazing plugins releasing on a weekly basis

The ability to have controls over outputs is exactly what got me super excited about Stable Diffusion potential use cases in our industry.

Here are some examples of the outputs I’m able to achieve for Sipher at Ather Labs using a combination of: Stable Diffusion, ControlNet OpenPose, Canny & Depth, Inpainting and LoRA model training.

No alt text provided for this image
Training our own internal model base on our own character design
No alt text provided for this image
No alt text provided for this image
Applying this workflow into our progress allows use to speed up our Art creative process.

If you are interested in learning more about our efforts in making use of this amazing technology, please say hello!



What’s next?

This is an exciting space, especially since now we have efficient LLMs that could make meaningful embedding connections in the embedded vector space.

Size matters as well, as smaller and more nimble indie studio like us will be able to embrace these type of technologies much faster towards our workflow, with the goals of freeing up our team mates to focus on the higher creativity works that AI cannot perform.

My future posts will be about what we at Ather Labs are using this technology for, especially in the creative, game pre-production, production & game marketing context!

Nguyên Võ

Employee Experience Executive

1y

Mathew Tran vô đọc rồi chiều tóm tắt lại cho mình nha bạn 😆

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics