A Generative AI's Tale: Crafting Visual Novels

Let's say you have a text, and you wish to enhance it with illustrations or even transform it into a short animated movie. The process of hiring illustrators and voice actors can be both costly and time-consuming. Furthermore, their individual styles might not align with your vision. Fortunately, thanks to Generative AI, you can swiftly generate and experiment with illustrations and text-to-speech processing.

In this guide, I will take you through the process of creating visual novels. We will delve into how to break down unstructured text into scenes, how to incorporate specific actors into a scene, and so on.

Let's commence our storytelling journey.

Split text to scenes

Firstly, we must divide the unstructured text into scenes. This will result in each scene possessing a similar structure, making it possible to customize and visualize each individual scene. To accomplish this task, we will employ OpenAI text models. Utilizing GPT-3 should suffice for handling the task. The most challenging aspect of this task is to supply an effective prompt. For instance, we could use the following:

Please groom the following story into three scenes while maintaining
its original meaning and intent as closely as possible word by word.
Also before each scene put in square 
brackets are taken from scene text with detailed descriptions about the place where the scene coming from,
example [environment: room in the hotel, red carpet, white walls, raining outside]. 
Also before each scene put in square 
brackets are taken from the scene text main character in the scene,
example [main-character: woman]

The generated result can then be parsed using regular expressions.

[environment: hillside with sheep grazing][entities: woman]

Alternatively, we can provide a prompt to generate a valid JSON response:

Please split text to 3 scenes and provide output in valid JSON format  
{ 
	scenes: [  
		{ environment: "", character: "",  emotion: "", sceneText: ""  }, 
		{ environment: "", character: "",  emotion: "", sceneText: ""  }, 
		{ environment: "", character: "",  emotion: "", sceneText: ""  }  
	] 
 }

Once we have structured the scenes, we are prepared to visualize them.

Image Generation

After we've structured the scenes, we need to establish a mechanism for image generation. The primary requirements for the image generation model include:

The model should provide an API for image creation. This is why the Midjourney model isn't suitable, as it currently lacks an API at the time this article was written.
Consistency in character design and visual styles is critical. Characters in visuals should maintain a consistent appearance. For this reason, for instance, the Stability.ai REST API is not a suitable choice. Its use of Stable Diffusion can lead to dramatic output changes even with minor adjustments to the prompt or seed.

In light of the requirements stated above, an ideal choice would be a custom open-sourced Stable Diffusion model with Dreambooth fine-tuning.

Keep the same character and visual style

To maintain consistent characters in the Visual Novel, we should fine-tune the Stable Diffusion model. Dreambooth can assist in this process by enabling you to train a model using several images. You can experiment in Google Colab until you achieve satisfactory results. Please note that a unique model will need to be created for each character.

There are alternative approaches to fine-tuning your model. However, we have tested Dreambooth, and it has proven to be quite effective.

After we created a set of characters it is time to place them on the scene.