Midjourney has COMPETITION & it’s FREE/Open Source
Recently, the AI image generation space has slowed down a little bit. I mean, there was a release of Stable Diffusion XL, which made a little splash, and then, of course, mid-journey V5. But in terms of AI speed, that was quite some time ago.
I have an AI image generator I want to show you guys today. It is going to be fully open source, which is amazing. I mean, we always love to see that. That’s the reason Stable Diffusion was so successful and was able to evolve so greatly over time because it was fully open source.
This model is incredibly competitive. It’s high resolution, it’s high fidelity, it can spell. I have talked about it on the channel before, but now it is almost about to fully release. This is one of the best, if not the best AI image generator we have seen to date. The fact that it’s going to be open source too means that it’s going to evolve exponentially, just like Stable Diffusion, meaning that this is base level stuff we’re seeing right now. Once people start modifying it because it’s open source, it’s going to get that much better.
I want you guys to think about Dream Booth plus this model throughout the whole video because it’s going to be nuts. Finally, folks, Deep Floyd’s iaf model is going live. The code is already live on GitHub, which is really exciting. But in the coming days, we can expect the model weights to be released.
Hey guys, Matt here. After I already recorded this entire video you’re watching now, I have to butt in here with a quick side note. Guys, the weights for Deep Floyd iaf just released minutes after I recorded this video, so yeah, this thing is now fully open source for anyone to mess around with. So yeah, the weights are here, everything is here. Deep Floyd iaf is here, super exciting. I’m going to link some extra things down below that I don’t mention in this video because some more stuff came out after I finished recording. Some more blog posts, inferences, and tools for it, and research as well. So this is a fully open source Stable Diffusion level of access for everybody. And here are some examples right off the bat that they provide for us.
Spoiler alert, I have gotten access to Beta versions of this model in the past, it is incredible, but yeah, this is supposed to be Dark Side of the Moon Pink Floyd, but like paper mache, really cool imagery here that it produces. By the way, their claims here for benchmarks are that it beats Imogen, beats Dolly, too, beats Party, beats e Diffie, beats Stable Diffusion, all on benchmarks, essentially putting it on top. I don’t know about mid-Journey though, they don’t say that they beat mid-Journey, but I know this thing can spell and mid-Journey cannot spell, so it’s very interesting in that regard.
Here is the GitHub Page, by the way, this is directly linked to Stability AI. I should mention that it’s still by Deep Floyd’s lab, though, and again, the model is known as iaf. They have a Discord server that I’ll link down below if you want to see more into this. But here are the examples we get so far, obviously all of these are nuts. Right off the bat, I don’t see a single bad image. First of all, perfect spelling here, this is not Dolly. We’ve got all of these, I don’t know, what are these muskrats technically, with all the different similar, very similar but different-colored sweaters? Again, some really good spelling here. I mean, look at how many words are in this here. “Strike, deer, mistress, and cure his heart,” that is directly spelled out in the prompt, I’m sure, and it’s right next to a perfect image of a boot. So, this thing is next-level technology. This is a very complex one as well, this is a drawing of someone drawing a drawing, which is very hard to describe for an AI image generator, yet it produced it, as you can see.
This thing also can do multiple aspect ratios. “What if it is more than text printed on a sign?” here that it was able to generate. But yeah, you can see there are just a lot of really great examples of this thing producing really good AI imagery, it’s nuts. A novel state-of-the-art open-source text-image model with a high degree of photorealism and language understanding. It is a modular composed of a frozen text encoder and three cascaded pixel diffusion modules, a base module that generates 64 by 64 pixel images based on the text, and two super-resolution models, each designed to generate images of increasing resolutions, 256 and 1024. All stages of the model utilize a frozen text encoder based on the T5 Transformer to extract text embeddings, which are then fed into a unit architecture enhanced with cross-attention and attention pooling. The result is a highly efficient model that outperforms current state-of-the-art models, achieving a zero-shot FID score of 6.66 on the Coco dataset. Our work underscores the potential of larger unit architectures in the first stage of cascaded diffusion models and depicts a promising future for text-to-image synthesis.
Absolutely, and if that text didn’t make any sense to you, here is a visual representation of it. As you can see, the prompt is put in a photo of a violet baseball cap with yellow text “Deep Floyd is better than text.” It generates a 64 by 64 image, upscales it to 256 by 256, then to 1024 by 1024. And you can see it’s actually quite simple how it all works. Frozen T5 XXL, the different iaf models, then the iaf upscalers. Simple process, yet quite effective.
Things get even better because this thing has in-painting as well. Check this out, adding a hat onto his head. Pretty darn good in-painting. I mean, you never would have known that it was in-painting in the first place, which is the kind of in-painting you want. You can see the different parameter sizes for all of the different models as well.
Research paper is coming out soon, but yeah, this is a really nice setup little GitHub here and a nice release for Deep Floyd iaf. They’ve been working on this for a long time, and it’s a really, really great model. Let’s take a look at some examples.
There are many, many places that Deep Floyd iaf has produced results, so we’re going to start on their Instagram. We’ve got Deep Floyd iaf written out on a sign. I mean, the photo of the sign looks very realistic too. You can tell it’s like a rusty sign. You can tell this is paint. The text actually fits really well with the rest of the sign, and the background is blurry. It looks almost like a real photo.
Again, we have another one, Deep Floyd Street. Same thing going, we have a nice, really perfect blurred background. You can even see some grain because this is like a low light shot, so it’s actually been able to pick up on the fact that there should be a little bit of grain in the photo itself. But yeah, we’ve got this rusty, gross-looking pole and this rusty street sign, and the text, it all fits together perfectly. It looks like a photo, and it spelled it all perfectly with text that isn’t like gross-looking either, it looks good.
And you can see more photo realism here. It’s able to produce this beautiful Swan image where it’s like this rainbow iridescent Swan. That Swan’s head’s a little screwed up and scary, I would say. But the background looks really nice and the rest of this Swan and the water all look pretty fantastic.
“Delicious” that is spelled out in noodles, which is pretty funny, but it did a pretty good job making it look like noodles with the lettering, and it’s in cursive, by the way, which is pretty crazy. And the text for this was literally an alphabet soup, ramen with the word “delicious” written with noodles. That was literally the only thing for the prompt, not that advanced.
This was their classic test prompt, a 4K DSLR photo of a rainbow owl with deer horns in the woods. It’s a pretty high-res photo of a rainbow owl with the horns, and it’s in the woods, and it’s definitely like a nice 4K high-res photo. This was pretty funny. This was “Make AI Open Again” written on a hat. It actually looks like it’s stitched onto the hat, though, if you guys will notice, like the stitching and everything. It looks like it almost was threaded on there, so it’s really good at picking up those fine details. It’s a really phenomenal model. I mean, you’ll never ever see mid-journey produce anything that looks like this. Mid-Journey cannot spell.
A near flawless photo of a burger. This was literally just “delicious freshly cooked burger with extra bacon and melty cheese between a fluffy brioche bun.” You can see the cheese is very melty, the bacon looks good. It all looks pretty appetizing. “Make Floyd Deep Again” spray-painted on the wall of a train station or a subway. I actually produced this one during the beta, which was pretty funny, as you can see. It’s by Matt vidpro, which is really funny to see, but this was a 4K photograph of a cat dressed as Walter White from Breaking Bad by metvid Pro. And guys, look how good this came out. You’ve got the cat perfectly in the center, he’s got the Walter White hat on, he’s dressed in the Walter White-like chemistry protective garment, and he’s got the Breaking Bad logo right on his shirt, which is pretty cool. So this was a pretty incredible generation, I think it came out fantastic.
More food, which is very easy for these models to do but looks very appetizing. Capybara podcast, we literally have the glowing letters right here with a capybara just standing there, pretty awesome. “Open Source Me” written on the wall of a bar, this one was really cool as well. This is literally a tennis ball that is turning into a little bird, and the prompt to get this was “cute yellow canary bird head with tennis ball body,” and as you can see, they actually use natural language in this prompt, saying, “Wow, that’s detailed, hyper-realistic, ultra-fine details.”
A nearly perfect image of a frog is tozer just a little bit messed up, and this one was really cool, saying “really soon” letters made of clouds that say “really soon” above a beautiful ocean. So yeah, you can see it’s like actually looks like someone drew it out or spelled it out with their fingers, but it actually was just able to generate this based off of text. So yes, viewers, here is the Deep Floyd AI server. I’ll go ahead and link this server down below. Yeah, you can see there’s a lot of really great generations in the Cherry Picks section. I like the fact that they’re saying this is cherry-picked.
This one was pretty cool. This was a Game Boy made of cheese. That was literally it. It literally says “Game Boy” right on the actual Game Boy itself. It looks just like a Game Boy too, by the way, like the whole Game Boy is very accurate, no mismatching buttons or anything weird like that, and it’s made of cheese, like you would expect.
Photo of a Welsh Corgi wearing a tuxedo, and you can see that’s exactly what we get out of this generation. This thing is able to handle some really, really difficult prompts like this. Vader stuff is not easy. This is ultra-realistic, beautiful, epic DSLR photo of Darth Vader riding a rainbow unicorn. Hyper-realistic, like that’s a crazy prompt, and here we go, we’ve got Darth Vader riding a rainbow unicorn. Let’s see, you know what, let’s just see what mid-journey does. Mid-Journey V5 will put it up against it for this image right here. Here’s what Mid-Journey is giving us. It’s alright, not as good, I think, as Deep Floyd necessarily. Maybe some tough competition there, I don’t know, it’s somewhat of a close call almost. But you can see how Deep Floyd is going to be extremely competitive. This is pretty crazy as well. We’ve got two people with the Google shirt and then the Microsoft shirt, like, it has to get Google right and Microsoft right, that’s not easy, and did it again over here as well, and this Google logo is actually correct too, and so is the Microsoft logo. Like, isn’t that actually insane? Never seen an image model that could ever do that, ever.
We’ve got Pink Freud with a picture of Sigmund Freud, which is really, really cool. I mean, the stuff we’re going to be able to make with this model is going to be off the charts, and it’s going to be open source, so it’s going to get better as well. I mean, check this out as well. Literally a picture of a cat saying “Let’s do this” on the side, so cool-looking. God listens to Frog, very specific text and very specific images like this purple cat and everything. I mean, you guys that are really good prompt crafters are going to have a field day with this AI. The shadows of prehistory, it’s like an actual movie poster.
A portrait of Kurt Cobain saying “I hate AI,” that’s pretty funny. This is kind of funny as well, “No, I don’t have a gun,” and then there’s a gun on the other side of the heart, like “You do,” just crazy specific stuff with this AI image generator. You can’t do this kind of stuff with mid-journey, you just can’t. Really cool picture of some mushrooms as well, so it’s not just text stuff, it’s a very realistic image generator. It’s like Dolly on steroids or something, but I mean, some of these images, it’s like I don’t think I don’t know if I could ever see another image generator being able to produce these. Stability AI here, I mean, it’s just insane what it can do.
Sloth holding a sign that says “Mondays, am I right?” Like, it’s ridiculous what this thing is capable of. The torch logos as well, of course, can do fantastic logos. It’s just so specific, it’s so, so specific. I think this takes it beyond what is possible with mid-journey, for sure. Mid-Journey might beat it in some very specific ways, maybe, but man, the ability to spell this good is just insane. Like, look at the jokes people are already able to create with this model, it’s ridiculous. “Product Design” here, the Batman thermos-like chicken nugget instead of the CPU, chicken nugget instead of a diamond ring, like, it actually produces it really, really well.
Here’s an example of an image like I would not be able to tell that this was generated by AI at all, to be honest. I mean, there’s not a lot going on, but it looks real. Will prompt for food. It’s just really, really specific stuff gets generated by this thing that I’ve never seen another AI image generator really do like this. This thing really pushes it to the max, and I’m really excited to see what people do with this. So here are some of the fails, as you can see. It does fail to spell quite a lot, honestly. Like, look at all these different failed spellings. I have had it spelled perfectly on the first try when I’ve used it in the past, but I’ve had it mess up quite a lot, so there is like there is a thing where it’s not perfect all of the time, but it’s going to be open source, which means they’re going to find out how to make this thing cheap and even better with fine-tuned models. So like, yeah, I’m excited for Deep Floyd iaf, I don’t know about you guys. Literally going to be one of the best AI image models who ever hit the scene. This honestly is almost just an opinion. If you really want the spelling and the control of what you’re able to do with it, this might be the best model for you. And maybe it’s not as cinematic sometimes as mid-journey or as polished, but it might be better in like quite a lot of cases. We’ll have to do another video directly comparing this to Mid-Journey V5, but it is going to be insane when this fully releases with the weights. Tell me what you guys think down in the comments below, viewers. Are you excited for Deep Floyd iaf? I’m ready for the weights to release. I’m ready for everyone to get their hands on this and start creating amazing things. If you are able to generate anything with Deep Floyd iaf, share it with me in my Discord. We’re almost at 10,000 members on the Discord server, which is awesome. But thanks so much for watching. We’ll do more comparison and more actual testing with Deep Floyd iaf in a future video. I want to compare it directly with Mid-Journey V5 and really see how this thing stacks up. But it’s a fantastic model. See you guys in the next one and thanks for watching.