Gemini In, Gemini Out

This year’s Google I/O event was a strange affair. There was an unhinged DJ who yelled “NO ONE WROTE THIS! GOOGLE WROTE THIS!” while he sort of (?) demoed generative music that he was looping.

Sundar Picahi came out a few minutes later and with the vitality of a mannequin announced that this was “The Gemini Era” and talked about how much progress they’ve made since last Google I/O with Gemini.

Keep in mind that last Google I/O Bard was first made available to everyone. Then Google changed the name of Bard to Gemini this February. They announced an improved version of Gemini 1.5 Pro (a.k.a. Gemini Advanced for some reason?), but didn’t change the version number, as well as Gemini 1.5 Flash, a lighter model, and Gemini Nano which will be embedded in Chrome browsers now, not just Android phones. This is not to get it confused with AI Overlays for Google Search, which can be turned on with the Google Labs flask icon.

The only name Google has left untouched is DeepMind, which is perhaps the most sinister-sounding name possible for LLM and general AI research (Project Astra).

That doesn’t mean that all of this is in anyway sinister, but a lot of it seemed misguided. A lot of it is also very confusing, since there are many Geminis, and they’re going to appear in a variety of places.

There are some demos that everyone in Google’s C-Suite is wild for, regardless of the specific product:

Summarizing. Every executive wanted a summary of everything. One summarized an email chain between herself, her husband, and a prospective roofer. The summary said that there was a quote for the work, and the time the work could start but didn’t even include the quote in the summary. She asked a followup question to compare the quotes and that’s when she saw the price. Another exec didn’t have the time to watch a 3 minute video on pickleball rules. Wild that these were selected as demos.
Meal planning. We saw two sets of meal planning examples in the presentation. It showed off how you could load up a prompt (a question) with terms and then you’d get back breakfast, lunch, and dinner recipes. Individual UI elements existed to override a particular item, so it wasn’t like you were locked in, but these weren’t really any different from the recipes you’d get doing a Google search before this rolls out. It wasn’t writing a recipe, showing the recipe, doing measurement calculations or generating a shopping list. These are links to all the recipe sites that are laden with shady ad-tech cruft and SEO keyword stuffing to try and get into Google search results. I wasn’t as wowed as these busy professionals.

These are dreadful things to watch, and are not really as impressive as executives seem to think that they are. I hope that Apple doesn’t fall into this trap at WWDC.

There was only one travel planning demo, so I didn’t include it above, but it was a lengthy one. The exec had already booked flights, and a hotel, and that information was in her Gmail. She constructed a prompt to get help organizing what to do and where to eat according to that flight and hotel information. The results were produced and she could browse and override individual bits, but budget and prices really didn’t seem to factor in. These restaurants are also things you could just … Google for instead of paying $19.99 a month for Gemini Advanced. Who’s that stressed about planning they’re paying that fee?

Surely, at some point that might filter down to regular Google Search, but maybe Google is planning on Gemini being so exciting that people start paying for it?

There were some good demos about being able to load up a bunch of documents and pick out important information from them. More than just opening each and performing a text search. Also that data is explicitly not used for training models, and Google doesn’t use it. That sort of thing could have interesting applications.

I was a lot less happy with the demonstration of a virtual teammate that sits in Google Workspace. In this case, named Chip. The first hypothetical scenario that the presenter invents for Chip is to “quickly catchup” by asking the Google Chat space, “[Does] Anyone know if our IO storyboards are approved?”

If anyone asked the group that general question, spamming everyone, he should have read the channel updates first or done a search for “storyboards” maybe check in with the person responsible for approving them. Instead, everyone gets spammed and then gets spammed by Chip’s reply, which is, “Based on the Google I/O chat between Aparna and Kristina it looks like the storyboard is approved”. Yeah, for some reason it doesn’t use punctuation to appear more human-like. Also, it couches it’s response with “it looks like” to seemingly avoid legal liability? Remember, Gemini, like all LLMs isn’t a reliable source of truth.

Congratulations, you spammed everyone in the chat so you look like a fool, got a bot that replied without any certainty, and still should check on the approval state. If those storyboards weren’t approved you’d be in a position of trying to tell them this was Chip’s fault.

Then he follows that up by demoing Chip summarizing where they’re at on their schedule, and it highlights a potential conflict. Another person offscreen asks for a summary.

These are not tasks that require automation, because you should have hired capable people. We should appreciate labor that goes into all aspects of communication and not treat our conversations with one another like a free-flowing firehose.

What is not demoed, and what I’m sure will appeal to bad bosses around the world, is the capacity to use this tool to micromanage employees, or generally snoop on progress in an invasive and disrespectful way. Chip doesn’t care about summarizing your status for that boss, or making any mistakes, because Chip isn’t a person.

Creativity

A constant source of tension with generative AI is over training sources, and whether the application is a tool, or a replacement for an artist. Google is not transparent about the datasets it trains on, so we’ll just take it as a given that there’s stuff in that training data that people would object to.

Setting that aside, we started the I/O event with the guy using Google to make a short clip of nonsensical music that he looped. That part was very much not using Google’s tool. It just generated that little snippet and that was it.

Doug Eck came out on stage later in the presentation to talk about Generative Media - image, music, and video.

Imagen 3

more photo-real, fewer distortions and artifacts, better text rendering, and “independent evaluators preferred Imagen 3 over other popular image generation models.” It really doesn’t seem all that distinct in the demo, and I am definitely not the target audience for this. There’s little an artist can do with the output, so this continues to be mostly for someone that couldn’t produce artwork.

Music AI Sandbox

Creates instrumental sections “from scratch” and transfer “styles” between tracks. Wycleaf Jean appears in a video to describe how he considers the tool to be akin to sampling. “As a hip hop producer, we dug in the crates. We playin’ these vinyls and the part where there’s no vocals, we pull it, we sample, and we create an entire song around that. So right now we’re diggin’ in the infinite crate. It’s endless.”

Then my nemesis Marc Rebelliet appears and talks about how he uses it to generate a bunch of loops. “Google’s loops right here. These are Gloops.”

Sigh.

Veo

“High quality” 1080p videos from text, image, and video prompts. One of the demo started from a video, and extended it. Then to show us what it can really do they put it in Donald Glover’s hands. Cut to Donald Glover saying he’s interested in AI. Then there are a lot of vague clips of things where you can see some warbling, and the ground surface artifacting like crazy with the cowboy boots. That’s it though, they didn’t actually have the short film they allegedly were making with Donald Glover.

Veo will apparently only be available to select creators at lab.google and there’s a waitlist open now. But… what does it do? How can you edit or adjust the output? Can someone fix those cowboy boots? Can someone keep any kind of consistency from shot to shot so it doesn’t look like it’s completely different each time you generate a video? How are you going to handle generating sound to match with the video you’re generating?

Update: The videos have a maximum limit of 60 seconds. Good grief.

I’m the most skeptical of generative video at the end of the day. These things approximate stock footage —probably because they used a lot of stock footage in their training data? Possibly. There are some more videos on their labs site so you can see things tearing and burbling.

I don’t think it is responsible for Google, or OpenAI for that matter, to sell fully generative video as being something that’s right around the corner.

Not a lot of producers are technically savvy, they’ll believe this stuff, and it’ll make a big mess.

In Summary

I think this was a cynical event trying to apply AI to things as fast as they can get it out the door. Building a business model on the fly to charge for computer resources. Inculcating LLMs into things that are not always improved by having them there. Impressing the inveterate gamblers of Wall Street to show that you have “AI” like OpenAI does.

There’s intriguing stuff in here, to be sure, like the Astra demo, and checking through your personal files with a level of context awareness that a search lacks.

But summarizing? Meal planning? Increasing office dysfunction? Suspicious generative video?

Sundar even made a heavily scripted, cringeworthy joke out of it at the end of the presentation where he mentioned someone was probably counting how many times they said “AI” in the presentation. Then the script text file (not even the video output up to that point) went into a prompt and a Gemini model counted 120 times. Was that even correct?

I know it’s to show off feeding data to the model and asking it to do something, but it’s an oddly accurate metaphor for this presentation where Gemini didn’t really need to be used, and it didn’t really improve anything.

2024-05-14 17:00:00

Category: text

Unauthoritative Pronouncements