Categories
Screen Chatter Software Engineering

Experiments with Chat GPT functions and browser automation

Back in 2018 I gave a talk at a few different meetups and conferences called “Your Web Page Never Listens To Me“, it was all about the Web Speech API and what voice / conversational user interfaces could look like for the web.

At the time, speech recognition was finally getting pretty fast and accurate, but all my demos were limited to saying things the computer was expecting to hear – much like a CLI, you had to give exactly the right command for things to happen the way you want, and the discoverability for what commands might exist was non-existent.

In 2023, we’re getting used to Large Language Models like Chat GPT, which are remarkably good at holding a conversation, and it feels like they do a decent job at understanding what you’re trying to say. It certainly makes my string matching and regex based conversation handling from 2018 look like something from the stone age. Perhaps ChatGPT or similar could help me get a conversational UI going to interact with web pages?

Most people who’ve just had a quick play with ChatGPT are impressed by how much it seems to know and how well it writes, but you quickly realise it can’t do anything. It can’t look things up. It can’t open some other app and do tasks for you. All it can do is chat. And when it doesn’t know something it makes it up.

My main source of following updates to the world of generative AI and large language models is Simon Willison’s blog, and in March one of his blog posts titled “Beat ChatGPT in a Browser” stood out to me: you can get these large language models to interact with “commands” or “functions”. You tell them about a function that you will make available to it, what the function do,es and how to use it, and it will try send a chat message with syntax for calling the function or command as part of the conversation. As a programmer you can then wire that up to a real function or command, that interacts with the outside world, and get it to do something useful. Very cool idea!

At the time I think I experimented for an hour or so trying to prompt ChatGPT to respond in ways that I could use programmatically with little luck, and figured it was a bit harder than it sounded, especially for someone who hasn’t spent much time learning how to work with large language models.

Well, last week Open AI announced an update that introduced “Function Calling” in their Chat APIs. This is using the GPT3.5 or GPT4 models with an API for declaring the functions you want to make available to the chat bot, and it can then utilise them as part of the conversation. Again I experimented for an hour or so, and this time, it seemed to mostly work!

So far my experiment looks like this:

A screen recording of my using a command line app. In the screen recording I type a prompt. You can then see a browser open, perform a web search, and open a web page. In the background the terminal is spinning information. Once it finishes the terminal prints the answer. The dialogue is below.

The dialogue of this interaction, in terms of messages between the user and the assistant, looks like this:

User: When and where is DDD Perth happening this year?

Assistant: DDD Perth is happening on 7th October 2023. The venue for the conference is Optus Stadium in Perth.

But there’s a lot more hidden dialogue making function calls, with 8 messages in total, 6 of them “behind the scenes”:

User: When and where is DDD Perth happening this year?

Assistant call function: getUrl({searchTerm: "DDD Perth"})

Function getUrl() result: "https://dddperth.com/"

Assistant call function: openPage({url: "https://dddperth.com/"})

Function openPage() result: {"loaded":true}

Assistant call function: getTextFromPage()

Function getTextFromPage() result: "Skip to contentMenuDDD PerthFacebook iconFacebook, opens in new windowInstagram iconInstagram...."

Assistant: DDD Perth is happening on 7th October 2023. The venue for the conference is Optus Stadium in Perth.

Now, there’s a whole bunch of problems here:

  • The language model I’m using only allows 4097 tokens, and the only option it has to read a page is to read all the text, which on most pages I try is at least 10,000 tokens. I had to cherry-pick the example above.
  • This is completely vulnerable to prompt injection. If I get it to visit a website the website could give new instructions to run functions I don’t want it to run, including opening web pages I don’t want it to open (porn, bitcoin miners, something that tries to take a photo of me using a webcam and post it to reddit…)
  • Interacting with it via a terminal isn’t really great.
  • I’m relying on a public search engine to locate the page I’m looking for, so it’ll often land on teh wrong page.
  • The function to get all the text often accidentally leaks JavaScript and CSS code into the output.
  • And probably many more things!

…but its exciting to see it work. I define a set of functions I want the LLM to be able to interact with, and I give it a natural language prompt, and it successfully navigates its way through my functions to answer the prompt using data from the real live internet.

I’m definitely keen to keep playing with this.

I’ll share my work on GitHub here: https://github.com/jasononeil/screen-chatter/ (and keep posting on this blog)

Leave a Reply

Your email address will not be published. Required fields are marked *