English tutor: LLM demo

Brief

I've been experimenting with LLMs lately - finding them to be really fascinating! I gave a talk in July 2024 to BrisPHP developers about LLMs and how they can be used in application development.

What I wanted to experiment with here was using an LLM in a mechanism that is neither a chat interface, nor a basic summarisation task, as we've seen enough of those already.

What I wanted to test the waters on was having the LLM "augment" the UI.

The idea

The idea was to have a reference text (i.e. a few sentences) that the user can ask questions about. Instead of "the system" simply answering the question in a chat thread, "the system" can also highlight parts of the reference text if it thinks it would help communicate.

The build (webapp bits)

This was built on Cloudfront Pages/Page Functions, with a React frontend. The Page Functions invoke Langchain.js callables which are instrumented with Langsmith.

The trickiest bit for me was determining the best way to have the LLM "augment" the UI. I first tried getting the LLM to respond like below:

{
  "response": "Lily found a <Highlight>shiny</Highlight> <Highlight>red</Highlight> key under a <Highlight>big</Highlight> oak tree."
}

My thinking here was I have a <Highlight> React component to handle the highlighting and it should be reasonable to parse the response and render the highlights. The code I ended up with to parse the response was a bit gnarly and didn't spark joy.

I then refactored to something more like below, which was felt a lot simpler to implement the React side of things, but required more nudging on the LLM side.

{
  "response": [
    "Lily found a ",
    {
      "type": "highlight",
      "content": "shiny"
    },
    " ",
    {
      "type": "highlight",
      "content": "red"
    },
    " key under a ",
    {
      "type": "highlight",
      "content": "big"
    },
    " oak tree."
  ]
}

The LLM bits

I first started trying to get the LLM (OpenAI, various models) to answer all in one go, respond with both the text answer, and the augmented text response as well. What I found was that the LLM struggled with answering the question generally (following the instructions about how to answer) when it was also tasked with generating a structured output for the UI.

As with many things with LLMs, I found that breaking the problem down into smaller, more manageable pieces was the way to go. I ended up with a chain of Langchain callables:

Answer the question (without knowing about the UI)
Plan how the UI should be augmented
Generate the response data following the annotation plan

Langchain Chain to augment UI

Model selection

There was enough signs that I could get gpt-4o-mini to work for this task, but I ended up just using gpt-4o-2024-08-06 as it was more correct with less prompting work. I setup a simple Langsmith experiment (with only a few examples) to evaluate the two models.

If I were to continue with this, I would likely setup some more benchmarks in a Langsmith experiment (not sure how evals would work here) to work on prompting to get gpt-4o-mini to work better at a lower cost.

My gut is that few-shotting a few examples would quickly nudge the model in the right direction.

Tom Newby