Could anyone explain the capability of this in plain English? Can this learn and retain context of a chat and build on some kind of long term memory? Thanks
I'm not an LLM expert by any means but here is my take.
It's Speech Recognition -> Llama -> Text to Speech, running on your own PC rather than that of a third party.
The limitations on the context of the LLM are that of the model being used, e.g. Llama 2, Wizard Vicuna, whatever is chosen, in whatever compatible configuration is set by the user regarding context window etc, and given a preliminary transcript (as the LLM doesn't "reply" to the user in a sense, it just predicts the best continuation of a transcript between the user and a useful assistant, resulting in it successfully pretending to be a useful assistant, thus being a useful assistant - it's confusing).
I can imagine that it's viable to get that kind of behaviour by modifying the pipeline.
If the architecture was instead Speech Recognition -> Wrapper[Llama] -> Text 2 Speech, where "Wrapper" is some process that lets Llama do its thing but hooks onto the input text to add some additional processing, then things could get interesting.
The wrapper could analyse the conversation and pick out key aspects ("The person's name is Bob, male, 35, he likes dogs, he likes things to be organised, he wants a reminder at 5pm to call his daughter, he is an undercover agent for the Antarctic mafia, and he prefers to be spoken to in a strong Polish accent") and perform actions based on that:
- Set a reminder at 5pm to call his daughter (through e.g. HomeAssistant)
- Configure the text-2-speech engine to use a Polish accent
- Modify the starting transcript for future runs:
- Put his name as the human's name within the underlying chat dialogue
- Provide a condensed representation of his interests and personality within the preliminary introduction to the next chat dialogue
This way there's some interactivity involved (through actions performed by some other tool), some continuity (by modifying the next chat dialogue) and so on.
I've been wondering about how feasible it is to simulate long term memory by running multiple LLMs at the same time. One of them would be tasked with storing and retrieving long term memories from disc, so it'd need to be instructed about some data structure where memories were persisted, and then you'd feed it the current context, instructing it to provide a way to navigate the memory data structure to any potentially relevant memories. Whatever data was retrieved could be injected into the prompt to the next LLM, which would just respond to the given prompt.
No idea what sort of data structure could work. Perhaps a graph database could be feasible, and the memory prompt could instruct it to write a query for the given database.
This is achieved using vector databases to store memories as embeddings. Then you can retrieve a “memory” closest to the question in the embedding space.
This is an active area of research. The best we currently have is vector databases and/or sparse hierarchical information storage (you retrieve a summary of a summary via vector search, find associated summaries via vector search once more, then pluck out the actual data item and add it to the prompt.