Curious how you test accuracy across different models, and how much is cost per ...

ajcp · on Aug 11, 2024

In my experience at this point all the flagship multi-modal LLM provide for the same accuracy. I see very little, if any, drift in output between them, especially if you have your prompts dialed.

For the Gemini Flash 1.5 model GCP pricing[0] treats each PDF page an image, so you're looking at pricing per image ($0.00002) + the token count ($0.00001875 / 1k characters) from the base64 string encoding of the entire PDF and the context you provide.

10 page PDF ($0.0002) + ~3,000 tokens of context/base64 ($0.00005625) = $0.00025625

Cut that in half if you utilize Batch Prediction jobs[1] and even at scale you're looking at a rounding error in costs.

For on-going accuracy tracking I take a static proportion of the generations (say 1%, or 10 PDFs for every 1,000) and run them through an evaluation[2] workflow. Depending on how/what you're extracting from the PDFs the eval method is going to change, but I find for "unstructured to structured" use-cases the fulfillment evaluation is a fair test.

0. https://cloud.google.com/vertex-ai/generative-ai/pricing 1. https://cloud.google.com/vertex-ai/generative-ai/docs/model-... 2. https://cloud.google.com/vertex-ai/generative-ai/docs/models...