In my experience at this point all the flagship multi-modal LLM provide for the same accuracy. I see very little, if any, drift in output between them, especially if you have your prompts dialed.
For the Gemini Flash 1.5 model GCP pricing[0] treats each PDF page an image, so you're looking at pricing per image ($0.00002) + the token count ($0.00001875 / 1k characters) from the base64 string encoding of the entire PDF and the context you provide.
10 page PDF ($0.0002) + ~3,000 tokens of context/base64 ($0.00005625) = $0.00025625
Cut that in half if you utilize Batch Prediction jobs[1] and even at scale you're looking at a rounding error in costs.
For on-going accuracy tracking I take a static proportion of the generations (say 1%, or 10 PDFs for every 1,000) and run them through an evaluation[2] workflow. Depending on how/what you're extracting from the PDFs the eval method is going to change, but I find for "unstructured to structured" use-cases the fulfillment evaluation is a fair test.