Hacker Newsnew | past | comments | ask | show | jobs | submit | barefeg's commentslogin

For some time, authentication was not part of the MCP. Now it’s there https://modelcontextprotocol.io/specification/2025-03-26/bas... so I’m wondering what is being addressed in Klavis. Is it something that the reference implementation of MCP lacks? If so, will it eventually make it to MCP?

I think it’s important to release SDKs that are secure by default, so not providing this in the reference MCP would be a big issue.

In my view, MCP should be maintained by the vendors themselves. It’s too complicated to use in the enterprise if everything comes from the community with questionable security. So I applaud initiatives that try to solve this. I think smithery.ai provides something similar while also being a repository of servers (I’m not associated with them), but again the problem is needing to trust an extra middleman vendor.

Does anyone else share this view? For example, will AWS (or insert any other hyperscaler) end up providing the “Bedrock” of MCP where security is native to the platform? Or will individual companies (Box, Google, MS, etc.) start rolling them out as part of their standard developer APIs?


Yes thank you! the newest MCP spec added the authentication part but it seems that people think it is still not perfect and are doing more modifications to the auth part. E.g. https://github.com/modelcontextprotocol/modelcontextprotocol.... We will also keep an eye on the spec development.


> For example, will AWS (or insert any other hyperscaler) end up providing the “Bedrock” of MCP where security is native to the platform?

Cloudflare already provides something along those lines with MCP on Workers with authentication (via their zero trust product AFAIK): https://blog.cloudflare.com/remote-model-context-protocol-se...

Sounds like they were one of the partners with Anthropic in their recent “Integrations” announcement.


Thanks for sharing your blogpost. We had a similar journey. I installed and tried both Langfuse and Phoenix and ended up choosing Langfuse due to some versioning conflicts on the python dependency. I’m curious if your thoughts change after V3? I also liked that it only depended on Postgres but the scalable version requires other dependencies.

The thing I liked about Phoenix is that it uses OpenTelemetry. In the end we’re building our Agents SDK in a way that the observability platform can be swapped (https://github.com/zetaalphavector/platform/tree/master/agen...) and the abstraction is OpenTelemetry-inspired.


As you mentioned, this was a significant trade-off. We faced two choices:

(1) Stick with a single Docker container and Postgres. This option is simple to self-host, operate, and iterate on, but it suffers from poor performance at scale, especially for analytical queries that become crucial as the project grows. Additionally, as more features emerged, we needed a queue and benefited from caching and asynchronous processing, which required splitting into a second container and adding Redis. These features would have been blocked when going for this setup.

(2) Switch to a scalable setup with a robust infrastructure that enables us to develop features that interest the majority of our community. We have chosen this path and prioritized templates and Helm charts to simplify self-hosting. Please let us know if you have any questions or feedback as we transition to v3. We aim to make this process as easy as possible.

Regarding OTel, we are considering adding a collector to Langfuse as the OTel semantics are currently developing well. The needs of the Langfuse community are evolving rapidly, and starting with our own instrumentation has allowed us to move quickly while the semantic conventions were not developed. We are tracking this here and would greatly appreciate your feedback, upvotes, or any comments you have on this thread: https://github.com/orgs/langfuse/discussions/2509


So we are still on V2.7 - works pretty good for us. Havent tried V3 yet, and not looking to upgrade. I think the next big feature set we are looking for is a prompt evaluation system.

But we are coming around to the view that it is a big enough problem to have dedicated saas, rather than piggy back on observability saas. At NonBioS, we have very complex requirements - so we might just end up building it up from the ground up.


He hinted at wanted to be convinced to work at the company. It seems as if he wants people to just know he’s good and people offer him the job instead of applying for it. It’s a bit contradictory since he applied for the job in the first place.


From this interview https://youtu.be/Nlkk3glap_U?feature=shared it seemed Anthropic was focusing more on those topics rather than winning the race. He hinted at being “forced” by the competition to release their models.


I was confused for a bit but there is no relation to https://haystack.deepset.ai/


I've shared the confusion, it's a bit unfortunate naming considering the pretty mature and fleshed out LLM framework of the same name.


Yeah I was excited at first since the framework is pretty solid and a vector DB from the same folks would have been interesting.


This technique had a very recent resurgence via https://txt.cohere.com/int8-binary-embeddings/. Hugging face also covered the technique here https://huggingface.co/blog/embedding-quantization. It seems like a very good tradeoff compared to the shorter embeddings which require fine tuning via the matryoshka technique. On the other hand, Nils Reimers suggests that trivial quantization of the full precision embeddings is not as good as using “compression friendly” embeddings like Cohere’s Embed V3. Does anyone know what’s the difference in precision between trivial quantization and optimized embeddings?


Could you give more details/resources on the distributed in cluster buildkit cache?


We use Dagger’s implementation. The basic approach is to have a buildkit engine run as a daemonset on the cluster, and clients specify the same docker socket that buildkit uses. The magic is in cache synchronization, eg only lazily pulling layers as the client requests them. This is scalable but obviously since caching is hard there are some complexities with efficient synchronization of cache layers and cache volumes. This is currently a long lived service that runs as a deployment alongside a bunch of ephemeral runners to manage the cache synchronization.

There are several other different architectures that range from simpler to more complex. The architecture I recommend people start out with is a single long lived beefy buildkit instance that a bunch of runners share, since that is much much simpler to implement. It of course has the downside that you have to refresh/rebuild the cache if the instance ever goes down. For runs that need read/write locks on volumes (eg Gradle build cache) my recommendation after trial and error to rsync those to the runners and then rsync them back after the run completes so you don’t have a bunch of locks fighting each other for the same folder.


I would also like to know more


Have you tried the arc browser? It embraces having lots of open tabs and organize them as you wish (or not) https://arc.net/


I updated their slogan just now:

"Firefox is the Chrome replacement I've already been using."


Do you need to have the same number of positive and negatives? Is there any meaning of pairing a positive an a negative in the triplet?


It's because of the loss of the model. I ask the model to produce a higher similarity between the query and the positive document rather than between the query and the negative document. I'll add more losses soon so there are more choices


is the loss the usual lambdarank?


Out of curiosity, how does it compare to YouTube’s own generated transcripts?


I would say that overall they are much, much better than the auto generated ones from YouTube. If the speaker speaks incredibly clearly and slowly, without slang, etc, then the built in ones are good enough. But in a tougher situation, the biggest whisper model achieves near superhuman accuracy— way better.


I found the YouTube one to be really bad for my voice and the way I speak - Whisper does it perfectly (using the large dataset).

English is my second language, and I mumble.


While it seems YouTube's auto-generated are hit or miss, I wonder if feeding them through an LLM can fix the mistakes and still get the video's idea out of them


I've found that to be the case. I typically don't want a full transcript -- I want the materials list, or a summary, or a counterargument. I've found it is totally sufficient to just plop the transcript into an LLM and ask for my desired output. No need to clean of the transcript ahead of time.


Whisper is generally better than the one in youtube.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: