Hacker Newsnew | past | comments | ask | show | jobs | submit | abhgh's commentslogin

I was not aware this existed and it looks cool! I am definitely going to take out some time to explore it further.

I have a couple of questions for now: (1) I am confused by your last sentence. It seems you're saying embeddings are a substitute for clustering. My understanding is that you usually apply a clustering algorithm over embeddings - good embeddings just ensure that the grouping produced by the clustering algo "makes sense".

(2) Have you tried PaCMAP? I found it to produce high quality and quick results when I tried it. Haven't tried it in a while though - and I vaguely remember that it won't install properly on my machine (a Mac) the last time I had reached out for it. Their group has some new stuff coming out too (on the linked page).

[1] https://github.com/YingfanWang/PaCMAP


We generally run UMAP on regular semi-structured data like database query results. We automatically feature encode that for dates, bools, low-cardinality vals, etc. If there is text, and the right libs available, we may also use text embeddings for those columns. (cucat is our GPU port of dirtycat/skrub, and pygraphistry's .featurize() wraps around that).

My last sentence was on more valuable problems, we are finding it makes sense to go straight to GNNs, LLMs, etc and embed multidimensional data that way vs via UMAP dim reductions. We can still use UMAP as a generic hammer to control further dimensionality reductions, but the 'hard' part would be handled by the model. With neural graph layouts, we can potentially even skip the UMAP for that too.

Re:pacmap, we have been eyeing several new tools here, but so far haven't felt the need internally to go from UMAP to them. We'd need to see significant improvements given the quality engineering in UMAP has set the bar high. In theory I can imagine some tools doing better in the future, but the creators have't done the engineering investment, so internally, we rather stay with UMAP. We make our API pluggable, so you can pass in results from other tools, and we haven't heard much from that path from others.


Thank you. Your comment about LLMs to semantically parse diverse data, as a first step, makes sense. In fact come to think of it, in the area of prompt optimization too - such as MIPROv2 [1] - the LLM is used to create initial prompt guesses based on its understanding of data. And I agree that UMAP still works well out of the box and has been pretty much like this since its introduction.

[1] Section C.1 in the Appendix here https://arxiv.org/pdf/2406.11695


I’m working on a new UMAP alternative - curious what kinds of improvements you’d be interested in?

A few things

Table stakes for our bigger users:

- parity or improvement on perf, for both CPU & GPU mode

- better support for learning (fit->transform) so we can embed billion+ scale data

- expose inferred similarity edges so we can do interactive and human-optimized graph viz, vs overplotted scatterplots

New frontiers:

- alignment tooling is fascinating, as we increasingly want to re-fit->embed over time as our envs change and compare, eg, day-over-day analysis. This area is not well-defined yet common for anyone operational so seems ripe for innovation

- maybe better support for mixing input embeddings. This seems increasingly common in practice, and seems worth examining as special cases

Always happy to pair with folks in getting new plugins into the pygraphistry / graphistry community, so if/when ready, happy to help push a PR & demo through!


> alignment tooling is fascinating, as we increasingly want to re-fit->embed over time as our envs change and compare, eg, day-over-day analysis. This area is not well-defined yet common for anyone operational so seems ripe for innovation

It is probably not all the things you want, but AlignedUMAP can do some of this right now: https://umap-learn.readthedocs.io/en/latest/aligned_umap_bas...

If you want to do better than that, I would suggest that the quite new landmarked parametric UMAP options are actually very good this: https://umap-learn.readthedocs.io/en/latest/transform_landma...

Training the parametric UMAP is a little more expensive, but the new landmarked based updating really does allow you to steadily update with new data and have new clusters appear as required. Happy to chat as always, so reach out if you haven't already looked at this and it seems interesting.


Timefold looks very interesting. This might be irrelevant but have you looked at stuff like InfoBax [1]?

[1] https://willieneis.github.io/bax-website/


I haven't; from a quick reading, InfoBax is for when you have an expensive function and want to do limited evaluations. Timefold works with cheap functions and does many evaluations. Timefold does this via Constraint Streams, so a function like:

    var score = 0;
    for (var shiftA : solution.getShifts()) {
        for (var shiftB : solution.getShifts()) {
            if (shiftA != shiftB && shiftA.getEmployee() == shiftB.getEmployee() && shiftA.overlaps(shiftB)) {
                score -= 1;
            }
        }
    }
    return score
usually takes shift * shift evaluations of overlaps, we only check the shifts affected by the change (changing it from O(N^2) to O(1) usually).

That being said, it might be useful for a move selector. I need to give it a more in depth reading.


Thanks for the example. Yes, true, this is for expensive functions - to be precise functions that depend on data that is hard to gather, so you interleave the process of computing the value of the function with gathering strategically just as much data as is needed to compute the function value. The video on their page [1] is quite illustrative: calculate shortest path on a graph where the edge weights are expensive to obtain. Note how the edge weights they end up obtaining forms a narrow band around the shortest path they find.

[1] https://willieneis.github.io/bax-website/


You don't - the way I use LLMs for explanations is that I keep going back and forth between the LLM explanation and Google search /Wikipedia. And of course asking the LLM to cite sources helps.

This might sound cumbersome but without the LLM I wouldn't have (1) known what to search for, in a way (2) that lets me incrementally build a mental model. So it's a net win for me. The only gap I see is coverage/recall: when asked for different techniques to accomplish something, the LLM might miss some techniques - and what is missed depends upon the specific LLM. My solution here is asking multiple LLMs and going back to Google search.


Love awk. In the early days of my career, I used to write ETL pipelines and awk helped me condense a lot of stuff into a small number of LOC. I particularly prided myself in writing terse one-liners (some probably undecipherable, ha!); but did occasionally write scripts. Now I mostly reach for Python.


I'm curious to know if Anthropic mentions anywhere that they use speculative decoding. For OpenAI they do seem to use it based on this tweet [1].

[1] https://x.com/stevendcoffey/status/1853582548225683814


Wouldn't this be an optimization problem, that's to say, something like z3 should be able to do - [1], [2]?

I was about to suggest probabilistic programming, e.g., PyMC [3], as well, but it looks like you want the optimization to occur autonomously after you've specified the problem - which is different from the program drawing insights from organically accumulated data.

[1] https://github.com/Z3Prover/z3?tab=readme-ov-file

[2] https://microsoft.github.io/z3guide/programming/Z3%20Python%...

[3] https://www.pymc.io/welcome.html


Hadn't seen this before, very nice read, thank you!


This is the definitive reference on the topic! I have some notes on the topic as well, if you want something concise, but that doesn't ignore the math [1].

[1] https://blog.quipu-strands.com/bayesopt_1_key_ideas_GPs#gaus...


These are very cool, thanks. Do you know what kind of jobs are more likely to require Gaussian process expertise? I have experience in using GP for surrogate modeling and will be on the job market soon.

Also a resource I enjoyed is the book by Bobby Gramacy [0] which, among other things, spends a good bit on local GP approximation [1] (and has fun exercises).

[0] https://bobby.gramacy.com/surrogates/surrogates.pdf

[1] https://arxiv.org/abs/1303.0383


Aside from secondmind [1] I don't know of any companies (only because I haven't looked)... But if I had to look for places with strong research culture on GPs (I don't know if you're) I would find relevant papers on arxiv and Google scholar, and see if any of them come from industry labs. If I had to take a guess on Bayesian tools at work, maybe the industries to look at would be advertising and healthcare.I would also look out for places that hire econometricists.

Also thank you for the book recommendation!

[1] https://www.secondmind.ai/


Your tutorials show a real talent for visualization. I never grokked SVMs before I came across your Medium page at https://medium.com/cube-dev/support-vector-machines-tutorial... . Thanks!


Thank you for your kind comment!


Active Learning is a very tricky area to get right ... over the years I have had mixed luck with text classification, to the point that my colleague and I decided to perform a thorough empirical study [1], that normalized various experiment settings that individual papers had reported. We observed that post normalization, randomly picking instances to label is better!

[1] https://aclanthology.org/2024.emnlp-main.1240/


We successfully built an AL pipeline by building models that quantify both aleatoric and epistemic uncertainties and use those quantities to drive our labeling efforts.

Specifically, post training you measure those on an holdout set and then you slice the results based on features. While these models tend to be more complex and potentially less understandable we feel the pros out-weight the cons.

Additionally, giving access to a confidence score to your end users is really useful to have them trust the predictions and in case that there is a non-0 cost for acting due to false positives/negatives you can try to come up with a strategy that minimize the expected costs.


Evals somehow seem to be very very underrated, which is concerning in a world where we are moving towards (or trying to) systems with more autonomy.

Your skepticism of "llm-as-a-judge" setups is spot on. If your LLM can make mistakes/hallucinate, then of course, your judge llm can too. In practice, you need to validate your judges and possibly adapt to your task based on sample annotated data. You might adapt them by trial and error, or prompt optimization, e.g., using DSPy [1], or learning a small correction model on top of their outputs, e.g., LLM-Rubric [2] or Prediction Powered Inference [3].

In the end, using the LLM as a judge confers just these benefits:

1. It is easy to express complex evaluation criteria. This does not guarantee correctness.

2. Seen as a model, it is easy to "train", i.e., you get all the benefits of in-context learning, e.g., prompt based, few-shot.

But you still need to evaluate and adapt them. I have notes from a NeurIPS workshop from last year [4]. Btw, love your username!

[1]https://dspy.ai/

[2]https://aclanthology.org/2024.acl-long.745/

[3]https://www.youtube.com/watch?v=TlFpVpFx7JY

[4] https://blog.quipu-strands.com/eval-llms


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: