Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

My current holy grail is my attempt to convert a Shipibo (an indigenous Peruvian language)-to-Spanish dictionary into a Shipibo-to-English dictionary. The pdf I have (available freely on archive.org) isn't a great scan (though I think it'd be a heck of a lot easier than some of the handwritten examples they show). Layout (2-columns) along with header/footers can cause some headaches, but it is all Latin script. This seems to fall on its face pretty badly (not even a couple of pages in), so my search continues. (The other major problem I'm having is trying to separate out Shipibo definitions/examples from the Spanish ones, and only translating the Spanish to English...so pretty complex I guess. I've been taking fresh stabs at this project every few months when I see OCR/LLM news pop up and continue to be disappointed)




I'm assuming you're interested in studying Ayahuasca traditions?

I recently learned that traditionally in Shipibo culture, ayahuasca was never meant to be given to "the normal mind". Instead the maestras would be the ones taking the ayahuasca in order to help guide them into diagnosing people dealing with various sicknesses.

These maestras were also ranked by how many different plants they'd done a dieta on. A dieta is kinda similar to fasting. You can't shower with soap, you can't have sex, you can't have too much salt/seasoning, can't be exposed to too much smoke, can't have alcohol, etc. And you use that specific plant throughout your time. Basically you want to eliminate any conflicting variables so you can experience the plant as purely as possible to understand its effects. Traditionally these dietas could last over a year but modern day maestros typically do them for just a few weeks.

I don't really have a point to this. Just found it fascinating how deeply and strictly they study certain plant medicines and wanted to share


Yes essentially. I've got a few resources cobbled together over the last few years but it'd be really nice to have this reference (my Spanish isn't the best, and running to the translator for a definition can be a little annoying). Also to share with fellow learners/apprentices I know. There are a couple of classes out there (which are actually geared more toward the ceremonial/icaro language, not purely conversational Shipibo, which is a bit simpler as you don't need to worry as much about conjugation and other complexities) which I might look into eventually.

(Fwiw I've accumulated a couple years worth of dieta under my belt and am well aware of the restrictions! It's indeed very fascinating, been pretty serious about it the last few years and I've barely scratched the surface)


Couldn't you use your smartphone and Google Lens (on Android, Google app on iOS includes Google Lens functionality) to translate the Spanish to English?

FYI - Lens on Android does in-place language translation including attempting to use the same/similar font that the original language is written/printed.

Unfortunately, I don't think Lens can be used in an automated batch translation mode to convert an entire book/multiple pages


I suppose both of us watched the same Youtube video by Metta Beshay (i think that is his name?)

I applaud your efforts, but that seems difficult to me. There's so much nuance in language, and the original spanish translation would even be dependent upon locale-destination of the original dictionary. Which would also be time based, as language changes over time.

And that translation is likely only a rough approximation, as words don't often translate directly. To add in an extra layer (spanish -> english) seems like another layer of imperfect (due to language) abstraction.

Of course your efforts are targeting a niche, so likely people will understand the attempt and be thankful. I hope this suggestion isn't too forward, but this being an electronic version, you could allow some way for the original spanish to be shown if desired. That sort of functionality would be quite helpful, even non-native spanish speakers might get a clearer picture.

What tools are you using to abstract all of this?

If the spacing and columns of the images are consistent, I'd think imagemagick would allow you to automate extraction by column (eg, cutting the individual pages up), and OCR could then get to work.

For the Shipibo side, I'd want to turn off all LLM interpretation. That tends to use known groupings of words to probabilistically determine best-match, and that'd wreak havoc in this case.

Back to the images, once you have imagemagick chop and sort, writing a very short script to iterate over the pages, display them, and prompt with y/n would be a massive time saver. Doing so at each step would be helpful.

For example, one step? Cut off header and footer, save to dir. Using helpful naming conventions (page-1, and page-1-noheader_footer). You could then use imagemagick to combine page-1 and -age-1-noheader_footer side by side.

Now run a simple bash vet script. Each of 500 pages pops up, you instantly see the original and the cut result, and you hit y or n. One could go through 500 pages like this in 10 to 20 minutes, and you'd be left with a small subset of pages that didn't get cut properly (extra large footer or whatever). If it's down to 10 pages or some such, that's an easy tweak and fix for those.

Once done, you could do the same for column cuts. You'd already have all the scripts, so it's just tweaking.

I'm mentioning all of this, because combo of automation plus human intervention is often the best method to something such as this.

Anyhow, good luck!


Once you have managed to get the data out and structured, you may want to check out dict.press. It's a dictionary publishing and management tool (which I maintain). Multiple widely used Indian dictionary projects run on it.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: