BETA: Skritter | Generating vocab lists for full texts

Newer Topic Created 13 years ago Older Topic

Generating vocab lists for full texts

葛修远 October 14th, 2011 8:50p.m.

I think it'd be really cool to generate complete vocab lists for famous texts such as the Analects, so that students could choose to study these and know they're covering all the vocab.

Generating the list isn't hard, but now I have 1637 items in a text file.

I have a couple of questions about this:

1. Is there any performance issue with having such a huge list? It'd have to be in 16 sections, each with 100 items. What I'm thinking is that Skritter would just handle all the items you already know and add what you don't? Would that be efficient?

2. Do people think it's a good idea to handle the issue in this way? Or try to create a more specialised vocab list? It's just quite difficult to efficiently determine what's "specialised" and what's general vocab that shouldn't be on the list.

Personally I think it'd be cool to just have all 1600 items and let Skritter handle it, I'm just worried it'll have performance issues.

nick October 14th, 2011 9:48p.m.

I think you'll be doing a ton of extra work to Skritter-memorize every single word and character in the Analects, many of which may be in the long tail of words used once that you would never see again. A better approach might be to only make a list out of those which appear X times or more.

Skritter can handle a list that large, although if you make more, smaller sections, they'll load faster (but loading the list where you see all the sections in the list will load slower).

Are you just doing characters, or do you have a segmented copy of the text such that it's broken down into words, too? That's generally the stumbling block in doing this--how to split text into words appropriately. We haven't implemented anything that does this yet, although it's on our list.

Netbrian October 14th, 2011 10:24p.m.

For Japanese, I've been using the utility http://forum.koohii.com/viewtopic.php?pid=132000#p132000 to generate word frequency lists from the text I want to practice on, and then importing all words above a certain frequency threshold into a Skritter vocabulary list. I haven't figured out how well this works yet though. :)

My understanding is that this is even easier to do with Chinese, and that there are tools online to do it with, but I'm not familiar with it.

雅各 October 14th, 2011 10:35p.m.

@nick Splitting words using the "longest match" dictionary look up method for segmenting text seems to work reasonably well, I use it for my study all the time. Is there a particular reason you are seeing segmentation as not being done appropriately?

GrandPoohBlah October 14th, 2011 11:34p.m.

If you are reading texts such as the Lunyu and Daodejing, then I suspect you already have a decent background in classical Chinese, and so it would probably be easier for you to simply go through the text and pick out the characters you aren't familiar with. That said, I do think it would be much more efficient to study those characters in the context of the text in which they appear, since context greatly affects the meaning of characters in classical Chinese.

(looking at your website, I see that you've already had practice with translating classical texts, so perhaps my advice isn't so useful after all...)

葛修远 October 15th, 2011 12:27a.m.

Well, I spend quite a lot of time working on Classical Chinese for my degree. Basically I'd like to use Skritter as a sort of catch-all safety net to guarantee I've covered every word / character that might come up.

So long as the Skritter system can handle such a list, I think I might go ahead and do it (I have software to segment large texts into vocab).

葛修远 October 15th, 2011 1:30a.m.

OK, so the list is here:

http://www.skritter.cn/vocab/list?list=124191330

My idea is just that if you cover that list, you know you can have a good stab at the Analects.

It has a lot of items in, but if you've already been studying a while most of them should already be covered. Skritter can then pick out the ones you don't know.

nick October 15th, 2011 9:47a.m.

Yeah, I guess longest match works okay for most purposes. We're looking to do something more accurate (and so more complicated) for Skritter, is all.

Great list, aeriph!

Kewt October 21st, 2011 11:28a.m.

I've got a question about all this… I would like to generate vocabulary-lists from texts I read in Chinese. It seems not to be so difficult, according to 葛修远… But I don't know how I can do that ! I think it would be very useful for me.

thx

This forum is now read only. Please go to Skritter Discourse Forum instead to start a new conversation!

create an account

recover an account

Generating vocab lists for full texts