Date: Friday 2020.07.24 @ 10:34 am
From: R. David Zorc
To:   April Almarines
RE:   How to make a CORPUS

My very dear April,

  Also, before you reinvent the wheel, please do note that someone has already done a Tagalog Corpus. Please check out:
                http://sealang.net/tagalog/corpus.htm
You certainly should make use of this. I don’t know if they give much detail about their sources, but it is an excellent start.

  As I mentioned in the email I just sent, I realized that this would be a necessary step for your Tagalog Dictionary. When you sent me Quezon’s inaugural address in Tagalog, I decided to do a corpus for just that article. See the 29 series from a through f.
All of these are available via <https://zorc.net/RDZorc/LINGUISTICS_IN_20_HOURS/>.

  First, you should be aware of the 8 steps. { 29a=FUTURE_PROJECTS.txt}

  Second, you need to create a TXT file of one or more complete texts. {29b=Quezon-Inaugural_Address=TAGALOG.txt}

  Third, you should do or have an English translation of the texts {29c=Quezon-Inaugural_Address=ENGLISH.txt}. Depending on whether the translation is literal, free, or creative, this will give you an overall guide as to the meaning of the original text. Here is where a literal translation helps most, but a free or creative one will be somewhat of a guide too.

  Fourth, you need to PARSE (break down) the text into a single line for EACH word. {29d=Quezon-Inaugural_Address=PARSING-step1.txt} Note the number of words = tokens = lines that are generate altogether. This particular article has 2,017.

  Fifth, you need to SORT all the words that appeared alphabetically. You then need to edit (scan) the list for extra line breaks (blank lines) and also for words that you might wish to break up. Note that somehow or other this file has 2,006 lines, which is 11 fewer than 2,107. If you compare the two you should see why there is a difference.

  Sixth, you need to count the words that are identical (that is, eliminate the duplicates, and replace this by the number of occurrences = FREQUENCY. This is easy if a word occurs only once, twice, thrice, etc. But for entries that go on and on, I create a blank working document and ask WORD to replace the word with the same word (i.e., REPLACE <ang> with <ang>). Word will give you a count as to how many it replaced, and you need to enter this figure after the first occurrence of that word, and then delete all the rest. Note some of the HIGH FREQUENCY words are: <ang> 96, <at> 89, <ating> 19, <ay> 41 etc.

  Seventh, and lastly, you need to translate what you have. I tend to give brief glosses for WORDS or LEXEMES which can be put in quotes <”>, single quotes <’>, or just TABBED <CNTRL ^> over. For FUNCTORS (grammatical items) I tend to give abbreviations in square brackets, such as: [pro-1-sg-nom] = ako ‘I’ (first person singular nominative or topic pronoun), [dp-limiting] = lang, lamang (limiting discourse particle) ‘only, just’. No matter how you treat them, it is imperative (for sorting purposes among others) that your glosses and codes be consistent. I have not done this yet.

  If your CORPUS has many different articles, texts, recordings, or whatever, you must be sure to assign a unique abbreviated code for which you will maintain careful records. Since this is a speech by Quezon, the code could be <QZ> or, if you wind up having lots of things by Quezon, then <Q1> or <QZ01>. This is super important, because when push comes to shove, you will need to know the source of a contentious word so you can check its context quickly. If you have an enormous source of articles, which would be required for something like a dictionary, then the codes might become more complex, although it is wise to keep the code to a minimum of digits, equivalent to the maximum number of materials you have within that corpus. So, if you wind up with nineteen million, say 19,876,543,210 entries in your total corpus, you may need up to a 12 digit code. If less, then the number of digits that can accommodate ALL of your resources.

  The other thing I need to address is how and why I call translations: LITERAL, FREE, and CREATIVE. A LITERAL translation sticks close to the original (SOURCE) language; a FREE translation sticks close to the TARGET language (English in most of our cases); and a CREATIVE translation waxes eloquent and restates the overall EFFECT of the original language in the equivalent EFFECT of the TARGET language. This involves, for example, translating Aklanon <patugsiling> as “The Golden Rule” = ‘Do unto others as you would have them do unto you.’

  There is also what is called an INTERLINEAR translation, where one stretches out the SOURCE language entries to fit the TARGET language beneath it. Here is an example where I have done this for the Klata version of the “Monkey and Turtle” story.

ENG NGUMA NENG      PONNU  OLE  AKAP
[top]   story     [cm-obl]     turtle       and     monkey

Kulli,        kinna  ponnu  ole  akap     ngo   holalak. 
long ago   [exist] turtle and monkey [link] be-friends

Hotung  oddow   neyye   bonnoow      hila     diya’t       beled. 
one-[lk]  day       these    [past]-walk   they  there-[loc]  river


  I hope I have explained this sufficiently, but if not, ask away, my dear.

Cheers,
Sir David