Cookies Policy

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies.

I accept this policy

Find out more here

A Syntactic Feature Counting Method for Selecting Machine Translation Training Corpora

Brill’s MyBook program is exclusively available on BrillOnline Books and Journals. Students and scholars affiliated with an institution that has purchased a Brill E-Book on the BrillOnline platform automatically have access to the MyBook option for the title(s) acquired by the Library. Brill MyBook is a print-on-demand paperback copy which is sold at a favorably uniform low price.

Access this chapter

+ Tax (if applicable)

Chapter Summary

Recently, the idea of “domain tuning” or customizing lexicons to improve results in machine translation and summarization tasks has driven the need for better testing and training corpora. Traditional methods of automated document identification rely on word-based methods to find the genre, domain, or authorship of a document. However, the ability to select good training corpora, especially when it comes to machine translation systems, requires automated document selection methods that do not rely on the traditional lexically-based techniques. Because syntactic structures and syntactic feature densities can heavily affect machine translation quality, syntactic feature-based methods of document selection should be used in choosing training and testing corpora. This paper provides evidence that document genres can be distinguished on the basis of syntactic-tag densities alone, supporting the idea that automated document identification is possible using alternative methods. Such methods would be ideal for creating syntactically as well as lexically balanced corpora for both genre and subject matter.



Can't access your account?
  • Tools

  • Add to Favorites
  • Printable version
  • Email this page
  • Recommend to your library

    You must fill out fields marked with: *

    Librarian details
    Your details
    Why are you recommending this title?
    Select reason:
    Corpus Linguistics Beyond the Word — Recommend this title to your library
  • Export citations
  • Key

  • Full access
  • Open Access
  • Partial/No accessInformation