Research & Innovation

WebDoc

View pages in this document

http://corpus.leeds.ac.uk/serge/webgenres/google.html

The project aims to classify web pages automatically. There are many different kinds of documents on the web, from games to shopping pages to journalism to blogs. Different sorts of page have quite different uses and characteristics. A query for 'Venice' results in pages of various types, referring to recent news, information about history, guidebooks, hotel lists, opinions about hotels and restaurants, etc. For many applications (language teaching, machine translation, information retrieval and extraction) it is also important to have the possibility of selecting a subcorpus according to specific parameters, such as encyclopedic knowledge vs. instructions, texts written for professionals vs. for the general public, or opinions vs. factual text.

In this project we will work on different language families, so that the method can be shown to be portable to further languages. We will be testing the approach using webpages in English, Chinese, German and Russian.

Hand in hand with classifying pages, we need to identify the categories we shall classify them into. The web is new, and this is not an area that has been widely researched to date. We shall adopt an iterative approach by classifying samples of web pages to see which pages fit the existing classification scheme, and amending the scheme to allow for those that do not.

Contacts: Serge Sharoff

Pages in this document

  1. Research & Innovation
  2. Track record in funded projects
  3. ACCURAT
  4. ASSIST
  5. eColoRe
  6. eColoMedia
  7. eColoTrain
  8. EvIDence
  9. Intellitext
  10. HyghTra
  11. Kelly
  12. LangCorp
  13. Mellange
  14. MITRAS
  15. MULIMO
  16. MyExhibition
  17. NNI
  18. ORCIT
  19. ReadingCorp
  20. TAUS
  21. TTC
  22. WebDoc