Computing the ambiguities part 1
Natural language processing is a field of data where we use textual information in order to comprehend a semantic meaning or to extract information.
Nowadays, many big firms like google and facebook are interested in understanding social communities their needs and their interaction with other communities, and to perform those tasks they found themselves in need for a model to the human behaviour, a human behaviour model is a predictive model based on human activities and communication, such model contains a lot of features that are sensible to people like monitoring location or monitoring social networking activities and a lot of people are not willing to give such information due to their privacy concerns. So they turned into another tempting model, which can not be altered by consumers' choices, it is the language model, it cannot be altered because it is based on a neo-fundamental need the need for a social life that extends the physical limits and since it is becoming a need for anyone to have a virtual social life to promote careers and to share experiences big firms are confident that their model basis are going to last long.
This model uses social content such as posts, likes, notes, comments and messages to profile people into categories of interest which can be crucial when it comes to making efficient advertisement campaigns, or in simple words how to send the right ad to the right persons. Since we only think using a language, the profiling can only be a language model, we have at least those two use cases the content is written in a standard language like English and Arabic or in a non standard language like the Tunisian dialect.
To make a language model for a standard one it is not that hard we just have to have a large dataset of verified textual content, from the dataset at first, we need a layout segmentation to extract paragraphs, titles and sentences then we perform the simplest statistical approach calculating occurrences, in order to extract vocabulary, and relations between words afterwords from those relations and with the use of grammatical standards we extract semantic relations like the syntactic tree and other features.
For a non standard language this can be a little bit tricky let's say we have this community Tunisia, people are using a dialect that is a mix of Arab, French, English and others, this language is does not have a standard it is a spoken language that is widely used and written in social networks, it is rare compared to the usual to have a Tunisian person writing a content in a standard language.
A nonstandard language does not have a proper structure then what we need is a structure model for non standard language, a model that extends the capabilities of a standard language model comprehend the semantics behind code switching, for example let's take this sentence.
مشيت لل pharmacie نلقى روحي ناسيا ال ordonnance.
the translation to english is : I went to the pharmacy, but i forgot the prescription.
in this case the code switching can help the reader to focus on generally ( in terms of probabilities ) low Intel sentence elements in this case the place descriptor, more then that in terms of entropy the code switching ( the sudden code switching can easily be noticed in an entire text ) can hold an information that can be used to accentuate a fact or an element inside the sentence.