Simply put, an anglicism is a word or construction borrowed from English into another language. Our project will focus on integral borrowing, which is when a foreign word is borrowed directly into a second language.
When starting this project, we looked at two other studies of loan words in French, and both included a number of other languages in addition to English. Unsurprisingly, a number of borrowings were from the languages of nearby and neighboring countries, like Spain and Italy. In both studies, though, English loan words greatly outnumbered loan words from other languages.
It may be hard for English speakers to understand French's aversion to English loanwords at first, since English is very accepting of loan words. Unlike in English, though, a sizable proportion of borrowings into French come from English, so it is understandable that the Académie may see English words as invasive.
France is known for being very protective of its language, and even has the Académie française, whose mission it is to correct unfavorable speech and keep the French language prestigious. Lately this involves discouraging invasive loanwords, many of which come from English. We wanted to focus on these anglicisms, and look at the Académie's success with promoting their suggested replacements.
We had trouble finding an established French corpus to use that would fit our needs. It takes time to create and update a corpus, so the newest additions could be a few years old, and we wanted to look at French as it's used right now. We did not want to use newspapers or other types of formal writing, because we thought that writers of that style would actively avoid anglicisms. We very briefly considered using a promising corpus of Quebecois French, but abandoned the idea because Quebecois speakers and writers are often very strict about not letting anglicisms creep in, an attitude heavily influenced by Quebec being bordered by English-speaking Canada and the US.
We considered using Reddit posts and comments from French subreddits because the writing would be informal and from younger people, but the posts are not edited professionally, often quite short, and some users are not native French speakers.
We finally decided to use BuzzFeed France as a corpus, because the articles are colloquial, professionally edited and written, and written by and for young people, who are the most likely to use anglicisms. A benefit of using BuzzFeed articles is that they are relatively short, and have a standard format. Occasionally there were full-length articles, but most were lists that could be easily marked up in XML. A caveat of the corpus is that many of the articles are translated from or based on English articles, and many of the topics (especially TV and movies) are American.
We could have created our corpus by going to BuzzFeed France and copy and pasting articles into XML files, but that would be very tedious. Instead, we used a Python program called Scrapy to make a spider (a bot that crawls the web) that gathered articles and saved them to out server. With the help of Na-Rae and Andrew, the spider was able to bring back only French articles, and bring them back marked up with everything we needed except the anglicisms.
One of the more challenging parts of building the spider was getting it to only scrape French articles. Nothing in the URLs indicated what language a BuzzFeed article is in, so while we could make the spider stay within BuzzFeed, getting it to stay within French BuzzFeed wasn't so easy. French BuzzFeed has links to English BuzzFeed, but English BuzzFeed only has links to English BuzzFeed (US, UK, Australia, and India). The spider harvests every link on a page, so if one English link was harvested it would contaminate the entire corpus.
By viewing the source code of different-language BuzzFeed pages, Devra was able to find something in the HTML of each page that indicated what language it was in. The head HTML element of a page has an attribute called "lang." If the page is in English, lang will be equal to "en." For German pages, lang=de, and for French pages lang=fr.
Using the lang attribute, Andrew and Devra were able to write an if statement in the spider's code that made it only write the contents of an article out to a new file if the lang attribute was equal to fr.
Majors: French and Linguistics
Certificate: Arabic Studies
Linguistic Interests: Language contact, Dialectal variation
and stratification, Phonetics, Phonological change
Project Duties: Developed project idea, XML mark-up,
comparative analysis, made fun of the academie francaise a lot
Majors: Linguistics and Music
Certificate: Asian Studies
Linguistic Interests: ESL, Computational Linguistics
Project Duties: Corpus creation, spider, analysis
Majors: French and Linguistics
Minor: Economics
Linguistic Interests: Historical Linguistics, Second
Language Acquisition
Project Duties: XML markup, XSLT development, analysis
Major: Linguistics
Certificates: Russian and East European Studies, Global Studies
Linguistic Interests: Second Language Acquisition,
Sociolinguistics
Project Duties: Site design, research
This project was completed as part of the Computational Methods in the Humanities Course at the University of Pittsburgh, under the direction of David Birnbaum. We would also like to thank our project mentor, Mako Ishikawa, along with Andrew Nitz and Na-Rae Han for helping with creating the spider.