EBLUL closes

One of the mailing lists I subscribe to has just carried an announcement that the European Bureau for Lesser Used Languages has closed. From their site I notice that the news service Eurolang is also currently closed.

Does anyone know if there is a story behind this, or is it just coincidence – perhaps funding sources that just happened to run out at more or less the same time? Are new organisations emerging to champion the cause of lesser used languages or is the language agenda less important in these times of economic crisis?

Leave a comment

Changing rooms at Debenhams

…I really was confused, but it really wasn’t my fault (this time).

I had clearly followed the sign that said “Ystafell wisgo” – unfortunately this really wasn’t a good translation of the English “This way to pay”.

The ladies on the till seemed mildly amused and faintly embarrassed when this was pointed out to them.

A different sign revealed that it wasn’t a good translation for “Cosmetics” either (no, I was still looking for the changing rooms actually).

Why would a company spend money on translating their signs but then have no quality control procedures in place? Surely you wouldn’t need to be a Welsh speaker to think it unlikely that the Welsh for “This way to pay” and “Cosmetics” was the same!


Google gyfieithu

…but how rough? On their Research Blog Google suggest that it is “often good enough to give a basic understanding of the text”, so I decided to put it to the test with some text borrowed from murmur.

Original Welsh:

Dulliau ystadegol o gyfieithu peirianyddol sy’n galluogi’r cyfieithiadau yma – hynny yw, mae Google yn defnyddio’r swmp anferthol o destunau sydd ar gael ar y we i ganfod patrymau tebyg yn eu geiriau a thrwy ddefnyddio testunau cyfochrog mewn gwahanol ieithoedd maent yn chwydu allan yr hyn sy’n cyfateb agosaf mewn iaith arall.

Google translation:

Statistical machine translation methods which enable the translation here – that is, Google will use huge quantities of texts available on the web to find similar patterns in their words and by using parallel texts in different languages they vomiting out what is the nearest equivalent in another language.

Original English:

These translations are produced by statistical approaches to machine translation – that is, Google uses the vast amounts of text on the web in order to find patterns between the words, and by using parallel texts of different languages are able to produce what appear to be translations.

Google translation:

Mae’r rhain cyfieithiadau yn cael eu cynhyrchu gan ddulliau ystadegol i peiriant cyfieithu – hynny yw, mae Google yn defnyddio symiau helaeth o’r testun ar y we er mwyn canfod patrymau rhwng y geiriau, a thrwy ddefnyddio ochr yn ochr testunau o ieithoedd gwahanol yn cael eu gallu i gynhyrchu hyn yn ymddangos yn cyfieithiadau .

Well the Welsh to English is a little colourful perhaps, but pretty good I think. English to Welsh looks like something I might have written, but again I think the gist is there (more fluent Welsh speakers may disagree?). Also it will apparently improve over time.

One of the interesting comments that Google make is that “We’ve found that one of the most important factors in adding new languages to our system is the ability to find large amounts of translated documents from which our system automatically learns how to translate. As a result, the set of languages that we’ve been able to develop is more closely tied to the size of the web presence of a language and less to the number of speakers of the language.” Whilst we might be critical of the number of obscure forms and dull documents which are produced bilingually on the web under the mantle of Welsh Language Schemes, perhaps we are now seeing some unanticipated benefits for the language – or maybe people clever than me had planned this all along?


Web2.0 and Bilingualism

Researching and writing the report really was very interesting and stimulating indeed, especially the meetings with stakeholders (many thanks to all those who took part).

The report raises a number of different issues which we might like to discuss, but here are just a couple to get us going.

How does an organisation meet the need to be agile and responsive in web2.0 if it needs to outsource its translation?

What should be done with User Generated Content – should it be translated, deleted if in the “wrong” language…?

In the light of comments to my previous posts about the Welsh language blog deficit, is it inevitable that the language which is perceived of as having the largest audience will be more attractive to users generating content and what implications would this have for the other language(s)?

What actually are the most significant concerns for organisations who wish to deliver web2.0 services bilingually?

Feel free to discuss these points, the report, or any other points you feel are relevant.


Welsh Language Blog Deficit 2

…and wrong where?

The base of the 22% from the OII was internet users, whereas for the Welsh language estimate I used the whole poulation. Doh! Should have spotted that one!

Now I don’t have the historical figures to hand, but I have good reason to belive that a reasonable current estimate for use of the internet by fluent Welsh speakers is 67%.

This would give a base of 18,3342 internet using, competent Welsh writers, if 22% of these were blogging, we would have about 40,335 blogs.

Well I guess we have accounted for 20,000 “missing” blogs, now we just need to account for the other 40,000.

Actually once I get back to work (yes I am on holiday) I should be able to calculate a pretty accurate estime for 2006 as I think I have a complete set of all the necessary statistics for that year.


Welsh language blog deficit?

According to the Oxford Internet Institute’s The Internet in Britain 2009 report (pdf 1.9MB) 22% of people in Britain aged 14+ write a blog.

Messing about rather crudely with figures from the Welsh Language Board’s The Welsh Language Use Surveys of 2004-06 report (pdf) I arrive at a figure of 273,646 Welsh speakers aged 14+ who describe the extent to which they can write Welsh as “very well” or “well”.

Now these numbers are somewhat rough around the edges and the time periods don’t match – but if we assumed that 22% of these Welsh speakers were blogging, that would suggest they would create somewhere in the region of 60,200 blogs. If we use the 2005 figure for blogging in Britain (17%) that would suggest somewhere in the region of 46,500 blogs.

So how many Welsh language blogs are there? Well, I would put a hugely optimistic top figure of 200, including blogs produced by learners, people from overseas, blogs which are pretty much inactive and so on. If we are talking about blogs produced from within Wales, written “well” or “very well” in Welsh and updated at least every couple of months, I would be surprised if there were more than 100.

Whichever way you look at it I recon we are somewhere in the region of 46-60,000 blogs short. Now my first instinct is simply to think I have a decimal place wrong somewhere – embarrassing as that would be, I would actually be quite relieved! Feel free to point out the obvious error I have overlooked! However, you need to move that decimal point quite a lot to reduce the deficit to zero.

So how can we explain this apparent deficit?

1) Actually blogging isn’t evenly distributed across Britain – people in Wales don’t blog much.

2) Actually blogging isn’t evenly distributed across languages, Welsh speakers don’t blog much.

3) Actually 22% (or 17%) of Welsh speakers do blog, they just do it in English.

4) Actually my maths was just horribly wrong and there is no deficit.

5) Errors and differences in the statistics have been compounded and multipled to give an inaccurate estimate.

6) Open to suggestions…

Even if I am only vaguely right on my numbers, and if the explanation is 1, 2, or 3, this seems remarkable. It certainly for me raises some interesting questions about the vitality of the Welsh language online.


Planet Wicipedia

…including some ideas on how to measure the “quality” of a Wikipedia, suggesting a measure of “depth” (described as a “rough index of its collaborativeness”) based on number of edits, number of users, and number of stub pages. I’m not entirely sold on this as a measure of “quality” per se, but it does have some resonance for me in terms of thinking about the community of people involved rather than simply about the content itself.

It also mentions a campaign organised via Wicipedias Caffi section, to boost the number of articles. Similar campaigns appear to have been successful for Catalan. However, simply boosting the number of articles doesn’t necessarily make Wicipedia any better (depending on how we define “better” I suppose).

I agree with Craig that Wicipedia is important – or at least it has the potential to be important. He suggests that “presence is power” and whilst I kind of agree with him, there is also the danger that presence may be purely symbolic if it is not combined with use. Beyond the issue of use is the issue of effect – if a Welsh speaking pupil goes to Wicipedia to read about the moon landings and the article is poorly written or insufficiently detailed, they are very likely to go to the equivalent Wikipedia article. Next time they want some information, will they bother with Wicipedia or just go straight to Wikipedia?

Wicipedia is just one of a number of opportunities for Welsh language use on the internet, many of which appear to hold potential for language maintenance, however I am beginning to wonder about the extent to which this potential is actually being realised.


History mis-repeating itself

…yes on the 16th January 2006 the BBC carried a very similar story under the headline Pedestrian Sign’s Forked Tongue. Rather intriguingly the left and right were in the opposite languages in 2006, so with a little bit of careful peeling they could perhaps create two correctly translated signs. It does make you wonder what sort of quality control procedures are employed by the companies creating these signs – sigh!

1 Comment

Automatic translation from Welsh gets a boost from France!

Press Release
12 Mai 2009

Automatic translation from Welsh gets a boost from France!

High-quality Welsh-English machine translation will come a step closer when a
new initiative gets underway this month.

The multinational Apertium team, which released their Welsh-English translator
(http://www.cymraeg.org.uk [1]) in August 2008, has been accepted into the
fifth Google Summer of CodeTM [2], and one of the projects to be funded will
be an improvement to that translator.

Apertium (http://www.apertium.org) is a Free Software [3] machine translation
platform. It was first developed to handle translation between related
languages in Spain, but over the last few years it has been extended to deal
with other languages. To date, translators for 17 language pairs have been
released, covering languages spoken by 1.1bn people, from English (est. 500m
speakers) to Aranese (est. 4,000 speakers). A similar number of other
language pairs are in development – these include Indian languages like Hindi
and Bengali, and Scandinavian languages like Norwegian and Sami.

Google Summer of Code offers student developers stipends to write code for
open-source projects, advised by mentors already working on the projects, and
has helped create millions of lines of code for dozens of projects. This was
the first year that Apertium applied for the program, and 9 Apertium projects
are being supported.

The Apertium Welsh-English translator works by applying grammatical rules to a
Welsh sentence to turn it into an English sentence. An alternative approach
(adopted by software like Moses [4]) is to use a large body of text to work
out what the likely translation of a given phrase is.

The Summer of Code student, Gabriel Synnaeve from Grenoble, France [5], will
be working on combining these two approaches, using techniques developed at
Carnegie-Mellon University in the USA [6]. The aim is to improve the quality
of the translation – in effect, the Apertium and Moses translations will be
compared, and the best bits of each will be used in the final translation.

For instance, take the Welsh sentence:
“Mae Heddlu’r De yn ymchwilio i farwolaeth dyn 41 oed o Abertawe.”
(South Wales Police are investigating the death of a 41-year old man from

Apertium currently produces:
“South Wales Police is investigating death man 41 years old from Swansea.”

Moses currently produces:
“the south wales police investigation into the death of a man 41 years
of age of abertawe.”

The aim is to combine the best chunks from each program, so that we get
something like:

  • [is investigating] +[the death of a man] *[41
    years old] *[from Swansea]
    Here, the chunks marked * come from Apertium, and the one marked + from
    Moses, and combining both improves the quality of the translation.

This is cutting-edge stuff, and has rarely been tried before. Prof Harold
Somers, in a 2004 report for the Welsh Language Board [7], suggested that a
medium-term goal for machine translation in Welsh would be “to integrate …
different [machine translation] engines into a single system”. Nothing has
been done on that to date, and Gabriel’s work will be the first attempt to
bring this vision of “multi-engine machine translation” for Welsh closer to

Francis Tyers [8], who will be mentoring Gabriel, said, “I was quite surprised
that we didn’t get any Welsh students applying, but this is a fantastic
opportunity to improve Welsh language technology. I have no doubt we’ll see
some real gains in the translation quality.”

Gabriel has already started work. “At the minute I’m fine-tuning the Moses
Welsh-English translator to make it as efficient as possible. The Apertium
community is very friendly, and I wanted to participate in a big open
source project, so I’m glad I went for it.”

Kevin Donnelly [9], who co-developed the Apertium Welsh-English translator
with Francis, noted that this was a big step forward for Welsh. “It is
wonderful that so many talented people are working on Apertium, and that they
are giving Welsh such a high priority. What we need now is for bodies
promoting Welsh here in Wales to step up to the plate and give whatever
enouragement and other support they can.”


[1] http://ufal.mff.cuni.cz/pbml-91-100.html. Francis Tyers and Kevin
Donnelly (2009): “apertium-cy – a collaboratively-developed free RBMT system
for Welsh to English”, Prague Bulletin of Mathematical Linguistics, 91.

[2] http://code.google.com/soc

[3] http://www.fsf.org/about/what-is-free-software. The Free Software
Foundation’s definition of “Free Software” is software that the user is free
to use, copy, change, and distribute.

[4] http://www.statmt.org/moses. Moses is an open-source statistical machine
translation system.

[5] Gabriel Synnaeve is a student at the École Nationale Supérieure
d’Informatique et de Mathématiques (http://ensimag.grenoble-inp.fr), a
leading informatics and mathematics centre. He will graduate in September
2009 and will then begin work on a doctorate on Bayesian machine learning.

[6] Alon Lavie (http://www.cs.cmu.edu/alavie) is leading this work. See
also: http://www.cs.cmu.edu/
alavie/papers/EAMT-2005-MEMT.pdf. S. Jayaraman
and A. Lavie (2005): “Multi-Engine Machine Translation Guided by Explicit
Word Matching”, Proceedings of EAMT-2005.

[7] http://www.byig-wlb.org.uk/english/publications/publications/2302.doc.
Harold Somers (2004): “Machine translation and Welsh: the way forward.”,
Report for the WLB.

[8] Francis Tyers studied computer science at Aberystwyth, and is now a
language engineer for Prompsit Language Engineering, S.L. and a PhD student
at the Universitat d’Alacant. He is a key Apertium developer, with a special
interest in extending it to handle the Celtic languages.

[9] Kevin Donnelly has been working on Free Software in Welsh since 2003, and
developed the online Welsh dictionary Eurfa (http://www.eurfa.org.uk).

Kevin Donnelly, 01248-715925, kevin@dotmon.com


Datganiad i’r Wasg
12 Mai 2009

Cyfieithu awtomatig o’r Gymraeg yn cael hwb o Ffrainc!

Bydd cyfieithu peirianyddol o ansawdd da o Gymraeg i Saesneg yn dod yn agosach
pan gychwynnir ar broject newydd y mis yma.

Mae’r tîm rhyngwladol Apertium, a ryddhaodd eu cyfieithydd Cymraeg-Saesneg
(http://www.cymraeg.org.uk [1]) ym mis Awst 2008, wedi cael ei dderbyn i mewn
i’r pumed Google Summer of CodeTM [2], a bydd gwelliannau i’r cyfieithydd hwn
yn cael ei ariannu fel un o’r projectau.

Platfform cyfieithu peirianyddol yw Apertium (http://www.apertium.org), sy’n
Feddalwedd Rhydd [3]. Datblygwyd yn y dechrau i gyfieithu rhwng ieithoedd
sy’n perthyn i’w gilydd yn Sbaen, ond dros y blynyddoedd diweddar estynnwyd y
rhagleni drin iaethoedd eraill.
yn cynrychioli 1.1bn o bobl, o Saesneg (tua 500m o lefarwyr) i Araneg (tua
4,000 o lefarwyr). Mae nifer tebyg o barau eraill yn cael eu datblygu, sy’n
cynnwys ieithoedd Indeg megis Hindi a Bengaleg, ac ieithoedd Scandinafaidd
megis Norwyeg a Sami.

Hyd yn hyn, mae cyfieithyddion ar gyfer 17 pâr o ieithoedd wedi eu rhyddhau,

Mae Google Summer of Code yn cynnig lwfans i fyfyrwyr i ysgrifennu cod ar
gyfer projectau cod-agored, gyda chyngor gan fentoriaid sy’n gweithio esoes
ar y projectau, ac mae o wedi helpu i greu miliynau o linellau o god ar gyfer
dwsinau o brojectau. Dyma’r flwyddyn cyntaf i Apertium wneud cais i’r
rhaglen, ac ariannir 9 o brojectau Apertium.

Mae’r cyfieithydd Cymraeg-Saesneg Apertium yn gweithio gan weithredu rheolau
gramadegol i frawddeg Gymraeg i’w throi hi’n frawddeg Saesneg. Ffordd arall
o wneud hyn (a ddefnyddir gan feddalwedd megis Moses [4]) yw defnyddio corff
mawr o destun i weithio allan beth yw’r cyfieithiad tebygol am unrhyw

Bydd y myfyriwr, Gabriel Synnaeve o Grenoble, Ffrainc [5], yn ceisio cyfuno’r
ddwy ffordd yma o weithio, gan ddefnyddio technegau a ddatblygwyd ym
Mhrifysgol Carnegie-Mellon yn yr UDA [6]. Yr amcan yw gwella ansawdd y
cyfieithiad – bydd y cyfieithiadau Apertium a Moses yn cael eu cymharu, a’r
darnau gorau o bob un yn cael eu defnyddio yn y cyfeithiad terfynol.

Er enghraifft, gweler y frawddeg Gymraeg:
“Mae Heddlu’r De yn ymchwilio i farwolaeth dyn 41 oed o Abertawe.”

Mae Apertium ar hyn o bryd yn cynhyrchu:
“South Wales Police is investigating death man 41 years old Swansea.”

Mae Moses ar hyn o bryd yn cynhyrchu:
“the south wales police investigation into the death of a man 41 years
of age of abertawe.”

Y bwriad yw cyfuno’r darnau gorau o bob rhaglen, i gynhyrchu rhywbeth fel:

  • [is investigating] +[the death of a man] *[41
    years old] +[of] *[Swansea]
    Yma, mae’r darnau a nodir gan * yn dod o Apertium, a’r rhai a nodir gan + o
    Moses, ac mae cyfuno’r ddau yn gwella ansawdd y cyfieithiad.

Dyma waith arloesol, heb ei wneud o’r blaen. Awgrymodd yr Athro Harold
Somers, mewn adroddiad ym 2004 ar gyfer Bwrdd yr Iaith [7], y dylai amcan
tymor-canol ar gyfer cyfieithu peirianyddol yn Gymraeg fod “to integrate …
different [machine translation] engines into a single system”. Nid oes unrhyw
beth wedi ei wneud hyd yn hyn, a gwaith Gabriel fydd y cais cyntaf i ddod â’r
syniad yma o “multi-engine machine translation” ar gyfer y Gymraeg yn agosach
i fodolaeth.

Dywedodd Francis Tyers [8], fydd yn rhoi cyngor i Gabriel, “Dipyn o siom oedd
hi nad oedden ni’n cael cais gan fyfyriwr Cymreig, ond mae hyn yn gyfle gwych
i wella technoleg iaith yn Gymraeg. Rydym ni’n si?r o weld cynnydd o
safbwynt ansawdd y cyfieithu.”

Mae Gabriel wedi cychwyn ar y gwaith eisoes. “Ar hyn o bryd dwi’n gwneud
newidiadau mân i’r cyfieithydd Moses i’w wneud mor effeithlon â phosib.
Mae’r gymuned Apertium yn gyfeillgar iawn, ac roeddwn i eisiau cyfrannu i
broject mawr cod-agored, felly dwi’n falch nes i’r cais.”

Dywedodd Kevin Donnelly [9], a weithiodd gyda Francis i greu’r cyfieithydd
Cymraeg -Saesneg Apertium, fod hwn yn gam mawr i’r Gymraeg. “Mae’n
ardderchog cael cymaint o bobl dalentog yn gweithio ar Apertium, a braf yw hi
gweld eu bod nhw’n ystyried Cymraeg fel blaenoriaeth. Yr hyn sydd angen r?an
yw ymdrech gan y mudiadau sy’n hybu Cymraeg yma yng Nghymru i annog a rhoi
cefnogaeth i’r gwaith yma.”


[1] http://ufal.mff.cuni.cz/pbml-91-100.html. Francis Tyers and Kevin
Donnelly (2009): “apertium-cy – a collaboratively-developed free RBMT system
for Welsh to English”, Prague Bulletin of Mathematical Linguistics, 91.

[2] http://code.google.com/soc

[3] http://www.fsf.org/about/what-is-free-software. Mae’r Free Software
Foundation yn diffinio “Meddalwedd Rhydd” fel meddalwedd y gellir ei
ddefnyddio, copïo, newid a dosbarthu gan y defnyddiwr.

[4] http://www.statmt.org/moses. System cyfieithu peirianyddol ystadegol yw
Moses – mae’n god-agored.

[5] Gabriel Synnaeve yw myfyriwr yn yr École Nationale Supérieure
d’Informatique et de Mathématiques (http://ensimag.grenoble-inp.fr), canolfan
bwysig ar gyfer mathemateg ac thechnoleg gwybodaeth. Bydd o’n graddio ym mis
Medi 2009, ac yn cychwyn gwaith wedyn ar ddoethuriaeth ar ddysgu peirianyddol

[6] Alon Lavie (http://www.cs.cmu.edu/alavie) is leading this work. See
also: http://www.cs.cmu.edu/
alavie/papers/EAMT-2005-MEMT.pdf. S. Jayaraman
and A. Lavie (2005): “Multi-Engine Machine Translation Guided by Explicit
Word Matching”, Proceedings of EAMT-2005.

[7] http://www.byig-wlb.org.uk/english/publications/publications/2302.doc.
Harold Somers (2004): “Machine translation and Welsh: the way forward.”,
Report for the WLB.

[8] Astudiodd Francis Tyers wyddoniaeth cyfrifiadurol yn Aberystwyth, ac ar
hyn o bryd mae’n beiriannwr iaith gyda Prompsit Language Engineering, S.L. ac
yn fyfyriwr PhD ym Mhrifysgol Alacant. Mae’n un o’r datblygwyr blaenorol
Apertium, gyda diddordeb arbennig yn ei estyn i drin yr ieithoedd Celtaidd.

[9] Mae Kevin Donnelly wedi bod yn gweithio ar Feddalwedd Rhydd yn Gymraeg ers
2003, a datblygodd Eurfa, geiriadur arlein Cymraeg (http://www.eurfa.org.uk).

Cysyltwch â:
Kevin Donnelly, 01248-715925, kevin@dotmon.com


Is Google Street View bilingual?

Amid all the excitement and furore around Google’s Street View a colleague of mine (thanks Ceri) unearthed this little gem from the comments on dot.life – A blog about technology from BBC News.

“20. At 3:56pm on 20 Mar 2009, paulvilla wrote:

I had a look at Swansea and noticed that streetview had blurred the Welsh version on some of the roadsigns. I assume the numberplate detection got confused by the non-standard letter patterns. Curiously the English on the same sign is un-blurred. If nothing else, at least streetview brings a bit of relief from bi-lingual everything – if only briefly!”

I must confess to having tried to find some examples of this – without success. Can anyone with the necessary lack of a life find any examples of this? Has this affected other languages?

Ceri’s email had the subject line “Google tries to stamp out Welsh language shocker” which I was really tempted to use as the title of this post. However I have learned that humorous titles don’t always travel well 🙂