DZone Snippets is a public source code repository. Easily build up your personal collection of code snippets, categorize them with tags / keywords, and share them with the world

Snippets has posted 5883 posts at DZone. View Full User Profile

Transliterate Filenames From Cyrillic

11.01.2008
| 6504 views |
  • submit to reddit
        Recursively traverse provided directories and transliterate file and directory names from cyrillic into latin.

#!/usr/bin/python
# -*- coding: utf-8 -*-

from os import walk, rename, unlink, mkdir
from os.path import isdir, exists
from sys import argv, exit, getfilesystemencoding
from shutil import copyfile
import shutil

conversion = {
        u'а' : 'a',
        u'б' : 'b',
        u'в' : 'v',
        u'г' : 'g',
        u'д' : 'd',
        u'е' : 'e',
        u'Ñ‘' : 'e',
        u'ж' : 'zh',
        u'з' : 'z',
        u'и' : 'i',
        u'й' : 'j',
        u'к' : 'k',
        u'л' : 'l',
        u'м' : 'm',
        u'н' : 'n',
        u'о' : 'o',
        u'п' : 'p',
        u'Ñ€' : 'r',
        u'с' : 's',
        u'Ñ‚' : 't',
        u'у' : 'u',
        u'Ñ„' : 'f',
        u'Ñ…' : 'h',
        u'ц' : 'c',
        u'ч' : 'ch',
        u'ш' : 'sh',
        u'щ' : 'sch',
        u'ь' : "'",
        u'Ñ‹' : 'y',
        u'ь' : "'",
        u'э' : 'e',
        u'ÑŽ' : 'ju',
        u'я' : 'ja',
        u'А' : 'A',
        u'Б' : 'B',
        u'Ð’' : 'V',
        u'Г' : 'G',
        u'Д' : 'D',
        u'Е' : 'E',
        u'Ё' : 'E',
        u'Ж' : 'ZH',
        u'З' : 'Z',
        u'И' : 'I',
        u'Й' : 'J',
        u'К' : 'K',
        u'Л' : 'L',
        u'М' : 'M',
        u'Н' : 'N',
        u'О' : 'O',
        u'П' : 'P',
        u'Р' : 'R',
        u'С' : 'S',
        u'Т' : 'T',
        u'У' : 'U',
        u'Ф' : 'F',
        u'Ð¥' : 'H',
        u'Ц' : 'C',
        u'Ч' : 'CH',
        u'Ш' : 'SH',
        u'Щ' : 'SCH',
        u'Ъ' : "'",
        u'Ы' : 'Y',
        u'Ь' : "'",
        u'Э' : 'E',
        u'Ю' : 'JU',
        u'Я' : 'JA',
        }

def cyr2lat(s):
    retval = ""
    for c in s:
        try:
            c = conversion[c]
        except KeyError:
            pass
        retval += c
    return retval
    
if len(argv) == 1:
    print "Usage: %s <dirs>" % argv[0]
    exit(-1)

processed = []

def recursive_walk(dir):
    # See http://docs.activestate.com/activepython/2.5/whatsnew/2.3/node6.html
    found = []
    dir = unicode(dir)
    for finfo in walk(dir, True):
        dirnames = finfo[1]
        fnames = finfo[2]
        for subdir in dirnames:
            subdir = "%s/%s" % (dir, subdir)
            if subdir in processed:
                continue
            for yield_val in recursive_walk(subdir):
                yield yield_val
        for fname in fnames:
            yield '%s/%s' % (dir, fname)
    raise StopIteration

if __name__ == "__main__":
    fs_enc = getfilesystemencoding()
    for dir in argv[1:]:
        for fpath in recursive_walk(dir):
            new_fpath = cyr2lat(fpath)
            print fpath.encode('utf-8')
            # First make dirs
            path_elts = new_fpath.split('/')
            for idx in range(len(path_elts))[1:]:
                subpath = '/'.join(path_elts[:idx])
                while True:
                    i = 0
                    if exists(subpath):
                        if not isdir(subpath):
                            print '%s exists but is not a directory, will try again' % subpath
                            subpath += str(i)
                            continue
                        else:
                            path_elts[idx - 1] = subpath.split('/')[-1]
                            break
                    else:
                        print 'Creating directory: %s' % subpath
                        mkdir(subpath)
                        break
            print 'Copying %s to %s' % (fpath.encode('utf-8'), new_fpath)
            shutil.copyfile(fpath, new_fpath)
    

Comments

Snippets Manager replied on Thu, 2010/05/13 - 7:39pm

if anyone needs to remove the "01 - " type prefixes from the filenames too, this can be used in the cyr2lat function just before the return: retval = re.sub('^[0-9 ,-=.]+','',retval)

Snippets Manager replied on Thu, 2010/05/13 - 7:39pm

Mangled the code... with < code > now.. ------ #!/usr/bin/python # -*- coding: utf-8 -*- from os import walk, rename, unlink, mkdir from os.path import isdir, exists from sys import argv, exit, getfilesystemencoding from shutil import copyfile import shutil conversion = { u' ' : ' ', u'0' : '0', u'1' : '1', u'2' : '2', u'3' : '3', u'4' : '4', u'5' : '5', u'6' : '6', u'7' : '7', u'8' : '8', u'9' : '9', u'а' : 'a', u'б' : 'b', u'в' : 'v', u'г' : 'g', u'д' : 'd', u'е' : 'e', u'ё' : 'e', u'ж' : 'zh', u'з' : 'z', u'и' : 'i', u'й' : 'j', u'й' : '', u'к' : 'k', u'л' : 'l', u'м' : 'm', u'н' : 'n', u'о' : 'o', u'п' : 'p', u'р' : 'r', u'с' : 's', u'т' : 't', u'у' : 'u', u'ф' : 'f', u'х' : 'h', u'ц' : 'c', u'ч' : 'ch', u'ш' : 'sh', u'щ' : 'sch', u'ь' : 'q', u'ы' : 'y', u'ь' : 'q', u'э' : 'e', u'ю' : 'ju', u'я' : 'ja', u'А' : 'A', u'Б' : 'B', u'В' : 'V', u'Г' : 'G', u'Д' : 'D', u'Е' : 'E', u'Ё' : 'E', u'Ж' : 'ZH', u'З' : 'Z', u'И' : 'I', u'Й' : 'J', u'К' : 'K', u'Л' : 'L', u'М' : 'M', u'Н' : 'N', u'О' : 'O', u'П' : 'P', u'Р' : 'R', u'С' : 'S', u'Т' : 'T', u'У' : 'U', u'Ф' : 'F', u'Х' : 'H', u'Ц' : 'C', u'Ч' : 'CH', u'Ш' : 'SH', u'Щ' : 'SCH', u'Ъ' : 'q', u'Ы' : 'Y', u'Ь' : 'q', u'Э' : 'E', u'Ю' : 'JU', u'Я' : 'JA', u',' : '-', } def cyr2lat(s): retval = "" d = '' for c in s: if ord(c) > 128: try: c = conversion[c] except KeyError: c='' retval += c return retval if len(argv) == 1: print "Usage: %s " % argv[0] exit(-1) processed = [] def recursive_walk(dir): # See http://docs.activestate.com/activepython/2.5/whatsnew/2.3/node6.html found = [] dir = unicode(dir) for finfo in walk(dir, True): dirnames = finfo[1] fnames = finfo[2] for subdir in dirnames: subdir = "%s/%s" % (dir, subdir) if subdir in processed: continue for yield_val in recursive_walk(subdir): yield yield_val for fname in fnames: yield '%s/%s' % (dir, fname) raise StopIteration if __name__ == "__main__": fs_enc = getfilesystemencoding() for dir in argv[1:]: for fpath in recursive_walk(dir): new_fpath = cyr2lat(fpath) print fpath.encode('utf-8') # First make dirs path_elts = new_fpath.split('/') for idx in range(len(path_elts))[1:]: subpath = '/'.join(path_elts[:idx]) while True: i = 0 if exists(subpath): if not isdir(subpath): print '%s exists but is not a directory, will try again' % subpath subpath += str(i) continue else: path_elts[idx - 1] = subpath.split('/')[-1] break else: print 'Creating directory: %s' % subpath mkdir(subpath) break print 'Copying to %s' % new_fpath shutil.copyfile(fpath, new_fpath)

Snippets Manager replied on Thu, 2010/05/13 - 7:39pm

the code has an error in the cyr2lat function, as it passes along any character not in the table, be it in ascii or unicode. The fix is to check whether the character is ascii ( ord(c)<128) and if it's not, skip it altogether (c='' in the exception). Here's the fixed version: p.s. also removed the 'from' print in the second to last line, as it would make problems for people who have 'ascii' default encoding, or anything other than 'utf-8'. ---------- #!/usr/bin/python # -*- coding: utf-8 -*- from os import walk, rename, unlink, mkdir from os.path import isdir, exists from sys import argv, exit, getfilesystemencoding from shutil import copyfile import shutil conversion = { u' ' : ' ', u'0' : '0', u'1' : '1', u'2' : '2', u'3' : '3', u'4' : '4', u'5' : '5', u'6' : '6', u'7' : '7', u'8' : '8', u'9' : '9', u'а' : 'a', u'б' : 'b', u'в' : 'v', u'г' : 'g', u'д' : 'd', u'е' : 'e', u'ё' : 'e', u'ж' : 'zh', u'з' : 'z', u'и' : 'i', u'й' : 'j', u'й' : '', u'к' : 'k', u'л' : 'l', u'м' : 'm', u'н' : 'n', u'о' : 'o', u'п' : 'p', u'р' : 'r', u'с' : 's', u'т' : 't', u'у' : 'u', u'ф' : 'f', u'х' : 'h', u'ц' : 'c', u'ч' : 'ch', u'ш' : 'sh', u'щ' : 'sch', u'ь' : 'q', u'ы' : 'y', u'ь' : 'q', u'э' : 'e', u'ю' : 'ju', u'я' : 'ja', u'А' : 'A', u'Б' : 'B', u'В' : 'V', u'Г' : 'G', u'Д' : 'D', u'Е' : 'E', u'Ё' : 'E', u'Ж' : 'ZH', u'З' : 'Z', u'И' : 'I', u'Й' : 'J', u'К' : 'K', u'Л' : 'L', u'М' : 'M', u'Н' : 'N', u'О' : 'O', u'П' : 'P', u'Р' : 'R', u'С' : 'S', u'Т' : 'T', u'У' : 'U', u'Ф' : 'F', u'Х' : 'H', u'Ц' : 'C', u'Ч' : 'CH', u'Ш' : 'SH', u'Щ' : 'SCH', u'Ъ' : 'q', u'Ы' : 'Y', u'Ь' : 'q', u'Э' : 'E', u'Ю' : 'JU', u'Я' : 'JA', u',' : '-', } def cyr2lat(s): retval = "" d = '' for c in s: if ord(c) > 128: try: c = conversion[c] except KeyError: c='' retval += c return retval if len(argv) == 1: print "Usage: %s " % argv[0] exit(-1) processed = [] def recursive_walk(dir): # See http://docs.activestate.com/activepython/2.5/whatsnew/2.3/node6.html found = [] dir = unicode(dir) for finfo in walk(dir, True): dirnames = finfo[1] fnames = finfo[2] for subdir in dirnames: subdir = "%s/%s" % (dir, subdir) if subdir in processed: continue for yield_val in recursive_walk(subdir): yield yield_val for fname in fnames: yield '%s/%s' % (dir, fname) raise StopIteration if __name__ == "__main__": fs_enc = getfilesystemencoding() for dir in argv[1:]: for fpath in recursive_walk(dir): new_fpath = cyr2lat(fpath) print fpath.encode('utf-8') # First make dirs path_elts = new_fpath.split('/') for idx in range(len(path_elts))[1:]: subpath = '/'.join(path_elts[:idx]) while True: i = 0 if exists(subpath): if not isdir(subpath): print '%s exists but is not a directory, will try again' % subpath subpath += str(i) continue else: path_elts[idx - 1] = subpath.split('/')[-1] break else: print 'Creating directory: %s' % subpath mkdir(subpath) break print 'Copying to %s' % new_fpath shutil.copyfile(fpath, new_fpath) ----------

Snippets Manager replied on Thu, 2011/03/31 - 4:06am

Thank you! This was exactly what I needed :)

Snippets Manager replied on Sun, 2012/01/22 - 11:35am

I'm trying the script on my ubuntu laptop, having some issues. 1) I am getting an error when starting the script on directory that is in Cyrillic : python ../cyr2lat.py Русский/ Traceback (most recent call last): File "../cyr2lat.py", line 130, in for fpath in recursive_walk(dir): File "../cyr2lat.py", line 113, in recursive_walk dir = unicode(dir) UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128) 2) Recursiveness is not very useful if the script doesn't read and convert directories in Cyrillic. How to have directory names converted as well? 3) The script copies the files rather than rename them. While this is a very safe strategy, if you need to concert a large collection of files, deleting also becomes a tedious task. How can i get it to rename instead? 4) The script doesn't seem to like nested (sub-sub) directories 5) When a source filename is all latin characters, it quits and throws up an error that the filename already exists : shutil.Error: `IrinaAllegrova//Grand collection (2002)/09 Suzheny, ryazheny.mp3` and `IrinaAllegrova//Grand collection (2002)/09 Suzheny, ryazheny.mp3` are the same file If the script works i'll post a new version, correcting some mistakes in the translit table and add additional characters from the Ukrainian / Bulgarian / Serbian alphabet. Thanks!