mjestecko/articles/toponym-extractor/page.mmd

Title:  Geonames Toponym Extractor Utility
Brief:  Simple script for extracting ASCII toponym fields from geonames datasets
Date:   1713683410
Tags:   Python, Script, Programming
CSS:    /style.css

[Link to code](https://codeberg.org/veclavtalica/geonames-extractor)

Small script I used for extracting data for machine learning endeavors.

Usage:
```
dataset feature_class [feature_code] [--dirty] [--filter=mask]
```

From this invokation ...
```
./extractor.py datasets/UA.txt P PPL --filter=0123456789\"\'-\` > UA-prep.txt
```

... it produces a newline separated list of relevant toponyms of particular kind, such as:
```
Katerynivka
Vaniushkyne
Svistuny
Sopych
Shilova Balka
```

`--filter=` option is there so that aplhabet size could be reduced for learning purposes,
as there are usually quite a lot of symbols that are only found few times,
which produces poor balancing.

`--dirty` option reduces cases such as `Maydan (Ispas)` and `CHAYKA-Transmitter, Ring Mast 4` to `Maydan` and `CHAYKA-Transmitter`.

Duplicates are also removed.