37 lines
1.0 KiB
Plaintext
37 lines
1.0 KiB
Plaintext
|
Title: Geonames Toponym Extractor Utility
|
||
|
Brief: Simple script for extracting ASCII toponym fields from geonames datasets
|
||
|
Date: 1713683410
|
||
|
Tags: Python, Script, Programming
|
||
|
CSS: /style.css
|
||
|
|
||
|
[Link to code](https://codeberg.org/veclavtalica/geonames-extractor)
|
||
|
|
||
|
Small script I used for extracting data for machine learning endeavors.
|
||
|
|
||
|
Usage:
|
||
|
```
|
||
|
dataset feature_class [feature_code] [--dirty] [--filter=mask]
|
||
|
```
|
||
|
|
||
|
From this invokation ...
|
||
|
```
|
||
|
./extractor.py datasets/UA.txt P PPL --filter=0123456789\"\'-\` > UA-prep.txt
|
||
|
```
|
||
|
|
||
|
... it produces a newline separated list of relevant toponyms of particular kind, such as:
|
||
|
```
|
||
|
Katerynivka
|
||
|
Vaniushkyne
|
||
|
Svistuny
|
||
|
Sopych
|
||
|
Shilova Balka
|
||
|
```
|
||
|
|
||
|
`--filter=` option is there so that aplhabet size could be reduced for learning purposes,
|
||
|
as there are usually quite a lot of symbols that are only found few times,
|
||
|
which produces poor balancing.
|
||
|
|
||
|
`--dirty` option reduces cases such as `Maydan (Ispas)` and `CHAYKA-Transmitter, Ring Mast 4` to `Maydan` and `CHAYKA-Transmitter`.
|
||
|
|
||
|
Duplicates are also removed.
|