引言
最初只是想把Doccano標注的數據集轉換成BIO(類似conll2003數據集)的標注格式;
摘要
可先閱讀一下教程:【已解決】關于如何將Doccano標注的文本轉換成NER模型可以直接處理的CoNLL 2003格式
裝包:pip install doccano-transformer
報錯信息
運行下述程序后,會報錯
from doccano_transformer.datasets import NERDataset
from doccano_transformer.utils import read_jsonldataset = read_jsonl(filepath='NER.jsonl', dataset=NERDataset, encoding='utf-8')
gen=dataset.to_conll2003(tokenizer=str.split)file_name="CoNLL.txt"with open(file_name, "w", encoding = "utf-8") as file:for item in gen:file.write(item["data"] + "\n")
報錯信息如下:
l In[18], line 1
----> 1 from doccano_transformer.datasets import NERDataset2 from doccano_transformer.utils import read_jsonlFile ~/anaconda3/envs/nlp/lib/python3.9/site-packages/doccano_transformer/datasets.py:52 import json3 from typing import Any, Callable, Iterable, Iterator, List, Optional, TextIO
----> 5 from doccano_transformer.examples import Example, NERExample8 class Dataset:9 def __init__(10 self,11 filepath: str,12 encoding: Optional[str] = 'utf-8',13 transformation_func: Optional[Callable[[TextIO], Iterable[Any]]] = None14 ) -> None:File ~/anaconda3/envs/nlp/lib/python3.9/site-packages/doccano_transformer/examples.py:41 from collections import defaultdict2 from typing import Callable, Iterator, List, Optional
----> 4 from spacy.gold import biluo_tags_from_offsets6 from doccano_transformer import utils9 class Example:ModuleNotFoundError: No module named 'spacy.gold'
修復bug
根據該GitHub doccano_transformer
項目的github issues和pr 給出的信息修復該bug:
要修改doccano_transformer/examples.py
源碼文件;
根據報錯信息,確定example.py
文件所在目錄
File ~/anaconda3/envs/nlp/lib/python3.9/site-packages/doccano_transformer/datasets.py:5
根據報錯信息,知道筆者的examples.py
路徑如下:
(每人的所在文件夾不同,請自行修改)
~/anaconda3/envs/nlp/lib/python3.9/site-packages/doccano_transformer/examples.py
按照圖片所示內容進行修改即可:
-
修改點 1
原始代碼:
from spacy.gold import biluo_tags_from_offsets
修改成:
from spacy.training import offsets_to_biluo_tags
-
修改點 2
原始代碼:
tags = biluo_tags_from_offsets(tokens, label)
修改成:
tags = offsets_to_biluo_tags(tokens, label)
修改完上述代碼,重新運行代碼就不會報錯了;
相關閱讀
- BIO序列提取實體(NER命名實體識別)
該文把BIO標注的數據,轉成下述格式:{'string': '我是李明,我愛中國,我來自呼和浩特', 'entities': [{'word': '中國', 'type': 'loc'}, {'word': '呼和浩特', 'type': 'loc'}]}
參考資料
- [1] github issues https://github.com/doccano/doccano-transformer/issues/35
- [2] 該bug的PR參考 https://github.com/doccano/doccano-transformer/pull/38/files