Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In addition to general domain adaptation, many telcos will have the issue of multilingual datasets, where a mix of languages (typically English + something else) will exist in the data (syslogs, wikis, tickets, chat conversations etc.). While many options exist for both generative LLMs [11] and text embedding models [12], not many foundation models have seen enough non-English data in training, thus options in foundation model choice are somewhat restricted for operators working on non-English data. A solution to work around this issue is to use automated translation and language detection models on the data as a preprocessing step.

In conclusion, while foundation models and transfer learning have been shown to work very well on general human language when pretraining is done on large corpuses of human text (such as Wikipedia, or the Pile [29]), it remains an open question to be answered whether domain adaptation and downstream task adaptation work equally well on the kinds of domain-specific, semi-structured, mixed modality datasets we can find in telecom networks. To enable this, telecoms should focus on standardization and data governance efforts, such as standardized and unified data collection policies and high quality structured data. 

...