Topic modelling and text classification models for applications within EFSA

VANDEVOORT, Brecht; BEX, Geert Jan; CREVECOEUR, Jonas; NEVEN, Frank

Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/41621

Full metadata record

DC Field	Value	Language
dc.contributor.author	VANDEVOORT, Brecht	-
dc.contributor.author	BEX, Geert Jan	-
dc.contributor.author	CREVECOEUR, Jonas	-
dc.contributor.author	NEVEN, Frank	-
dc.date.accessioned	2023-10-26T09:25:45Z	-
dc.date.available	2023-10-26T09:25:45Z	-
dc.date.issued	2023	-
dc.date.submitted	2023-10-16T11:22:24Z	-
dc.identifier.citation	EFSA Supporting Publications, 20 (8) (Art N° 8212E)	-
dc.identifier.uri	http://hdl.handle.net/1942/41621	-
dc.description.abstract	This report presents an overview of topic modelling and classification models in relation to four case studies in the EFSA project OC/EFSA/AMU/2020/02. As adequate document embeddings have a positive influence on the effectiveness of topic modelling as well as text classification, an extensive number of different possibilities for word and document embeddings are discussed. It was found that a multitude of increasingly more complex embeddings are readily available for off-the-shelf use. But as they are trained on large but mostly general text corpora, their utility for domain specific text varies. Fine tuning or creating document embeddings from scratch is only feasible in the presence of enough data and has an associated computational cost. For some domains (like scientific articles), pretrained embeddings are available. For topic modelling, we discuss standard techniques like non-negative matrix factorization and latent Dirichlet allocation as well as more recent methods based on clustering of document embeddings like Top2Vec and BERTopic. For text classification, we consider hierarchical text classification approaches combined with established techniques for text classification via document embeddings. We propose a selection of techniques for each of the case studies justifying their choice and present a plan for evaluation. Finally, we discuss our findings after having implemented and validated the selected techniques.	-
dc.language.iso	en	-
dc.publisher		-
dc.subject.other	Natural Language Processing	-
dc.subject.other	Topic Modelling	-
dc.subject.other	Text Classification	-
dc.title	Topic modelling and text classification models for applications within EFSA	-
dc.type	Journal Contribution	-
dc.identifier.issue	8	-
dc.identifier.volume	20	-
local.format.pages	112	-
local.bibliographicCitation.jcat	A3	-
local.type.refereed	Non-Refereed	-
local.type.specified	Article	-
local.bibliographicCitation.artnr	8212E	-
dc.identifier.doi	10.2903/sp.efsa.2023.EN-8212	-
dc.identifier.eissn		-
local.provider.type	Pdf	-
local.uhasselt.international	no	-
item.contributor	VANDEVOORT, Brecht	-
item.contributor	BEX, Geert Jan	-
item.contributor	CREVECOEUR, Jonas	-
item.contributor	NEVEN, Frank	-
item.fullcitation	VANDEVOORT, Brecht; BEX, Geert Jan; CREVECOEUR, Jonas & NEVEN, Frank (2023) Topic modelling and text classification models for applications within EFSA. In: EFSA Supporting Publications, 20 (8) (Art N° 8212E).	-
item.fulltext	With Fulltext	-
item.accessRights	Open Access	-
crisitem.journal.issn	2397-8325	-
Appears in Collections:	Research publications

Files in This Item:

File	Description	Size	Format
published_version.pdf	Published version	3.98 MB	Adobe PDF	View/Open

Show simple item record

Google Scholar^TM

Check

Files in This Item:

Google ScholarTM

Altmetric

Google Scholar^TM