A balanced dataset should be able to provide a comprehensive representation of the domain in which we carry out classification, but it is not trivial to make sure that the classifier gets the right amount of information in each class, not only in terms of quantity but in terms of knowledge and content.
In the context of a terrorist incident classifier, I will talk about a case of overfitting on underrepresented features, the early attempts of mitigating the features that carry an undesirably strong signal and a suggestion for addressing the problem systematically by assessing the salience of words in the dataset.
Linguistics Resource Developer @ Recorded Future
Danila has a background in linguistics and studied language technology in Gothenburg. She currently works as a linguistics resource developer at Recorded Future. At Recorded Future, they struggle with the creation of good datasets for doing machine learning on vast amounts of text, just like many other companies. Danila will talk about her experience with dataset at Recorded Future, fighting issues like imbalanced classes, overfitting, in the area of a terrorist event classifier.