% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/curate_enntt_data.R
\name{curate_enntt_data}
\alias{curate_enntt_data}
\title{Curate ENNTT Data}
\usage{
curate_enntt_data(dir_path)
}
\arguments{
\item{dir_path}{A string. The path to the directory containing the ENNTT
data files. Must be an existing directory.}
}
\value{
A tibble containing the curated ENNTT data with columns:
\itemize{
\item session_id: Parliamentary session identifier
\item speaker_id: Speaker's MEP ID
\item state: Representative's state/country
\item session_seq: Sequential position in session
\item text: Speech content
\item type: Corpus type identifier
}
}
\description{
This function processes and curates ENNTT (European Parliament) data from a
specified directory.
It handles both .dat files (containing XML metadata) and .tok files
'(containing text content).
}
\details{
The function expects a directory containing paired .dat and .tok files with
matching names, as found in the raw ENNTT data
\url{https://github.com/senisioi/enntt-release}.
The .dat files should contain XML-formatted metadata with attributes:
\itemize{
\item session_id: Unique identifier for the parliamentary session
\item mepid: Member of European Parliament ID
\item state: Country or state representation
\item seq_speaker_id: Sequential ID within the session
}

The .tok files should contain the corresponding text content, one entry per
line.
}
\examples{
# Example using simulated data bundled with the package
example_data <- system.file("extdata", "simul_enntt", package = "qtkit")
curated_data <- curate_enntt_data(example_data)

str(curated_data)

}
