Lupa Digital is an initiative that applies AI and data mining to large-scale journalistic content, exploring the evolving news landscape through the lens of metajournalism.
Identify and classify news articles across a wide range of domains.
Extract relevant information and preprocess data to support analysis.
Detect trends and reveal patterns across topics, sources, and time periods.
Transform large volumes of archived news into structured, analyzable datasets, enabling a deeper understanding of entities, events and themes.
To support this pipeline, a combination of tools and techniques was used…
Data Extraction and Mining
Archived news pages were processed in chunks of 10,000, with checkpoints saved for fault tolerance and resumability. Used:
PySpark for parallelized processing across large datasets
BeautifulSoup for HTML parsing and web scraping
BloomFilter to efficiently detect and skip duplicate articles
ML model to classify documents as news or non-news
TextBlob and VADER for sentiment analysis
spaCy for lemmatization, tokenization, and NER for topic extraction
Probabilistic counters for scalable topic frequency estimation
Example: Web Scraping
Cenário de saída da Grécia do euro está de regresso - União Europeia - Jornal de Negócios Cotações Mercados Análise Fundamental Técnica Portfolio Gestor Accões Favoritas Estatísticas Taxas de Juro Câmbios Subscrever assinar Login Registe-se \xa0Cenário de saída da Grécia do euro está de regresso União Europeia Cenário de saída da Grécia do euro está de regresso O provável novo embaixador dos EUA junto da União Europeia antecipa que a Grécia saia do euro dentro de um ano ou ano e meio. Já o ministro alemão das Finanças voltou a alertar para esse risco caso o Governo grego não aplique as reformas tal como acordadas a…
Example: Data Mining
Topic Standardization
Number of unique topics: 1457234
+-------------------------+
|value |
+-------------------------+
|P2 Ípsilon Ímpar Fugas P3|
|laboratório |
|galp |
|Galp |
|O Banco de Portugal |
|Saúde Covid-19 |
+-------------------------+
Problems:
Inconsistent casing (galp vs Galp)
Leading definite articles (O Banco de Portugal)
Multiple concepts grouped into a single topic (Saúde Covid-19)
Topic Standardization
Idea: Map each topic to a list of terms for normalization
# split topics with more than 2 spacesdf = df.withColumn("tokens", F.split(F.col("value"), " {2,}"))# standardize topics using mapping# either using a portuguese dictionary (Natura Dictionary)dicitionary = spark.read.text("wordlist_utf8.txt")...word_map_broadcast = spark.sparkContext.broadcast(word_map)# or the most common representation ("galp" VS "Galp")tokens_df = tokens_df \ .withColumn("std_token", F.lower(F.col("token"))) \ .groupBy("std_token", "token") \ .agg(F.count("*").alias("count")) \ .withColumn("rank", F.row_number().over(window_spec)) \ .filter(F.col("rank") ==1) \ .orderBy("std_token") \ .select("std_token", "token", "count")...# split topics into possible subtopicsdef split_topics(token):""" ex.: Banco de Portugal -> banco, Portugal, Banco de Portugal Leonor Pereira -> Leonor Pereira laboratório -> laboratório EDP Gás -> EDP, gás, EDP Gás """ ...df = df.rdd.flatMap(lambda row: [[row[0], split_topics(row[1])]]) \ .toDF(["topic", "token"])# remove invalid topics using regexdef date_detection(token): months =bool(re.search(r"\b(?:Jan|Fev|Abr|Mai|Jun|Jul|Ago|Set|Out|Nov|Dez)\b", token)) march =bool(re.search(r"\b(?:\d{1,2} Mar|Mar \d{2,4})\b", token)) months2 =bool(re.search(r"^(janeiro|fevereiro|março|abril|maio|junho|julho|agosto|setembro|outubro|novembro|dezembro)\b", token, flags=re.IGNORECASE)) dateD =bool(re.search(r"\b(Janeiro|Fevereiro|Março|Abril|Maio|Junho|Julho|Agosto|Setembro|Outubro|Novembro|Dezembro)[ ]?\d{2,4}\b", token, flags=re.IGNORECASE)) dateY =bool(re.search(r"\b\d{1,2}(?:\s+de)?\s+(janeiro|fevereiro|março|abril|maio|junho|julho|agosto|setembro|outubro|novembro|dezembro)\b", token, flags=re.IGNORECASE)) days =bool(re.search(r"\b(segunda-feira|terça-feira|quarta-feira|quinta-feira|sexta-feira|sábado|domingo)\b", token, flags=re.IGNORECASE)) hours =bool(re.search(r"\b\d{1,2}:\d{2}\b", token)) orbool(re.search(r"\b\d{1,2}h\d{2}\b", token)) minutes =bool(re.search(r"\b\d{1,2}m\d{2}\b", token))return months or march or months2 or dateD or dateY or days or hours or minutesdef invalid_tokens_detection(token):return date_detection(token) or ...df = df.rdd.flatMap(lambda row: [(row[0], [token for token in row[1] ifnot invalid_tokens_detection(token)])])# among other transformations# e.g.: invalid characters, ...df.write.mode("overwrite").json("topics.json")
Topic Standardization
+-------------------------+--------------------------------------------------+
|token |topic |
+-------------------------+--------------------------------------------------+
|laboratório |[laboratório] |
|galp |[Galp] |
|Galp |[Galp] |
|O Banco de Portugal |[banco, Portugal, Banco de Portugal] |
|Saúde Covid-19 COVID |[Saúde, Covid-19, Covid, Saúde Covid-19 COVID]|
|José Mourinho |[José Mourinho] |
|11 de janeiro |[] |
\(\ \)
New issues introduced during processing:
|António Costa |[António, costa, António Costa] |
|CHEGA |[chega] |
|25 de abril |[] |
|Revolução de 25 de abril |[Revolução, Revolução de 25 de abril] |
+-------------------------+--------------------------------------------------+
Data Post-processing
Remove duplicated links
Replace original topics with the cleaned versions
Generate a significant token column: lemmatized, lowercased, accent-free, and with >4 mentions (for topic-based querying)