Exploring the Effectiveness of AWS Lambda and Knative in a Serverless Web Crawler: A Comparative Study

Pruscini, Davide (2024) Exploring the Effectiveness of AWS Lambda and Knative in a Serverless Web Crawler: A Comparative Study. [Laurea magistrale], Università di Bologna, Corso di Studio in Informatica [LM-DM270]
Documenti full-text disponibili:
[img] Documento PDF (Thesis)
Disponibile con Licenza: Creative Commons: Attribuzione - Non commerciale - Non opere derivate 4.0 (CC BY-NC-ND 4.0)

Download (4MB)

Abstract

The Internet has become a key resource for accessing and sharing information. However, not all content found on it can be considered legitimate, and using tools such as web crawlers can help search for violations. In this thesis, carried out in collaboration with Kopjra, we aim to develop a web crawler application capable of automatically visiting a website, extracting URLs and indexing the HTML documents of its web pages, so as to enable keyword searches. We decided to compare two serverless implementations based on AWS Lamba and Knative, with a third microservice-based one that exploits the resources made available by Kubernetes. It is also possible to choose between two search methodologies: HTTP requests or Browser automation. To support the application, two microservices were developed, comprising the backend and frontend, as well as the deployment of an Elasticsearch cluster, which is necessary for proper ingestion of the content of web pages. Thanks to a series of tests, it is possible to compare the different implementations and understand the critical issues of each.

Abstract
Tipologia del documento
Tesi di laurea (Laurea magistrale)
Autore della tesi
Pruscini, Davide
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Indirizzo
CURRICULUM A: TECNICHE DEL SOFTWARE
Ordinamento Cds
DM270
Parole chiave
Cloud Computing,FaaS,Serverless,Docker,Kubernetes,AWS Lambda,AWS SQS,AWS SNS,Knative,RabbitMQ,CloudEvents,Web Crawler,Web Scraper,Browser Automation,Puppeteer,MongoDB,Elasticsearch,Amazon CloudWatch,Prometheus,Grafana,InfluxDB,Telegraf
Data di discussione della Tesi
14 Marzo 2024
URI

Altri metadati

Statistica sui download

Gestione del documento: Visualizza il documento

^