#1 #60 Databricks et Snowflake aboient, haproxy passe et graphe les ramassages des miettes

Hosted by Julien Durillon

Dans cet épisode de référence, bien que difficile à numéroter, nous recevons Mathieu Ancelin et nous parlons : de la levée de fonds de PlanetScale, de la guerre entre Databricks et Snowflakes, des 20 ans de HAProxy, des ressources query dans SQL, des meilleurs performances de nos vieux claviers PS/2, d'un outil Apple Open Source pour l'analyse de logs de Garbage Collection, avant de finir en musique... indice : c'est pas du Mozart.
Clever Cloud
Clever Cloud
#60 Databricks et Snowflake aboient, haproxy passe et graphe les ramassages des miettes
Loading
/

Dans cet épisode de référence, bien que difficile à numéroter, nous recevons Mathieu Ancelin et nous parlons : de la levée de fonds de PlanetScale, de la guerre entre Databricks et Snowflakes, des 20 ans de HAProxy, des ressources query dans SQL, des meilleurs performances de nos vieux claviers PS/2, d’un outil Apple Open Source pour l’analyse de logs de Garbage Collection, avant de finir en musique… indice : c’est pas du Mozart.

Timecodes & liens :

00:00:00 Présentation des guests

00:02:00 PlanetScale is now generally available
https://planetscale.com/blog/ga
https://vitess.io/

  • $50M in Series C funding
  • Vitess’s maintainers(Clustering systems for MySQL)
    • Connection pooling
    • Query de-duping
    • Transaction rate manager
    • Virtually seamless dynamic re-sharding

00:06:29 La guerre entre Databricks et Snowflake
https://databricks.com/blog/2021/11/02/databricks-sets-official-data-warehousing-performance-record.html

  • Databricks concurrent de Snowflake (data platform)
  • TPC Transaction Processing Performance Council
    • 1980s was the era of the Wild West of database benchmarking
  • TPC-DS benchmark record for its data lakehouse technology
    • TPC-DS is a decision support benchmark with audited results.
    • 99 queries over 100TB 3.108 seconds
    • 2.7x faster and between 7x and 12x better in terms of price performance
    • outperformed the previous record by 2.2x holded by Alibaba
  • removing the DeWitt Clause from our service terms
    • a new provision that prohibits people (researchers, scientists, or competitors) from publishing any benchmarks of Oracle’s database systems.
    • It’s a primary reason you often see benchmarks comparing anonymous systems, sometimes referred to as DBMS-X, in research papers and why many benchmarks are completely absent.
    • Benchmark clause @ Google
      • a) must seek permission before disclosing results
      • b) must provide repro details
      • c) must allow Google to test my services

https://www.snowflake.com/blog/industry-benchmarks-and-competing-with-integrity/

  • Resultat assez proche de Databricks
  • Price is more like 267 compared 1791
  • Signup and try with already loaded dataset
  • Removed Dewitt Clause

https://databricks.com/blog/2021/11/15/snowflake-claims-similar-price-performance-to-databricks-but-not-so-fast.html

  • New score from Snowflake includes a self-published prebaked data set
  • Using official TPC-DS dataset, time to execute 99 queries is doubled

00:13:00 Willy Tarreau on HAProxy at Its 20-Year Anniversary
https://www.haproxy.com/blog/willy-tarreau-on-haproxy-at-its-20-year-anniversary/

  • HAProxy has 20 years old
    • Happy birthday
    • Willy Tarreau founder of haproxy
  • Timeline (https://www.haproxy.com/history/)
    • 1999 – Zprox
      • Testing tool developed to gauge how an application would perform when facing lots of clients with 28 Kbps modems
    • 2000 – Zprox
      • Modified to include regex-based header rewriting, with a minimalistic config language.
      • Keywords introduced: listen, server
    • 2001 – HAProxy 1.0
      • Developed to offload traffic from hardware load balancers
    • 2002 – HAProxy 1.1
      • Simple round-robin scheduler
      • Simple health checks
      • Improved its logging capabilities
      • Cookie insertion
    • 2003 – HAProxy 1.2
      • IPv6 support on the client side
      • Replaced the wait-queue linked list with a rbtree
      • Introduced maxconn setting
      • Keywords introduced: except, forwardfor
    • 2006 – HAProxy 1.3
      • Elastic Binary Trees within the internal scheduler
      • TCP scripting
      • Explicit source port ranges
      • Interface binding
    • 2009 – HAProxy 1.4
      • RDP protocol support with server stickiness and user filtering
      • Client-side Keep-Alive
      • HTTP authentication support
      • ACL-based persistence
    • 2010 – HAProxy 1.5
      • SSL and compression
      • Data sampling
      • Server-side keep-alive
      • DDoS protection
    • 2015 – HAProxy 1.6
      • Lua scripting
      • Server-side connection multiplexing
      • Dynamic buffer allocation
      • Replaced zlib with an in-house stateless implementation
    • 2016 – HAProxy 1.7
      • HAProxy Runtime API
      • Server hot reconfiguration
      • SPOE (Stream Processing Offload Engine)
      • Introduced content processing agents & multi-type certs
    • 2017 – HAProxy 1.8
      • Improved HAProxy Runtime API
      • Introduced multithreading
      • Dynamic Cookies
      • New mux layer
    • 2018 – HAProxy 1.9
      • HTX – internal HTTP representation
      • End-to-End HTTP/2 (enabling gRPC)
      • Improved queue priority control
      • Improved the scalability of the multithreading feature
    • 2019 – HAProxy 2.0 & 2.1
      • Cloud-native threading and logging
      • HAProxy Kubernetes Ingress Controller
      • HAProxy Data Plane API
      • Prometheus exporter
      • Dynamic SSL Certificate Updates
      • FastCGI
      • Improved debugging
      • Native Protocol Tracing
    • 2020 – HAProxy 2.2 & 2.3
      • Fully Dynamic SSL Certificate Storage
      • Improved idle connection management
      • Native Response Generator
      • Health Check System Overhaul
      • Syslog Protocol (UDP/TCP)
      • OpenTracing (SPOE)
      • SSL/TLS Environments
      • Improved Cache
    • 2021 – HAProxy 2.4
      • HTTP/2 WebsocketsFIX & MQTT Protocols
      • Dynamic SSL Certificate Storage
      • Built-in OpenTracing
      • DNS TCP Resolution
  • Outage Google Cloud Load Balancer

00:21:00 Forecasting SQL query resource usage with machine learning
https://blog.twitter.com/engineering/en_us/topics/insights/2021/forecasting-sql-query-resource-usage-with-machine-learning

  • SQL powered by Presto over Hadoop and Google cloud storage
  • Problems:
    • Avoid overwhelmed due to resource-consuming queries
    • Data system customers would like to know the resource consumption estimation of their queries.
    • Elastic scaling needs query resource usage forecasting.
  • Forecast typically done with query plans generated from SQL engines
  • the system
    • learns from plain SQL statements
    • builds machine learning models from historical query request logs without dependency on any SQL engines or query plans.

Carte pci facebook: https://engineering.fb.com/2021/08/11/open-source/time-appliance/

Spending $5K to learn how database indexes work
https://briananglin.me/posts/spending-5k-to-learn-how-database-indexes-work/

00:34:00 Les claviers PS/2 sont plus performants que l’USB
https://blogmotion.fr/systeme/les-claviers-ps2-plus-performants-que-usb-18944
https://www.youtube.com/watch?v=As44YzdnqqE&list=PLTbQvx84FrATz-mQ5-C6U7vr8shnC_C3i&index=70
https://www.youtube.com/watch?v=nXYXLuqsllY&list=PLTbQvx84FrATz-mQ5-C6U7vr8shnC_C3i&index=91

  • Les claviers PS/2, ça envoie des interruptions en direct au processeur.
  • L’USB c’est du poll régulier. Si tu bourrines la touche “flèche droite” entre deux polls (quelques milliseconds, ça dépend de si ton processeur est chargé ou pas), un seul appui est enregistré.

00:41:15 GCGC : Garbage Collection Graph Collector by Apple
https://github.com/apple/GCGC

  • Jupyter notebook interface to analyze GC log files.
  • 17 generated plots, which analyze latency, concurrent and stop-the-world events, heap information, allocation rates, frequencies of events, and event summaries
  • The tool uses Jupyter notebook data visualization allows for easy customization of provided plots.
  • Supports for Shenandoah/G1/Zgc (some edge cases are known and not handled automatically)

00:46:00 douce musique de fin : MESHUGGAH – Bleed
https://youtu.be/qc98u-eGzlc?t=6

Podcast

Nos dernières émissions