Data
- Data Lake (no enforcement on structure) vs. Data warehouse (has structured - eventually - data ready to be consumed)
- OLTP modifies data vs. OLAP queries data
- NoSQL: MongoDB, Cassandra
- Batch vs. real-time (volume, continuous)
- Data flow: ingestion > storing > processing > visualizing
- ETL: extract, transform, load -> ELT: load to storage, transform later
- Transformations: formatting, labeling, filtering, validating
- Sources and sinks (destination)
- Data modeling: conceptual (entities), logical (structures), physical (implementation)
Data Lifecycle
- Collect (Ingest) > Store > Process > Explore & Visualize
Data ingestion
- Realtime
- Large number of sources
- Requires low latency
- Pub/Sub –> Cloud Dataflow
- Path
- Small number of sources
AIML
- AI > ML > Deep Learning (Neural network)
Services
Database types
- Relational vs. Non-relational/NoSQL.
- SQL: (structured query language) consistency over performance.
- ACID: atomicity, consistency, isolation, durability.
- Atomicity: multiple statements are committed in one transaction.
- Consistency: from a valid state to another valid / consistent across tables state.
- Isolation: support concurrency (same result if it is sequential)
- Durability: ensure the valid state is kept through crash/infra issue
- (+) Consistent, reliable, data integrity.
- (-) Scale, performance, unstructured data.
- NoSQL: performance over consistency.
- Non-structured, key-value, wide table.
- Some have ACID (e.g. datastore).
- Use cases: IOT, user profiles, high-speed analytics.
- Eventual consistent.
- Databases: Redis, MongoDB, Cassandra, HBase, Bigtable, RavenDB.
- (+) Scale, performance.
- (-) Consistency, data integrity.
Persistent disk
- SAN
Cloud Storage
- Object storage
- Tiers:
- Multi-regional > Dual region > Regional
- Standard > Near line > Cold line > Archive (same latency, different retrieval cost)
- Versioning + redundancy
- Data lake
- Control
- IAM for bulk; ACL for granular level
- Project > bucket > object level
- Data transfer
- Storage transfer service (sources: S3, HTTP, other GCS buckets)
- Data transfer appliance (100TB or 480TB - from corporate data center)
- Serving
- Static website
- Import to databases (e.g. Cloud SQL)
- Data analysis (e.g. BigQuery, CloudML, Cloud Dataproc)
Logging and Monitoring
- Managed databases
- Built-in features: logs, audits, metrics: disk I/O, CPU, network
- No additional configuration needed
- No application-level data
- Un-managed database: running on GCE (e.g. MySQL, MongoDB, Cassandra)
- Custom features with Agents!
- Logging agent: stackdriver logging = application logs (configure with fluentd)
- Monitoring agent: stackdriver monitoring = application performance/metrics/alerts
- fluentd: open source data collector, let you unify the data collection and consumption.
Cloud SQL - managed; regional; RR cross-region
- Relational database
- Regional
- Regional master/slave with regional persistent disk
- Regional or cross-regional read-replicas. Can failover/DR and promote read-replica in another region in case of failure.
- Binary logs: for point-in-time recovery, replication.
- Managed: backup, patches, upgrades (during maintenance window), monitoring and logging, auto disk size scale, failover (master/slave), network access control
- 10TB - Regional read-replica
- Load through GCS
- Dump (.sql - table/schema)/CSV (.csv) files can’t contain triggers, views, stored procedures.
- Compress to .gz to save cost
- Performance best practices
- (IOPs) depending on the disk size
- Select instance size with memory equal to your largest table (page swap?)
- SQL best practices
- INNER JOIN vs. WHERE
- SELECT A, B, C vs. SELECT *
- Normalize tables
- There is no charge for stopped instance, but there is still charge for data storage.
- Transferred data to BigQuery will be done through Cloud Storage.
- Cloud SQL
- PostgreSQL: HA active/standby instances share a regional replicated persistent disk
Memorystore: managed redis
- Simple key-value pairs
Filestore: document; no-ops; regional / multi-regional
- Serverless - Document store
- Collection - nested collection (sub-collection)
- Collection ‘crew’ document ‘jeanluc’
- Reference: to another collection
- Using API to manipulate data (.update() .add())
- Automatic single-field indexes
Datastore: ACID! Transactional! NoSQL! Replaced by Firestore
- Single Datastore database per project
- No-Ops (fully managed), highly scalable (NoSQL - multi-region/sharding/replication)
- Not suitable for:
- Analytics (use Bigtable/BigQuery instead)
- Extreme scale (10M+ read/write: use Bigtable, < 1TB)
- If you don’t need ACID, use Bigtable
- RDS or LNS migration (use Cloud SQL)
- Near zero latency (use Cloud Memorystore)
- Multi-regional, regional: CANNOT be changed per project
- May not need to re-create the project to change this property.
- Datastore is transactional database vs. Bigtable is analytical database.
- Data organization
- Kind (category) of entities - Table
- Entities has hierarchy (entity groups within the same namespaces) - Rows
- Parent: with key (e.g. “Key(Users, 12345)") from the parent entity.
- Each entity has one or more properties, and a Key (auto generated or a column) - Column / Primary Key
- Indexes and querying
- Avoid: exploding indexes!
- Narrow index scope (selective, not all properties)
- Impact: storage and degraded performance
- Types: built-in (single property), and composite
- Composite indexes:
gcloud datastore create-indexes index.yaml
- Query
- Using the console’s GUI
- Using GQL (Google query language)
- Only index (per entity) can be used at the query’s WHERE statement
- Data consistency: performance vs. accuracy
- Strongly consistent: parallel processes see changes in order. Slower latency for transactions to complete. e.g. financial transactions (deposit / withdrawal)
- Eventually consistent: processes see changes may out of order, but eventually correct state.
- Modes
- Cloud Firestore is the next-gen of Cloud Datastore. You can build Firestore database in the Datastore mode.
- Native: Firestore mode, supports real-time synchronization
- Datastore: Datastore’s system behavior + Firestore’s storage layer.
Bigtable: NoSQL, Analytical - HBase; Managed; Cross-region clusters
- Single key
- Originator of NoSQL industry
- Petabyte
- 1TB, 3 nodes ~ 1.5K/mo
- Up to 10,000 write queries, per second, per node
- Use cases
- < TB: Datastore, > at least TB then consider Bigtable
- NoSQL, high throughput analytics
- Huge dataset
- Infrastructure
- Clients > Front-end server pool > Cluster (Nodes: store meta-data) > Colossus (SSTables: sharded) > Shared Log
- Separate storage (SSTablets) vs. compute (Nodes)
- Configuration
- NOT a no-ops, must configure nodes
- Entire Bigtable project is called an “instance”, which has nodes and clusters
- Nodes are grouped into clusters
- 1 or more clusters per instance
- To achieve consistent, configure app to use a single cluster
- Instance types
- DEV: single node, no replication
- PROD: 3+ nodes per cluster, replication, guaranteed throughput
- Can’t change HDD to SSD, need to copy data to a new instance
- Storage is auto-scaled
- Replication: data is synchronized between clusters, across-region
- Resizing: add/remove nodes/clusters with no downtime
- Changing disk type (HDD/SSD) requires new instance
- SSD has better IOPS, HDD has better throughput.
- Interaction: cbt (cloud bigtable) tool or HBase shell
- Use Key Visualizer to identify hotspot node
- Data organization
- One big table: can have thousands of columns / billions of rows
- Table is sharded across tablets
- Table components
- Row Key: first column
- Columns are grouped into column families
- Each column stores {row key, cf1:cq1}
- Indexing and queries
- Only the row key is indexed
- Field promotion: move fields from column data to row key
- Schema design
- Per table: only the row key is indexed
- Keep all entity info in a single row. Related entities should be in adjacent rows.
- Well defined row keys: distribute load across nodes, not single node (i.e. hotspotting)
- Combine all queried parameters into the single row key. Use only the information from the row key for search
- Best practices
- Reverse domain name: larger scope/qualifier > smaller scope/qualifier
- String identifier e.g. userID
- Timestamp (YYYYMMDD): should not be at front/alone
- For time series data: use tall+narrow instead of short+wide
- gcloud commands
gcloud components update
gcloud components install cbt
-- need to have a .cbtrc file with project name and instance name
cbt createtable TABLE-NAME
cbt ls
cbt createfamily TABLE-NAME
Cloud Spanner: managed; region/multi-region
- Challenge CAP theorem (Consistency, Availability, Partition Tolerance) thus 3 spanners
- May sacrifice Availability for Consistency/Partition Tolerance
- Is called NewSQL: No-Compromise Relational Database Service
- Regional / Multi-regional / Global
- 5 9’s SLA (5 minutes)
- Support standard SQL format (ANSI 2011)
- Horizontal: add more instances (nodes) to share the load. Vertical: upgrade single instance with more CPU/RAM.
- Is NOT MySQL/PostgreSQL compatible. Migration to/from requires work!
- Architecture: similar to Bigtable
- Node < separate > Storage; storage can spread across zones/regions
- Node: 10K reads; 2K write; 2TB storage
- Add nodes if CPU utilization is too high (recommendation is < 45%)
- Data is replicated across zones/regions consistently
- Avoid hotspotting: colocate write-heavy workloads in the same region as the leader
- IAM
- Project, Instance or Database level
- roles/spanner.PERMISSION
- Admin: full access
- Database Admin: create/edit/delete databases, grant access to databases
- Database Reader: read/execute database/schema
- Viewer: view instances, database - CANNOT modify or read from database
- Data organization and schema
- RDBMS: table, row, column. Supports joins, queries, etc.
- Tables are handled differently
- Tables are parent/child relationship
- Interleave data layout: nested tables: Singers(1) > Albums(1,1) > Songs(1,2,1). Interleave ID is FK?
- Primary key: used in child table, and is carried along in grandchildren.
- Avoid hotspotting: ‘cos nodes are distributed by key space (key range)
- No sequential numbers
- No timestamp (‘cos sequential)
- Data export from Spanner to Cloud Storage is done through Dataflow
Cloud Pub/Sub: no-ops; global
- Realtime: unbounded data vs. batch (bounded data)
- Continuous flow of data
- Often requires real-time collection, processing, storing in 1-step for low latency.
- Tight vs. loosely couple systems
- Intermediary / buffer / shock absorber
- Global-scale, no-ops, global availability, auto-scaling
- Guaranteed at-least-once delivery
- m-to-n asynchronous: more than 1 subscriptions per topic.
- Components: Topics, Messages, Publishers, Subscribers, Message Store (queue per subscriber)
- Support both push and pull
- Push: lower latency, real-time like. Require subscribers have Webhook endpoints that accept POST/HTTPS
- Pull: for large volume of messages - batch
- IAM: project, topic or subscription level
- Admin, Editor, Publisher, Subscriber
- Pricing: data volume (per GB per month)
- Message order: not by default. Can be helped with Dataflow or timestamp attribute in the message.
- vs Kafka
- Global support
- Hybrid workload: connect Kafka (on premises) with Pub/Sub (GCP)
- Using connector: open source plug-ins that connect Kafka to GCP (Pub/Sub, Dataflow, BigQuery)
- Source connector: upstream connector (to Kafka)
- Sink connector: downstream connector (from Kafka)
- Monitoring subscriber health: using Stackdriver
- Backlog in Pub/Sub queue (subscriptions): slow subscribers or error in acknowledging messages.
- Pub/Sub Lite: zonal service.
- Backup: point-in-time recover with Snapshot. Good for deploying new application version with option to rollback.
- Advance
- Seek: allows for replaying / discarding of messages
- Retain: messages stay in Topic after being ACKed.
- Snapshot: point-in-time rollback
- Limitations
- Message ordering
- Message size: <10MB
- Message retention duration: 10 minutes to 7 days
Cloud Dataflow: auto-scale, no-ops, stream and batch, regional
- Process both batch and real-time data. Solution: Beam + Cloud Dataflow
- Built on top of Apache Beam. Configuration is 100% code-based
- Integrates natively with other tools: Pub/Sub, BigQuery, AI Platform. Connectors: Bigtable, Apache Kafka
- IAM
- Only at the project level (i.e. all pipeline or none)
- Pipeline data access separate from pipeline access
- Dataflow Admin / Developer / Viewer / Worker (used for service accounts): without IAM, these roles won’t be able to view data.
- Workers use controller service account (GCE instance service account)
- Retrieve meta-data
- Can be replaced by user-managed controller service account
- Dataproc: managed Apache Hadoop/Spark service
- Dataproc: migrating Hadoop OR prefer DevOps vs. Serverless
- Dataproc can’t process both stream and batch
- Key concepts
- Element: single entry of data (e.g. data row)
- PCollection: distributed data set, input / output
- Transform: data processing operation (step) in pipeline
- ParDo: parallel processing - transform applied to individual elements (e.g. extracting elements)
- DoFn: user-defined function used by ParDo
- Aggregation: aggregate e.g. CoGroupByKey: merge tables
- Dealing with late/out of order data
- Resolved with: Windows + Watermarks + Triggers
- Windows: logically, by time span
- Watermarks: timestamp - event time, processing time (source and/or pub/sub provided timestamp)
- Triggers: determine when results in window are emitted. Allow late-arriving data to be included.
- Best practices
- Handling errors: try-catch block. Output errors to new PCollection > Pub/Sub/GCS. Recycle ‘bad’ data.
- Know your window types:
- Global
- Fixed time
- Sliding time (overlapping fixed time windows)
- Session
- Can specify endpoint region: good for keeping data in a region (wrt regulatory requirements)
- MapReduce workload can be migrated to Dataflow
- Use DirectRunner to run jobs locally (good for DEV)
- Update Dataflow pipelines
- Update job (clone): creates new job with the same name/new jobId. Map old to new job transforms with transform mapping. Transition buffed data transferred to new job using transform mapping.
- Cloud Dataflow SQL: Apache Bean SQL
- Hadoop ecosystem: Hadoop, Spark (ML service), Pig, Hive (Data Warehouse / Analysis)
- MapReduce: take big data, distribute (map) to many workers, stitch (reduce) back results from individual/parallel processes/workers.
- Managed Apache Hadoop/Spark clusters (create cluster, deployment, job submission, manual scaling, monitoring)
- On-demand, managed but not no-ops
- Must configure cluster (manual scaling)
- Use case: migrate existing Hadoop/Spark system
- HDFS is available (but may not be optimal)
- IAM
- Project level only (primitive and predefined roles)
- Dataproc
- Editor: full
- Viewer: read only
- Worker: assigned to service accounts: read/write GCS, write to Cloud Logging.
- Configuration
- Master Node > Worker Nodes > PVM Worker Nodes »> HDFS
- Master node contains YARN resource manager
- Autoscaling
- Optional secondary worker nodes with preemptible VMs
- Dataproc manage preemptible VMs adding/removing if the VMs are preempted.
- Requires at least 2 nodes before preemptible nodes.
- Cost optimization: add preemtiple nodes, high memory > using SSD
- More worker nodes + secondary nodes > faster processing
- Performance improvement: with local SSD (during cluster creation)
- Common initialization scripts through public GCS bucket (e.g. Kafka initialization)
- Pricing: underlying GCE instances (master nodes, worker nodes)
- Dataproc version: Debian 2.0; Ubuntu: include pre-packaged and optional libraries (Hadoop, Spark, Hive, Pig, Flink)
- Support custom Dataproc image with customized libraries
- To access the Dataproc Hadoop web GUI (port 8088, 9870).
- Open firewall port
- Using SOCKS proxy
- Migrating and optimizing
- Migrate data first, through GCS, Apache HBase to Bigtable, Apache Impala/Hive to BigQuery.
- POC with small data set first
- Optimizing cost:
- delete the cluster once the jobs are done. Move data to BigQuery (Hive, Impala), Bigtable (HBase), HDFS (GCS) before deleting the cluster.
- Copy data to GCS: by installing connector or copy manually
- Update file prefix in scripts: from hdfs:// to gs://
- Use Dataproc, and run against/output to GCS.
- Consider migrating jobs to Dataflow/BigQuery/etc
- preemptible, faster processing, delete (compute) once done. Dataproc separates storage and compute.
- Optimise performance
- Use regional cloud storage bucket: data is nearer to the cluster -> better performance
- Use larger persistent disk; SSD over HDD
- Allocate more VM’s
- Autoscaling
- Is NOT compatible with: HA clusters, HDFS storage, Spark structured streaming
- Native output connectors: GCS, Bigtable, BQ
BigQuery: serverless, no-ops, regional/muti-regional. Hive
- Data warehouse: near real-time analysis of massive amount of data (PB)
- Storage and Analysis. Can process data on BQ or GCS
- Location: multi-region (US, EU), regional
- Support standard SQL (2011) or legacy SQL
- “3rd wave” of computing - Google Big Data Stack 2.0
- Storage (Colossus) and compute (Dremel) are separated, connected by petabit (Jupiter)
- How BQ works
- Jobs can scale up to thousands of CPU across many nodes.
- Types: Load, Export, Query, Copy
- Priority: Interactive, Batch
- Columnar data store (column data is stored in the same storage volume, so it is faster then going across different storage volumes). Not designed for update.
- IAM
- Project, dataset, view, table
- Authorized view: allow users to access certain subset of data.
- Table is ACID compliant
- Admin: full
- Data Owner: full to owned datasets
- Data Editor: edit dataset
- Data Viewer: view datasets and tables
- Job User: run jobs
- User: run queries, and create datasets (but not tables)
- Use view to work around access to tables
- Dataset can be shared, or public (allUsers, allAuthenticatedUsers)
- Cache query: avoid querying original table again
- Per user only
- Standard SQL
project.dataset.table vs. legacy SQL [project:dataset.table]
- User defined function (UDF)
- Combine SQL with JavaScript/SQL functions / programming logic (e.g. loop)
- WebUI: only usable with Legacy SQL
- Load and export data: BigQuery transfer service
- Batch load: GCS
- Set Schema: “automatically detect”
- Streaming insert: Dataflow
- Read as external source: Drive, GCS, Bigtable
- Dataproc to BQ: pre-installed BigQuery connector (through GCS for staging)
- Export:
- To GCS
- Copy table from dataset to another dataset or clone
- Us wildcard for exporting file > 1 GB
- Query GCS:
- Support: CSV, JSON, Avro, Parquet, Datastore backups, Filestore backups.
- Can load multiple files as once using wildcard*
- BigQuery Transfer Service
- Import to BQ from other Google’s services such as Ads, DoubleClick, YouTube reports
- Best practices
- Avoid SELECT *
- De-normalize data: group data into single table (good for reach, not for write); nested JSON
- LIMIT does NOT effect cost (! data scanned)
- Filter early with big WHERE clause
- Do biggest joins first
- Data format for import: best Avro compressed
- Partition: why not tables (limit 1000 tables per dataset) - usually with timestamp
- By ingestion time
- By custom column (timestamp or date)
- Clustered table: only for partitioned table with a clustering key
- Order of clustering key (combined fields) is important
- Further reduce scanning data since the clustering key are excluded
- Clustered tables (requires partitioned table): (like index), divides table reads by s specified column field
- Using slots (limit 2000 per project). Use flat-rate (pre-paid for slots) if you need more than 2000 slots.
- Multi-region dataset -> more reliable, but higher cost
- Backup: BQ has built-in snapshot schedule (Point in time) 7 day history. Restore using @(time) in ms. e.g. @-3600000
- Use approximation function instead of aggregation function (APPROX_COUNT_DISTINCT() vs. COUNT(DISTINCT))
- Preview first; price check first
- Query processing details
- Wait (yellow), Read (purple), Compute (joins)(orange), Write (blue)
- Using view: WITH abc AS
- Schema
- Support nested table (table in a table) with RECORD type (JSON)
- Streaming inserts
- Data in streaming buffer for 90 mins before visible in the GUI, but data can be queried immediately.
- Monitoring and Logging
- Monitoring: resource, performance, capacity, alerts (for long queries), slot utilization
- Logging: has audit trails; contains query executed.
- Pricing
- Storage, query, streaming inserts
- Pay-as-you-go, flat rate (2000 slots), slots
- Data older than 90 days will automatically moved to long-term storage. No need to move to GCS
Machine learning
- Supervised learning: data with labels
- Regression: continuous, numeric variables (e.g. stock price)
- Classification: categorized values (e.g. gender, negative/positive, yes/no)
- Unsupervised learning: clustering, finding patterns
- Reinforced learning: use negative/positive reward/reinforcement to complete a task
- Data
- Label: e.g. housing price
- Features: e.g. size, #rooms, location Feature Column
- MSE: mean squared error: 1/n sum (yi -y’i)2
- Transfer learning: (i.e. AutoML)
- z = w(x)+y : w -> weight ; loss = L(w) - ML optimize loss
- Feature engineering: process data
- Type: MCAR, MAR, NMAR
- Imputation: replaced by mean/median, fixed constant, most frequent values, K-nearest neighbor
- Outliers and feature clipping
- One-hot encoding
- Google knowledge graph search API
- AutoML Edge: export custom trained models, optimized for edge devices.
- Firebase for mobile device deployment
- AutoML table vs. BQML
- Takes longer time, process input data, try with different models
- Operating ML model
- Data drift / Concept drift
- IAM
- Project and models
- Models only
Neural network:
- An epoc = a pass through data
- Neural network adjusts
- Weights: multiplication to input values
- Bias: output value given 0 weight
- Data: training, test, evaluation
- ML adjusts these parameters automatically.
- Parameters: variables adjusted by training with data input
- Hyper-parameters: applied to the process (hidden layer?) as well as the data input
- Types
- Model hyperparameters
- Algorithm hyperparameters
- Examples: batch size, training epochs, #hidden layers, regularization, learning rate
- Features: input types (e.g. x2). Feature engineering: combination of features / inputs
- Loss: error for training or testing data
- Learning rate: speed of weight/parameter adjustment vs. loss. Define how slow/fast accurate training process is.
- Deep and wide
- Wide - memorization: many features, fewer hidden layers
- Deep - generalization: many hidden layers
- Deep and wide - recommendation
- Over-fitting: too specific to training data
- Causes: not enough training data; too many features; model fitted to unique features to training data (aka Noise)
- Solution:
- more training data; variety of training data
- make model less complex: less but more relevant features (combine relevant features to one)
- remove noise: increase ‘regularization’ parameter
- Options
- Regularization
- Increase training data
- Feature selection
- Early stopping
- Cross-validation
- Dropout layers: NL - weight occasionally set to 0
- Regularization:
- Add penalty if the model gets more complex
- Types
- L1: lasso regression - assign greater importance to more influential features. Good for large features
- L2: ridge regression - good for models with features with roughly equal sizes
- Under-fitting: error is high on the training set compared to the test set
- Data source: recommendation GCS - can use BigQuery.
- Platform: master node and worker nodes
- Prediction: single, batch
- ML API: encode input in base64 format
- AI platform
- Scaled tier: master, worker, parameter
- Basic, Standard, Premium, and Custom
- Local training:
gcloud ml-engine local train
Datalab: replaced by Notebooks
- Built on top of Jupyter (formerly iPython): interactive python
- Run on GCE (+VPC and Source Repository)
- Share notebook between users (i.e. share code)
Dataprep: offered by third-party Trifecta
- Backed by Cloud Dataflow
- Process
- Import
- Transform with recipes
- Run
- Export results (GCS, BQ)
- Intelligent suggestions: for simple tasks
- IAM
- Dataprep User
- Dataprep Service Agent: gives Trifecta necessary access to project resources
- Pricing: 1.16 * Dataflow job
- Can click on the black bar to select all missing values
- Join datasets: add a new Join step
Data Studio
- Web based BI - drag and drop report builder
- Use GWS identity for access/sharing permission, not IAM
- Files saved in Google Drive
- Connect to many services (Google, Google Cloud, and others)
- BQ, GCS, Cloud SQL, Spanner
- YouTube, Analytics, Sheets, AdWords
- Local upload
- Caching
- Query cache: remembers queries by reports components.
- Can NOT be turned off
- Will get from pre-fetch if not found
- Pre-fetch cache: ‘smart cache’, predicts that ‘might’ be requested.
- Can be turned off
- Will get from live data set if not found
- Only active for data sources tha use owner’s credentials for data access
- Cache expires in 12h
- Refresh: Refresh Data button; update the connector freshness
- Embed
- To website: using iframe
- Public report can be viewed by everyone. Private report can be viewed by users logged in with Google account
- Pricing
- Free, but data query (e.g. BigQuery query) is chargeable
Cloud Composer
- Fully managed Apache Airflow implementation
- Programmatically create, schedule, and monitor data workflow
- Cloud Composer = Apache Airflow + GKE + GCS
- workflows: Cloud Composer creates workflows using DAG files
- Orchestrate data pipelines
- Format: DAG (Direct Acyclic Graph) - written in Python
- Setup
- Configure environment variables
GAPS
- L1, L2 regularization
- Combining co-dependent and redundant features allows you to reduce the total number of features trained without sacrificing accuracy.
- L2 regularization is used to reduce weights for overfitted training models. L2 regularization is more relevant when all features have relatively equal weights/influence
- L1 regularization is able to reduce the weights of less important features to zero or near zero