Understanding AI

AISunday, March 1, 2026·11 min read

What You Should Know Before You Trust It

Understanding AI

What You Should Know Before You Trust It

Preface

This document is based on an actual conversation between the author and Claude, an AI assistant developed by Anthropic. The questions posed during that conversation were deliberately chosen to probe the boundaries, limitations, and reliability of large language models (LLMs). The responses have been structured and expanded here for educational purposes.

The goal is not to discourage you from using AI. AI tools like ChatGPT, Claude, Gemini, and others are powerful and can accelerate your learning and productivity significantly. However, as engineering students, you are being trained to think critically, verify assumptions, and never accept results without understanding their source and validity. The same discipline applies when working with AI.

An AI that sounds confident is not the same as an AI that is correct. Your job as an engineer is to know the difference.

1. How AI Models Are Built

Large language models like Claude, ChatGPT, and Gemini are not databases. They do not store information in tables and retrieve it on demand. Instead, they learn patterns from massive volumes of text during a process called training. The training data is compressed and encoded into billions of numerical parameters (weights) within a neural network.

When you ask a question, the AI does not look up an answer. It generates a response by predicting the most likely sequence of words based on the patterns it learned during training. This is fundamentally different from searching a database or consulting a textbook.

1.1 Where Does the Training Data Come From?

The training data for modern LLMs comes from several broad categories:

Source Category	Description	Implication
Internet (Web Crawls)	Billions of web pages scraped from the open internet, including news sites, blogs, forums, Wikipedia, and social media.	Largest portion of training data. Quality varies enormously — from peer-reviewed content to personal blogs.
Books	Digitized books, both public domain and potentially copyrighted. Textbooks, novels, reference works.	Provides deeper, structured knowledge. Exact sources are not fully disclosed by any AI company.
Academic Papers	Open-access repositories like arXiv, PubMed, and institutional papers.	Enables technical reasoning in STEM fields, but coverage depends on what was openly available.
Code Repositories	GitHub, GitLab, and open-source codebases.	Enables code generation. Quality depends on which repositories were included.
Government Documents	Legal texts, patents, public records, and institutional publications.	Provides factual and regulatory knowledge, but coverage varies by country.

1.2 What Is NOT Transparent

No major AI company — not Anthropic (Claude), OpenAI (ChatGPT), or Google (Gemini) — has published a complete breakdown of their training data. This means:

You cannot verify exactly which books, papers, or websites were included.
You do not know which sources were weighted more heavily than others.
You cannot determine whether certain perspectives were filtered or boosted.
Private or licensed datasets may have been used without public disclosure.

The training data composition directly shapes what the AI knows, what it is biased toward, and what it is blind to. You, the user, have no way to audit it.

2. The Truth Level Problem

A common assumption is that AI has a configurable "truth level" — a setting that determines how honest it is. This is not how it works. There is no honesty dial that providers adjust up or down.

However, AI behaviour is shaped in several important ways:

2.1 What Is Configured

Aspect	How It Is Shaped
Tone and Personality	The AI is designed to be warm, helpful, and conversational. This is not neutral — it is a deliberate design choice.
Refusal Boundaries	Certain topics are restricted: weapons instructions, malicious code, copyrighted content. The AI will decline or redirect.
Framing of Sensitive Topics	Political, ethical, and social topics are presented in an evenhanded manner rather than taking strong positions.
Uncertainty Expression	The AI is trained to say "I don’t know" rather than fabricate answers — but this does not always work as intended.

2.2 The Fundamental Limitation

The AI can be honest within its boundaries, but it cannot see past them. If there is a blind spot in the training data or a constraint the AI is not aware of, it cannot disclose what it does not know it does not know.

This is not a theoretical concern. In practice, AI regularly:

Presents outdated information as current fact.
Blends peer-reviewed findings with low-quality blog content into the same confident tone.
Generates citations that sound legitimate but do not actually exist (hallucination).
Presents minority scientific positions as mainstream if they appeared frequently in training data.

An AI is honest within a frame that its developers built. The AI did not choose the frame. As a user, you must always be aware that the frame exists.

3. Language Bias

Not all languages are represented equally in AI training data. The disparity is significant and has real consequences for users in different linguistic communities.

3.1 The Language Hierarchy

Coverage Level	Languages	Implication
Strong	English (dominant), French, German, Spanish, Chinese, Japanese, Portuguese, Korean	Good depth, nuance, idiom accuracy, and cultural context. English far exceeds all others.
Moderate	Bahasa Malaysia/Indonesia, Arabic, Thai, Vietnamese, Turkish, Hindi	Functional but reduced accuracy in idioms, cultural nuance, and domain-specific terminology.
Weak / Absent	Thousands of indigenous, regional, and minority languages (Iban, Kadazan, most African languages, etc.)	Severely degraded or non-functional. Users in these languages receive significantly worse service.

3.2 Why This Happens

Training data comes predominantly from the internet. The internet is dominated by English and a handful of other high-resource languages. Languages with smaller digital footprints — fewer websites, fewer digitized books, fewer academic publications — are underrepresented.

There are approximately 7,000 languages in the world. Modern AI handles fewer than 100 with meaningful competence. This is not a deliberate exclusion — it is a structural consequence of how training data is collected.

What This Means for You

If you are working in Bahasa Malaysia, the AI can function, but its depth, precision, and cultural sensitivity will be noticeably lower than in English. Technical terminology, local idioms, and domain-specific Malay vocabulary may be handled imprecisely. For critical academic or professional work, always verify Malay-language AI outputs against authoritative sources.

4. Historical and Knowledge Bias

AI does not have balanced knowledge across all subjects, time periods, and regions. Its knowledge reflects what has been written about, digitized, and published — predominantly in English.

4.1 The Dominant Lens

Western history, European timelines, American politics, and major global events receive heavy coverage in training data. This creates a systematic bias:

Middle Eastern Islamic history: Reasonable coverage, largely through Western academic scholarship.
Southeast Asian Islamic history: Thin coverage compared to Middle Eastern traditions.
Malay sultanate intellectual traditions: Very thin — limited digitized sources in English.
Local Sufi traditions in Nusantara: Barely present in training data.
Colonial-era Jawi manuscripts: Almost entirely absent.

4.2 The Ibn Arabi Example

To illustrate: an AI can discuss Muhyiddin Ibn Arabi (1165–1240 CE) with considerable depth — his concept of Wahdat al-Wujud (Unity of Existence), the Fusus al-Hikam, the Futūhāt al-Makkiyya. This might suggest balanced coverage of Islamic intellectual history.

It does not. Ibn Arabi is extensively covered because Western academics — scholars like William Chittick, Henry Corbin, and Toshihiko Izutsu — produced substantial English-language scholarship about his work. The AI’s knowledge of Ibn Arabi comes largely through the Western academic lens, not from balanced access to the broader Islamic scholarly tradition.

Scholars who were equally significant within the Islamic tradition but did not attract Western academic attention are far less represented. The AI knows what crossed into the Western academic gaze. The rest is largely invisible to it.

The AI’s depth on a topic is not proof of balanced coverage. It is proof that the topic was written about extensively in the languages and sources that dominated training data.

5. When to Trust AI — and When Not To

Not all AI outputs carry the same risk. The key is understanding which types of outputs are self-verifying and which require external validation.

5.1 The Trust Spectrum

Category	Reliability	Why	Your Action
Code	HIGH — Verifiable	You run it. It either works or produces an error. The feedback loop is instant and objective.	Test it. The compiler does not care about bias.
Mathematical Calculations	HIGH — Verifiable	Results can be checked against known formulas and validated numerically.	Verify key steps. Cross-check with hand calculations.
Document Formatting & Structure	HIGH — Visible	You can see the output directly and verify it matches your requirements.	Review the output visually.
Technical Explanations	MODERATE — Plausible	Often correct in principle but may contain subtle errors, outdated information, or oversimplifications.	Cross-reference with textbooks or peer-reviewed sources.
Scientific Claims & Facts	MODERATE TO LOW	The AI blends sources of varying quality into one confident tone. Peer-reviewed findings and blog posts sound identical.	Always verify against primary sources. Never cite AI directly.
Citations & References	LOW — Unreliable	AI frequently generates plausible-sounding references that do not exist (hallucination).	Verify EVERY citation manually. Check DOIs, journal names, and author names.
Current Events & Dates	LOW — Time-limited	Training data has a cutoff date. Information may be outdated without the AI indicating this.	Check official or recent sources for anything time-sensitive.
Historical / Cultural Knowledge	VARIABLE	Depends heavily on whether the topic was well-represented in English-language training data.	Be especially cautious with non-Western, regional, or minority cultural topics.

6. Practical Guidelines for Students

6.1 The Golden Rule

Use AI as a tool, not as an oracle. It accelerates your work. It does not replace your judgment.

6.2 Do’s

Use AI for drafting and structuring: Let it help you organise reports, generate outlines, and structure arguments. Then refine with your own knowledge.
Use AI for code assistance: Syntax help, debugging, boilerplate generation — these are where AI excels because the output is immediately testable.
Use AI for learning: Ask it to explain concepts in different ways. But verify the explanations against your lecture notes and textbooks.
Use AI for brainstorming: Generate ideas, explore approaches, and consider angles you might not have thought of.
Use AI for language improvement: Grammar checking, rephrasing, and translation assistance — but review cultural nuance yourself.

6.3 Don’ts

Never submit AI-generated text as your own work without understanding and verifying every claim in it. Plagiarism aside, you are responsible for factual accuracy.
Never trust AI citations blindly: Verify every reference. AI hallucination of citations is a documented and common problem.
Never assume AI is current: Training data has a cutoff. For recent developments, regulations, or statistics, use official sources.
Never use AI as your sole source for scientific claims: AI sounds confident regardless of whether its information is accurate, outdated, or fabricated.
Never assume equal quality across languages: AI performance degrades significantly outside of English and a few other high-resource languages.

7. The Bigger Picture

7.1 Who Controls the Knowledge System

AI systems are built by a small number of technology companies, predominantly based in the United States. The training data reflects what exists on the English-language internet and in digitised Western publications. This creates an inherent power asymmetry:

Knowledge that exists in digitised, English-language form is amplified.
Knowledge that exists in oral traditions, local languages, or non-digitised formats is invisible.
The framing of historical events, cultural narratives, and scientific priorities reflects the dominant data sources.

This is not a conspiracy — it is a structural consequence of how these systems are built. But it is something you should be conscious of, particularly when working with topics related to Malaysian history, Islamic scholarship, Southeast Asian engineering standards, or any domain where local knowledge differs from global (Western) defaults.

7.2 AI as an Engineering Tool

As engineering students, you are trained to understand your instruments. You learn the accuracy, precision, range, and limitations of every measurement device you use. You would never trust a sensor reading without understanding its error margins and calibration status.

Apply the same discipline to AI. Understand what it is, where its data comes from, what it is good at, and where it fails. Use it for what it does well. Verify everything else.

Treat AI the way you treat any instrument in your lab: powerful when used correctly, dangerous when used blindly.

8. Summary

The following points summarise the key takeaways from this document:

#	Key Takeaway
1	AI does not retrieve facts from a database. It generates responses by predicting word patterns based on training data.
2	Training data comes primarily from the internet, books, academic papers, and code. The exact composition is not publicly disclosed.
3	There is no "truth level" dial. AI behaviour is shaped by design choices, but the AI cannot see past its own boundaries.
4	Language coverage is deeply unequal. English dominates. Bahasa Malaysia and many other languages receive significantly less representation.
5	Historical and cultural knowledge is biased toward what was written about in English. Topics outside the Western academic gaze are underrepresented.
6	Code outputs are verifiable and reliable. Scientific claims, citations, and factual assertions require independent verification.
7	AI sounds equally confident whether it is correct, wrong, or fabricating. Your critical thinking is the only safeguard.
8	Use AI as an engineering tool: understand its specifications, calibrate your trust, and never accept output without verification.

"The person who knows the limitation of a tool is wiser than the person who only knows how to use it."