Kamis, 05 Januari 2017

Paper Pertemuan 13 Arsitektur


PERTEMUAN 13 ARSITEKTUR








NAMA : DEVI SETYAWATI
NIM : 41816310004
MATA KULIAH : ARSITEKTUR DAN MANAJEMEN E-BUSINESS
DOSEN : Yulius Eka Agung Saputra, SE, M.Si
RUANG : D2-209


Basic Search Engine Optimization Techniques

  1. Determine what keywords you want to appear in SE results (this requires some research and analysis).
  2. Understand how search engine spiders work.
  3. Understand how search engines crawl and compile data on the Web and know what documents (html files) relate to which keywords and phrases.
Basically, search engines collect data about a unique Web site by sending an electronic spider to visit the site and copy its content which is stored in the search engine’s database. Generally known as ‘bots’ (robots), these spiders are designed to follow links from one document to the next. As they copy and assimilate content from one document, they record links and send other bots to make copies of content on those linked documents. This process continues ad infinitum. By sending out spiders and collecting information 24/7, the major search engines have established databases that measure their size in the tens of billions.

As spiders follow links and record everything in their paths, one can safely assume that if a link to a site exists, a spider will find that site. Webmasters and SEOs no longer need to manually or electronically submit their sites to the major search engines. The search spiders are perfectly capable of finding them on their own, provided a link to that site exists somewhere on the web. Google and Yahoo both have an uncanny ability to judge the topic or theme of documents they are examining, and use that ability to judge the topical relationship of documents that are linked together. The most valuable incoming links (and the only ones worth perusing), come from sites that share topical themes. Offering spiders access to the areas of the site one wants them to access is half the battle. The other half is found in the site content. Search engines are supposed to provide their users with lists of documents that relate to user.

After the URL of a site, there are four basic elements a search engine looks at when examining a document:

    1. the Title of the site (Page Title meta tag)
    2. the Description meta tag
    3. the Keywords meta tag
    4. keywords in text and especially in the anchor text used in internal links

Page Titles should be written using the strongest keyword targets as the foundation. Some titles are written using two or three basic two-keyword phrases. A key to writing a good title is to remember that human readers will see the title as the reference link on the search engine results page (followed by the description)

The Description Meta tag is also fairly important. Search engines tend to use it to gather information on the topic or theme of the document. A well written Description is phrased in two or three complete sentences with the strongest keyword phrases woven early into each sentence. As with the title tag, some search engines will display the Description on the search results pages, generally using it in whole or in part to provide the text that appears under the reference link. Some search engines place minor weight in the Keywords Meta tag.

Good content is the most important aspect of search engine optimization. The easiest and most basic rule of the trade is that search engine spiders can be relied upon to read basic body text 100% of the time. By providing a search engine spider with basic text content, SEOs offer the engines information in the easiest format for them to read. While some search engines can strip text and link content from Flash files, nothing beats basic body text when it comes to providing information to the spiders. Very good SEOs can almost always find a way to work basic body text into a site without compromising the designer’s intended look, feel and functionality.

The content itself should be thematically focused. In other words, keep it simple. Some documents cover multiple topics on each page, which is confusing for spiders and SEOs alike. The basic SEO rule here is if you need to express more than one topic on a page, you need more pages. Fortunately, creating new pages with unique topic-focused content is one of the most basic SEO techniques, making a site simpler for both live-users and electronic spiders. An important caveat is to avoid duplicate content and the temptation to construct doorway pages specifically designed for search placements.

Purpose - To explore the capabilities and limitations of blog search engines.
Design/methodology/approach - First, we describe the features of a range of current blog search engines. Second, we discuss and illustrate with examples the reliability and coverage limitations of blog searching.
Findings – Although blog searching is a useful new technique, the results are sensitive to the choice of search engine, the parameters used and the date of the search. The quantity of spam also varies by search engine and search type.R
Research limitations/implications – The results illustrate blog search evaluation methods and do not use a full-scale scientific experiment.
Originality/value - Blog searching is a new technique, and one that is significantly different to web searching. Hence information professionals need to understand its strengths and weaknesses.
INTRODUCTION
The information sources available to librarians and other information professional have expanded from the traditional shelves of books to a plethora of online repositories. In parallel, information retrieval techniques have developed from the card index system to keyword searching and the advanced Boolean interfaces available for the typical digital library and web search engines. Information professionals need to keep track of the new information sources and technologies, understanding what is available, how to access it, and how to interpret or evaluate the results.

blog searching is one of the most unusual. Blogs are mini web sites containing entries in reverse chronological order. They are often updated daily or weekly and frequently take the form of a personal diary (Herring, Scheidt, Bonus, & Wright, 2004), a specialist information resource (e.g., theshiftedlibrarian.com) or a political commentary (Trammell & Keshelashvili, 2005).Although a few ‘A-list’ blogs are relatively authoritative, with readerships of hundreds of thousands for their timely political or technological commentaries (Trammell & Keshelashvili, 2005), the majority of blogs carry little authority and the content of most is probably trivial, or crass and opinionated (Weiss, 2004). Hence, from a traditional librarian’s perspective blogs seem an information source to be mostly avoided. A follower of blogs may perhaps visit those of friends and a few trustworthy information blogs (Bar-Ilan, 2005) for professional or leisure interests, but would probably have little cause to use a general blog search engine such as blogsearch.google.com. Nevertheless, blogs do contain information that can be of value in some cases, such as for public opinion insights (Gruhl, Guha, Kumar, Novak, & Tomkins, 2005).
If a researcher is not looking for a specific fact or theory but is interested in attitudes or opinions towards an event or topic, then an appropriate blog search may well yield a set of relevant posting by a variety of individual bloggers. Hence understanding the potential of blog searching is (yet another) capability that information professionals may benefit from mastering.

The advertising industry has already recognised the potential of blogs and other ‘consumer-generated media’ (CGM) to gain insights into consumer opinions (Pikas, 2005). For example Nielsen BuzzMetrics’ BrandPulse will track mentions of a company’s brand name online (http://www.nielsenbuzzmetrics.com/brandpulse.asp) and IBM and Microsoft (Gamon, Aue, Corston-Oliver, & Ringger, 2005; Gruhl, Guha, Liben-Nowell, & Tomkins, 2004) have similar projects to extract users opinions or comments from large quantities of comments. There are two main issues here. First, continually monitoring online sources allows trends and changes to be identified. For instance a company may wish to know how a particular advertising campaign or news story has changed their brand or product perceptions. Second, this is a passive activity. Consumers are not interviewed or sent a survey but are indirectly canvassed via their perhaps throwaway comments in blogs or email discussion lists. The unique advantage of this is that retrospective opinions can be sought even about unexpected events

A few search engines provide this function, typically reporting the daily proportion of blog postings that match the query. Any noticeable peak in such a graph may represent a burst of discussion around a specific topic. The debate can then be found typically by clicking on the peak in the graph, which produces a list of the posts on that day matching the search.

Although blog search engines have existed since at least 2001 with DayPop and have been already described briefly by various librarians (Bradley, 2003; Curling, 2001; Notess, 2002), their increasing power and an expanding blogspace makes them more relevant now than ever before. In this paper we describe the capabilities of some common blog search engines and present an illustrative analysis of the reliability and coverage of their results. The purpose of these is not to give definitive information in either case, because rapid change seem likely, but to illustrate the types of blog search capabilities that are available and their likely shortcomings

Blog Searching Engines
Blog search engines are similar to web search engines like Google in that they automatically gather large quantities of information from the web and give a free interface to allow the public to search their databases. The main difference between the two is that blog search engines mainly index blogs and ignore the rest of the web. The special features of blogs give blog search engines some specific and unique attributes. First, since each blog posting is dated, blog search engines can report the date at which the posting was created. For normal web pages, search engines can only report the last updated date, and this is often not very reliable. Second, many blog search engines have a date-specific search capability. Again, some general search engines have this as an advanced search option, but only for the last modified date of pages.
Although blogs are web sites and hence use standard HyperText Markup Language (HTML) for their construction, blog search engines are designed differently to general search engines in order to take advantage of blog structures. The core of any blog is the list of individual blog postings, but these are typically presented to the blog visitor in a range of different formats. a blog search engine will try to understand the format of a blog and dissect and store just the individual blog postings, ignoring all the grouped pages. This is an operation that needs to be coded for each blog format. Hence it is quite labour-intensive for computer programmers. A corollary of this is that it is likely that blog search ngines only index the most common blog formats and ignore minor or one-off formats, and it is difficult to understand and process the format of blogs in foreign languages. There is a fallback mechanism, however, the Rich Site Summary (RSS) format (Hammersley, 2005; Notess, 2002). This is a technology used by a minority of blogs to deliver their individual most recent postings to users. The standard format of RSS means that it is easy to process and there is often no need to understand the language of a blog to correctly process its RSS feed. In summary, a typical blog search engine is likely to be constructed using a combination of comprehensive indexing of common blog formats

Table 1. Blog search engines (August 2006).

Search Engine
URL
Content
Other
Bloglines
http://www.bloglines.com
Posts or feeds or others
Can add extra entries to the search options
Feedster
http://www.feedster.com
Blogs or news or podcasts or all
No boxes for search preferences - need syntax, instructions on site
Technorati
http://www.technorati.com
Posts or tags or blog directory

Icerocket
http://www.icerocket.com
Blogs or several other things

Blogdigger
http://www.blogdigger.com
Blogs
No instructions/help, just search box
Blogpulse
http://www.blogpulse.com
Blogs

A9
http://a9.com/
Blogs or several other things
Uses IceRocket search
Findory Blogs
http://www.findory.com/blogs
Blogs or News or Video or Podcasts or Web
Just a search box - no advanced preferences or instructions/help
Google Blog Search
http://blogsearch.google.com
Posts

BlogSearch-Engine
http://www.blog
searchengine.com
Blogs or moblogs
“Powered by” IceRocket
Bloogz
http://www.bloogz.com
Blogs
Can search blogs or URLs, not both at once
Gigablast
http://blogs.gigablast.com
Blogs or several other things
Also site clustering, summary excerpts, site restriction
Sphere
http://www.sphere.com
Blogs


Table 2 summarises the available advanced search facilities, including Boolean searches, language specific searches and word location limits (e.g., author/title/body). It is clear from the table that a variable range of capabilities is offered, with no engine being comprehensive.


Table 2. Blog search engine capabilities (August 2006).

Boolean search
Date search
URL search
Time limits
Language selection
Word location
#Results
selection
Sort choice
Bloglines
Partial
Yes
No
2001
Yes
Yes
10,20,30,
50,100
Yes
Feedster
Full
Yes
Yes
No
No
Yes
No
Yes
Technorati
Full
No
Yes
No
No
No
No
No
Icerocket
Full
Yes
No
No
No
Yes
No
No
Blogdigger
Full
No
No
No
No
No
No
Yes
Blogpulse
Full
Yes
Yes
180 days
No
No
10,25,50
Yes
Findory Blogs
Partial
No
No
No
No
No
No
No
Google Blog Search
Full
Yes
Yes
2000
Yes
Yes
10,20,30,
50,100
No
Bloogz
Partial
No
Yes
No
Yes
No
No
Yes
Gigablast
Full
No
Yes
No
Yes
No
10,20,30,
50,100
No
Sphere
Full
Yes
No
4 mths.
Yes
Yes
No
Yes

Producing a trend graph for a query and looking for spikes in the graph is a good way of discovering relevant recent events. Below is a list of blog trend graph capabilities.
  • Blogpulse (submit a query and click on “trend this”): Graphs of the percentage of postings daily matching a query for the most recent 6 months. Can produce 3 simultaneous graphs and clicking on the graph gives a list of postings from the selected date.
  • Technorati (submit a query and click on the mini-graph): Graphs the total volume of postings daily for up to the most recent 360 days. A small Technorati graph can be added to a user’s web site.
  • IceRocket (submit a query and click on “trend it”): Graphs of the percentage of postings daily matching a query for the most recent 3 months. Can produce 3 simultaneous graphs.
Evaluation : Realibility and Coverage
Research into general search engines has shown that their coverage and reliability are imperfect (Bar-Ilan, 1999; Bar-Ilan & Peritz, 2004; Jasco, 2006; Lawrence & Giles, 1999; Mettrop & Nieuwenhuysen, 2001; Rousseau, 1999). The problems include differences in the results reported between search engines and even by the same search engine over time. In addition, different search engines can report different sets of results and rank their results in different ways. Hence it is logical to assume that the same would be true for blog search engines

Coverage (results)

It is not possible to precisely describe the coverage of blog search engines. There is no single source of blog URLs and so each search engine probably has a different set of blog URLs and uses a different ad-hoc method to find new blogs. In addition, some search engines may collect blog data indirectly via RSS feeds. For example, methods to find new blogs include following links in existing blogs and automatically identifying blogs in a general crawl of the web (e.g., Google could do this)

Table 3 summarises the results, excluding the search engines in Table 1 that used IceRocket results.
  • Book (very common word)
  • Librarian (medium-usage word)
  • Timbuktu (low usage word)
  • Citedness (rare word)

Table 3. The total number of hits reported in each search engine.
Search engine
book
Librarian
Timbuktu
citedness
Google Blog Search (beta)*
15,252,764
1,662
411
11
Technorati*
11,048,316
151,474
12,497
32
Bloglines*
5,486,000
191,600
6,930
27
IceRocket
4,449,856
63,755
4,683
3
BlogPulse
2,990,010
46,179
2,905
3
Feedster*
1,404,746
25,429
816
3**
Blogdigger
687,025
24,480
547
6
Gigablast
458,742
13,726
667
3
Sphere*
357,020
9,071
672
3
Bloogz
48,478
1,769
54
0
Findory Blogs
2,159
282
1
0
*Numbers change between pages of results. **Using the “search further back” option.

Table 4. Results of time-specific queries: from July 11 to 12, 2006.
Search engine
book
librarian
Timbuktu
citedness
Ice Rocket
38,552
609
37
0
BlogPulse
33,983
542
34
0
Sphere*
11,640
298
30
0
Bloglines
1,420
33
2
0
Feedster
153
2
0
0
Google Blog Search *
95
100
60
0

Converege (Language)
This would be consistent with the search engines (perhaps with the exception of Google) developing language-specific strategies.
The results shown in tables 3 and 4 for each query suggest that the search engines’ effective database sizes are significantly different. In some cases the results are unreliable and vary significantly between different pages of the result set and also for the same query submitted at different times. Google’s results seem rather low in Table 4, perhaps because it is a beta (pre-release) version, or perhaps it uses only a subset of its database for time-specific queries.


Table 5. Coverage of Google translations of the word ‘library’ in several languages.
Search engine
library
Biblioteca (Italian, Portuguese Spanish)
Bibliothèque
(French)
Bibliothek (German)
المكتبه
(Arabic)
図書館
(Japanese)
(Korean)
图书
(Chinese simplified)
Google
4024970
186662
45669
17992
89
248160
4991
105563
Technorati
2634679
193666
41424
22780
0
1,161,055
0
0
Bloglines
2887000
7390
3750
2710
0
0
0
0
IceRocket
1060191
48616
26
6684
141
431
861
4681
BlogPulse
554482
23505
9505
2106
55
80690
0
143056
Feedster
207112
533
99
91
2
247
1
126
Blogdigger
175926
3935
2451
2358
0
1081
0
1131
Gigablast
103760
3458
1809
662
0
231
6
992
Sphere
83506
6810
251
390
2
7
1
5
Bloogz*
16175
442
0
285
0
0
0
0
Findory Blogs*
991
1
0
1
0
0
0
0

Coverage (bloggers)

Blogger demographics are an important issue for those wishing to know about the opinions of bloggers or to use blog searches for public opinion or trend identification. It is clear that bloggers are not typical citizens of the world:
Presumably blog search engines, like general search engines (Chakrabarti, 2003), identify new blogs to index by following links from known blogs so that they tend to cover the more popular blogs and would not have an explicitly biased policy for the kind of blogs indexed. For example if search results contain mainly right-wing blogs then this is unlikely to be the result of a coverage policy decision

Internal Consistency

The results reported by general search engines have been shown to be internally inconsistent, in the sense that the same query may yield significantly different results when repeated a short while later (Mettrop & Nieuwenhuysen, 2001). Moreover, different numbers may be reported on different results pages. For blogs, an additional factor is that the total number of matches for a particular day may vary over time if results from spam blogs are removed, as the spam blogs are identified, or if additional blog postings are subsequently found (e.g., in a previously unknown blog).

The number of results reported by a search engine for a query may vary for two reasons. First, the search engine may perform the initial search over only a fraction of its database and then guess at the total number of results in the full database. Second, the search engine may perform the systematic elimination of duplicates or near-duplicates on a page-by-page basis, using the results to predict the total number of valid matches. This second reason explains why the order in which the results are sorted can have an impact on the apparent total number of results.

Table 6 illustrates some of these issues. Most of the search engines report small changes in the search results when moving between different pages. Google Blog Search gives more significant changes and previous experience has shown that it can sometimes give radically different results depending upon the order in which the results are sorted, and the total number of results can change dramatically when a lot of spam is involved. Table 7 illustrates this phenomenon with some sample queries. In addition, simply pressing the refresh button in Google sometimes changes the results
Table 6. Blog search engine result changes for the query “Library”.

Search engine
Page 1
Page 2
Page 10
Page 1 again
Google
3988495
3988376
4011573
4016487
Technorati
2634843
2634888
2634888
2634888
Bloglines
2888000
2888000
2888000
2888000
IceRocket
1060317
1060321
1060321
1060321
BlogPulse
554524
554524
554524
554524
Feedster
207188
207205
207206
207206
Blogdigger
175926
175926
175926
175926
Gigablast
103760
103760
103760
103760
Sphere
83506
83504
83503
83492
Bloogz*
16175
16175
16175
16175
Findory Blogs*
991
991
991
991

Spam

Although spam does not seem to have attracted attention in cybermetrics research, it is an issue for blog search engine research (e.g., Han, Ahn, Moon, & Jeong, 2006; Narisawa, Yamada, Ikeda, & Takeda, 2006) because blog spam is prevalent. Spam blogs may be identified automatically or manually and the different search engines may have differing levels of success in identifying and removing it. Table 8 reports some results of manual spam blog counting in some search engines. The relatively low quantity of Spam is reassuring and in contrast to our earlier experience with news-related blog searching, which typically produced 50%-90% spam results from fake news blogs.

Table 8. Spam blog/non-blog results in the first 100 search matches.
Spam/Non-blogs
BlogPulse
Google
IceRocket
Book
8/0
0/29
6/11
Librarian
4/2
1/19
9/5
Timbuktu*
2/5
7/35
11/5
citedness
0/1 (3 hits)
1/2 (11 hits)
0/0 (3 hits)
*Noticeable repetition + non-English blogs

Table 7. Google blog search results for different pages and sort options.
Query
Page 1 date-sorted/relevance-sorted
Page 2 date-sorted/relevance-sorted
Page 3 date-sorted/relevance-sorted
Page 4 date-sorted/relevance-sorted
Book
22,282,559/
26,180,926
15,436,952/ 25,154,932
22,476,958/
42,241,386
26,926,910/
25,154,118
Librarian
1,284/1,247
2,116/2,173
2,923/3,044
3,500/3,900
Timbuktu
15,915/15,902
15,881/15,839
15,837/15,787
15,741/15,723
citedness
11/11
11/11
-
-

Precision

General search engines sometimes seem to make mistakes: i.e. returning pages not matching the query term. This may be because the page has changed between indexing and the time of the query. This should not happen for blogs, or only rarely, because blog postings tend not to be modified after being posted. A related issue is stemming – some information retrieval systems automatically stem words before matching them.

Overlaps and Ranking

Web search engines generally list results in order of decreasing relevance so that the most useful pages or sites are in the first few results. The ranking of web pages is typically performed using a combination of the text in a page and the number of links pointing to the page or site (Brin & Page, 1998; Chakrabarti, 2003). Hence the top results of search engines tend to overlap somewhat – there are online tools to explore this phenomenon (Jasco, 2005). Blog search engines, in contrast, seem not to rank results using links but present them by default in reverse chronological order, assuming that the searcher will be more interested in currency than relevance or authority. It seems unlikely that blog search engines will have a large overlap in results since the most recent posts will depend upon the blog checking order, which will vary by search engine (see Lewandowski, Wahlig, & Meyer-Bautor, 2006). A large overlap could only be expected for queries with few results and only if blog search engine databases significantly overlap.
We compared the top 50 results for the query ‘librarian’ in Google and BlogPulse, finding no overlaps at all, despite both reporting recent results first. We constructed a rare query “library of Timbuktu” to measure precise overlaps, illustrating the results for the biggest engines in Table 9. In addition, Bloogz found 3 results (1 overlap with Technorati); Sphere found the same result as IceRocket; Gigablast found 1 (unique) article; Blogdigger found 7 (3 overlapping with other engines) and Feedster does not allow phrase searches. Overall, it seems that there is a low degree of overlap between the search engines.

Table 9. Overlaps between search engines for the query “library of Timbuktu”.
Overlap
Google
Technorati
Bloglines
IceRocket
BlogPulse
Google (6 matches)
-
1
3
0
0
Technorati (10)
1
-
1
1
0
Bloglines (4)
3
1
-
0
0
IceRocket (1)
0
1
0
-
0
BlogPulse (2)
0
0
0
0
-


CONCLUSIONS
Blog search engines are a source of new types of information, such as public opinion and expert commentaries. Based upon the experiments above, users should expect great variety between search engines and alack of uniformity. Hence we make the following recommendations.
  • Try different search engines to find one with the most useful capabilities.
  • For low frequency queries a range of different search engines may be needed if one gives few results.
  • For non-English queries look for a blog search engine that gives good coverage of the language.

If the searches are to be used to predict public opinion or to use otherwise the total volume of hits for a query, then we make the following additional recommendations.
  • Don’t rely upon the “total results” estimates of most of blog search engines but perform additional checking and use the results of several engines together.
  • Don’t assume that the results are unbiased by language or nation, or that bloggers are r epresentative of the general population.

Tidak ada komentar:

Posting Komentar