Эта функция удобна, когда закодированная строка будет использоваться в запросе, как часть URL, в качестве удобного способа передачи переменных на следующую страницу.
Список параметров
Строка, которая должна быть закодирована.
Возвращаемые значения
Примеры
Пример #1 Пример использования urlencode()
Пример #2 Пример использования urlencode() и htmlentities()
urlencode function and rawurlencode are mostly based on RFC 1738.
However, since 2005 the current RFC in use for URIs standard is RFC 3986.
Here is a function to encode URLs according to RFC 3986.
I needed encoding and decoding for UTF8 urls, I came up with these very simple fuctions. Hope this helps!
Don’t use urlencode() or urldecode() if the text includes an email address, as it destroys the «+» character, a perfectly valid email address character.
Unless you’re certain that you won’t be encoding email addresses AND you need the readability provided by the non-standard «+» usage, instead always use use rawurlencode() or rawurldecode().
I needed a function in PHP to do the same job as the complete escape function in Javascript. It took me some time not to find it. But findaly I decided to write my own code. So just to save time:
urlencode is useful when using certain URL shortener services.
The returned URL from the shortener may be truncated if not encoded. Ensure the URL is encoded before passing it to a shortener.
(tilde), while urlencode does.
Below is our jsonform source code in mongo db which consists a lot of double quotes. we are able to pass this source code to the ajax form submit function by using php urlencode :
If you want to pass a url with parameters as a value IN a url AND through a javascript function, such as.
. pass the url value through the PHP urlencode() function twice, like this.
However, some weird things happen when dealing with characters like (these are HTML entities): ‼ ▐ ┐and Θ have weird things going on.
If you try to pass one in Internet Explorer, IE will *disable* the submit button. Firefox, however, does something weirder: it will convert it to it’s HTML entity. It will display properly, but only when you don’t convert entities.
The point? Be careful with decorative characters.
This very simple function makes an valid parameters part of an URL, to me it looks like several of the other versions here are decoding wrongly as they do not convert & seperating the variables into &.
= ‘machine/generated/part’ ; $url_parameter1 = ‘this is a string’ ; $url_parameter2 = ‘special/weird «$characters»‘ ;
$link_label = «Click here & you’ll be » ;
Shortly: — Use urlencode for all GET parameters (things that come after each «=»). — Use rawurlencode for parts that come before «?». — Use htmlspecialchars for HTML tag parameters and HTML text content.
look on index.php array (size=0) empty test-bla-bla-4%253E2-y-3%253C6
look on test-bla-bla-4%253E2-y-3%253C6 array (size=1) ‘token’ => string ‘bla-bla-4>2-y-3
Simple static class for array URL encoding
/** * * URL Encoding class * Use : urlencode_array::go() as function * */ class urlencode_array <
Meet URL Decode and Encode, a simple online tool that does exactly what it says: decodes from URL-encoding as well as encodes into it quickly and easily. URL-encode your data without hassles or decode it into a human-readable format.
URL-encoding, also known as «percent-encoding», is a mechanism for encoding information in a Uniform Resource Identifier (URI). Although it is known as URL-encoding it is, in fact, used more generally within the main Uniform Resource Identifier (URI) set, which includes both Uniform Resource Locator (URL) and Uniform Resource Name (URN). As such it is also used in the preparation of data of the «application/x-www-form-urlencoded» media type, as is often employed in the submission of HTML form data in HTTP requests.
All communications with our servers come through secure SSL encrypted connections (https). We delete uploaded files from our servers immediately after being processed and the resulting downloadable file is deleted right after the first download attempt or 15 minutes of inactivity (whichever is shorter). We do not keep or inspect the contents of the submitted data or uploaded files in any way. Read our privacy policy below for more details.
Our tool is free to use. From now on, you don’t need to download any software for such simple tasks.
Details of the URL-encoding
Types of URI characters
The characters allowed in a URI are either reserved or unreserved (or a percent character as part of a percent-encoding). Reserved characters are characters that sometimes have special meaning. For example, forward slash characters are used to separate different parts of a URL (or more generally, a URI). Unreserved characters have no such special meanings. Using percent-encoding, reserved characters are represented using special character sequences. The sets of reserved and unreserved characters and the circumstances under which certain reserved characters have special meaning have changed slightly with each new revision of specifications that govern URIs and URI schemes.
Other characters in a URI must be percent encoded.
Percent-encoding reserved characters
When a character from the reserved set (a «reserved character») has special meaning (a «reserved purpose») in a particular context and a URI scheme says that it is necessary to use that character for some other purpose, then the character must be percent-encoded. Percent-encoding a reserved character means converting the character to its corresponding byte value in ASCII and then representing that value as a pair of hexadecimal digits. The digits, preceded by a percent sign («%»), are then used in the URI in place of the reserved character. (For a non-ASCII character, it is typically converted to its byte sequence in UTF-8, and then each byte value is represented as above.)
The reserved character «/», for example, if used in the «path» component of a URI, has the special meaning of being a delimiter between path segments. If, according to a given URI scheme, «/» needs to be in a path segment, then the three characters «%2F» (or «%2f») must be used in the segment instead of a «/».
Reserved characters after percent-encoding
!
#
$
&
‘
(
)
*
+
,
/
:
;
=
?
@
[
]
%21
%23
%24
%26
%27
%28
%29
%2A
%2B
%2C
%2F
%3A
%3B
%3D
%3F
%40
%5B
%5D
Reserved characters that have no reserved purpose in a particular context may also be percent-encoded but are not semantically different from other characters.
In the «query» component of a URI (the part after a «?» character), for example, «/» is still considered a reserved character but it normally has no reserved purpose (unless a particular URI scheme says otherwise). The character does not need to be percent-encoded when it has no reserved purpose.
URIs that differ only by whether a reserved character is percent-encoded or not are normally considered not equivalent (denoting the same resource) unless it is the case that the reserved characters in question have no reserved purpose. This determination is dependent upon the rules established for reserved characters by individual URI schemes.
Percent-encoding unreserved characters
Characters from the unreserved set never need to be percent-encoded.
URIs that differ only by whether an unreserved character is percent-encoded or not are equivalent by definition, but URI processors, in practice, may not always treat them equivalently. For example, URI consumers shouldn’t treat «%41» differently from «A» («%41» is the percent-encoding of «A») or «%7E» differently from «
«, but some do. For maximum interoperability, URI producers are therefore discouraged from percent-encoding unreserved characters.
Percent-encoding the percent character
Because the percent («%») character serves as the indicator for percent-encoded octets, it must be percent-encoded as «%25» for that octet to be used as data within a URI.
Percent-encoding arbitrary data
Most URI schemes involve the representation of arbitrary data, such as an IP address or file system path, as components of a URI. URI scheme specifications should, but often don’t, provide an explicit mapping between URI characters and all possible data values being represented by those characters.
Since the publication of RFC 1738 in 1994, it has been specified that schemes that provide for the representation of binary data in a URI must divide the data into 8-bit bytes and percent-encode each byte in the same manner as above. Byte value 0F (hexadecimal), for example, should be represented by «%0F», but byte value 41 (hexadecimal) can be represented by «A», or «%41». The use of unencoded characters for alphanumeric and other unreserved characters is typically preferred because it results in shorter URLs.
The procedure for percent-encoding binary data has often been extrapolated, sometimes inappropriately or without being fully specified, to apply to character-based data. In the World Wide Web’s formative years, when dealing with data characters in the ASCII repertoire and using their corresponding bytes in ASCII as the basis for determining percent-encoded sequences, this practice was relatively harmless; many people assumed that characters and bytes mapped one-to-one and were interchangeable. However, the need to represent characters outside the ASCII range grew quickly and URI schemes and protocols often failed to provide standard rules for preparing character data for inclusion in a URI. Web applications consequently began using different multi-byte, stateful, and other non-ASCII-compatible encodings as the basis for percent-encoding, leading to ambiguities as well as difficulty interpreting URIs reliably.
For example, many URI schemes and protocols based on RFCs 1738 and 2396 presume that the data characters will be converted to bytes according to some unspecified character encoding before being represented in a URI by unreserved characters or percent-encoded bytes. If the scheme does not allow the URI to provide a hint as to what encoding was used, or if the encoding conflicts with the use of ASCII to percent-encode reserved and unreserved characters, then the URI cannot be reliably interpreted. Some schemes fail to account for encoding at all and instead just suggest that data characters map directly to URI characters, which leaves it up to individual users to decide whether and how to percent-encode data characters that are in neither the reserved nor unreserved sets.
Появилось таки некоторое количество времени, и я решил написать сий пост, идея которого возникла уже давно. Связан он будет будет с такой, казалось бы, простой вещью, как URI, детальному рассмотрению которой в рунете уделяется как-то мало внимания.
«Пфф, ссылки они и в Африке ссылки, чего тут разбираться?» — скажете вы, тогда я задам вопрос:
Перед тем как начать хотел бы обозначить, что есть пост на схожую тему, в котором все обозначено проще и немного понятнее. Целью же этого поста, я ставлю более глубокое изучение вопроса и сбор информации об URI в одном месте, дабы «не потерять». Ну, почти в одном месте, статья будет разделена на две части А для удобства бахнем оглавление, которое работает не без особенностей URI, которую мы рассмотрим попозжа, в этой статье.
Ознакомление
1. URI
Унифицированный Идентификатор Ресурса, в простонародье — URI Самое свежее описание того, чем же все-таки являются эти пресловутые URI датируется январем аж 2005-го, а именно RFC3986, написанный самим Тимом Бёнесом-Ли, родоначальника всеми нами любимого тырнета. Резюмируя п.1.1 можно сформулировать определение:
Многие из вас замечали, что на разных ресурсах ссылки называют то URL, то URI и, вероятно, становилось интересно — какой же из вариантов правильный? Дело в том, что URL увидел свет и был документирован в 1990 году, в то время как URI был документирован лишь в 1994 году. И вплоть до 2002 года, до выхода RFC3305, уместными были оба варианта именования, что, порой вносило путаницу. В п.2 RFC3305 сообщается об устаревании такого термина как URL, применимо к ссылкам, и что отныне верным будет именование URI, с того момента, во всех документах W3C использует термин URI. Исходя из этого, применяя термин URL к соответствующим ссылкам, вы не делаете смысловой ошибки, но делаете ее с точки зрения правильного именования.
Так же примечателен тот момент, что вплоть до выхода RFC2396, в 1997 году, URI расшифровывался как Universal Resource Identifier, что можно увидеть в RFC1630
1.1. Синтаксис
URI составлен из ограниченного набора символов, состоящих из цифр, букв и нескольких графических символов, все эти символы вписываются в кодировку US-ASCII (ASCII). Зарезервированное подмножество символов может использоваться, чтобы разграничить компоненты синтаксиса в URI, в то время как остающиеся символы: не зарезервированный набор и включая те зарезервированные символы, которые не действуют как разделители в данной компоненте URI, определяют данные идентификации каждого компонента.
Зарезервированные символы
Не зарезервированные символы
Для данного случая, согласно ABNF : ALPHA — любая буква верхнего и нижнего регистров кодировки ASCII (в regExp [A-Za-z]) DIGIT — любая цифра (в regExp 4) HEXDIG — шестнадцатиричная цифра (в regExp [0-9A-F])
Процентное кодирование
Т.о., %20, например, означает пробел.
1.2. Компоненты URI
где в квадратных скобках опциональные компоненты
Переходя по указанной в оглавлении ссылке, браузер производит переход ко вторичному ресурсу относительно данной страницы, т.е. скроллит вниз, до появления нужного на экране.
На этом, пожалуй, знакомство с URI можно закончить и начать углубляться в отдельные подвиды URI, а именно
2. URL
URL используются, чтобы определить местоположение ресурсов, обеспечивая абстрактную идентификацию расположения ресурса. Определив местоположение ресурса, система может выполнить множество операций на ресурсе, которые могут быть характеризованы такими словами как ‘доступ’, ‘обновление’, ‘замена’, ‘поиск атрибутов’. В целом только метод доступа должен быть определен для любой схемы URL.
2.1. Структура
В целом, URL имеет схожую структуру, для всех схем, хотя для каждой отдельно взятой схемы, структура может отличаться от общего шаблона. Графически ее можно выразить в следующем виде:
3. URN
Унифицированные имена ресурсов (URN) предназначены, чтобы служить постоянными, независимыми от расположения, идентификаторами ресурсов и разработаны для упрощения отображения других пространств имен (которые совместно используют свойства URN) в URN-пространство. Таким образом, синтаксис URN обеспечивает средство закодировать символьные данные в форме, которая может быть отправлена посредством существующих протоколов, записана при помощи большинства клавиатур, и т.д.
3.1. Структура
Самоидентифицирующийся URN
Такие URN содержат в NID название хэш-функции, а в NSS значение хэша, вычисленного для идентифицируемого объекта. Такие ссылки используются в magnet-ссылках и заголовках p2p-сети Gnutela2. Например, URN из magnet-ссылки с одного торрент-трекера: magnet:?xt=urn:btih:c68abc1ba9b8c7c4bc373862cad1a8c01d69e53d.
С теорией все, во второй части рассмотрим, что можно и что нужно делать с URI, если мы их обрабатываем, а именно — нормализация, разбор и т.д.
За сим откланяюсь, спасибо что читали, надеюсь не было скучно, удачи!
rawurlencode — URL-кодирование строки согласно RFC 3986
Описание
Кодирование строки осуществляется согласно » RFC 3986.
Список параметров
URL, который должен быть закодирован.
Возвращаемые значения
Возвращает строку, в которой все не цифробуквенные символы, кроме -_. должны быть заменены знаком процента (%), за которым следует два шестнадцатеричных числа. Это кодирование, описанное в » RFC 3986, служит для защиты буквенных символов от интерпретации в качестве специальных разграничителей URL и защищает URL от искажения при передаче символов с последующей конвертацией (как в некоторых почтовых системах).
До версии PHP 5.3.0, rawurlencode кодировал символ «тильда» (
Список изменений
Версия
Описание
5.3.4
Символы «тильда» больше не кодируются, когда rawurlencode() используется с EBCDIC строками.
5.3.0
Теперь соответствует » RFC 3986.
Примеры
Пример #1 Пример использования rawurlencode для включение пароля в FTP URL
Результат выполнения данного примера:
Или, если вы передаете информацию как часть URL:
Пример #2 Пример использования rawurlencode()
Результат выполнения данного примера:
Смотрите также
Коментарии
PHP’s functions rawurlencode() and urlencode(), both encode the whole argument parameter string, making the result useless as a valid link.
The function listed here encodes a link string (e.g. http://www.domain.com/long_path/to\file.php?query=param#fragm) to a valid parameter string, preserving the original URI structure and the path given.
rawurlencode() MUST not be used on unparsed URLs.
rawurlencode() should not be used on host and domain name parts (that may include international characters encoded in each domain part with a «q—» prefix followed by a special encoding of the international domain, currently in testbed).
rawurlencode() may be used on usernames and passwords separately (so that it won’t encode the ‘:’ and ‘@’ separators).
rawurlencode() must not be used on paths (that may contain ‘/’ separators): the [‘path’] element of a parsed URL must first be exploded into individual «directory» names. A directory or filename that contains a space must not be encoded with urlencode() but with this rawurlencode(), so that it will appear as a ‘%20’ hex sequence (not ‘+’)
rawurlencode() must not be used to encode the [‘query’] element of a parsed URL. Instead you must use the urlencode() function:
Typical queries often use the ‘&’ separator between each parameter. This ‘&’ separator however is just a convention, used in the www-url-encoded format for HTML forms using the default GET method. However, when references are done in a HTML page to an URL that contains static query parameters, these ‘&’ separators should be encoded in the HTML code as ‘&’ for HTML conformance. This is not part of the URL specification, but of the HTML encapsulation! Some browsers forget this, and send ‘&’ with their HTTP GET query. You may wish to substitute ‘&’ by ‘&’ when parsing and validating URLs. This should be done BEFORE calling urlencode() on query parts.
The [‘fragment’] part of a parsed URL (after the first ‘#’ separator found in any URL) must not be encoded with this rawurlencode() function but instead by urlencode().
Validating a URL sent in a HTTP request is then more complicated than what you may think. This must be done only on parsed URLs (where the basic elements of an URL have been splitted), and then you must explode the path components, and check the presence of ‘&’ sequences in the query or fragment parts.
The next thing to do is to check the URL scheme that you want to support (for example, only ‘http’, ‘https’, or ‘ftp’).
You may wich to check the [‘port’] part to see if it’s really a decimal integer between 1 and 65535. You may wish to remove the default port number used by the URL schemes you want to support (for example the port ’80’ for ‘http’, the port ’21’ for ‘ftp’, the port ‘443’ for ‘https’), and restrict severely all port numbers below 1024, or some critical ports below 140 (this includes DNS and NetBios ports).
This done, you must use the urlencode() function on all parts up to the exploded path elements, and rawurlencode() on the query and fragment parts, according to the specs, to recreate a complete and validated URL.
— 1) About «reserved» characters in URLS:
Beware that RFC 1738 specifies that the characters «<", ">«, «|», «\», «^», «
«, «[«, «]», and «`» are all considered unsafe and SHOULD be URL-encoded with a «%xx» triplet within *ALL* URLs.
However, some HTTP URLs seem to use the «
» character as a prefix for a user account for example: http://www.any.host.domain/
This usage is acceptable, but the RFC specifies that «%7E» should be used instead of «
» in the path component. HTTP servers should accept «
» as being equivalent to «%7E», and according to the RFC, the «%7E» form should be the canonical one.
However, some HTTP servers are not fully complying to this RFC and consider «%7E» differently from «
» (i.e. they consider it as being part of a path component name, and search a directory name containing a «
» character, instead of mapping the «
user» path component to a user’s directory. In that case, these non compliant HTTP server will not find the resource associated to that URL and may return a 404 error or other errors such as an access denied.
When using rawurlencode() on such HTTP URLs, it’s best to consider this legacy usage, by using str_replace() on the result to convert back «/%7E» to «/
«, so that the URLs will correctly map to the legacy use of the «
» character by these servers. On compliant HTTP servers, they will treat the «
» unsafe character equivalently with the «%7E» recommanded form, so they will automatically canonicalize the «
— 2) Encoding of hostnames in URLs
Finally, beware that host domain names parts in URLs *MUST NOT* be encoded with rawurlencode(), as the «[» and «]» are valid delimiters that *MUST* be used to reference an IPv6 address or other hostnames that don’t fit to the restricted set of characters allowed in a host name (the «[» and «]» characters MUST be used if the hostname includes characters such as «:» which is typically used to specify an alternate non-default port number).
The encoding of host names uses another encoding, required to encode international domain names, with a base-64 encoding of Unicode characters and a «bq—» prefix. This encoding must be used only on individual subdomain parts (separated by «.» characters). This encoding does not use any «%xx» triplets.
So NEVER use urlencode() or rawurlencode() on an unparsed URL, unless this full URL is part of a query parameter string!
— 3) Encoding of username/passwords in URLs:
There is no standard to specify a password in a URL. In fact, there’s a legacy usage of the «:» character to separate a username from a password, but it is strongly discouraged. The RFC does not attempt to specify a semantic to the authentication part of an URL (before the «@» character and the hostname part).
If you need to encode a password, always use rawurlencode() on username and passwords separately, and then insert the «:» character to separate both components. Don’t use urlencode() (which could use a «+» to encode a space, and would not work because usernames and passwords consider «+» and spaces as being different!)
About the «;» reserved character in URLs:
rawurlencode() will encode it with a «%2A» triplet. When used on the path part of a URL, this will break the usage defined in URL RFCs, that allows specifying additional parameters to *EACH* element of a path (separated by «/»).
So if a path element contains a «;» character (some filesystems allow it, but this is not recommanded) as part of a directory name, this character must be encoded so that it won’t be mixed with a parameter extension.
The generic format of a path element may include path elements such as: «/.» or «/..» or «/.specialname» or «/regularname» Each part may be followed by a «;» and other parameters separated by «;». These parameters can be eithger ordered or unordered. Unordered parameters have a symbolic name separated from their value with an equal sign.
Do not mix path element parameters with a query string: these parameters are directly attached to the individual path element, and this makes a difference when this path element is not the last one of the URL. These parameters are part of the resource name (unlike the query string), and the semantic of «.» and «..» apply to the full path element with its parameters, so that: «/subdir1/subdir2/page.html;charset=UTF-8/../index.html» will resolve to «/subdir1/index.html».
Note that: «/subdir1/subdir2/page.html;charset=UTF-8» designates a DISTINCT resource name from: «/subdir1/subdir2/page.html» It does not necessarily involves a query, and so it can be cached by default (unlike URLs that contain a query string).
When using path element parameters, their optional name and required value must be rawurlencode()’d separately before inserting «;» and «=» parameters and creating the path elements that will be imploded in the full path.
The consequence is that you MUST not urlencode() or rawurlencode individual path elements, without first parsing them: — first explode the path into its path lements separated by «/» — then explode each path element in their name and parameters separated by «;» characters — then split path element parameters that contain a «=» sign into a name/value pair. — make sure that unordered path paremeters (that have been cut according to «=» into a pair) are specified *after* ordered parameters (including the main path element name) in each path element, and that no two unordered parameters have the same name (this restriction does not occur on unordered, unnamed parameters which only supply a value). — finally you can interpret rawurlencoded names and values that constitute each path element.
Note also that some non-compliant HTTP servers consider that named parameters are ordered, and don’t add a semantic to the «;» and «=» used to break up the list of path element parameters. On client agents, when validating URLs, it’s best then not to try to interpret this list, and you should just split the main part of a path element and the parameters list by isolating the first «;» that introduces this list. However, the encoded parameter list cannot include any «/» parameter.
Caveats: note that path element parameters (introduced by «;») may be used on upper levels of a hierarchic URL, even before the final document name and its query parameters. When building lists of URLs, you should not separate URLs blindly with a «;» separator, as each URL may include a «;» character, in their path part (the «;» character cannot ocur safely in a query string). In that case, use a surrounding pair such as «<>» or quotes to enclose each URL in such a list.
Note that RFC 1738 has been amended: The «[» and «]» are no longer considered unsafe, but instead are now considered «reserved», meaning that they CAN be used in URLs!
Currently this usage has only been allowed in the hostname part, but there are some proposals to allow such use in some URL schemes. Similar extensions are now found that use the «<>» characters as «reserved» characters with special semantics, instead of «unsafe» characters that must be URL encoded.
Note also that some characters are currently «reserved» but should have instead been considered as «unsafe»: this includes the parenthesis «()» which are clearly unsafe when a URL is used in MIME headers.
Because of this, if a valid URL contains «()» characters, one should use an upper-level encoding to either enclose the URL with a pair of «unsafe» characters defined in the upper-level protocol (for example a «<>» pair in MIME headers, because these characters cannot be part of a valid URL).
note that if you implement your own server request engine in the HTTP manner like:
The microsoft URLEncode method ignores the documentation in RFC1738 which states that:
«. the special characters «$-_.+!*'(),», and reserved characters used for their reserved purposes may be used unencoded within a URL»
So for example, myaddress@mydomain.com becomes myaddress%40mydomain%2Ecom, whereas php and other languages would encode this as myaddress%40mydomain.com
This can be an issue when porting from asp or if you are doing string comparison of strings urlencoded on different platforms.
NB. php will correctly decode myaddress%40mydomain%2Ecom to myaddress@mydomain.com, it is only the encoding that differs
This seems the correct way to encode ftp url which you could provide for your users:
Browsers mangle certain language characters
In addition to my last post I would like to add that, this function is for the «directories/somefile.ext» paths
In order to construct valid ftp url (with password added in it ) do this
Last function will encode path url so that language characters remain untouched and you get same file name for download after download dialog appears.
As peter@nospam said, the microsoft uses an different table for encode string when sending data.
here is it for those who need know what is this table..
the index of array is the ord() of a character.. use with chr(index) to know the char.. and replace with the value.
Easier version to ‘rickyale at ig dot com dot br’ his example
NOTE: 142 and up in his array are language specific ASCII characters so the conversion to their unicode (‘%C5%BD’) equivelant may or may not work for you. This needs a far more serious and bigger system to handle for non us tables
On the comments of rickyale and djmaze. Is what you try to achieve is not a combination of utf8 and url encoding, e.g. :
At least works for me, Jeroen Hofstee
You can encode paths using:
Note in regards to ‘rickyale at ig dot com dot br’ program:
Wouldn’t the whole issue be fixed by using charset=utf-8 in the HTML page?
Of course, I could have tested only a limitted number of cases.
I had serious trouble with local Windows paths containing umlauts on my Apache 2 / Windows NT machine. Apache could not find any of those files if I just used rawurlencode. It’s not noted anywhere here, but you fix this by simply making your path utf8 first:
What happen if you need tu convert this %C3%B1 into this ‘ñ’ using rawurldecode()? Well, it doesn’t work as we’d wish to. We’ll get this «Ã±». To fix this issue, I’ve made the following function:
Hope to be helpfull, if you have an issue like this, try to use this function.
I have to mention something about javier’s post: the issue you are experiencing only happens because you are using ISO-8859-1 (a.k.a. ISO-LATIN-1) encoding, which is an extension of ASCII using the values 128-255 for latin specific characters (these characters are NOT part of ASCII). To say somethig like 0xF1 is the correct value for «ñ» in ASCII is wrong: any value equal or higher than 0x80 is invalid in ASCII; and there is no «correct» value for «ñ» in ASCII because the ASCII character set does not include that character. These encode/decode functions are designed to work on UTF-8, which is an ASCII-compatible encoding for Unicode, thus being able to represent the entire Unicode character range.
The main point is: the «Ã±» you get is the 0xC3 0xB1 sequence, interpreted as two single-byte ISO-8859-1 characters; but if you interpret them as UTF-8, they indeed represent «ñ». If you are working with the latin character set and encoding, then you are fine with your method (which is essentially a utf-8 => iso-latin-1 converter).
For anybody who is using UTF-8 enconding, check if there is any issue before you use a method like javier’s: these multi-byte values are actually the right way to represent any non-ASCII character on UTF-8.
For deeper details on the UTF-8 and ISO-8859-1 encodings, take a look at wikipedia: http://en.wikipedia.org/wiki/UTF-8 http://en.wikipedia.org/wiki/ISO-8859-1
I’ve written a simple function to convert an UTF-8 string to URL encoded string. All the given characters are converted!
If you, like me, sometimes have the misfortune of being forced to work with PHP4, here is a PHP implementation of http_build_query() that produces more or less the same output as this function, accepting the same arguments.
The only differences here are that the RFC selector argument does not behave precisely correctly. This implementation passes RFC1738 through urlencode() and RFC3986 through rawurlencode(), which is not 100% correct, see the manual pages of those function for more information.
if (! function_exists ( ‘http_build_query’ )) <
For those looking to strip all non-reserved characters from a URL according to RFC 3986, the code would look like:
So a basic «slug» generation routine might look like:
URL/URI encoding is very complicated matter. For example ‘http://example.org:port/path1/path2/data?key1=value1&argument#fragment’ (1), or ‘scheme://user:password@example.com:port/path1/path2/data?key1=value1&key2=value2#fragment’ (2) e.g. this (2) should be encoded: ‘scheme://’.rawurlencode(‘user’).’:’.rawurlencode(‘password’).’@example.com:port/’ .rawurlencode(‘path1′).’/’.rawurlencode(‘path2′).’/’.rawurlencode(‘data’) .’?’.htmlentities(urlencode(‘key1′).’=’.urlencode(‘value1′).’&’.urlencode(‘key2′).’=’.urlencode(‘value2’)) .’#’.urlencode(‘fragment’) etc.
For easy encoding, I’ve written ‘toURI’ function, see https://gist.github.com/msegu/bf7160257037ec3e301e7e9c8b05b00a
URIs as: [scheme:][//authority][path][?query][#fragment] means [scheme:][//[user[:password]@]host[:port]][/path][?query][#fragment] or: scheme:[user@host][?query] (mailto: etc.)
toURI() short review:
toURI() using examples:
//Simple use, without special characters in query arguments/values